Methodological techniques for dealing with missing data

American Journal of Health Studies, Spring-Summer, 2003 by Thomas W. O'Rourke

When analyzing data, it is commonplace to observe that data are not always complete for each case. Rather, some data are usually missing. In some cases the amount of missing data may be minimal; in others it may be significant. This article deals with the methodological issues related to missing data. Specific questions addressed are: What are missing data? Why are missing data important? What are the major reasons for missing data? How might missing data be prevented or minimized? How do you detect missing data? What are the types of missing data? What do you do with missing data? And, how do common software packages handle missing data?

**********

WHAT ARE MISSING DATA?

Missing data are data you desired to collect but never got into your database for subsequent analysis.

WHY ARE MISSING DATA IMPORTANT?

There are several reasons for being concerned about missing data. Missing data can reduce your effective sample size, which results in a loss of statistical power. That is, you may have initiated your study with a sample size of 300 people but missing data may effectively reduce the sample size to 250 cases. As sample size decreases, variability in your data increases and confidence decreases. As sample size decreases, your data may no longer be representative. Missing data may introduce bias. For example, if your find that some sensitive health behavior questions, such as sexual activity or drug behavior, or just a basic demographic question such as annual income, are not answered by many people, then the responses you do get may not be representative of the population. That is, people whose behavior was socially acceptable or whose incomes were neither very low or high may have responded, whereas those having less socially acceptable responses or who were at very low or high incomes may have been more likely not to respond. Thus when you analyze the health behaviors you have a built-in bias due to missing data.

Similarly, missing data make it difficult to measure effects. Using the same example, you may have been interested in assessing the effects of income on health behaviors. However, if people at the ends of the income spectrum were more likely not to respond then those in the middle of the range, you may conclude there was no relationship between income and health behaviors when in fact one may exist. Thus, missing data can influence both the analysis and interpretation of your data.

WHAT ARE THE MAJOR REASONS MISSING DATA?

There are many reasons for missing data. Common reasons are presented below and ways to address them are presented in the following section. A common form of missing data is the subject's refusal to answer an item. Often this is because of question sensitivity. Sensitive questions about health behaviors, income, and illegal activities are examples. Another common reason is that the respondent simply doesn't know the answer. This may be because of memory problems for example, not being able the remember the date of the last physical check-up, or comprehension problems--not understanding the words or constructs in the question. Sometimes the data desired are simply not applicable. For example, a questionnaire item asking a person how long ago it was since they had a tetanus shot may result in a nonresponse by someone who never had one. Or, asking enrollees of an HMO to evaluate their most recent visit during the past twelve months will not be applicable to those who haven't been to the HMO facility during that period.

Missing data may be the result of questionnaire programming errors if computer-assisted interviewing is used. For example, with reference to the previous recent visit question, the respondent who wasn't in the HMO in the past year shouldn't have been confronted with the questions in the first place. Data processing errors also may be responsible for missing data. The data may have not been entered or entered correctly. Missing data are also commonplace in studies where data are collected over time. For example, in measuring a health promotion weight reduction intervention program, data on knowledge, attitudes, eating behaviors and weight may be collected at the beginning and end of the program as well as six and twelve months later. At each data collection point, subject attrition is possible and cumulative, with attrition increasing with time.

HOW MIGHT MISSING DATA BE PREVENTED OR MINIMIZED?

An understanding of reasons for missing data can help reduce the problem and thus strengthen your data. Problems with missing data can be prevented and data loss minimized. The best approach to avoiding or minimizing missing data is prevention. That is, missing data can be prevented at the outset by developing a well-designed instrument with clear directions and unambiguous and answerable items. Another strategy is, at the time of data collection via phone or personal interview, checking that all applicable data are collected before ending the interview. Data returned by mail questionnaire can be checked for missing data and follow-up done accordingly, although this can be a time-consuming and costly process.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale