How Missing Values are Addressed in Statistical Analysis
By default, most statistical analysis programs make the incorrect assumption that data is Missing Completely At Random (see Types of Missing Values for definitions and related concepts). This assumption is rarely appropriate with survey data. It is routinely made because it is the simplest assumption. In many areas of statistics, assumptions can be broken without dramatic consequences. The treatment of missing data is not such an area and incorrectly assuming data is Missing Completely At Random can lead to massively misleading results (e.g., in the case of regression, it can cause the conclusions of a model to be reversed).
When we have missing values in data, we need to go through the following process:
- Try and fix the data (e.g., re-contact respondents and get their answers). If possible, it is better to try and work out the correct value of the missing data. Often categories are missing because they are inapplicable. Somebody who is listed as a home maker and has income listed as Missing Data probably has no income. Often common sense tells us what the true answers must be. If a respondent has indicated that they never purchase ice cream, they may not have been asked about their frequency of buying Magnums, we will be safe in replacing the missing value with a value of No. Similarly, if someone has indicated that they Don’t Know whether it is important to have a king sized bed in a hotel room, we can be reasonably confident in assuming that it cannot be of great importance to the respondent. And, if someone cannot remember the last time they went to the cinema, we can be reasonably confident it was not in the last week. Where common sense is not enough, we need to look for clues in answers to other questions in order to replace the missing value with a meaningful response. It is not unknown for non-commercial research institutes to have junior researchers and students to read through questionnaires to determine what the likely response may have been. Again, this practice may seem suspect, but it is probably less dangerous than ignoring the problem.
- Determine whether the missing values are best characterized as being:
- (Optionally) Data imputation, which involves replacing the missing values with predictions for their likely values. Imputation is always something of a last resort and this step should only be conducted if the next step cannot be conducted appropriately. Most automated imputation methods implicitly assume that the data is Missing At Random.
- (Optionally) Weighting, whereby the data is weighted to correct for the missing value pattern. Theoretically this is equivalent to imputation but in practice it is a different process.
- Using statistical methods that make appropriate assumptions regarding the type of missing data. Where the statistical methods available make assumptions that are known to be incorrect it is sometimes advisable to use imputation. However, it is always theoretically preferable to use statistical methods which make appropriate assumptions, as inevitably the process of imputation is very inaccurate and these inaccuracies infect any statistical methods.
The rest of this page reviews the most common types of analyses that are conducted in market research and how they can be implemented depending upon the type of missing data.
Averages and percentages
When averages and percentages are computed in standard statistical software the missing values are excluded from the analysis and this implicitly involves the assumption that the data is Missing Completely At Random. If this assumption is incorrect, imputation is generally the best solution if the data is Missing At Random.
When the missing values are Nonignorable there is little that can be done to compute meaningful averages and percentages.
Correlations implicitly assume that the data is either Missing Completely At Random or Missing At Random. It is generally not appropriate to compute correlations with data that is imputed. This is because a:
- Good imputations use the observed correlations in the data to infer the values of the missing values, and thus using imputated data to compute correlations involve circular logic.
- Most imputations are not very good and the correlations computed using the imputed values will be biased , whereas without the imputation they may not be biased at all, even when Missing At Random.
A simple example helps in understanding this problem. Imagine that our the true values of 10 respondents are as follows. These variables clearly have a perfect correlation of 1.
Now consider a situation where the y variable is missing for respondents who have values of 6 or more, which is an example of data that is Missing At Random. Using the only data that is available, we still observe a perfect correlation and thus our analysis is not ruined by the data being Missing At Random.
The next table shows the results computed using SPSS Missing Value Analysis module, using the EM algorithm. Note that SPSS has done a pretty good job (and, if we had played around with the options in SPSS we could have got it do do a better job). However, the correlation is now estimated as 0.994. At first glance that may seem like being almost the same as the correct value of 1, but if you think about it for a moment you will realize that the missing data pattern was a really obvious one and still the algorithm has gotten it wrong and the consequence of this is that we have underestimated the true relationship. With weaker correlations and more variables, the problem becomes much greater and it is thus, in general, best to not using imputed values when computing correlations. If you read the Imputation page you will see another example of correlations, where the imputation causes the correlation to be exaggerated.
When the missing values are Nonignorable there is little that can be done to compute meaningful correlations.
Principal Components Analysis
Different statistical programs make different assumptions about missing values when conducting principal components analysis. To understand the differences between these implementations it is important to understand that principal components analysis is computed from the correlation matrix (i.e., the correlations between each of the pairs of variables).[note 1]
SPSS by default has a setting of Exclude cases pairwise which means that it computes the correlations between each pair of variables. This involves an implicit assumption that the data is Missing Completely At Random.
An alternative assumption is to only compute correlations using data where each respondent has no missing values. This is the default in R (where it referred to as na.exclude and is the only option in Q. This approach to missing data is consistent with the assumptions that the data is Missing Completely At Random and sometimes Missing At Random.[note 2] SPSS can also be set to use this assumption (Options : Missing Values : Exclude cases listwise). In terms of its assumptions about the nature of the missing data, this approach is generally preferable to pairwise deletion. However, with large amounts of missing values it is often impossible to use this method.
As principal components analysis is based on correlations, and correlations are typically invalid when data imputation is involved, imputation is also not typically appropriate prior to principal components analysis. Various versions of principal components analysis have been developed which can accommodate missing values by making either Missing At Random or Missing Completely At Random assumptions, but they are not available as standard options in commonly used statistical software.
Cluster Analysis and Latent Class Analysis
By default, most regression models exclude all respondents for which there is any missing data. This is consistent with Missing Completely At Random and can be consistent with Missing At Random as well.[note 3] To appreciate how it is consistent with Missing At Random, review the earlier discussion of correlation and consider the regression model which predicts a straight line through the points (i.e., you get the same correct results if using the data which has missing values, even though they are Missing At Random).
For the same reasons as discussed with correlation, regression using imputed data is general a bad idea. It is difficult to envisage a situation where it is appropriate.
As with principal components analysis, if regression is conducted using an option such as the SPSS option of Exclude cases pairwise, which essentially works by computing correlations between all the variables based on all the available data, this involves making an assumption that the data is Missing Completely At Random, which is a much stronger and less plausible assumption that occurs when all the observations are deleted that contain any missing values. It is important to appreciate that except when randomization explains the missing values, the use of the Exclude cases pairwise option is extremely difficult to justify, and is impossible to justify in situations where the missing data is caused by skips in the questionnaire or Don't know options.
When the missing data is Nonignorable, the simplest solution for regression is the same as for latent class analysis, and it is to treat the missing values as additional categories. Where the data is numeric, these can be addressed by creating additional variables. For example, if you have an independent variable with values of 1,2,NaN,1,3,NaN, you can replace the missing values with values of 0 and include a separate dummy variable in the regression to model the missing data. That is, the one variable is replaced by two in the analysis:
- Or, more accurately, PCA can be computed from a correlation matrix. There are other algorithms.
- It is not clear whether this is always the case or not.
- It is not clear whether this is always the case or not.
A more up-to-date version of this content is on www.displayr.com.