Preparing for Basel II: common problems, practical solutions Part 1: missing data

RMA Journal, The, April, 2004 by Jeffrey S. Morrison

This article continues the series introduced in the May and June 2003 issues of The RMA Journal on modeling requirements for Basel II. Previous articles focused on the fundamentals associated with building PD and LGD models, as well as the steps necessary for model validation. This and forthcoming articles discuss common challenges in model development and offer practical statistical advice in overcoming them. To make this article as helpful as possible, examples use the SAS software language.

Missing or incomplete data is perhaps the most widespread challenge facing any model builder, regardless of the industry. Whether working with results from a survey, developing a telecom model to predict churn, predicting the response to a new credit card offer, or creating a model for probability of default, you must determine a strategy for handling missing information. You may, for example, have information on one loan applicant's income but not on others. The same may be true for the consumer credit score. If you feel that income and credit score are both very important in determining future default, what do you do with those accounts where the information simply is not available? Because your loan accounting system may be less than perfect, how do you handle an obvious input error, such as an annual income of -$100 or an LTV of 500%?

The truth is that every model builder will select some way to deal with the missing data problem before a model is estimated. As there are a variety of methods available, the task is to find the best. The decision will affect not only the model estimates or parameters, but also their statistical reliability. The remainder of this article presents several popular approaches for handling the missing data problem, beginning with the most simple. The last approach presented reflects the latest research on the subject and is made surprisingly easy to implement because of some recent improvements in certain statistical software packages.

Approach 1: Fix Those Errors

Many times, information is missing because of errors in the data collection process. If 40% of bureau-score data is missing, then you need to find out why. Was there a matching problem within the data warehouse? There will always be accounts that the credit bureaus are unable to score--but 40%? For income data, was there a systematic type of error made that could be corrected with some degree of certainty? Should blanks in your data really be interpreted as zeros? Most importantly, there is no substitute for careful examination of the dataset before doing any analyses. (1)

Approach 2: Delete Those Records

An obvious approach is to sin> ply delete the records with missing information. However, doing so may create two problems.

First, if you have a number of predictor variables in your model with missing information, a deletion approach can significantly reduce the number of observations for modeling. For example, you may start off with 10,000 observations containing both defaulted and non-defaulted accounts. However, a deletion approach could result in a data set of 1,200. Sometime referred to as case-wise deletion, most regression routines automatically skip records where any predictor variable is missing.

Second, such an approach could bias your estimates. This happens because the population represented by your sample may have a different distribution of missing information. In other words, the sample needs to be as representative as possible of the overall characteristics of the general population. If the bias is severe, it could affect forecasting accuracy when you get to the validation stage of the process.

Approach 3: Substitute a Proxy for Missing Information

Since data is such a rare commodity in the model-building world, most modelers do not choose the second approach to handling missing information. Perhaps the most common approach is to substitute the mean, median, or mode of the variables for which you do have valid information. For example, if you are missing information on the credit bureau score, and the average bureau score in your sample is 689, then you would substitute that value each time you encounter missing data. The same procedure could be done using the median value of existing predictors. The decision to use the median rather than the mean value as a proxy could be empirically determined by looking at the validation results. In other words, try it both ways!

However, the proxy method, too, can cause a problem in the model-building process. Let's say that our credit bureau score is 20% missing. If we substitute 689 each time for the missing value, then we artificially reduce the variance of the predictor variable in the model. Since each missing value has been substituted with a single numeric value, the overall variability of the predictor has been made artificially low. Since this variance is crucial in determining whether a variable is statistically significant in the model-building process, you could be inadvertently adding statistical significance where none exists. Therefore, at the end of the model-building process, you may have predictors in the model that shouldn't be there.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
Click Here
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with Thompson Gale