Financial Services Industry
Industry: Email Alert RSS FeedPreparing for Basel II: common problems, practical solutions Part 1: missing data
RMA Journal, The, April, 2004 by Jeffrey S. Morrison
This article continues the series introduced in the May and June 2003 issues of The RMA Journal on modeling requirements for Basel II. Previous articles focused on the fundamentals associated with building PD and LGD models, as well as the steps necessary for model validation. This and forthcoming articles discuss common challenges in model development and offer practical statistical advice in overcoming them. To make this article as helpful as possible, examples use the SAS software language.
Most PopularCBS MoneyWatch.com Articles
Missing or incomplete data is perhaps the most widespread challenge facing any model builder, regardless of the industry. Whether working with results from a survey, developing a telecom model to predict churn, predicting the response to a new credit card offer, or creating a model for probability of default, you must determine a strategy for handling missing information. You may, for example, have information on one loan applicant's income but not on others. The same may be true for the consumer credit score. If you feel that income and credit score are both very important in determining future default, what do you do with those accounts where the information simply is not available? Because your loan accounting system may be less than perfect, how do you handle an obvious input error, such as an annual income of -$100 or an LTV of 500%?
The truth is that every model builder will select some way to deal with the missing data problem before a model is estimated. As there are a variety of methods available, the task is to find the best. The decision will affect not only the model estimates or parameters, but also their statistical reliability. The remainder of this article presents several popular approaches for handling the missing data problem, beginning with the most simple. The last approach presented reflects the latest research on the subject and is made surprisingly easy to implement because of some recent improvements in certain statistical software packages.
Approach 1: Fix Those Errors
Many times, information is missing because of errors in the data collection process. If 40% of bureau-score data is missing, then you need to find out why. Was there a matching problem within the data warehouse? There will always be accounts that the credit bureaus are unable to score--but 40%? For income data, was there a systematic type of error made that could be corrected with some degree of certainty? Should blanks in your data really be interpreted as zeros? Most importantly, there is no substitute for careful examination of the dataset before doing any analyses. (1)
Approach 2: Delete Those Records
An obvious approach is to sin> ply delete the records with missing information. However, doing so may create two problems.
First, if you have a number of predictor variables in your model with missing information, a deletion approach can significantly reduce the number of observations for modeling. For example, you may start off with 10,000 observations containing both defaulted and non-defaulted accounts. However, a deletion approach could result in a data set of 1,200. Sometime referred to as case-wise deletion, most regression routines automatically skip records where any predictor variable is missing.
Second, such an approach could bias your estimates. This happens because the population represented by your sample may have a different distribution of missing information. In other words, the sample needs to be as representative as possible of the overall characteristics of the general population. If the bias is severe, it could affect forecasting accuracy when you get to the validation stage of the process.
Approach 3: Substitute a Proxy for Missing Information
Since data is such a rare commodity in the model-building world, most modelers do not choose the second approach to handling missing information. Perhaps the most common approach is to substitute the mean, median, or mode of the variables for which you do have valid information. For example, if you are missing information on the credit bureau score, and the average bureau score in your sample is 689, then you would substitute that value each time you encounter missing data. The same procedure could be done using the median value of existing predictors. The decision to use the median rather than the mean value as a proxy could be empirically determined by looking at the validation results. In other words, try it both ways!
However, the proxy method, too, can cause a problem in the model-building process. Let's say that our credit bureau score is 20% missing. If we substitute 689 each time for the missing value, then we artificially reduce the variance of the predictor variable in the model. Since each missing value has been substituted with a single numeric value, the overall variability of the predictor has been made artificially low. Since this variance is crucial in determining whether a variable is statistically significant in the model-building process, you could be inadvertently adding statistical significance where none exists. Therefore, at the end of the model-building process, you may have predictors in the model that shouldn't be there.
Brought to you by CBS MoneyWatch.com
- Best- and Worst-Paid College Degrees
- 6 Things You Should Never Do on Twitter or Facebook
- How Much Sleep Do You Really Need?
- 6 Big Myths about Gas Mileage
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Business Articles
- Freudenberg IT Invests $38 Million for Growth
- Research and Markets: Israel Ophthalmic Devices Investment Opportunities, Analysis and Future Forecasts Through to 2015
- Research and Markets: Emerging APAC (China) Networking Opportunity 2009 - Addressing a Growing Demand in a Downturn Economy
- Research and Markets: Indian Small & Medium Businesses SaaS Channel Partners 2009 - A Growing Opportunity in a Challenging Business Environment
- Research and Markets: Nippon Oil Corporation LNG Export and Import Markets, 2000 to 2015 Report - Profile and Analysis and Forecasts of Terminal Wise Capacity and Associated Contracts
Most Recent Business Publications
Most Popular Business Articles
- 7 tips for effective listening: productive listening does not occur naturally. It requires hard work and practice - Back To Basics - effective listening is a crucial skill for internal auditors
- Using object-oriented analysis and design over traditional structured analysis and design
- FAS 109: a primer for non-accountants - Financial Accounting Standards Board's "Statement 109: Accounting for Income Taxes"
- LIFO vs. FIFO: a return to the basics
- Design a commission plan that drives sales - Sales Commissions




