SOI sampling methodology and data limitations: appendix

Statistics of Income Bulletin, Spring, 2007

This appendix discusses typical sampling procedures used in most Statistics of Income (SOI) programs. Aspects covered briefly include sampling criteria, selection techniques, methods of estimation, and sampling variability. Some of the nonsampling error limitations of the data are also described, as well as the tabular conventions employed.

Additional information on sample design and data limitations for specific SOI studies can be found in the separate SOI reports (see pages 319-320 at the end of this Bulletin). More technical information is available, on request, by writing to the Director, Statistics of Income Division RAS:S, Internal Revenue Service, P.O. Box 2608, Washington, DC 20013-2608.

Sample Criteria and Selection of Returns

Statistics compiled for the SOI studies are generally based on stratified probability samples of income tax returns or other forms filed with the Internal Revenue Service (IRS). The statistics do not reflect any changes made by the taxpayer through an amended return or by the IRS as a result of an audit. As returns are filed and processed for tax purposes, they are assigned to sampling classes (strata) based on such criteria as: industry, presence or absence of a tax form or schedule, and various income factors or other measures of economic size (such as total assets, total receipts, size of gift, and size of estate). The samples are selected from each stratum over the appropriate filing periods. Thus, sample selection can continue for a given study for several calendar years--3 for corporations because of the incidence of fiscal (noncalendar) year reporting and extensions of filing time. Because sampling must take place before the population size is known precisely, the rates of sample selection within each stratum are fixed. This means, in practice, that both the population and the sample size can differ from those planned. However, these factors do not compromise the validity of the estimates.

The probability of a return's selection depends on its sample class or stratum and may range from a fraction of 1 percent to 100 percent. Considerations in determining the selection probability for each stratum include the number of returns in the stratum, the diversity of returns in the stratum, and interest in the stratum as a separate subject of study. All this is subject to constraints based on the estimated processing costs or the target size of the total sample for the program.

For most SOI studies, returns are designated by computer from the IRS Master Files based on the taxpayer identification number (TIN), which is either the Social Security number (SSN) or the Employer Identification Number (EIN). A fixed and essentially random number is associated with each possible TIN. If that random number falls into a range of numbers specified for a return's sample stratum, then it is selected and processed for the study. Otherwise, it is counted (for estimation purposes), but not selected. In some cases, the TIN is used directly by matching specified digits of it against a predetermined list for the sample stratum. A match is required for designation.

Under either method of selection, the TINs designated from one year's sample are, for the most part, selected for the next year's, so that a very high proportion of the returns selected in the current year's sample are from taxpayers whose previous years' returns were included in earlier samples. This longitudinal character of the sample design improves the estimates of change from one year to the next.

Method of Estimation

As noted above, the probability with which a return is selected for inclusion in a sample depends on the sampling rate prescribed for the stratum in which it is classified. "Weights" are computed by dividing the count of returns filed for a given stratum by the number of population sample returns for that same stratum. These weights are usually adjusted for unavailable returns, outliers, or trimming weights. Weights are used to adjust for the various sampling rates used, relative to the population--the lower the rate, the larger the weight. For some studies, it is possible to improve the estimates by subdividing the original sampling classes into "poststrata," based on additional criteria or refinements of those used in the original stratification. Weights are then computed for these poststrata using additional population counts. The data on each sample return in a stratum are then multiplied by that weight. To produce the tabulated estimates, the weighted data are summed to produce the published statistical totals.

Sample returns are designated by computer from the IRS Master Files based on the taxpayer identification number.

Sampling Variability

The particular sample used in a study is only one of a large number of possible random samples that could have been selected using the same sample design. Estimates derived from the different samples usually vary. The standard error of the estimate is a measure of the variation among the estimates from all possible samples and is used to measure the precision with which an estimate from a particular sample approximates the average result of the possible samples. The sample estimate and an estimate of its standard error permit the construction of interval estimates with prescribed confidence that this interval includes the actual population value.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale