Business Services Industry

Identifying the generating distribution of business and economics data: an empirical method

Journal of the Academy of Business and Economics, March, 2004 by Laurence R. Takeuchi, Joseph Richards

ABSTRACT

A general framework for empirically identifying familial membership within the Pearson system of distributions is proposed. Specifically, the research addresses the problem of constructing a point and a non-parametric confidence interval estimate of familial membership. Unlike most identification procedures that require the data to support or negate membership in a specific hypothesized family, e.g., Chi-Square and Empirical Distribution Function goodness of fit, the presented approach uses bootstrap re-sampling techniques to help identify the likely generating family of the data.

1. INTRODUCTION

The study of systems of probability distributions first appeared in the literature in the late 1800's when Karl Pearson (1895) began to question the basic assumption of normal theory. His empirical studies revealed non-normal characteristics to be inherent features of many populations. Initially, many of Pearson's contemporaries doubted the need for curves (systems of distributions) other than the normal density. However, by the turn of the century, theoreticians and empiricists had accepted the possibility of non-normality, and began to explore various typologies for alternative distributions. Of these typologies, the Pearson (1902a; 1902b) system of probability distributions became the most cited.

The Pearson, as well as complementary systems of distributions, such as Johnson's [S.sub.U] and [S.sub.B] system (Johnson 1949) are particularly relevant to research in economic and business disciplines when the phenomenon under inquiry is assumed to be governed by a probability law. Often, the specific law in question is unknown, but its realizations (data) are available. In such instances these systems provide a general familial space of distributions in which one member, i.e., a point in the space, can be identified as the "most likely" candidate generating the observed phenomenon. Thus, knowledge of general systems of distributions (families) can aid the researcher in addressing the basic problem: Given a sample set of independently, identically distributed (iid) observations, how does one empirically determine the implied generating family?

In much economics and business research however, the problem of familial identification is circumvented by imposing an explicit a' priori functional form for the generating family, e.g., X is distributed normal. Mathematical tractability often determines the choice of the assumed family, e.g., closure of normally distributed variables under addition, rather than empirical or theoretical arguments. Simplifying the problem in this manner reduces the general research issue of familial identification to parameter estimation within a given family; in the case of the normal density family, the estimation of [mu] and [sigma]. Imposing a' priori distribution can be costly in terms of the congruity of the research model with the phenomena being studied, and hence the usefulness of the research. While there are many procedures to reject a specific distribution, such as normality, a more useful approach would be to determine the family or families of distributions that are consistent with the data.

We can formally state the simplified problem as follows: Let the random component of the phenomenon of interest be represented by an iid random variable X. Further, let the distribution of X belong to a family of distributions {f(x;[omega]): [omega] [member of] [OMEGA]}. Each member f (x;[omega]) of {f(x;[omega]): [omega] [member of] [OMEGA]} is uniquely defined (identified) by the values of the parameter set [omega] [member of] [OMEGA]. For example, assume that the distribution of a random variable X follows the normal probability law [psi](x;[mu],[sigma]). The law [psi](x;[mu],[sigma]), however, can be viewed as a general family of distributions {[psi](x;[omega]): [omega][member of] [OMEGA]}, where [omega] = ([mu], [sigma]), and [OMEGA] = {([mu],[sigma])| -[infinity], < [mu] < [infinity], 0 < [sigma] < [infinity]}. Since a unique member of this family exists for each pair ([mu], [sigma]) [member of] [OMEGA], the simplified research problem is to decide on the basis of data which member, or members of the assumed family, "best" represents the distribution of X. Thus, the problem of within-family identification is one of statistical estimation, estimation of the underlying parameter(s) that uniquely identify a member of the assumed family implied by the data. It begs the question however, of the empirical or conceptual validity of the a'priori familial specification. This question is important when parameter estimation, and hence, membership inference depends on the general family in question.

This paper proposes a statistical procedure to identify the likely candidate(s) of probability distribution(s) that generated a set of observed data. The procedure is useful when data realizations are assumed to be governed by a single, unknown, continuous distribution having membership in the Pearson system of probability densities. Using this procedure a researcher can construct a point and, more informatively, a joint confidence interval estimate that respectively identify a single Pearson class or classes (families) of distributions that could have propagated the observed data. The procedure employs a computationally intensive technique referred to as "bootstrap" (Efron, 1985; Efron, 1982; Efron & Tibshirani, 1993; Efron & Tibshirani, 1986).


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale