Regression Analysis for Prediction: Understanding the Process

Cardiopulmonary Physical Therapy Journal, Sep 2009 by Palmer, Phillip B, O'Connell, Dennis G

ABSTRACT

Research related to cardiorespiratory fitness often uses regression analysis in order to predict cardiorespiratory status or future outcomes. Reading these studies can be tedious and difficult unless the reader has a thorough understanding of the processes used in the analysis. This feature seeks to "simplify" the process of regression analysis for prediction in order to help readers understand this type of study more easily. Examples of the use of this statistical technique are provided in order to facilitate better understanding.

INTRODUCTION

Graded, maximal exercise tests that directly measure maximum oxygen consumption (VO2 max) are impractical in most physical therapy clinics because they require expensive equipment and personnel trained to administer the tests. Performing these tests in the clinic may also require medical supervision; as a result researchers have sought to develop exercise and non-exercise models that would allow clinicians to predict VO2 max without having to perform direct measurement of oxygen uptake. In most cases, the investigators utilize regression analysis to develop their prediction models.

Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur,15 regression analysis has 2 uses in scientific literature: prediction , including classification, and explanation. The following provides a brief review of the use of regression analysis for prediction. Specific emphasis is given to the selection of the predictor variables (assessing model efficiency and accuracy) and crossvalidation (assessing model stability). The discussion is not intended to be exhaustive. For a more thorough explanation of regression analysis, the reader is encouraged to consult one of many books written about this statistical technique (eg, Fox;5 Klein bau m, Kupper, & MuI 1er;12 Pedhazur;15 and Weisberg16). Examples of the use of regression analysis for prediction are drawn from a study by Bradshaw et al.3 In this study, the researchers' stated purpose was to develop an equation for prediction of cardiorespiratory fitness (CRF) based on non-exercise (N-EX) data.

SELECTING THE CRITERION (OUTCOME MEASURE)

The first step in regression analysis is to determine the criterion variable. Pedhazur15 suggests that the criterion have acceptable measurement qualities (ie, reliability and validity). Bradshaw et al3 used VO2 max as the criterion of choice for their model and measured it using a maximum graded exercise test (GXT) developed by George.6 George 6 indicated that h is protocol for testing compared favorably with the Bruce protocol in terms of predictive ability and had good test-retest reliability ICC = .98 - .99). The American College of Sports Medicine indicates that measurement of VO2 max is the "gold standard" for measuring cardiorespiratory fitness.1 These facts support that the criterion selected by Bradshaw et al3 was appropriate and meets the requirements for acceptable reliability and validity.

SELECTING THE PREDICTORS: MODEL EFFICIENCY

Once the criterion has been selected, predictor variables should be identified (model selection). The aim of model selection is to minimize the number of predictors which account for the maximum variance in the criterion.15 In other words, the most efficient model maximizes the value of the coefficient of determination (R^sup 2^). This coefficient estimates the amount of variance in the criterion score accounted for by a linear combination of the predictor variables. The higher the value is for R^sup 2^, the less error or unexplained variance and, therefore, the better prediction. R^sup 2^ is dependent on the multiple correlation coefficient (R), which describes the relationship between the observed and predicted criterion scores. If there is no difference between the predicted and observed scores, R equals 1 .00. This represents a perfect prediction with no error and no unexplained variance R^sup 2^ = 1 .00). When R equals 0.00, there is no relationship between the predictor(s) and the criterion and no variance in scores has been explained R^sup 2^ = 0.00). The chosen variables cannot predict the criterion. The goal of model selection is, as stated previously, to develop a model that results in the highest estimated value for R^sup 2^.

According to Pedhazur,15 the value of R is often overestimated. The reasons for this are beyond the scope of this discussion; however, the degree of ove restimati ? ? is affected by sample size. The larger the ratio is between the number of predictors and subjects, the larger the overestimation. To account for this, sample sizes should be large and there should be 1 5 to 30 subjects per predictor.1 1,1S Of course, the most effective way to determine optimal sample size is through statistical power analysis.11'15


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest