Does accountability work? - correspondence
Education Next, Fall, 2003 by Audrey Amrein, David Berliner
Margaret Raymond and Eric Hanushek harshly criticize (see "High-Stakes Research," Feature, Summer 2003) our study of high-stakes testing policies. Before reporting the results from our study, the New York Times journalist obtained feedback from our study's external reviewers as well as from scholars and advocates who support high-stakes testing. Raymond and Hanushek ask the media, "Why not bring in some outside expertise to review such a report before heralding its arrival?" Actually, the media did.
Our study analyzed data across multiple indicators of academic achievement, not simply the National Assessment of Educational Progress (NAEP). Yet Raymond and Hanushek's review looked only at the results from our analysis of NAER ignoring the effects that high-school graduation exams have had on college-admissions tests like the SAT and on participation and performance in Advanced Placement courses. They also ignored the fact that high-school graduation exams have resulted in increased dropout rates and an increasing use of the General Educational Development, or GED, tests as a substitute for a high-school diploma. Do these consistently negative effects matter when assessing high-stakes testing? We think so.
Raymond and Hanushek discard our findings on the basis that our methods were flawed. All of our findings were derived using one of the strongest designs in empirical research--the archival time-series analysis, a method that some claim is second in quality only to a true controlled experiment. An archival time-series analysis is simple enough that readers do not need a background in statistics to understand the underlying logic. Readers need not get caught up in more-complicated analyses, such as significance testing, effect sizes, and even regression--statistical methods that Raymond and Hanushek criticize us for not using. However, many statistical textbooks recommend against using complicated statistical methods with archival time series analyses.
Raymond and Hanushek throw the bias card into their critique, writing, "When a report is commissioned by an organization like the Great Lakes Center for Education Research and Practice, a Midwestern group sponsored by six state affiliates of the National Education Association, it would seem to call for a reasonable dose of skepticism." Not mentioned by Raymond and Hanushek is the fact that the research was originally funded by the Rockefeller Foundation and was published in a peer-reviewed scholarly journal six months before the consortium of teacher unions released this version of the study. The fact that teacher unions backed the study had no impact on its conclusions.
Raymond and Hanushek claim that the "accumulated literature" supports the conclusion that "student performance on the available measures, usually state tests, improves after accountability reforms are introduced." We believe that is patently false. We conducted a thorough review of the literature on high-stakes testing and found very few articles that would support such a proposition.
AUDREY AMREIN, DAVID BERLINER
Arizona State University
Tempe, Arizona
Margaret Raymond and Eric Hanushek respond: The assertion that "archival time-series analysis" is second in quality only to a true controlled experiment is ludicrous. Long ago, in their classic discussion of research design, Donald Campbell and Julian Stanley said that the time-series design "rarely has accepted status in the enumerations of available experimental designs in the social sciences" The obvious inability of simplistic historical approaches to establish "experimental isolation"--to rule out other factors that might have influenced the observed outcomes--opens up results from such analyses to significant interpretative questions.
Another problem with Amrein and Berliner's study is that they did not define an adequate comparison group. Instead, they compared student-performance trends in (some of) the states that adopted high-stakes testing with the average gain among states participating in NAEP--a trend that partially reflects the gains among high-stakes states, thereby corrupting the analysis. Amazingly, they make no attempt to defend this faulty approach. Instead, they trumpet the fact that they reached similar conclusions when they applied the same troubled analysis to other measures of student performance, such as SAT scores and drop-out rates. When we applied Amrein and Berliner's own time-series methodology to the data (with an appropriate comparison group of states that have not adopted highstakes testing), their conclusions were completely reversed. Yet Amrein and Berliner don't even address this. Their response ignores the egregious errors in implementation that we identified, namely the fact that they threw out a majority of the state observations, miscoded outcome information, and completely confused the sequence of test introduction and achievement measurement in several states.
We know of no legitimate statistical text that argues it is irrelevant to use tests of statistical significance to guard against random fluctuations in the data--in this case, scores on tests of student performance. Each administration of the NAEP involves a different group of students, a different set of test questions, and a different testing environment. Across test administrations, these differences can lead to random changes in scores that bear little relation to actual changes in students' knowledge and skills. The purpose of tests of statistical significance is to determine whether results reflect genuine changes in performance or simply random fluctuation.
Most Recent Reference Articles
- ARAB EUROPEAN RELATIONS - Dec 22 - Russia Denies Selling Missile System To Iran
- EGYPT - Dec 29 - Opposition Says Mubarak Blessed Israeli Attacks
- ARAB AFFAIRS - Dec 22 - Syria Will Eventually Move To Direct Talks With Israel
- ARAB AFFAIRS - Dec 30 - GCC Denounces Massacre
- ARAB ISRAELI RELATIONS - Israel Issues An Appeal To Palestinians In Gaza


