Stop Misusing Tests to Evaluate Teachers


Monty Neill

In the spring of 1998, concerted action by North Carolina teachers forced the state legislature to revise a law requiring teachers and principals working in the 15 schools with the lowest student scores on statewide tests to take a “competency” test to determine whether they would be retained or fired.1 That law was part of a new wave of “accountability” legislation now flooding the nation. Thus, North Carolina provides a case study on using student test scores to evaluate teachers.

Under the 1997 Excellent Schools Act, North Carolina schools were to be ranked based on their student scores on state exams. In the 1997-98 school year, the state assigned “special assistance teams” to the 15 lowest-ranking schools. The teachers and principals of those schools were to take a “general knowledge” test. Those who did not pass after three tries would be dismissed. The law would have expanded the program to more than 100 schools in 1999-2000.

With the initial exam scheduled for June 12, 1998, the North Carolina Association of Educators (NCAE) filed a class-action suit, declared that it would support a boycott if 90% of those required to take the test joined in, and mounted an extensive lobbying campaign to change the law. The State Board of Education agreed that the testing idea was a mistake. At the last minute, the legislature approved and the governor signed a bill revising the law. Now, teachers at schools deemed “low-performing” on the basis of student scores will be evaluated by the special assistance teams, but only those teachers given the lowest rankings and found by the special assistance teams to be lacking in “general knowledge” will be required to take the test. Those teachers who fail will receive training at state expense and will have one more opportunity to pass.

The idea of judging teacher competence based on student test scores is not new. In England, “payment by results”—“the results of the examination of individual children”—was initiated in 1862 but discarded in 1890 because of its own results: cramming and the “overpressuring” of students as well as the stifling of teachers.2 Teacher evaluation by student test scores was also tried intermittently—but usually briefly—in the United States. Now, however, the notion is gaining ground. For example, Texas has enacted legislation to use student scores to evaluate its 250,000 teachers. Nationwide, by the spring of 1997, student test scores were being used to reward or sanction schools or school districts in 14 states, with six or more pending.3

What is the problem? Shouldn’t teachers be held accountable for successfully teaching their students? Shouldn’t students be saved from chronically dysfunctional schools and incompetent teachers? Phrased so, the only fair answer to these questions would be a resounding “Yes.” Indeed, the organization of which I am Executive Director, FairTest, and other educational reformers support accountability and the right of districts and states to intervene in failing schools. Nevertheless, the use of student scores for such high stakes purposes is not a good solution to the problem of inadequate schools.


Problems with “Accountability” Laws

The best predictors of student test scores are family income and the mother’s level of education. The environment of the school as a whole, the resources available to teachers, class size, and other factors are all beyond the control of individual teachers but likewise affect student learning.

The 15 schools targeted for teacher testing in North Carolina all have overwhelmingly low-income populations.4 A similar situation exists in Chicago, where “re-engineered” high schools serve almost entirely very low-income students, and a large proportion of the schools are virually all black.5 Opponents of the Excellent Schools Act argued that if teachers in such schools were to be tested, good teachers would seek employment in schools with higher-scoring students, while other good teachers would avoid these low-scoring schools. Thus, the educational opportunities of the low-income students would be diminished rather than improved. Apparently, North Carolina legislators found this to be a powerful enough argument to revise the original law.

A more sophisticated version of using student scores to evaluate teachers is the “value added” approach pioneered in Tennessee and Kentucky. In this approach, the focus is placed on score gains rather than on absolute scores. In Kentucky, schools are evaluated based on the aggregate gains of their students. According to some observers, however, the state’s reward program led to a climate of fear and substantial teacher hostility to the student testing program.6 Fueled in part by teacher anger, Kentucky’s student testing program was revised by the legislature. The state added in a multiple choice test, but the use of student assessment to determine rewards remains, though in somewhat modified form.7

Tennessee evaluates, but does not reward or sanction, teachers based on student test scores. Its value added approach uses multiple-choice tests to measure the learning gains of each student in each class. Thus, the state can determine the ability of a teacher to induce greater or lesser amounts of tested learning—or, perhaps, just how well teachers prepare their students for test taking.8

The approach used in these two states reduces the problem of holding teachers accountable for problems outside their control, but it does not eliminate it. For example, the students of a good teacher in a weak school are likely to be affected by the schoo#146;s systemic problems. Still, in the same or similar schools, teachers who do a particularly poor job of helping students improve on the tests can be identified. However, this test-based approach still contains two major problems: it often fails to correctly identify the actual problems, and it leads to narrow teaching to the test with harmful educational consequences.


Misidentifying Problems

For student test scores to be used in evaluating teachers, such use must accord with generally accepted measurement standards, or in other words, be validated. Often such validation is not attempted, or is done poorly. In North Carolina, the state’s own validation study concluded that student scores “cannot be directly assess individual teacher performance.”9 An invalid test means that any inference drawn from the results—for example, that a teacher lacks knowledge or cannot teach effectively—is very likely to be false. Such unjustified conclusions can lead to unfair and often harmful consequences.

North Carolina’s Excellent Schools Act was meant to solve the perceived problem of low student test scores. However, student scores on tests do not provide information about the causes of those scores. If a student does not do well on a history exam, there are many possible reasons why: she or he did not pay attention in class or do her homework, the course was not aligned well with the test, the teacher did not know her subject well, or the teacher knew the subject but could not teach it well.

The North Carolina law assumed that teachers in schools with low scores lacked “general knowledge.” However, the special assistance teams that worked in these schools found this assumption to be incorrect: “During the past year, 55 teachers were evaluated as a ‘3,’ the lowest rating, but 27 of them moved to higher rankings on subsequent evaluations. Of the remaining 28 teachers, eight retired and the other 20 did not lack general knowledge.”10

Taking action without knowing the cause of a problem is likely to lead to mistakes. For example, not only would teachers in North Carolina have been required to take a test that was not relevant, real problems might have been ignored while state authorities focused on the results of the teacher test.

Teaching to the Test

Tailoring curriculum and instruction to fit the test is common in the United States.11 A frequent consequence is that whole subjects or parts of subjects that are not tested—sometimes including social studies—are either not taught or are de-emphasized. This pertains not only to content, but also to thinking skills. For example, a study found that the test then used by Arizona, the Iowa Tests of Basic Skills, covered on average only 26% of the state’s mandated curriculum in the tested subjects. Since the test was all multiple choice, only the lower levels of the curriculum focusing on rote learning were tested.12 A separate set of studies found that many schools, including schools in Arizona, focused only on what was tested, meaning that most of the curriculum was ignored.13 This was in a situation in which scores were released to the public, but no sanctions existed for students, teachers, or schools. With higher stakes, tests are even more likely to dominate curriculum and instruction.

Many states have produced academic standards and revised their tests. In some instances, where tests include more extended open-ended questions, this may have led to positive changes in instruction. However, most state exams still fail to adequately match the state standards, particularly those requiring higher order thinking in and across the subject disciplines. Most state tests are still largely multiple-choice and short-answer, and they have not led to positive changes. The Arizona example of limiting instruction to fit a narrow test remains common, and it is more common in schools with large percentages of low-income or minority-group students.14 In short, education is being undermined in the name of test-driven accountability of students, teachers, administrators, and schools.


The Need for Real Solutions

An appropriate accountability mechanism for a schoo#146;s performance must be fair to teachers and based on comprehensive measures of student learning and of the school context. It also must support important student learning and be congruent with professional development and school improvement. For example, a high-quality, classroom-based student assessment system can provide rich information about student learning while supporting improved professional development.15 When combined with school and district contextual information, a basis for fairer teacher evaluation is established.

Schools, with support from districts and states, must create structures to facilitate dialogue about curriculum, instruction, assessment, and student learning. Such discussion is an essential prerequisite for sustained professional development and for creating a true community of learners.16 Within this context, most teachers can continuously improve. Unfortunately, some cannot, and they need to be counseled into an alternative career or, if necessary, removed.

It may be difficult to use what is designed to be primarily a school and educator improvement process as a tool for removing a colleague. However, teachers do not owe each other continued employment at the expense of children. Moreover, we expect that the inability or refusal to improve will be rare.

Districts or states have a responsibility to ensure that dysfunctional institutions improve. The Cross City Campaign, a multi-city educational reform network, has developed Principles of Accountability that outline key issues based on shared and reciprocal responsibilities rather than simply top-down control. It has also created a set of Intervention Standards to guide interventions when they are needed. These include the use of multiple indicators from many sources over time; a fair and mutually respectful public process that engages all stakeholders; and intervention conducted in a manner that builds a schoo#146;s capacity to improve teaching and learning. The multiple indicators include assessment information about student learning. Such information should not be limited to test scores; indeed, where the tests are not compatible with high-quality schooling, reliance on tests should be eliminated or minimized.17

The Cross City Campaign’s principles and standards are not structured to address individual teacher evaluation. Such evaluation, however, can be done best within a structure similar to that developed by Cross City to evaluate schools. For example, Baltimore has begun to develop a teacher accountability program rooted in high quality professional development for teachers that includes use of student portfolios.18

It is important to recognize that, even given a fair set of assessments in place of one narrow exam to rely on, holding teachers and schools accountable still must be done within the context of those things for which it is reasonable to hold them accountable. Schools and teachers are often evaluated as if poverty, racism, and their social consequences were of no importance. Thus, children who lack food, shelter, and security are viewed as being equally able to progress in school as children with wealthy or middle class parents. This understanding should not let teachers off the hook for doing a good job, but requires that a good job be measured in terms of the social context. If society wants better results for poor children, it must address issues of poverty and provide teachers and schools with adequate support, not just test students and teachers.


Organizing for Defense and Positive Change

If teachers are to successfully withstand improper uses of tests and inappropriate accountability procedures, they will have to organize a strong and clear resistance. Such resistance must include teacher willingness to participate in genuine accountability efforts.

Organizing implies finding allies and thinking strategically. Unfortunately, by accepting the unreasonable, damaging, and educationally invalid testing of students and of prospective teachers, too many education organizations have pushed away potential allies. For example, in North Carolina, the state teachers’ association recently said it would not take a stance on a proposed test for grade two students, although the organization has a policy opposing the testing of young children, largely because such tests are developmentally inappropriate.19

Compromises of this kind not only put the weight on children, but also give away the strategic high ground by accepting for students the kind of testing that teachers object to for themselves.20 This has weakened teachers’ clout on the immediate issue of not being held accountable by test scores. Accepting such student testing also distracts attention from creating true, high-quality communities of learners. It sends the message that complex problems and school improvement can be addressed with more testing, and in so doing, leaves everyone more vulnerable to the demands for more and tougher testing.

Stopping the mania for testing will not be easy, but reversing cycles of test emphasis has been done in the past and can be done again. Organizations of educators must take a strong position against the increased use of standardized testing for both students and teachers, while supporting genuine educational reform and accountability. If they do not, the damage to students, schools, and the teaching profession will only continue.



1. The events are summarized in “N. C. Lawmakers Alter Testing Plan,” Fair Test Examiner (Summer 1998): 16; “N. C. Teachers Sue Over State Test,” Fair Test Examiner (Spring 1998): 12; the Examiner articles relied on news reports from North Carolina papers and coutry papers and affidavits.

2. W. M. Haney, G. C. Madaus, and R. Lyons, The Fractured Marketplace for Standardized Testing (Boston: Kluwer Academic Publishers, 1993), 269-271.

3. E. Roeber, L. Bond, and S. Connealy, Annual Survey of State Student Assessment Programs, Fall 1997, Data on 1996-97 Statewide Student Assessment Programs, Vol. II (Washington, DC: Council of Chief State School Officers, 1998), 244-253.

4. Fair Test Examiner articles.


5. G. N. Schmidt, “Chicago to ‘Re-engineer’ More High Schools,” Substance (July 1999): 1, 29.

6. “Kentucky’s Assessment Program Faces Problems, Stays the Course,” Fair Test Examiner (Fall 1997): 8-10; E. Miller, “Early Reports from Kentucky on Cash Rewards for ‘Successful Schools’ Reveal Many Problems,” The Harvard Education Letter (January/February 1996): 1-3.

7. “Assessment and Accountability,” Insights: the Journal of the Kentucky Association of School Councils (January 1999): 1-9.

8. Roeber et al.; M. Neill, Testing Our Children (Cambridge, MA: FairTest, 1997), 153-156.

9. “N. C. Teachers Sue...”

10. “N. C. Lawmakers...”

11. G. F. Madeus, “The Influence of Testing on the Curriculum,” in L. N. Tanner, ed.., Critical Issues in the Curriculum. 87th Yearbook of the National Society for the Study of Education, Part I (Chicago: University of Chicago Press, 1988), 83-121; G. F. Madaus, M. M. West, M. C. Harmon, R. G. Lomax, and K. A. Viator, The Influence of Testing on Teaching Math and Science in Grades 4-12 (SPA8954759) (Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy, 1992).

12. N. L. Noggle, Report on the Match of the Standardized Tests to the Arizona Essential Skills (Tempe, AZ: College of Education, 1987), cited in T. Haladyna, N. Haas, and J. Allison, “Continuing Tensions in Standardized Testing,” Childhood Education: Infancy Through Early Adolescence 74, No. 5: 262-263.

13. M. L. Smith and C. Rottenberg, “Unintended Consequences of External Testing in Elementary Schools,” Educational Measurement: Issues and Practices (Winter 1991): 7-11.

14. Madaus et al.; N. Medina and D. M. Neill, Fallout from the Testing Explosion (Cambridge, MA: FairTest, 3rd ed., 1990); Neill (1997).

15. National Forum on Assessment, Principles and Indicators for Student Assessment Systems (Cambridge, MA: FairTest, 1995); M. Neill et al., Implementing Performance Assessments (Cambridge, MA: FairTest, 1995).

16. J. W. Little, “Teachers’ Professional Development in a Climate of Educational Reform,” Educational Evaluation and Policy Analysis 15, No. 2 (1993).

17. Cross City Campaign for Urban School Reform, Beyond Finger Pointing and Test Scores (forthcoming). Address: 407 S. Dearborn St., Suite 1725, Chicago, IL 60605.

18. This effort is based on ideas developed in Charlotte Danielson, Enhancing Professional Practice (Alexandria, VA: ASCD, 1996).

19. Personal communication (August 25, 1998).

20. The Draft North Carolina legislation did include provisions that the test results in grade two not be used for high stakes.


Monty Neill, Ed.D. is Executive Director of the National Center for Fair Open Testing (FairTest), 342 Broadway, Cambridge, MA 02139; tel: (617) 864-4810; email; website

©1999 National Council for the Social Studies. All rights reserved.