Gates on Teacher Evaluation

Originally posted on the National Journal Education Experts Blog

Wonky statistical studies like the Gates Foundation’s Measures of Effective Teaching project are often soporific. But with some three dozen states and thousands of public school systems crafting new teacher evaluation systems under federal financial incentives, the final report of the 3-year, $45-million Gates investigation is nothing if not timely.

The study of some 3,000 teachers in Denver, Dallas and several other urban school systems puts to rest the notion that we can’t objectively measure teachers’ contribution to student learning—the thing that matters most in schools. By measuring student performance and then randomly assigning them to teachers in the study the following year, MET researchers, led by Harvard’s Tom Kane, were able to identify who was getting results and who wasn’t.

The study established beyond doubt that the only defensible way to get a fair reading on teacher performance using student test scores is by gauging how students did in the prior year or (preferably) years and compare that performance to how they do in the year you’re evaluating teachers, what’s called a value-added metric. Simply comparing teachers’ student test scores is a single year—a tactic that some school systems continue to use—yields many very flawed evaluations, the MET researchers report.

But the final MET report also upends the notion that student test scores can defensibly serve as the sole or even the dominant measure of teacher performance. The study found that the more you rely on test scores the more they fluctuate erroneously from year to year, and the less likely you’re able to accurately measure teachers’ ability to raise students’ scores on more intellectually demanding tests than those used by most states today—a flaw that’s going to loom a lot larger as national testing consortia introduce new, tougher tests in the next few years.

Contradicting the claims of some school reformers, the MET researchers found that well-constructed classroom observations can accurately distinguish teachers’ ability to raise student achievement, as can student surveys (which are a new and promising gauge of teacher effectiveness, though one not extensively tested in high-stakes situations where jobs are on the line).

The best strategy—the one that yields the most accurate results—they found, combines classroom observations, student surveys, and measures of student achievement into an evaluation package. Another finding: multiple observations by multiple observers are stronger than a single observation by a single person, or even multiple observations by a single person.

Notably, the MET team combined the three different ways of measuring teacher performance in different configurations and found that weighting observations, surveys and student achievement equally came closest to identifying teachers’ true performance, and did so from one year to the next. (In their report, they expressed confidence in having student test scores count for up to 50 percent of teachers’ evaluations.)

The study focused on the statistical advantages of combining observations, surveys and student achievement. But there are other reasons why combining the methods is valuable. The first is that only a fraction of teachers teach tested grades and subjects, leaving school districts unable to use student test scores to evaluate upwards of 85 percent of their teachers. Individual teachers (not to mention their unions) are a lot more sympathetic to so-called multiple measures than test-score-driven evaluations. That’s important because teacher buy-in, districts are learning (often the hard way), is critical to the sustainability of new evaluation systems.

And if you want to use evaluations to improve teachers’ performance rather than merely to remove bad apples—and you should, because removing bad teachers is a necessary but ultimately insufficient way to build a stronger teaching force, especially as the Common Core State Standards ratchet up expectations in every classroom—then you’ve got to use tools that identify things that teachers can improve, something that value-added scores alone don’t do very well. An important step in this direction would be to create new standards for classroom observations (called rubrics) that focus on the content that teachers teach, rather than merely on the way they teach.

Importantly, the MET study and others have found that teachers want to improve, they want help improving their craft. The MET team videotaped the study’s 3,000 teachers at work in their classrooms and the teachers were thrilled with the feedback they got.

Comprehensive evaluation systems of the sort that Kane and his colleagues recommend are pricier than the cursory, drive-by principal visits that pass for evaluations in most school systems, and are more expensive than value-added gauges of teacher performance. The District of Columbia school district spends about $1,000 a year (not counting principals’ time) to evaluate a teacher under one of the most comprehensive systems in the country.

There are ways to reduce these costs without diminishing the accuracy of the ratings (fewer observations for top-performing teachers is one strategy). But you get what you pay for. The MET study suggests that only comprehensive systems dependably measure a teacher’s performance reliably from one year to the next.

And comprehensive models would allow school systems to target other monies a lot more effectively. We spend something like $20 billion a year on teacher professional development and much of it is dreck, of little value to most teachers. Spending money on comprehensive evaluations would help school districts to get a lot more for their professional-development monies by allowing them to target their support to individual teachers’ needs identified through the evaluation process. It makes sense, in other words, to view comprehensive new evaluation systems as an investment, not merely an expense.