Tuesday, January 8, 2013

Principals Can't Tell Who the Good Teachers Are

As you probably know, Memphis is one of the sites where the Gates Foundation's Measures of Effective Teaching Project (MET) is researching some central questions about how we can figure out who effective teachers are.  This puts Memphis in what should be the awkward position of making policy decisions based on as-yet-incomplete research.

Well, it's probably a good thing, then, that the Gates Foundation has finally finished the last stage of its research.  Here's the executive summary.  You know, since districts and states across the country have made huge, mainly controversial changes on the basis of what the Gates Foundation was pretty sure the research was going to say.

The Huffington Post has a story about the research results.  (You can compare the HuffPo report to  the glowing review of the research in the Commercial Appeal.)  HuffPo summarizes the results as "teacher observation less reliable than test scores" in measuring effective teaching.  The Gates Foundation research has now found that effective teaching can be measured - through some combination of teacher observation, test scores, and student surveys.  The basic methodology is that teachers had a baseline year to establish their effectiveness (2009-10) - the Foundation calls this "produc[ing] estimates of teaching effectivness for each teacher."  The report states:  "we adjusted those [original] measures [of teaching effectiveness] for the backgrounds and prior achievement of the students in each class.  But, without random assignment, we had no way to know if the adjustments we made were sufficient to discern the markers of effective teaching from the unmeasured aspects of students' backgrounds."  So the next year, academic year 2010-11, "we then randomly assigned a classroom of students to each participating teacher."

By "background", the Foundation is obliquely referring to students' socioeconomic backgrounds.  The researchers at RAND (the ostensibly independent corporation that Gates Foundation hired to conduct the research) hoped to answer two questions:  "First, did students actually learn more when randomly assigned to the teachers who seemed more effective when we evaluated them the prior year?  And second, did the magnitude of the difference in student outcomes following random assignment correspond with expectations?"  The short answer, according to the RAND researchers, is yes and yes.

Interesting.  Should we be surprised that the Gates Foundation research finds what the Gates Foundation thought it would find?  Or that it cost $50 million to find it out?

A couple of things jump out at me.  The first has to do with the role of teacher observations and the principals who conduct them.  Traditionally, teachers were judged only on the basis of their principal's once-every-several-years announced-in-advance observation.  Teachers generally performed very well in these observed lessons.  Where students did not make "adequate yearly progress" but teachers were uniformly rated as effective and highly effective, critics found a disconnect.  The solution has been to require that student standardized test results be included in teachers' overall assessment.  So - no surprise that the researchers have found that principals actually do not adequately assess their own teachers in the classroom observations.  Districts have tried a number of methods to "norm" their evaluators - they have to take a class before they can be evaluators, they have to grade a video of a teacher within an accepted range of how the researchers have "objectively" graded the video, etc.  But despite all of that "norming", principals' observations are viewed as unreliable.

Now the summary doesn't just come out and say that - instead, the summary states that "adding a second observer increases reliability significantly more than having the same observer score an additional lesson."  The researchers go on to qualify this statement in three ways.  First, they acknowledge that it may be too expensive to have enough observers if they have to observe full-length lesson, and so portions of lessons could be observed. 

The second and third qualifying statements both have to do with principals, and embedded in the second is what underlies the Foundation's concern:  "although school administrators rate their own teachers somewhat higher than do outside observers, how they rank their teachers' practice is very similar and teachers' own administrators actually discern bigger differences in teaching practice, which increases reliability."  Essentially, principals do not objectively recognize effective teaching, but they can recognize how teachers within a school stack up against each other.  Talk about a back-handed compliment.  So even if principals don't actually know if they have a good teacher, if they observe all of the teachers at their school, then they can tell you who is better than who.  So the principals' observations still have some value.

But then there's the third qualifying statement - the crux of the Foundation's concern:  "adding observations by observers from outside a teacher's school to those carried out by a teacher's own administrator can provide an ongoing check against in-school bias."  There's the there.  Principals can't be trusted to adequately judge their own teachers.  If you don't believe the Gates Foundation, just ask Tennessee's Commissioner of Education Kevin Huffman, who says that Tennessee's teachers observation scores are consistently inflated. 

I have mixed feelings about principal evaluations to start with - I'm not really all that opposed to administrators from outside the school doing the observations.  I've known enough bad principals - who unfairly reward their favorites and unfairly criticize their less favorites that I've viewed that as a bigger problem than generalized inflation.  I guess I assumed that principals could objectively observe teachers, but chose not to.  May be time to re-think that.  But I do think that principals do understand teachers' fears about the new evaluation systems that so heavily weight student achievement, and the principals do, after all, walk a fine line between supervision and maintaining working relationships with their subordinate teachers.  Therefore, some inflation - just for good employee relations - should not be all that surprising.  Principals don't always get a lot of training in how to handle walking that fine line, and they are human.  And we know that principals are not selected because they are the best teachers, though they are now required to be instructional leaders whatever their teaching success.  Still thinking about this one - but suspicious of a "measurement" that encourages the devaluation of the observations of the one person that most likely understands the teacher, the students, and the realities of teaching and learning in a particular school - and in some neighborhoods, the dangers of just getting to school.  Teachers - how do you look at your principals in this particular role?

The second thing that jumps out at me is the methodology.  So the idea is that a teacher has her principal-assigned class the first year, and the RAND randomly-assigned class the second year to see if the teacher can meet or exceed the expectations based on their baseline year.  But these aren't just any students who are assigned - these are students that attend the same school.  Teachers are not assigned new schools, just new kids.  Why is this important?  Because - at least in Memphis - we use "neighborhood schools".  Generally, schools serve kids that live near the school.  So the students are generally going to be more similar to each other than dissimilar.  At least in Memphis, our neighborhoods are still fairly racially segregated, and definitely defined by socioeconomic status. 

So let's tease that out.  If you're a good 4th grade teacher at Richland Elementary, you'd probably be a good 3rd grade teacher there, too.  And you would likely do well with a different set of 4th graders at the same school.  But let's acknowledge that a 4th grade teacher at Richland Elementary could be hard-pressed to be an equally as great 4th grade teacher at Coro Lake Elementary.  The Gates Foundation acknowledges that "as a practical matter," they could not "randomly assign students or teachers to a different school site."  I think what students the teacher teaches matters as much as how well the teacher teaches - and while value added measures get closer to this issue than traditional did-the-student-meet-this-benchmark testing - none of the testing addresses how what is happening in students' lives outside of the classroom impacts what is possible inside of the classroom.

Underlying this all is my fundamental disagreement that a student's achievement should be measured based on that student's performance on any particular standardized test.  We know, from the dozens of studies on the SAT and the ACT, that the single best predictor of a student's performance is that student's socioeconomic background.  I'm just not convinced that standardized tests measure how well a teacher taught something, or even how well a student learned something.  And they definitely do not measure a number of other things that are hugely important, including creative and critical thinking.  Don't get me wrong - student achievement is hugely important on its own - I'm just not sold that the testing products being sold to our school districts are the correct tool to measure it.

Still more to read in this executive summary - working to get the full report.  I'm very interested in what the Gates Foundation calls its ambiguous results on "peer effects" (the demographics and average prior achievement of each student's classmates), as well as the "trade-offs" of the different models for how we can weight the various measures of teacher effectiveness - for example, test scores as 50% versus 33% versus less.  Let me know what you think about it.


  1. It’s a safe assumption that some principals’ evaluating is prone to inflation simply because even with detailed evaluation rubrics, training, etc., observations have a degree of subjectivity. That doesn’t mean it’s a widespread problem or that principals who do their jobs well can’t evaluate teachers.

    Ed reformers conveniently ignore the well-documented problems with validity and reliability when applying VAM to individual teachers or assert that making test scores just one of “multiple measures” mitigates VAM’s limitations. “Hey, 100% of your high-stakes evaluation isn’t based on junk science, just 40-50%!”

    If someone develops a valid, reliable model based on test scores, then I’m fine with including it as part of my evaluation. But VAM isn’t such a model.

    The TN Dept of Ed, like the Gates Foundation, apparently holds value-added sacrosanct, and any disparity between VAM scores and evaluation scores are clearly the result of biased principals. See the following quotes from last summer's TEAM report:

    “In many cases, evaluators are telling teachers they exceed expectations in their observation feedback when in fact student outcomes paint a very different picture. This behavior skirts managerial responsibility and ensures that districts fail to align professional development for teachers in a way that focuses on the greatest areas of need” (p. 32).

    “This disparity between student results and observations signifies an unequal application of the evaluation system throughout the state” (p. 32).

    Page 33 mentions that Tennessee “leads the nation in available data on teacher performance and effectiveness” and that it possesses a “tremendous amount of student outcome data received through TVAAS.”

    However, there’s no mention anywhere of considering the validity or reliability of the TVAAS-based data!

    Although the report focuses on the disparity between the Level 1 scores, the chart on page 32 clearly indicates disparities at Level 4 and Level 5 as well:

    Level 4
    TVAAS 11.9%
    Observation 53%

    Level 5
    TVAAS 31.9%
    Observation 23.2%

    If the issue is actually observation score inflation, why then does the department not cite Level 4, the level with the largest disparity? Or if the source of error is the observation, not the TVAAS, why not argue that more teachers should’ve received a 5 for their observation score?

  2. The Gates Foundation also said that 33% of a teacher's evaluation should be based on the student survey.

    This is patently ridiculous. In the lower grades, the students generally like their teacher, and want to mark the "right" answer on those surveys, they just don't know how! (Yes, they do give the survey in the form of a bubble in test to Kindergartners)

    In the upper grades, many students in a challenging school generally will NOT like their teacher if she is a good teacher who knows how to control her class. They do know how to fill out the survey, and will give their teacher bad results.

    In neither case does it make sense to use the survey as 33% of a teacher's evaluation.