top of page

Register to Be Notified of New Summaries!

How Well Do Professional Reference Ratings Predict Teacher Performance?

  • Writer: Greg Thorson
    Greg Thorson
  • May 19
  • 5 min read
ree

This study asks whether professional reference ratings can predict future teacher performance. Using data from Spokane Public Schools between 2015 and 2018, the authors analyzed structured reference ratings for over 3,500 teacher applicants, linking them to subsequent teacher performance measures: evaluation scores and value-added outcomes in math and ELA. They found that reference ratings significantly predict teacher evaluation outcomes, with a one standard deviation increase in ratings associated with a 13 percent standard deviation gain in evaluations. Ratings predicted math value-added modestly (2 percent), but not ELA. Predictive validity was strongest for experienced applicants and references from supervisors or colleagues.


Full Citation and Link to Article

Dan Goldhaber, Cyrus Grout, and Malcolm Wolff. (2025). How Well Do Professional Reference Ratings Predict Teacher Performance? Education Finance and Policy, 20(2), 236–258. doi:10.1162/edfp_a_00421


Extended Summary

Central Research Question


This study investigates a practical and under-researched question in education hiring: Do professional reference ratings accurately predict subsequent teacher performance? The authors explore whether structured, categorical assessments from references can serve as a reliable, low-cost tool for informing school districts’ hiring decisions. Specifically, they examine whether reference ratings for teacher applicants are predictive of two key performance indicators—classroom observation scores (TPEP ratings) and student test-based value-added scores in math and English Language Arts (ELA). The study also investigates how predictive validity varies based on rater type (e.g., supervisor vs. colleague) and applicant experience level (novice vs. experienced).


Previous Literature


While the use of professional references and letters of recommendation is widespread in education hiring, empirical evidence linking references’ assessments to later job performance is limited and mixed. Earlier studies (Aamodt et al. 1993; McCarthy and Goffin 2001; Liu et al. 2009) suggested some predictive value of references, but sample sizes were small and contexts were specialized (e.g., military or corporate internships). Research on teacher hiring practices has increasingly demonstrated that structured screening tools can predict outcomes, but these are often costly to implement (e.g., TeachDC or LAUSD systems). Letters of recommendation, in contrast, are unstructured and tend to be uniformly positive, making them difficult to interpret. Goldhaber et al. (2017) previously found that letters influenced hiring rubrics, but were challenging to interpret. This prompted interest in exploring structured reference ratings as a potentially more useful and scalable alternative.


Data


The study uses applicant and employee data from Spokane Public Schools (SPS) in Washington State from 2015 to 2018. Applicants were asked to submit contact information for at least three professional references. Upon submitting a letter, references were directed to a structured online survey that asked them to rate the applicant on six teaching-related competencies—such as instructional skills, student engagement, and classroom management—using a percentile-based scale (from “Average” to “Among the best encountered in my career”). Each reference also provided an overall rating and identified the competency where the applicant was strongest and weakest.


In total, structured reference ratings were collected for 3,588 unique applicants. These were matched to post-hire teacher performance data from three sources: (1) district-level evaluations under Washington’s Teacher/Principal Evaluation Program (TPEP), (2) statewide administrative data on demographics and employment, and (3) student test scores used to generate teacher value-added scores in math and ELA. The analytic sample includes 757 applicants with TPEP ratings, and around 269 with value-added data in math and ELA. Applicants not observed in public schools were excluded from the performance analyses.


Methods


The main independent variable is a “GRM” score: a summative measure of the six categorical reference ratings, generated using a graded response model that accounts for variation in rating difficulty and discrimination across items. The study also examines the predictive value of the overall reference rating as a categorical variable.


Three main outcome variables are analyzed:


  1. Teacher performance evaluations (TPEP), which include eight competency-based ratings.

  2. Teacher value-added scores in math.

  3. Teacher value-added scores in ELA.


Linear regression models were used to estimate the relationship between reference ratings and each performance measure. The models controlled for teacher characteristics such as gender, race, degree level, and years of experience. Additional analyses included reference fixed effects to control for differences in how lenient or stringent individual raters were. The authors also tested whether predictive validity varied based on rater type or applicant experience (novice vs. experienced). To account for possible sample selection bias (since only a subset of applicants were hired and subsequently evaluated), a Heckman two-step correction was applied using instruments related to job competition.


Findings/Size Effects


The study finds that professional reference ratings are predictive of teacher performance, though the strength of this relationship depends on the outcome measure, rater type, and applicant experience level.


  1. TPEP Evaluations:


    • A one standard deviation increase in the GRM score is associated with a 13 percent of a standard deviation increase in TPEP scores.

    • Teachers rated “Among the best” (top 1%) on the overall reference scale performed 36 percent of a standard deviation better than those rated “Very good.”

    • When rater fixed effects were included, the effect size increased to 23 percent of a standard deviation, indicating that how a reference rated an applicant compared to others they rated was especially predictive.


  2. Value-Added in Math:


    • A one standard deviation increase in the GRM was associated with a 2 percent standard deviation increase in math value-added scores.

    • Teachers rated in the top 1% had math value-added scores roughly 6 percent of a standard deviation higher than those rated “Very good.”

    • With rater fixed effects, the GRM coefficient increased to 6.9 percent of a standard deviation.


  3. Value-Added in ELA:


    • Reference ratings were not significantly predictive of ELA value-added scores in the main models.

    • Some rater types (e.g., colleagues and instructional coaches) showed weak predictive value, but results were inconsistent and less interpretable.


  4. Heterogeneity by Applicant Experience:


    • Reference ratings were significantly predictive of performance for experienced applicants but not for novices.

    • This suggests that references may be better able to judge the teaching potential of individuals with a known classroom track record.


  5. Heterogeneity by Rater Type:


    • Ratings from Principals, Instructional Coaches/Department Chairs, and Colleagues were significantly predictive of TPEP performance.

    • Principals’ ratings also predicted math value-added; Instructional Coaches predicted ELA.

    • Ratings from Cooperating Teachers, University Supervisors, or “Other” references were not significantly predictive.


  6. Heckman Selection Models:


    • The analysis did not find evidence that selection into the sample (i.e., being hired and evaluated) introduced significant bias.

    • If anything, selection bias would slightly understate the strength of the observed relationships.



Conclusion


This study offers some of the strongest evidence to date that structured reference ratings can provide meaningful insights into teacher applicant quality. The findings suggest that categorical assessments from references—particularly supervisors and colleagues—can predict future classroom performance, especially when evaluating experienced applicants. While the predictive power is moderate, the cost of collecting this information is low, making it a promising tool for improving teacher hiring practices.


However, the study also identifies clear limitations. Reference ratings are not predictive of ELA value-added scores, and they are not useful for evaluating novice applicants. Additionally, ratings are highly skewed toward positive evaluations (“cheerleading”), which may limit their informativeness. The results underscore the importance of considering both who is providing the reference and how experienced the applicant is when interpreting such ratings.


Ultimately, while structured reference ratings are not a silver bullet, they can be a valuable addition to hiring systems—particularly in resource-constrained districts where more intensive screening tools are not feasible. Further research could examine how these ratings interact with other applicant data or how hiring managers respond to reference information when it is made available during decision-making.


Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Screenshot of Greg Thorson
  • Facebook
  • Twitter
  • LinkedIn


The Policy Scientist

Offering Concise Summaries*
of the
Most Recent, Impactful 
Public Policy Research

*Summaries Powered by ChatGPT

bottom of page