“Don’t they have means, such as classroom observations, that could be put to use in learning about teacher quality?”
If the classroom observations are to provide fair and accurate assessments, they may require a large time commitment by observing faculty, who ought to attend multiple class sessions and review teaching materials. Attending one or two classes, reviewing a syllabus and checking grade distributions for one course won’t suffice. Will allowances be made for that effort, which will reduce time available for other professional responsibilities?
“… similar answers from multiple students to specific qualitative questions about teaching can indicate particular teaching problems ….” They can indeed, and published research concerning a variety of student surveys indicates that there is good reason to be very cautious because of the prevalence of various kinds of biases and other memory issues. Some of the relevant citations:
In 2007, Stephen R. Porter began to raise serious questions about the trustworthiness of factual information reported in student survey results. [ and ]
Stephen R. Porter, “Do College Student Surveys Have Any Validity?” The Review of Higher Education v5 n1 (Fall 2011) 45–76
“In this article, I argue that the typical college student survey question has minimal validity and that our field requires an ambitious research program to reestablish the foundation of quantitative research on students. Our surveys lack validity because (a) they assume that college students can easily report information about their behaviors and attitudes, when the standard model of human cognition and survey response clearly suggests they cannot, (b) existing research using college students suggests they have problems correctly answering even simple questions about factual information, and (c) much of the evidence that higher education scholars cite as evidence of validity and reliability actually demonstrates the opposite. I choose the National Survey of Student Engagement (NSSE) for my critical examination of college student survey validity ….”
{Cited in an early, but not the final, version by Porter:
Garry, M., Sharman, S. J., Feldman, J., Marlatt, G. A., & Loftus, E. F. (2002). “Examining memory for heterosexual college students’ sexual experiences using an electronic mail diary.” Health Psychology, 21(6), 629-634. Abstract: To examine memory for sexual experiences, the authors asked 37 sexually active, nonmonogamous, heterosexual college students to complete an e-mail diary every day for 1 month. The diary contained questions about their sexual behaviors. Six to 12 months later, they returned for a surprise memory test, which contained questions about their sexual experiences from the diary phase .… [Except for their accurate recollection of (low) frequency of anal sex, the students grossly overestimated by as much as a factor of four the frequency of vaginal or oral sex; men and women did not differ significantly in their overestimates.]}
Stephen R. Porter, Corey Rumann, and Jason Pontius, “The Validity of Student Engagement Survey Questions: Can We Accurately Measure Academic Challenge?” New Directions for Institutional Research, n150 (Summer 2011) 87- 98 DOI: 10.1002/ir.391
This chapter examines the validity of several questions about academic challenge taken from the National Survey of Student Engagement. We compare student self-reports about the number of books assigned to the same number derived from course syllabi, finding little relationship between the two measures.
Stephen R. Porter, “Self-Reported Learning Gains: A Theory and Test of College Student Survey Response,” Research in Higher Education (Nov 2012) DOI 10.1007/s11162-012-9277-0
Abstract: Recent studies have asserted that self-reported learning gains (SRLG) are valid measures of learning, because gains in specific content areas vary across academic disciplines as theoretically predicted. In contrast, other studies find no relationship between actual and self-reported gains in learning, calling into question the validity of SRLG. I reconcile these two divergent sets of literature by proposing a theory of college student survey response that relies on the belief-sampling model of attitude formation. This theoretical approach demonstrates how students can easily construct answers to SRLG questions that will result in theoretically consistent differences in gains across academic majors, while at the same time lacking the cognitive ability to accurately report their actual learning gains. Four predictions from the theory are tested, using data from the 2006–2009 Wabash National Study. Contrary to previous research, I find little evidence as to the construct and criterion validity of SRLG questions.
Porter’s conclusions about student surveys receive strong support in this doctoral dissertation:
William R. Standish, III, A Validation Study of Self-Reported Behavior: Can College Student Self-Reports of Behavior Be Accepted as Being Self-Evident? (2017)
https://repository.lib.ncsu.edu/bitstream/handle/1840.20/33607/etd.pdf?sequence=1
Abstract excerpt: This validation study of self-reported behaviors compares institution-reported, transactional data to corresponding self-reported academic performance, class attendance, and co-curricular participation from a sample of 6,000 students, using the Model of the Response Process by Tourangeau (1984, 1987). Response bias, observed as measurement error, is significant in 11 of the 13 questions asked and evaluated in this study. Socially desirable behaviors include campus recreation facility (CRF) use and academic success being overstated as much as three times. Nonresponse bias, observed as nonresponse error, is also significant in 11 of the same 13 questions asked and evaluated with high GPA and participatory students over represented in the survey statistic. For most of the questions, measurement error and nonresponse error combine to misstate behavior by at least 20%. The behaviors most affected are CRF use, which is overstated by 112% to 248%; semester GPA self-reports of 3.36 versus an actual value of 3.04; and co-curricular participation that misstated by between -21% to +46%. This validation study sufficiently demonstrates that measurement error and nonresponse error are present in the self-reported data collected for the commonly studied topics in higher education that were represented by the 13 questions. Researchers using self-reported data cannot presume the survey statistic to be an unbiased estimate of actual behavior that it is generalizable to larger populations.
[Porter was a committee member, but not the dissertation director. Before beginning his doctoral dissertation research, Dr. Standish was an experienced higher education data analyst.]
Porter is not the only researcher who has highlighted problems with student surveys as information sources:
NA Bowman, “Can 1st-Year College Students Accurately Report Their Learning and Development?,” American Educational Research Journal, v47 n2 (2010) 466-496.
Abstract: Many higher education studies use self-reported gains as indicators of college student learning and development. However, the evidence regarding the validity of these indicators is quite mixed. It is proposed that the temporal nature of the assessment—whether students are asked to report their current attributes or how their attributes have changed over time—best accounts for students’ (in)ability to make accurate judgments. Using a longitudinal sample of over 3,000 first-year college students, this study compares self-reported gains and longitudinal gains that are measured either objectively or subjectively. Across several cognitive and noncognitive outcomes, the correlations between self-reported and longitudinal gains are small or virtually zero, and regression analyses using these two forms of assessment yield divergent results.
See also:
NA Bowman & TE Seifert, “Can College Students Accurately Assess What Affects Their Learning and Development?” Journal of College Student Development, v52 n3 (May-Jun 2011) 270-290; and NA Bowman, “Examining Systematic Errors in Predictors of College Student Self- Reported Gains,” New Directions for Institutional Research n150 (Sum 2011).
Shana K. Carpenter, Amber E. Witherby and Sarah K. Tauber, “On Students’ (Mis)judgments of Learning and Teaching Effectiveness,” Journal of Applied Research in Memory and Cognition 9 (2020) 137–151.
Abstract: Students’ judgments of their own learning are often misled by intuitive yet false ideas about how people learn. In educational settings, learning experiences that minimize effort and increase the appearance of fluency, engagement, and enthusiasm often inflates students’ estimates of their own learning, but do not always enhance their actual learning. We review the research on these “illusions of learning,” how they can mislead students’ evaluations of the effectiveness of their instructors, and how students’ evaluations of teaching effectiveness can be biased by factors unrelated to teaching. We argue that the heavy reliance on student evaluations of teaching in decisions about faculty hiring and promotion might encourage teaching practices that boost students’ subjective ratings of teaching effectiveness, but do not enhance —and may even undermine — students’ learning and their development of metacognitive skills
General Audience Summary: As the changing landscape of education provides more freedom and flexibility in the options available to students, it is becoming increasingly important that students be able to successfully evaluate and manage their own learning. This is easier said than done, however, because students often misjudge their own learning of a given topic to be better than it actually is. This common tendency toward overconfidence can be further bolstered by a number of intuitive but misleading factors that enhance students’ subjective impressions of how much they have learned, without always enhancing their actual learning. Students believe, for example, that they learn best from enthusiastic and engaging instructors who provide smooth and well-polished lectures that do not require active class participation. Such factors, although they readily inflate students’ judgments of their own learning, do not consistently enhance students’ actual learning. They also inflate students’ evaluations of the effectiveness of their instructors. Indeed, students’ evaluations of teaching effectiveness can be poor predictors of their actual learning in their courses, and these evaluations can be biased by external factors unrelated to student learning, such as an instructor’s gender, age, attractiveness, and grading leniency. Given the heavy reliance on student evaluations of teaching effectiveness in decisions regarding faculty hiring and promotion, faculty may be incentivized to adopt teaching approaches that boost their evaluations but do not enhance — and could even undermine — students’ academic success.
Despite the relatively small size of the populations involved, both of the following studies highlight disturbing possibilities:
M. Oliver-Hoyo “Two Groups in the Same Class: Different Grades.” Journal of College Science Teaching, 38(1) (2008) 37-39. [-recounts the experience of one award-winning instructor who taught two sections in the same classroom at the same time and received significantly different evaluations from the two sections. One of the matters about which there was disagreement was how available the instructor was outside of class. All students were informed in the same way about the instructor’s office hours, she kept them, and many students from each section made use of them. There was a significant association between receiving lower grades and rating the instructor as significantly less available during office hours.]
Robert J. Youmans and Benjamin D. Jee, “Fudging the Numbers: Distributing Chocolate Influences Student Evaluations of an Undergraduate Course,” Teaching of Psychology v34 n4 (2007) 245-247.
Abstract: Student evaluations provide important information about teaching effectiveness. Research has shown that student evaluations can be mediated by unintended aspects of a course. In this study, we examined whether an event unrelated to a course would increase student evaluations. Six discussion sections completed course evaluations administered by an independent experimenter. The experimenter offered chocolate to 3 sections [immediately] before they completed the evaluations. Overall, students offered chocolate gave more positive evaluations than students not offered chocolate. This result highlights the need to standardize evaluation procedures to control for the influence of external factors on student evaluations.
See also: Michael Hessler et al, “Availability of [chocolate] cookies during an academic course session affects evaluation of teaching,” Medical Education (2018) doi: 10.1111/medu.13627 9 pp.
Concerning student evaluations of teaching, it is also worth noting that the results can easily be unfair even if the evaluations are unbiased, and relatively reliable and valid:
Justin Esarey and Natalie Valdes, “Unbiased, Reliable, and Valid Student Evaluations Can Still Be Unfair ,” Assessment & Evaluation in Higher Education (Published online: 20 Feb 2020) DOI: 10.1080/02602938.2020.1724875
Abstract: Scholarly debate about student evaluations of teaching (SETs) often focuses on whether SETs are valid, reliable and unbiased. In this article, we assume the most optimistic conditions for SETs that are supported by the empirical literature. Specifically, we assume that SETs are moderately correlated with teaching quality (student learning and instructional best practices), highly reliable, and do not systematically discriminate on any instructionally irrelevant basis. We use computational simulation to show that, under ideal circumstances, even careful and judicious use of SETs to assess faculty can produce an unacceptably high error rate: (a) a large difference in SET scores fails to reliably identify the best teacher in a pairwise comparison, and (b) more than a quarter of faculty with evaluations at or below the 20th percentile are above the median in instructional quality. These problems are attributable to imprecision in the relationship between SETs and instructor quality that exists even when they are moderately correlated. Our simulation indicates that evaluating instruction using multiple imperfect measures, including but not limited to SETs, can produce a fairer and more useful result compared to using SETs alone.
[Despite having been on the faculty of the same institution as Porter and Oliver-Hoyo, I have never met either of them in person. I’ve corresponded briefly with Porter and Bowman. I met Dr. Standish once during an in-person committee meeting.]
link

