Last month I participated in the annual Reading of the Advanced Placement US Government exams. In May, over 300,000 students in US high schools took this exam, consisting of a series of sixty multiple choice questions and four free responses, usually in the hope of earning college credit or placement in a higher level course. The multiple-choice items are easily dealt with, but someone has to score the free response questions. Enter the AP Readers, a collection of several hundred college professors and high school teachers who voluntarily spend a week sitting in a convention center, scoring essays for eight hours a day, seven days straight. Sound awful? It gets worse: we actually score responses to the same essay question, over and over again. On a good day, I score about 450 essays, all answering the same question.
So why have I put myself through this tedious exercise for nine of the last ten years?
Besides the perks (decent pay, great co-workers, neat locations, and fun evenings), this job has taught me more about reliability and validity than almost any other experience. And in turn, this has helped me in teaching these subjects to my methods students in a way that they really get. They might not feel up to evaluating the Freedom House survey measures, but they do understand the frustration involved in school grades. Exploring an exam and how it is graded—sorry, scored—is something that they already have the practical knowledge and understanding to appreciate, and helps us hone in on what reliability and validity really mean.
To explain, I need to do a deep dive into the process of how we score these exams.
First, you’ll notice that I am deliberately using the word ‘score’. AP exams are not graded. We use extensive rubrics to determine whether students have earned points for various parts of their responses. Each response earns a score ranging from zero to six points, depending on the question. The rubric is designed by experts, who train all of the Readers on how to apply the rubric to essay responses. We spend about four hours training, and then move to individually scoring live exams. No grades are ever given, and every AP Reader knows the frustration of reading an essay by a student who knows what they are talking about, but does not quite hit the required points in the rubric. I have read four page zeroes, and half page sixes, and my beliefs about the student’s knowledge or effort do not matter: the rubric determines the score. There are a lot of jokes that go around about ‘worshipping the rubric’ and ‘drinking the Koolaid’ on the rubric; the truth is that the rubric, for this short week, is the source of all answers, and must be obeyed. Joking about it is simply one way of dealing with the reality that our own opinions do not matter very much.
The rubric, therefore, is our tool of validity. It and only it can determine whether a student has earned points on a given question. Sometimes I disagree with the rubric, quite strongly. When I do, it is usually because I question its content validity—I find it incomplete, failing to award points for some responses that I think are indeed valid. Sometimes the use of one word versus another is the difference between a point earned or not. Once the powers-that-be approve the rubric, however, it is set and cannot be changed—except in how we interpret it.
This brings us to reliability, which is at the core of what we do at the Reading. The goal is to ensure that given a set rubric, an essay will receive the score it earns, regardless of who scores it or when. An essay scored early in the day on Day 1 by Reader X should receive the same score as if it were scored by Reader Y at 4:55pm on Day 7. Reliability is increased by three methods: first, as already mentioned, via hours of group training on the rubric with plenty of sample essays to score and compare notes on with other readers. Second, through the backchecking process. Throughout the scoring period, the Table Leaders—the individuals in charge of each table of eight to ten Readers—read through at random 10-20% of the essays each Reader scores, checking to make sure that the Reader is correctly applying the rubric. Incorrect scores are altered, and the Table Leaders then talk over the misapplied point with the Reader to ensure that they interpret that part of the rubric correctly in the future. Finally, as scores are processed, the Chief Reader and Question Leaders keep tabs on each scorer’s statistics, and if a Reader’s scores are substantially higher or lower than the average across the entire Question, they are flagged for a conversation with their Table Leader to see where the mistakes are being made. Consistently inaccurate Readers are generally not invited back in future years.
For students that doubt the importance of reliability and validity, it can be really useful to describe this process and the amount of time, energy, and money that goes into ensuring these two features in the AP exam reading. Some of your students may have taken AP exams, but all of them have wondered about exactly how their academic work has been graded in the past. Talking through this model, versus the unstructured and less-accountable ways in which their instructors usually conduct grading, helps them see the value of these core concepts of methodology. You can also make the discussion more active, by asking your students to score the exams themselves using the same materials the Readers use. The College Board makes all the previous years exams available at AP Central. For both the US Government and Politics and the Comparative Government and Politics exams, you can find the free response questions, scoring rubrics, and sample essays on that site. Choose a question, hand out the rubric and sample exams, and ask the students to score the essays individually and check their responses with each other. They will quickly realize that using the same data and same measurement device, they will still come up with radically different scores. It is a very effective way to get students to quickly understand the role that reliability and validity play in measurement.
And if you are interested in becoming a Reader yourself, consider applying. The main criteria is that you have taught the introductory undergraduate college course in US or Comparative Government and Politics in the recent past. There’s also an exam on Research, by the way, for my fellow methods instructors. Send a cv and a syllabus and you can join us next year at the annual Read.
Hi Amanda, I enjoyed your post. I just thought I’d add a bit of information since I was looking into this recently. The third open-ended question for the 2016 comparative government exam asked students about the difference between correlation and causation. If an instructor were to do the activity you suggested, this question seems ideal. (In fact, I plan to incorporate your recommendation into my course fall semester and will use this question.)