How do we respond to high stakes testing?

How far are you willing to go to secure better exam results for your students?

I’ve been reading around the topic of educational assessment over half term. One of the points of agreement between most people writing on the subject is about the purpose of assessment which, they argue, is to enable the assessor to infer what has been learnt by the student. The literature then understandably goes on to talk a great deal about inference validity and the reliability of various assessment methods.

Beyond these technical discussions, some authors then venture into the question of what use will be made of the data gained. Will the data be used to help the teacher adjust their teaching, to place students in an appropropriate class, to inform parents about their child’s progress, or to draw conclusions about the impact of a teacher, teaching team, or a school? The way the data is used is often considered as a separate issue from assessment design and implementation, but of course it is not. The purpose of an assessment – the reason it is being carried out – is almost always known in advance of the assessment taking place, therefore test setting is rarely an unadulterated exercise.

The author perhaps most up front about this is Daniel Koretz, in his excellent book ‘Measuring Up’ (2008). Koretz dedicates a chapter to the topic of test inflation, the most concerning example of the way data use corrupts the assessment process.

Koretz, like others have done, reminds us of Campbell’s law: “The more any quantitative social indicator is used for social decision-making, the more apt it will be to distort and corrupt the social processes it is intended to monitor.” He provides numerous non-educational examples of this law in action, including how flight punctuality targets resulted in airlines extending their published flight times, the U.S. Postal Service employing extra staff to dedicate to delivering post to the addresses monitored for delivery times, and cardiologists who decline to operate on high risk patients. Each example illustrates that when we raise the stakes by making a measure the focus of accountability, we warp behaviour. Koretz then presents the evidence relating to how exam scores are inflated once the stakes are raised. Through the use of reference tests, he illustrates how public examinations over-estimate the actual mastery of the domain of knowledge the exam claims to be testing.

But what is the mechanism by which this inflation occurs?

Koretz proposes seven types of test preparation i.e. behavioural responses by teachers and schools to the pressure, perceived or real, to increase test outcomes. He places these in rough order of the degree to which they undermine valid inferences about an individual’s mastery of a domain of knowledge.

Working more effectively
Teaching more
Working harder
Reallocation
Alignment
Coaching
Cheating

Arguably, each of the above behaviours are incentivised by high-stakes testing. Which, if any, of these behaviours should concern us?

I suggest that there are at least four reasons why we may not welcome some of these changes in behaviour. First, we may be wary of the workload and psychological impact on either the teacher or student (pressure). Second, we might be concerned if the benefits accrued result in a disbenefit to others in the system (opportunity cost). Third, we may feel uncomfortable if an unfair advantage in the assessment is conferred on one type of student over another.(inequity) Fourth – and possibly most importantly – we may be concerned if the behaviour distracts from students seeking to master the whole domain i.e. they concern themselves with test performance over learning of the subject (undermining learning).

Items 1 to 3 on Koretz’s list are least open to criticism. If examinations, such as KS2 tests, GCSEs or A Levels, incentivise more effective teaching practices, increased teaching times, or harder working students, then our only caution may be to monitor any ill-effects on wellbeing.

‘Reallocation’ (item 4) may also be innocuous. The term is used to mean a reallocation of resources towards teaching content which is valued, this value being indicated by its possible inclusion in the assessment. If this means teachers focus on key concepts rather than periphery or tangential material, then this is probably a good thing. However, if schools begin to make timetabling decisions whereby the ‘strongest’ teachers are deployed disproportionally in externally examined year groups, we may object on the grounds that other year groups are disadvantaged.

But it is numbers 5 to 7 on Koretz’s list that raise the most concerns because they begin to undermine the validity of inferences about which students have learnt which material.

‘Alignment’ goes beyond reallocation of resources to intentionally and disproportionally to teach what is expected to be in the test. This assumes some knowledge of the likely test content. Of course, in external exams, significant efforts are made to avoid teachers knowing what is in the exam. However, this does not prevent teachers spending time analysing past papers and second guessing what will come up. There are also a finite range of ways that knowledge can be tested give the ‘style’ of exam paper. You will often hear teachers tell students that ‘this is unlikely to come up in the exam’ or ‘this is a common essay question’. This behaviour may be passed off as being understandable and mostly harmless. However, it confers advantage on students whose teacher is more experienced, has taught this particular syllabus for longer, or even marks for the exam board. I would be interested to see a study into the outcomes of students taught by lead examiners, but I think I know what it would show.

Once alignment begins to address question style, we are moving towards what Koretz calls ‘coaching’. Coaching takes preparation for an exam one step further by explicitly teaching students the techniques they need to access marks on papers. This is a widespread practice in schools. Since the removal of National Curriculum levels and their replacement in some schools with the use of GCSE grades all the way down to Year 7, explicit teaching of GCSE exam technique can begin before GCSE content is even being studied. However, I would question whether this becomes counterproductive as it assumes that the ability to answer GCSE questions is generic and can be abstracted from the content. An early focus on ‘answering a 6 mark question’, for example, may distract from actually learning what is needed to answer the question. But as students near final exams, the proliferation of coaching (such as the ‘walking, talking mocks’ advocated by the likes of PiXL) suggests that teachers believe that coaching confers benefits. Advocates of such approaches may argue that schools are simply equipping their students with the tools to show what they have learnt. This may be true to an extent, but when significant curriculum time is taken up with coaching exam technique rather than teaching or revising subject content, we should question what is lost.

I do not wish to venture into the murkier world of cheating, but we know it happens. It is at the end of a spectrum of behavioural responses to high stakes testing which range from the beneficial to the questionable. Individually, each of the latter items on Koretz’s list can be justified and played down, but we have good evidence to suggest that, in aggregate, the result is that high stakes testing does not tell us what we think it tells us, either about the mastery of a subject by an individual or the performance of groups or cohorts over time.

But how much does this matter? And how much can we realistically do about it?

Let’s take the second question first. If we are minded to address these issues, we can either lower the stakes of the tests in question, or take steps to further mitigate unintended consequences. I would support a little bit of both. Take GCSEs, for example. If the intention is to indicate suitability for progression to post-16 destinations, the current system is a very expensive and excessive way of doing it, which could easily be simplified. However, GCSEs serve other purposes too, not least as an indicator of school effectiveness (which they are a poor measure of). It is this purpose which pushes up the stakes for schools, who inevitably pass this onto students. There is no easy answer to this problem. It would be a retrograde step to make school outcomes less transparent to parents, for example. However, we could do more to communicate what this data does and doesn’t infer about school performance, and be less punitive with schools whose intake loads the dice against them. In the meantime, we might take a firmer stance against some of the least desirable practices such as schools condoning excessive teaching to the test. We should remember that it wasn’t always like this. The focus on exam results has steadily increased since the 1980s and there have been many undesirable consequences of this. However, neither should we look on the past with rose coloured glasses. The lack of transparency over school standards allowed poor practices to persist in many parts of the system. There is no sweet spot along this spectrum, but we should seek to correct the system when it veers off centre.

And then there is a the question of how much all this matters. I think it does, mainly because we owe it to students to make their experience of schooling and assessment as fair and constructive as possible. This means that assessment should serve the curriculum, not vice versa, and that both should serve the needs of students.This requires that we align the incentives within the system as far as possible towards promoting better teaching and better run schools. For those of us working in schools, Koretz’s analysis should give us pause for thought. To what extent are we undermining our own efforts to impart knowledge and create a fascination with our chosen subject? How do we reconcile our principles with the natural desire to help every student get the best grade possible? How we respond to high stakes assessment is a test of our priorities.

Footnote

Having published this, it occurred to me that there is an implicit idea about incentives that may not be clear for those in the room who aren’t economists. So to be more explicit…

It is perfectly rational for teachers and schools to respond to the incentives in the system by engaging in the behaviours Koretz lists (including cheating, if there are no consequences to doing so), and we should not attribute ‘blame’ for them doing so in that sense. For example, teachers have two interests in mind: their own and their students. They want their students to get high grades because it helps the student and it reflects well on them. Therefore, if the teacher is in possession of information which would confer benefits to a student’s exam performance, how could we expect them not to convey this?

Similarly, schools are incentivized to ensure students achieve good results so it is rational that they should employ methods to achieve this, such as encouraging teachers to run extra classes. It becomes irrational when the cost of doing so outweighs the benefits.

This is important because we are unlikely to significantly change behaviour by appealing to some moral instinct of either teachers or school leaders. That is not to say that they are not curtailing their behaviour already due to an ethical belief, but that trying to achieve wholescale change through the mechanism of moralising is unlikely to be effective. Much more effective would be to change the incentives which motivate them.

In my final paragraph, I play on moral instincts in contradiction to my point above. However, this is really a rhetorical device. My key point is earlier in the paragraph when I say that aligning incentives with the desired outcomes is the way we need to proceed. Those of you who are ethicists rather than economists may well prefer my closing rhetoric, however!