What can we infer from an end of year test?

This is the third blog post in what has turned out to be a trilogy on the theme of ‘What can we infer…?’. Episode 1 ripped in to the practice of examining exercise books as a means of establishing how much students have learnt (here). Episode 2 critiqued Oftsed’s approach to lesson observation (here). Episode 3, this post, will be more positive in tone. I hope to set out some thoughts about how end-of-year tests might inform actions to improve teaching and learning.

We have recently introduced end-of-year tests at key stage 3. Our main reason for doing so was a concern that, since the removal of National Curriculum levels, we do not have consistent and reliable information about the attainment of students, and the information we do have has created considerable workload for teachers. We have also been influenced by the work of academics and authors on the subject of assessment, such as Daisy Christodoulou and Professor Rob Coe, and the synthesis of research evidence which points towards the importance of retrieval practice and a knowledge-rich curriculum (Tom Sherringon’s reading list is as good as any place to start if you aren’t already informed about research). We have been convinced that well-designed, summative tests are probably the best way of establishing a reliable estimation of learning gains and that such tests will also focus our minds (and students’ attention) on exactly what knowledge should be learnt.

Designing the assessments

The point of an assessment is to be able to infer something from the outcome. In the case of a summative assessment, which is what an end-of-year test is, we are attempting to infer what has been learnt. This contrasts with a formative assessment where the inference is about what should happen next. These two forms of assessment are often considered to be different things, serving different purposes. However, in reality all assessments can be used formatively. Whilst an end-of-year exam might not perform well as a diagnostic tool (i.e. an assessment designed to work out what students do or don’t understand, and specifically what misconceptions exist), it will produce data which can be used to inform future actions. For example, the outcomes of summative assessments might inform a review of the curriculum and teaching, lead teachers to alter schemes of work or revisit topics, or be recorded and passed on to next year’s teachers as a profile for a class to inform their planning. I will return to these possible inferences and actions later, but firstly I would like to consider how we ensure that the assessment is sufficiently robust to allow such summative and formative conclusions to be drawn.

For an end-of-year test to generate reliable data on the attainment of students, we decided that certain conditions must be met:

There should be a large domain of knowledge which had been taught and would be assessed, which means that the assessment should only happen infrequently. The domain we defined was the curriculum content for the entire academic year.
The assessment should sample in a broad and balanced way from the specified domain.
The exam questions should be skillfully written to allow students a fair chance to show what they know. We undertook a quality assurance exercise for the exam papers to allow for varying skill levels of those writing the papers.
The questions/tasks should be discriminating i.e. allow for the full range of attainment to be assessed such that it is possible to ascertain high, medium and low achievers.
The papers should not be seen by teachers to ensure that they could not teach to the test. In reality, the paper writer was usually one of the teachers so it was difficult to create the conditions for a completely blind test.
The exams should be sat in formal test conditions i.e. all students at the same time to avoid any tipping off their friends, in silence, supervised to prevent cheating, timed and with consistent instructions.
Students would write a candidate number on their paper, not their name, and papers would be distributed randomly among teachers to mark. Anonymous papers would avoid unconscious bias.
Marking should be standardised to reduce inconsistency.
Students should be encouraged to limit revision so that the test would indicate what had been retained from a year’s learning rather than indicating how much preparation they had undertaken.
Results would be expressed as a scaled-score to aid comparability.

These requirements were necessary, we felt, but were also problematic. We had not accounted for the varying skill levels of those writing tests and the work involved in quality assuring the papers. The logistics of administrating the tests, blind-marking papers and standardising marking were challenging. To alleviate some of the burden on teachers we employed invigilators for the tests. We also encouraged subjects to make good use of multiple choice questions and we invested in optical mark-reader technology to process answer sheets and generate analysis of the results. The financial costs involved were therefore not insignificant.

Having seen through test weeks for Year 7 and Year 8, we are now in the process of marking and processing the results. Our attention now turns to what, if anything, we can reliably infer from the test data.

Were the tests fit-for-purpose?

Before going any further in drawing inferences from the test results, we have to question whether all the tests were fit-for-purpose. How would we know?

Our first feedback came from the students themselves as they sat the tests. We were looking for whether:

Some questions contained mistakes or were difficult for students to answer, even if they appeared to have the right knowledge. Given that exam boards produce papers with the occasional bad question, even after employing ‘expert’ question writers and putting the papers through multiple levels of quality checking, it was no surprise that we might do the same.
Some questions took longer to answer than was anticipated and the time spent was disproportionate to the mark allocation.
Some papers were too long or too short for the time allocated, meaning that either many students didn’t finish the paper or had lots of time spare.
Some students reported significant difficulty answering virtually any questions (where there were few ‘entry level’ questions) or reported papers which they found ‘too easy’. This would raise questions which we would want to seek answers to when analysing test results.

Fortunately, whilst there were instances of the above, concerns were not widespread. The ‘best’ designed papers appeared to be accessible to all students, discriminating, possible to complete within the allocated time and threw up few complaints by students. Most subject papers passed the ‘student test’.

The second source of feedback will come from the test results. Here is what we expect to see from tests which are fit-for-purpose:

The results should be normally distributed. A discriminating paper should contain questions which rank students as one might expect from a cohort of children in a non-selective school. We should see a small number of students at the top end of the mark distribution, a small number at the bottom end and a peak somewhere in the middle. If a different pattern emerges we might question whether the exam was designed well. For example, if there are a number of students achieving high marks, a large number achieving very low marks and virtually none in the middle, we might question the accessibility of the paper for low to middle attaining students.
The normal distribution should not be skewed towards high or low marks, such that the curve gets ‘cut off’ on the right or left. This would indicate that the paper was too hard or too easy.
The question-level analysis should indicate that there were a few ‘very hard’ questions and a few ‘very easy’ questions, with the bulk being somewhere in between.

Our initial analysis suggests that there are some sets of results which raise questions about paper design. In these cases, we will need to exercise caution about what inferences we draw from the results.

An alternative hypothesis to anomalous results, rather than it being a fault of the test, is that there is fault in the curriculum and/or teaching. For example, if we find papers which have resulted in lots of students getting very low marks it may be that the test is fit for assessing the knowledge in question but that the curriculum was overly ambitious. Alternatively, if the data shows a significant divide between those getting high marks and those getting low marks it may raise questions about whether teaching has been accessible to only a proportion of students, perhaps the high prior attainers, whilst low attainers have fallen significantly behind.

Clearly anomalous data will raise many questions and it will be difficult to judge what has lead to unusual patterns of data. What will be important is not to leap to conclusions or apportion blame. Whether we need to change how we assess, how we teach or what we teach, it is better to know that change is needed and to have the opportunity to learn from the data.

Cutting the data

Once we have established confidence in our data, what can we begin to infer from it?

The recency effect

One way of cutting the data might be to look to see how students have performed on topics taught recently compared to those taught earlier in the year. One would expect that topics taught some time ago would not be remembered as well by students. As a school, we are beginning to implement spaced retrieval practice having been very much in a ‘teach it and move on’ mode, particularly at KS3. We therefore expect to detect a recency effect on test scores. If correct, this discovery will inform our discussions around spaced retrieval practice, which might have implications for curriculum planning and assessment practices.

Question level analysis

Analysing the marks question by question, or grouping questions by topic, may indicate patterns of strength or weakness across the cohort or particular classes. This data needs to be treated cautiously as many students achieving low marks on a question might indicate a poorly worded question rather than a lack of knowledge on behalf of the students. However, if low marks are achieved in a number of questions on the same concept or topic we may infer that students’ knowledge is not secure. This may lead us to question the time allocated to a topic in the scheme of work or to change how this topic is taught. Insights arising from diagnostic assessments during the year may help us understand what difficulties students are facing in their understanding.

Question type analysis

As mentioned before, we made use of multiple choice questions in many tests as an efficient way to test a wide range of subject content, and to reduce the marking load on staff. However, most tests contained questions which required a written response too. One of the limitations of multiple choice questions is that students can often reach the correct answer through a familiarity with the knowledge tested. For example, in Computing we may want to test whether students know the correct name for types of networks. In multiple choice format, students will see the correct answer (‘topology’) and may choose this as they recognise this as a term which was used by their teacher. On the other hand, an open question asking what term is used is likely to be answered correctly by far fewer students. To get the answer they must have it fairly securely stored in memory, associated with the right concept and be able to retrieve this knowledge. Arguably, more sophisticated multiple choice formats may not be ‘easier’, such as questions with more than one correct answer or those which required students to work through a problem or logical reasoning to reach the answer. For example, a question asking students to add binary numbers and choose the answer from a list of denary options will require a multi-staged calculation. By categorising question types and slicing the data to show the mark distribution for each we can make inferences about the relative difficulty of each question type. This will help us design papers which contain the right mix of questions, some at entry level and others which even the highest attainers will find challenging.

Individuals

Obviously we will want to look at how individual students have performed in the tests. We are conscious of not drawing simplistic or unwarranted conclusions about students based on their score. As mentioned previously, we decided to use scaled scores and have set 100 as the mean, with a range of 70 to 130. Statistically, there must be some significance testing before drawing conclusions about the relative performance of students. What we will look for is whether students fall in the average range of scores, or whether their performance is significantly below or above the cohort. Once this is established, we will look for unexpected results. As Rob Coe states here, an assessment should have the ability to surprise you. We will look for surprises. For example, which students have achieved scores significantly below or above what one might expect from past assessments? Such surprises will raise questions, not provide answers. Was the student unwell when they took the test? Alternatively, does previously strong performance on class work indicate an ability to deliver what is expected whilst failing to retain knowledge in long term memory?

Class comparison

We teach in predominantly mixed-ability groups therefore some comparison between classes is possible. Assuming the prior attainment profile of each class is broadly similar, one might expect to see a similar distribution of results across classes for the same subject. Significant differences in the normal distribution curve might indicate a teacher or group effect. For example, we have two tutor groups in one year group (who are taught as tutor groups for most subjects) who have a reputation for being more challenging in terms of their behaviour. One might expect to see this effect born out in test scores for these classes. We may also want to look for signs of varying levels of performance of classes taught by non-specialists, ‘split’ groups (i.e. with more than one teacher) or by inexperienced teachers. This type of analysis has many caveats, not least that correlation does not mean causation; just because a less experienced teacher’s class does much less well in a test it does not necessarily mean that this was because of the experience level of the teacher. As with individual student scores, the data will raise questions, not provide answers. However, patterns across the school may provide a useful insight in to the impact of staffing and timetable decisions.

In summary, given end of year tests are usually considered ‘summative’, there are a number of inferences possible which may inform future action. Depending on what the data shows, we may be able to:

make adjustments to future tests to increase validity and reliability
build in opportunities for spaced retrieval practice in schemes of work where there is evidence of knowledge-fading for earlier topics
change teaching approaches and resources to improve understanding of topics where students’ knowledge is less secure
build in ‘re-cap’ lessons in to the next year’s schemes of learning for the tested cohort where they need to strengthen their knowledge before moving on
adjust the time allocation for topics
profile classes to inform next year’s teachers
flag anomalies with pastoral teams so they can look for patterns across subjects and monitor students going forwards
establish estimates of the impact of decisions such as split classes and non-specialist teachers

We are at an early stage in implementing more formal tests and have yet to explore all of the implications. So far, the process has been intriguing, informative and challenging. We will soon be in a position to begin to answer the question ‘What can we infer from an end-of-year test?’

Concluding the trilogy

How do we know that students are learning and making progress? In these blog posts I have explored three sources of information which leadership teams and Ofsted might turn to to answer this question; exercise books, lesson observations and test data.

My conclusion is that both scrutiny of students’ work and scrutiny of teaching are both highly problematic indicators of learning, progress or the quality of provision. However, well-designed and carefully implemented formal tests may shine a light on learning.

If this conclusion is correct, it casts doubt on Ofsted’s current focus on work scrutiny and observation, and their decision to eschew internal assessment data. I would advocate the exact opposite of their current practice. If Ofsted insist on continuing to judge the quality of teaching and learning, and if they continue to form their judgement on outcomes based on the progress of current students, I would suggest that the most reliable evidence will come from assessment data. However, this assessment data will only be reliable if the school has a robust and defensible position to carry out assessments. Ofsted should examine closely how schools are assessing students’ knowledge and, if the inferences drawn from these assessments are justified, what they are doing with this information. Assessment data should be held in the highest esteem when judging the quality of provision, and exercise books and observations treated cautiously and as subservient to this superior indicator.

I appreciate that my views are counter to conventional wisdom, but conventional wisdom should always be open to challenge.