Can you instinctively know the grade of a piece of work?

I have been thinking about how we might persuade colleagues to revisit their initial approaches to assessment at work. It is only once we truly understand the ingrained habits and assumptions that one can begin to encourage a genuine reorientation. I am also conscious that recent education debate is dominated by a so-called ‘progressive’ versus ‘neo-traditionalist’ dichotomy. Believing, habitually, that where two opposed positions are set out, the best approaches usually emerge from the grey area between, I want to better understand teacher’s responses to ‘new’ assessment theories to critically interrogate them and ensure that we are moving in the right direction.

My titular question is derived from a discussion with a senior colleague who had been raising questions of assessment with the English and Maths faculties. He mused that introducing new ideas about assessment might be a challenge, given an engrained belief that staff could instinctively feel the grade of a piece of work. This coincided with two separate colleagues asking me, last week, what grade they thought a piece of work: one was a history essay, another an Extended Project. This suggested to me that such a belief spread further across the school, and I’d expect across the profession more widely. You only have to look at lesson activities that offer students grade based learning objectives as evidence of this.

I wasn’t sure of what to make of this proposition. It seemed logical, on the face of it. Could colleagues, who have been in the profession for years, if not decades (as most of mine have), not build up a reliable method of applying grade based descriptors from the sheer volume of exemplars they had seen in their careers? This seems highly probable, in the case of Maths, where a quality model of progression prevails. Questions can be ‘grade seven’ or ‘grade nine’ skill-levels in Maths. This is not so in the humanities or perhaps in English.

Still, surely a teacher could look at enough essays to sense which engage in the analysis of sufficient complexity to merit a higher grade than another? My gut instinct is that teachers could do this. In essence, teachers build in certain expectations, certain criteria into their minds to operate some form of comparative judgement. But is it desirable? I’m not sure, but the issues seem to be as follows:

  • Colleagues who believe they can instinctively apply a grade to a piece of work are typically experienced and in positions of middle management. This is not the case, necessarily, for those in their department who are perhaps considerably less experienced. Middle managers need to provide the structures to enable these colleagues to assess work accurately, and ascertain its particular quality relative to the rest of the class, year-group or cohort.
  • However experienced a teaching colleague might be, the new 9-1 specifications are radically different to their predecessors. This might be true of some subjects, more than others, but as a ‘Modern World historian’ the new rubrics represent a genuine revolution in both content and assessment structures. Teachers are likely to have mastered the progression model, but clearly they cannot assess work against externally set grades at this stage in time.
  • For schools and departments to generate data to plan teaching and interventions, summative judgements about students’ work needs to be reliably produced. I mean this in the sense that even if every ‘dart thrown’ is wide of the bullseye, it should be wide by the same amount. This is more likely to be the case with work that is assessed by a department, rather than individual teachers, and this is best facilitated with methods such as comparative judgement, rather than teachers awarding grades against an internal gut judgement and/or level based criteria.
  • Even if you could instinctively suggest that an extended piece of writing, for example, was a particular grade, is this assessment part of a broader set of structures that allows for a skill-level and content-level analysis of a pupil’s knowledge that is updated in real time. The belief that you can give a piece of work a grade, the task in question must be of the sort that matches the final exam. It suggests a teaching approach that is not breaking down the skills and knowledge required and building these up, gradually, over time. Instead, students are being subjected to terminal exam style questions, which Christodoulou has demonstrated to us distorts the teaching process and is inherently inaccurate where teachers begin to ‘teach to the test.’

I’m also left with a desire to swing at the biggest assumption of all laying under the titular question. Why would you ever want to generate a grade against a student? Is it even necessary? Questions for those smarter, and more important, than myself.



Colourful Comparative Judgement

Further Refinements Using Comparative Judgement

I recently wrote about my experimentations this year with Comparative Judgement, which are worth a read here. I spoke of making a further refinement in practice to make it easier to use with my sixth form essays.

Now that there are no longer AS exams to generate reliable data on our students,  we have conducted a second wave of formal assessments with our students , as a series of ‘end of year’ exams. A source essay was set for the AQA ‘2M Wars and Welfare’ unit relating to the General Strike, and students essays were scanned in to be judged comparatively. Comparative Judgement is supposed to be quick. Dr Chris Wheadon has remarked that it should be possible to make reliable judgements on which essay is best within thirty seconds. However, we were finding it difficult to do this. There are several points of comparison to make, and in previous rounds of comparing essays, it was difficult to determine which essay was best when, for example, one essay had made strong use of own knowledge to evaluate the accuracy of claims made in a source but another had a robust and rigorous dissection of the provenance of the source. Therefore, we decided to mark up the essays by highlighting the following ‘key elements’, that we determined were essential to determining the quality of the essay:

  • Use of precise own knowledge, integrated with the source content.
  • Comments relating to the provenance of the source.
  • Comments relating to the tone of the source & how this affects source utility.


This led to a discussion of how this could be practically be used when making a comparison. We determined, first of all, that when essays might initially appear equal, but where one did not have an even coverage across all three areas, the one that had a broader coverage would be determined best. In theory, we would have been doing this anyway, but marking up the essays beforehand made this significantly easier to spot and therefore judge.

We were also able to resolve other tensions too, when making judgements. It became clear that all of our students had deployed an excellent range of knowledge when determining the accuracy and validity of the arguments made by the sources. It was therefore easier to precisely compare the quality of students’ evaluation of the provenance of each source, having a visual guide of which parts of the lengthy essays to read.

The use of colour was therefore valuable in supporting us in extracting some general strengths & weaknesses of essays across the set. The significant points of comparison were clearer to spot and there were things we were looking for, when making a judgement, which were not coming up, i.e. precise comments on the purpose of the source. It also emerged, that we were not highlighting something of great importance to us; arguments sharply judging the value of the source. Our students were excellent at ripping apart the sources, and commenting on their accuracy and reliability and so on, but were not using these ideas to craft arguments about the utility of the sources to a historian. In essence, some were not answering the question. Some were, and some were not. This gave rise to a feedback task, where students were invited to either pick out of their essays for themselves where these arguments were, or to re-draft passages to insert such arguments.

Impact on the Students

Students also responded extremely positively to the colour coding of their essays. From initial intrigue about what each of the colours represented, once a key had been discerned, they were alive with discussions about what constituted a quality discussion about the value of the provenance of the source. Immediately, students were responding to comments about the structure and balance of their essays. Without prompting, they were looking across their tables to see if there were strong examples that they could look at, to support them with their own development.

This is certainly what we would want to see from sixth form students. By guiding students towards recognising certain parts of their essays, we had unlocked the potential for them to act as independent learners. This was in marked contrast to previous attempts at delivering feedback, where I had more generically suggested that they read through their essays themselves, and look to reconcile the feedback that had been given to what they had actually written. In over five sides of writing, this isn’t particularly helpful. Instead, individual students were focussing on what made their essays unique. Meanwhile, conversations about how limited students’ discussions were on the purpose and provenance of the sources took on new meaning for students, as they could very quickly see how far their writing differed from what I was suggesting had been required. As suggested, I was most pleased with students debating what they should really have said was. Some students challenged my highlighting. One high attaining student in particular instantly recognised that she had not discussed the provenance of the two sources, at all, but queried whether some of her remarks might have qualified. This led to a meaningful dialogue of why some of her suggestions did not ‘count’ as such. She immediately amended her answer to include some more relevant points, and her understanding of what it means to dissect the provenance of the source had been enhanced. The feedback appeared to be doing its job. However, only the next source essay can start to assess how far this assertion is true.


Making Good Progress

How can we put the lessons from Making Good Progress into practice?

I had originally intended this to be a follow up to my two blogs reviewing the new Robert Peal series of textbooks. However, I think the ideas contained in Daisy Christodoulou’s book demonstrate weaknesses with the design of most school’s assessment models and require application far more widely. There has been a refreshing focus on models and theories of assessment in education discourse recently. However, it has only served to depress me, that we’re doing it wrong! Time for some optimism, and to start thinking about what the next steps are to accurately assessing our pupils’ work.

You would need to read the book in full, of course, to see Daisy’s evidence base and full analysis of the problems with assessment in schools. I have written a particularly thorough summary of Daisy’s book that I would be keen to discuss with anyone should they wish to get in touch: it is a powerpoint slide summary for each chapter. However, I would suggest that Daisy’s unique contributions and her most important ideas are as follows:

  • Descriptor-led assessments are unreliable in getting an accurate idea of the quality of a piece of work.
  • Assessment grades are supposed to have a ‘shared meaning’. We need to be able to make reliable inferences from assessment grades. This is not the case if we simply aggregate levels applied to work in lessons, or to ‘end of topic’ pieces of work, and then report these aggregate grades. Daisy calls this banking, where students get the credit for learning something in the short-run but we do not know if it has stuck over time. I would suggest this is one of our biggest flaws, as teachers. We test learning too soon, rather than looking for a change in long-term thinking.
  • Summative assessments need to be strategically designed. We cannot use formative and summative assessments for the same task. Instead, we need to design a ‘summative assessment’ as the end goal. The final task, for example a GCSE exam question, needs to be broken down as finely as possible into its constituent knowledge and skill requirements. These then need to be built up over time, and assessed in a formative style, in a fashion that gives students opportunities for deliberate practice, and to attempt particular tasks again.


What Daisy proposes as a solution is an integrated model of assessment. A model which takes into account the differences between formative and summative assessments, and where every assessment is designed with reference to its ultimate purpose. What this looks like would be:

  • Formative assessments which are “specific, frequent, repetitive and recorded as raw marks.”
    • These would be regular tests, likely multiple-choice questions, where all students are supposed to get high marks and marks are unlikely to be recorded. Recording marks starts to blur the lines between formative assessment and summative assessment.
  • Summative assessments which are standard tests taken in standard conditions, sample a large domain and distinguish between pupils. They would also be infrequent: one term of work is not a wide enough domain to reliably assess.
    • For ‘quality model’ of assessments, such as English and the Humanities, these can be made particularly reliable through the use of comparative judgement. You could, and should, read more about it here. Daisy also suggests that we should use scaled scores, generated through nationally standardised assessments or comparative judgement. This would have the advantage of providing scores that could be compared across years, and class-averages can provide valuable data to evaluate the efficacy of teaching. I must confess that I need to understand the construction of ‘scaled scores’ more before I can meaningfully apply this information to my teaching practice. I would welcome the suggestion of a useful primer.


I’m starting to think about how I could meaningfully apply these lessons to a history department. Daisy suggests that the starting point is to have an effective understanding of the progression model. I think this is something that the history teaching community is already strong on, though the model remains contested which is no bad thing. However, the lack of standardisation across the history teaching community means we are unlikely to build up a bank of standardised summative assessments which we could use to meaningfully compare pupils’ work across schools, to diagnose weaknesses with our own students’ performance. This is something for academy chains and the Historical Association to perhaps tackle. I might be wrong, but I think this is something PiXL seem to be doing in Maths, and Dr Chris Wheadon is setting the foundations for in English. This isn’t something that can be designed at the individual department level.

Where teachers can more easily work together is on the construction of a “formative item bank”. This would consist of a series of multiple-choice questions that will expose students’ thinking on a topic, tease out misconceptions, and judge understanding. Invariably, students’ conceptual thinking in history is undermined by a lack of substantive knowledge. Only once teachers undertake this task, which surely must be a collective effort, can we discern the extent to which this style of formative assessment can detect first and second-order knowledge. Some adaptations might be required. We can then integrate this formative assessment with an appropriate model of summative assessments where the power of collective action on the part of history teachers will undoubtedly be even greater.

I shall therefore spend my holidays thinking about, among other things, what the first steps I need to take as a teacher are to develop such a bank of formative material, and how I would need to shape the structure of summative assessments across the various Key Stages. I intend to write more on this subject. I think it is at the very core of ensuring that we maximise the potential of the new knowledge-rich curriculums many are advocating. Of what use is such a curriculum if we do not have an accurate understanding of how far students are grasping its material?


Using Comparative Judgement

Some practical reflections on its use in practice

I was first made aware of Comparative Judgement as a method of assessment last year, through one of David Didau’s informative blogposts. I had always meant to get around to using it, but was put off by a fear of using technology. I have regularly compared scripts when awarding marks, and have on occasion sought to put together some sort of order before being brought back to the use of by my Deputy Head, and fellow A-level history teacher, to mark some Y12 mock essays.

Having had some new, functional photocopiers installed with a scanning function, I was willing to press ahead. I shall outline the process below for the uninitiated and then offer a simple evaluation of its value below. I’ll probe these thoughts more deeply later in the week.

The Process

  1. Scan in the exam scripts. Really easy if you have a ‘scan to USB’ function on your photocopiers. I’ve become a dab hand at this. You’ll want to use an easy code (like P12 for the twelfth student in 8P) to name the files, rather than perhaps typing in all of their names. Each essay/piece of work needs to be scanned separately. It took me about 15 minutes to scan in 46 sixth form mock scripts.
  2. Upload the scripts to a new task on which is free to use.
  3. Get judging. It took a Luddite such as myself a little while to find this function. Bizarrely, the web address to access the scripts is located in a section called ‘judges’ but once there you simply click left and right, depending on which script is better in your opinion. Nomoremarking recommends going with your gut and taking less than 30seconds to make a judgement. In practice, this was true of some Y8 essays I’ve compared, but sixth form essays took an average of three minutes to judge.
  4. The data coming in is easy to read. You are provided with a downloadable readout of the rank order of your pupils. It also comes with an ‘Infit’ score to consider which essays the software is less confident in placing. This is often where you have invited multiple judges, and you have perhaps implicitly disagreed on its value.
  5. Apply some marks. I have been less sure of this. However, I’ve read a selection of essays, found some on the level boundaries, applied marks, and then distributed the marks evenly throughout the levels.
    On essays where the Infit score is above 1.0 (indicating unreliable judgements) we’ve had some really interesting discussions about the merits of the essays, what we should be looking for and then manually awarded marks using an exam board mark scheme. I think it is clearly going to be valuable if you bank scripts from year to year with marks you are confident with, and feed them in – this should save time in awarding marks, if you have essays with firm marks already in the mix.

Dare I say it, judging essays has become fun. The clicking gamifies marking and I’m in a scramble to meet my marking quota. we have found that multiplying the number of scripts by 3 to determine the total number of judgements that need to be made, and evenly dividing this up between the team of markers works fine. In practice, essays are being compared against others 8 times there, and we’re achieving a reliability score of over 0.8 which David Didau says is the goal, and in excess of national examinations.

Strengths Weaknesses
Marks are awarded with great confidence, and a reliable set of data on the rank order of class is valuable for a range of accountability measures & considering further interventions. It is difficult to overcome the urge to write individualised comments on essays. Students (and SLT?) need to expect feedback where this isn’t the case. This feeds in with Christodoulou’s recent work on separating out formative & summative assessments.
It’s quick. Doesn’t sound like it, but marking those mocks could easily have consumed 8 days at 15-20 minutes an essay. At 3 minutes an essay, plus scanning, plus determining marks (30 mins when you have no known marks within the judgement process) is significantly quicker. Transforming electronically judged essays into generic feedback for pupils requires careful thought. I’m still refining this.
There is less of a marking bias. Especially if you ask pupils to submit essays with a code (see part one above) rather than naming them. Essays that ‘miss the wash’ are troublesome to reliably fit in the process. This is probably more frustrating rather than the end of the world.
I have thought much more carefully about what I’m really looking for in essays. I think this has led me to be clearer, already, with my classes about how they need to develop their essays. Getting an entire team on board with this might be more difficult than using the software individually. If you’re marking procedure is out of step with other staff, as a head of department, you can still have little confidence in the reliability of marks generated.


How I intend to develop my use of comparative judgement further

  • Ask students to highlight key areas of the script. This might involve showing the mark scheme, and asking them to pick out the five sentences they most want the examiner to see. This should speed up comparisons. Before I stuck my first essays through the process, I had already put formative comments on them. These were a useful aid in passing comment.
  • Banking essays for next year with secure marks attached to them. This should eliminate significant amounts of time transforming the rank order into marks.
  • Get students to submit work electronically. I am in the midst of getting KS3 to do this with an outcome task to a unit of work. I’m not sure how valuable this will be. Paper, pen and scanning seems to be less hassle, so far.**
  • Learn what this anchoring business is which seems to be taking comparative judgement to the next level by connecting subsequent pieces of work together. If I get to the bottom of this, I’ll blog on it.


Comparative judgement seems to be a valuable tool for making summative judgements on the quality of pupils’ work. It does not replace marking, or feedback but these should be steps on the road towards a final piece of work. This is where the comparative judgement bit fits in.

Your thoughts and questions are invited.


** Update – I have now discovered that will not accept word documents. They need to be PDF files which throws this plan of mine out of the window.

WLFS Conference – Part One

Knowledge is not an end in itself

The West London Free School history conference was an excellent opportunity to discuss a knowledge-rich approach to history teaching. I have often valued a truism presented by Mike Hughes to an INSET session I attended in my NQT year, that any task is only as good as the quality of the dialogue it provokes. It is with this in mind that one should think about this conference. While I may evaluate the ideas presented in a series of blog posts, the conference as a whole was excellent in promoting discussions between history colleagues, expanding the horizons of many of them. Lunchtime* was characterised by colleagues discussing what they could, should and would take back with them to their own departments.

The day started with a curious introduction from head teacher Hywel Jones. It is always good to see a senior leadership team supporting staff in their extracurricular endeavours and it was clear from the outset that there is a clear philosophy engrained throughout the school. The message was a strong one, that knowledge was valued in this school and passing on a knowledge-rich curriculum was vital to students’ success. It was for Christine Counsell to introduce a bit of nuance on the precise role of knowledge and importance of knowledge in the curriculum.

I do not intend to summarise the contents of Christine’s speech in full. I will, however, point out the key messages that I took away.

  • Christine had an important point to make about humility. I think this is important in the context of the current pedagogical debates taking place through blogs and on Twitter. Many straw men are being set up, heinous schools out there which deny students’ knowledge. I am not sure how far that is true, not in 2017 at least. Concrete evidence would be welcome. Certain commentators could take heed of this note of humility.
  • Assessment theory has been absent thus far from recent debates on knowledge. So far as it has been discussed, it has been in a limited way and brought back into fashion low-stakes testing. I have always been keen to remind my own students, particularly those preparing for external assessments, that those tests take a sample of students’ domain knowledge which needs to be built up over time. The upshot is that we need to be laying the foundations of knowledge, and the tools for manipulating it, over time.
    • Christine helpfully reminded us of ‘timeline tests’ where students need to plot their knowledge, in chronological order, to reach a threshold standard before taking a summative test. These ideas are not new, but a valuable reminder. Whilst submerged in curriculum reform at KS4 and KS5 it is easy to forget such basics.
  • This speech also helpfully distilled recent cognitive psychology on knowledge and its role in learning. Students need ‘fingertip’ knowledge to help support more advanced, second-order thinking (a theme that was not as prevalent during the day as it could have been). Once students have used and deployed that knowledge, they are left with “residue knowledge”. Recent work on knowledge organisers, I think, has taken quite a short-termist approach and my sense was that the WLFS put more emphasis on what “residue knowledge” we want students to have in five years and perhaps ten years after their schooling. It was at this point in Christine’s speech that I reflected on Fordham’s post on what knowledge is cumulatively sufficient; what is the relationship between the little details and the big picture? As a history community, our debates could be infused with more of this thinking and language.
  • The expression “cumulative assessment” was used, but not really elaborated upon. This too is perhaps important. I would contend that we want students’ residual knowledge to be a broad, overarching framework of the past. Students should be able to see the broad arcs of change in history, and analyse those. Not only were frameworks of knowledge** not discussed, but cumulative assessments seem as though they might be a useful step towards constructing them.


* I live my life through my stomach, and thoroughly enjoyed the lunch offering! An excellent effort by the school’s catering team.

**I use the expression in a Shemiltian sense. See Nick Dennis’ blog for a brief introduction to frameworks of the past.