Why Testing Matters More Than Marketing
Any AI company can publish accuracy numbers. The harder, more useful question is: accurate on what? Under what conditions? Marked against which criteria, by whose standard, on which subjects?
Before GradeDrive was made available to teachers, we ran it through extensive testing on the subjects and exam formats that secondary school and sixth form teachers actually use. Not curated examples designed to perform well. Real papers, real student handwriting, real mark schemes from the three major UK exam boards — AQA, OCR, and Edexcel — across GCSE and A-level STEM.
This post describes what that testing involved, what it revealed, and why the results continue to improve as more teachers use the platform.
What We Tested
STEM subjects present the hardest test case for AI marking. They combine handwritten responses with mathematical notation, diagrams, multi-step calculations, and mark schemes that require contextual interpretation — not just string matching.
Testing covered GCSE Biology, Chemistry, and Physics across all three major exam boards. At A-level, testing included Biology, Chemistry, Physics, and Mathematics, again across AQA, OCR, and Edexcel papers. Within each subject, we tested across different question types: short-answer factual recall, structured questions with part marks, extended response questions, and calculation-based questions requiring shown working.
The three exam boards each have distinct mark scheme conventions. AQA mark schemes for science tend to use point-based marking with explicit alternative answers; OCR schemes frequently include "credit worthy" alternatives that require interpretive judgement; Edexcel uses a mixture of point-based and levels-of-response marking, particularly for extended writing. GradeDrive was tested against all three approaches without any modification to how papers were uploaded — the same workflow, the same interface, applied to each board's format.
What STEM Papers Actually Demand
It is worth being precise about why STEM is a meaningful test of any marking system.
A GCSE Chemistry paper might include a student writing out a balanced equation by hand, with the coefficients placed above or to the side of formulae in ways that differ from how they would appear in typed text. A Physics calculation might show five lines of working, only the last of which contains the answer that earns the mark — with the preceding lines providing the context that makes the answer interpretable. A Biology extended response might span half a page of handwritten prose, mixing correct scientific terminology with informal phrasing, and earn marks at specific points within a continuous argument.
None of these are edge cases. They are the normal texture of GCSE and A-level STEM responses. A marking system that handles them inconsistently is not a reliable assistant — it is a source of additional work, because every unexpected result requires investigation.
GradeDrive's system reads mathematical notation in context, interprets multi-step calculations with reference to the method marks available, and handles handwritten responses across the full range of secondary school legibility. Testing identified specific question types that required refinement — long calculation chains where the AI needed to apply error-carried-forward logic, and extended responses where mark scheme wording was ambiguous enough to require clarification from the teacher before processing. Both were addressed in development before the platform was opened to general use.
Calibrated by Real Teachers
The most important element of GradeDrive's accuracy is not the AI model itself. It is the calibration process that shapes how the AI applies mark schemes in practice.
Every mark scheme contains interpretive decisions that are not made explicit in the document itself. An AQA Biology mark scheme might accept "breaks hydrogen bonds" as an equivalent to a more formal molecular description. An Edexcel Maths scheme might award a method mark for a particular sequence of steps, even if the final answer is wrong. These judgements are made by experienced examiners, and then passed down to teachers who apply them thousands of times across a marking season until they become intuitive.
GradeDrive's system is calibrated against these real teacher judgements. Before processing a new set of papers, teachers can provide guidance on specific questions — flagging where mark scheme wording is ambiguous, specifying whether particular alternative phrasings should be credited, or identifying question types where the AI's interpretation needs adjustment. This calibration information informs how GradeDrive processes the rest of the set.
The result is a system that improves with use. The first time a teacher uploads A-level Chemistry papers from a particular exam board, the AI applies its trained understanding of that board's conventions. The second time, it also applies anything the teacher clarified in the first round. Over time, the system builds a working model of how that teacher, for that subject, for that mark scheme, makes marking decisions — which is exactly how consistency is achieved in human marking too.
What Testing Revealed About Exam Board Differences
One of the more practically useful findings from testing was how significantly mark scheme conventions vary between exam boards — and how much that variation matters for AI accuracy.
OCR A-level Biology, for example, uses extended mark schemes with "indicative content" sections that list acceptable points without specifying which are required. A student might write three of the six listed points, and whether those three earn full marks depends on whether they cover the right conceptual ground — a judgement that requires understanding what each point is actually testing, not just pattern-matching against a list.
AQA GCSE Physics calculation questions, by contrast, are often highly structured: the mark scheme specifies the formula, the substitution, the rearrangement, and the final answer, each as a separate marking point. This is easier for an AI system to apply consistently, but requires accurate reading of multi-step working shown in different formats.
Edexcel papers frequently include data-interpretation questions where students must extract a value from a graph or table, use it in a calculation, and comment on what the result means. The marking spans three different cognitive operations — reading, calculating, evaluating — each with its own mark scheme criteria.
GradeDrive was tested and refined on all three approaches. The platform does not treat every exam board's papers the same way; it applies the conventions appropriate to the mark scheme it is working with.
Accuracy, Oversight, and the Teacher's Role
It would be misleading to claim that AI marking achieves human-level accuracy on every question type in every subject. The current state of the technology is more nuanced than that — and the honest picture is also more useful.
On short-answer and calculation questions with unambiguous mark schemes, GradeDrive's agreement rate with experienced human markers is consistently high. On extended response questions, and on questions where mark scheme wording allows for genuine interpretive variation, the system's suggestions require more frequent review and override by the teacher.
This is why GradeDrive is designed as a marking assistant, not a marking replacement. Every AI-suggested mark is presented to the teacher for review before it is finalised. The interface shows the student's response, the mark scheme criteria, and the AI's reasoning side by side — so the teacher is not simply approving a number but reviewing a marking decision with full context.
The efficiency gain comes from the fact that the majority of marking decisions — on the straightforward questions that make up most of a paper — are made correctly the first time and can be confirmed quickly. The teacher's attention is freed to focus on the responses that genuinely require professional judgement, rather than being spread equally across every question regardless of complexity.
Always Improving
GradeDrive is under active development. The AI models that power the marking system are updated regularly, incorporating advances in large language model capabilities, improvements to handwriting recognition, and refinements in mathematical notation processing.
Each update is tested against the same STEM benchmark sets used in the original evaluation, to ensure that improvements in one area do not introduce regressions in another. When the underlying models improve — and they are improving quickly — GradeDrive's marking accuracy improves with them.
Teacher feedback is also a formal part of the development loop. When teachers override the AI's suggested mark, that correction is recorded (with identifying information removed) and used to identify patterns where the system's training can be improved. The most common override types — particular question formats, specific mark scheme phrasings, edge case response types — become the focus of the next round of calibration work.
The system teachers use today is more accurate than the one tested twelve months ago. The one available twelve months from now will be more accurate still. That trajectory is the honest measure of what AI marking technology is becoming — not a snapshot claim about current performance, but a commitment to improvement that is built into how the platform is developed.
Try GradeDrive on your next set of STEM papers — free trial, no credit card required.
Ready to reclaim your evenings?
Join teachers across the UK using GradeDrive to mark papers faster, more consistently, and without the Sunday-evening dread.
GradeDrive Team
The GradeDrive team is made up of educators, engineers, and product designers on a mission to reduce teacher workload through focused AI tools.