Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Submit feedback
  • Sign in / Register
J
jianmu-supplemental
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 27,948
    • Issues 27,948
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 59
    • Merge Requests 59
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • compiler_staff
  • jianmu-supplemental
  • Issues
  • #28210

Closed
Open
Opened Mar 06, 2026 by Anny David@iblogs
  • Report abuse
  • New issue
Report abuse New issue

Large Language Models Are Disrupting Both Sides of Standardised Testing. That Changes Everything.

Standardised testing has operated on a stable assumption for over a century: humans write the questions, humans answer them, and the gap between a test-taker’s preparation and the exam’s difficulty is what produces a meaningful score. Large language models have broken that assumption on both ends. They can now generate test items that are psychometrically indistinguishable from expert-written ones. They can also answer those items well enough to pass most professional and academic examinations. The implications for educational measurement are not incremental—they are structural.

When AI Writes the Test

Automated item generation is not new. Testing organisations have used template-based systems for decades to produce question variants at scale. But the arrival of LLMs has fundamentally changed what is possible. A 2025 large-scale field study spanning 91 university classes and nearly 1,700 students found that AI-generated exam questions—produced through iterative cycles of LLM generation, critique, and refinement—performed comparably to expert-created items on Item Response Theory analyses. The difficulty distributions were statistically similar. The discrimination parameters were in range. The items worked.

A separate Spring 2025 case study at Franklin University Switzerland tested a human-in-the-loop framework in which instructors collaborated with ChatGPT to generate parallel exam variants. The researchers found that combining psychometric rigor with LLM flexibility produced items that maintained equivalence in difficulty while providing the uniqueness needed to deter cheating—a practical concern that has only intensified as students themselves have gained access to the same tools.

When AI Takes the Test

The other side of the equation is more unsettling. Research from Virginia Tech published in 2025 concluded that OpenAI’s reasoning models have effectively rendered unproctored online assessments unreliable. The models perform well on personality tests, situational judgement tests, verbal ability assessments, and quantitative reasoning—the last of which had previously been considered a safe haven from AI-assisted cheating. In a University of Reading experiment, 94 per cent of AI-written exam submissions went entirely undetected by human markers.

Survey data from 2023 found that nearly one-third of students admitted to using ChatGPT for coursework. By 2024, the UK recorded 7,000 formal cases of AI-assisted cheating at universities—triple the previous year. The GRE, administered by Educational Testing Service (ETS), has responded by shortening the exam to under two hours, adopting a multistage adaptive algorithm, and investing in AI-driven identity verification systems like ENTRUST for its TOEFL platform. But these are defensive measures against an offensive capability that continues to advance faster than detection can keep pace.

The Measurement Problem

For researchers in computational education and psychometrics, the deeper issue is not cheating—it is construct validity. If a language model can achieve a high score on an exam designed to measure human reasoning, the question is whether the exam is actually measuring reasoning, or something else entirely. LLMs do not reason the way humans do. They predict token sequences. When those predictions produce correct answers on assessments calibrated for human cognition, it suggests the items may be testing pattern recognition more than the constructs they claim to assess.

This creates an unusual research opportunity. By analysing which items LLMs solve easily and which they fail, researchers can identify questions that genuinely require human-specific cognitive processes—working memory under time pressure, contextual inference from lived experience, or spatial reasoning grounded in physical interaction. Tools like practice test questions and answers provide an accessible window into the types of standardised items currently in circulation, and comparing human performance patterns on such items against LLM outputs is an active area of investigation in educational AI research.

What Comes Next

The testing industry is moving toward adaptive assessment with real-time item calibration, where difficulty adjusts dynamically based on a test-taker’s responses. Some researchers have proposed incorporating LLMs directly into testing—not as threats to defend against, but as tools integrated into new assessment paradigms that measure how effectively a person can collaborate with AI rather than compete against it.

Whether the field moves toward AI-native assessment design or retreats to proctored, controlled environments, one thing is clear: the century-old model of fixed-item, fixed-time, unsupervised testing is no longer viable. The models that can write the test and take the test have made sure of that. What replaces it will be determined by the researchers willing to treat this not as a crisis of integrity, but as a fundamental measurement problem worth solving.

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: compiler_staff/jianmu-supplemental#28210