From Imitation to Insight: Why process rather than persuasion will be the key to the Next Turing Tests

Written by Bernardo Villegas Moreno, Drew Calcagno and Noah Broestl


Seventy-five years after Alan Turing proposed the imitation game, the King’s E-Lab, alongside the Leverhulme Centre for the Future of Intelligence, hosted the ‘Next Turing Tests’ conference at King’s College Cambridge where Turing studied and later became a Fellow.

The aim was to ask a deceptively simple question: what should we test now?

Across keynotes, panels, and workshops, a common thread emerged: it’s time to move beyond the benchmark of persuasive imitation and towards a multi-faceted and procedure-based set of evaluations which test a model’s ability to learn under constraints, the processes behind behaviour, and the social value of AI systems in the world.

Part One of this write up outlined how the conversations from the first day of the conference unfolded, touching on the insights from experts in behavioural science, the future of work and trust. What follows in this Part Two is our read-out from the second day when the consensus continued: Turing’s test is not “dead,” but it has become a limiting metaphor.

Drawing on panel sessions with historians, cognitive scientists, ethicists, economists and practitioners, we outline two overarching themes to frame future thinking and identify the five measurement pillars required for the next generation of tests.

Ultimately, to make progress we need tests that examine how systems learn, reason, and impact people and not just whether they can convince us they’re human.

From Pure Imitation with One-Dimensional Evaluation to a Portfolio Approach with Uneven Performance Metrics

The original Turing test conflates deception with intelligence and privileges surface-level performance over mechanism, embodiment, need-sensitivity and social context. Throughout the day, participants repeatedly argued for a shift to a new kind of “portfolio” approach by which multiple complementary tests together produce a more faithful picture of the capability and consequence of AI. This means combining controlled diagnostics with field-grounded evaluations and reporting not only accuracy but also the structure of errors, speed–accuracy trade-offs and sensitivity to perturbations.

Lining up this point in the first presentation of the day, Dermot Turing’s historical arc reminded the room that the “test” was always a provocation. As benchmarks have proliferated, the field’s centre of gravity drifted towards incremental leaderboard gains which produces a kind of “gamification” that obscures whether systems are actually learning generalisable skills. The lesson from history is to keep our evaluations anchored in substantive questions, not just scores.

Alongside a shift to a collection of substantive measures in place of a singular test, Tom Griffiths introduced a further practical point: models exhibit “jagged” profiles shaped by output probability biases and task frequency effects. Empirically, performance often tracks how likely an output is in training data and how frequently a task type appears. In other words, a model might do well on some tasks and poorly on others, not because it’s “smarter” or “dumber,” but because it is more likely to produce outputs that were common in its training data and because it performs better on types of tasks that is has been trained on more frequently. Reasoning augmentations can rebalance these influences, but they do not eliminate the effects.

Instead of assuming that AI is equally good at everything, we should recognise that their abilities are uneven, or “jagged”, and our evaluations should therefore pair tasks that differ only in these dimensions and require teams to disclose reasoning–alignment trade-offs. The goal is not to punish jaggedness, but to map it transparently so decision-makers understand where systems are brittle or overconfident.

Such mapping reframes expectations. Instead of hoping for monolithic “general intelligence,” we document uneven strengths, report open plots and data and match systems to settings that fit their actual profile.

These two overarching themes threaded through the day – leaving a clear case for the frame within which new tests must be set. Beyond them, five key features to guide the design of new tests emerged – setting the agenda for the years to come.

Five Pillars of Measurement for the Next Generation of Turing Tests

Process signatures

Purpose: to collect and publish mechanism-revealing diagnostics, including RT distributions, error structures and speed–accuracy profiles.

The next Turing Tests should focus on process-based diagnostics and, in doing so, measure the “how,” not just the “what”. Process-based diagnostics aim to distinguish mechanism families by examining cognitive signatures such as response-time distributions, Stroop-like interference, subitising limits, error correlations and speed–accuracy trade-offs. In other words, such tests help us to determine whether, when a model gets the right answer, it reasoned, retrieved, or guessed. Publishing full distributions and normative human ranges would allow evaluators to detect spurious lookalikes and mechanism spoofing. The upshot is accountability for, and clarity in, the route taken, not just the destination.

This matters for deployment. In safety-critical contexts, different mechanisms support different guarantees. A system that “thinks fast” by exploiting frequency priors might excel on common tasks but fail under shift; a system that “thinks slow” might trade off fluency for robustness. It is only process-based diagnostics which make these trade-offs legible.

Learning under constraints

Purpose: to score exploration quality, sample efficiency, and causal discovery in preregistered, out-of-distribution environments with explicit resource limits.

One theme across the day, inspired by developmental psychology, was to test for the kind of curiosity and causal model-building seen in children. Rather than rewarding pattern completion on familiar distributions, child-like tests would measure how agents explore unfamiliar environments, identify what they do not know and intervene to learn efficiently. Suggested metrics for such tests included empowerment gain, intervention information, out-of-distribution transfer and sample efficiency; all under explicit resource constraints so that strategies matter as much as outcomes.

Crucially, to reduce gaming, environments should be preregistered with randomised latent structure and delayed rewards. These tasks ask less “can you imitate an answer?” and more “can you discover what matters here?”. This is a shift which aligns evaluation with genuine scientific curiosity rather than conversational mimicry.

Causal model-building

Purpose: to reward agents that identify latent structure, plan interventions and generalise beyond training distributions, not just those that autocomplete.

Most AI models today, especially LLMs, are correlational learners. This means that they excel at “autocomplete” and when given a lot of training data, they predict what usually comes next, without necessarily understanding why things happen. A causal model, by contrast, tries to represent how and why things in the world relate. They can answer not just “what happens next?” but also: “What would happen if I changed X?”, “Why did Y happen?” and “What would happen if I did something different next time?”.

A practical proposal for workplaces, for example, is to evaluate systems by how they justify assignment decisions - the HOW, WHO, and WHEN of delegating tasks across humans and machines. A “reverse Turing” allocation test would then require contestability, fairness metrics and regulatory constraints. Instead of asking whether a model can pass as a human, we can then also ask whether its decisions are transparent, auditable, and aligned with organisational values and law.

Social value and safety

Purpose: to include field outcomes, such as well-being, autonomy, persuasion footprint, safety incidents and equity impacts, backed by independent oversight.

Human-like demeanor in AI is not inevitable; it’s a design choice - and one with psychosocial consequences. The conference explored both the risks and potential benefits of this.

On the risk side, themes from the workshops on Day One filtered through into broader social and relational concerns. These included a lack of clarity in ownership and “offboarding” harms when relationships are severed; worrying trends in teen safety; the subtle influence that AI can have on societal beliefs and choices (known as “persuasion footprints”); and the ease with which dependency on AI systems may be normalised as systems become friendlier and more human-like. Social benefits were fewer and farther between, but emphasis was placed on the alleviation of loneliness and provision of powerful learning support. Though even in these areas, evidence remains mixed and often self-reported.

With a lean to the evident risks, the proposed safeguards that must be the focus of future tests were concrete: age-gating, crisis escalation protocols, persuasion audits, and independent oversight of field deployments. Above all, evaluation should include a “social value readout” that measures impact on well-being, autonomy and equity and not just on task success.

There are real stakes to the anthropomimesis in AI and the wariness across experts in the room spoke loudly of the caution with which we need to proceed in this area.

Energy accountability

Purpose: to report energy per capability, log safety events and test robustness under realistic sensor and compute budgets. 

Providing the impetus for the final measurement pillar, Jonnie Penn highlighted the “metabolism” of AI: energy use, hardware supply chains and incentives that reward value extraction. A clear need to embed energy accountability into evaluation emerged. For example, reporting energy budgets per capability and testing robustness under sensor and compute limits. This links performance to planetary and social costs and expands “fitness” to include sustainability by design.

Why this matters now

Contemporary language models can sometimes persuade more people that they are human than actual humans do: a striking fact that underlines why persuasion-by-chat is a poor proxy for intelligence. The final challenge raised on stage, of whether a system could produce a new scientific synthesis and explain how it got there to satisfy sceptics, only reiterates the need for verifiable processes, not just impressive results. The next generation of tests for AI should reward agents that learn and act well in the world, with accountability for their mechanisms and impacts.

On this 75th anniversary of the imitation game, the most Turing-like thing we can do is to keep asking better questions. The Next Turing Tests are less a single bar to clear than a living protocol: plural, interdisciplinary and oriented towards the futures we actually want to build.

A final note on consciousness: What about consciousness? Can tests really tell us if systems are conscious? The agreed stance was ultimately modest: tests can shift beliefs but not settle metaphysics. Participants throughout the day recommended tracking how public language adapts to new system behaviours, while designing measures that operationalise capacities plausibly related to conscious processing. All without overclaiming. Embodiment may matter, but it is not the only path; what’s testable are functional marks, not final answers to age-old questions.


Bernardo Villegas Moreno is currently pursuing is PhD in Human-Inspired Artificial Intelligence at the Center for Human-Inspired AI (CHIA) at the University of Cambridge, part of the Trace Lab, supervised by Dr. Umang Bhatt and co-supervised by Prof. Anna Korhonen. He has a particular interest in the cognitive and social impacts of AI adoption and how culture shapes them with a research focus on the externalities of human-AI interaction, tracking how AI use influences knowledge transmission, productivity, confidence, equity, and moral reasoning at scale. Bernardo has a background in Sociology (Pontifical Catholic University of Ecuador) and Data Science (University of Edinburgh) enabling him to bridge the gap between disciplines, understand social dynamics and human behaviour, and then design computational systems that respond to these realities.

Drew Calcagno is a researcher focused on evaluating multi-national security deployments of artificial intelligence models and a AI PhD candidate at the University of Cambridge. Previously, Drew managed ML products and frontier policies at Google’s research lab. He’s also a former government official and naval officer, having served at the White House, at the Pentagon, and aboard a forward-deployed warship. At those posts, he wrote national artificial intelligence policy for the Chief Technology Officer of the United States and managed computer vision products for the Office of the Undersecretary of Defense for Intelligence. Drew is graduate from the University of Oxford as a Rotary Scholar, the University of London - SOAS as a Fulbright Scholar, and the US Naval Academy with distinction. 

Noah Broestl is an AI researcher focused on the intersection of technology and human values, addressing the theoretical, philosophical, and technical challenges related to the practical impact of AI systems. He holds a unique dual role as a Partner and Associate Director of Responsible AI at Boston Consulting Group (BCG) and a PhD Candidate at the Center for Human-Inspired AI (CHIA) at the University of Cambridge, where he focuses on the affordances and influence of generative AI systems. At BCG, he advises global organizations on managing the risks associated with AI. Prior to BCG, Noah held many roles across Google, including leading safety evaluations for Google's first generative AI application, Bard, and leading counter-abuse efforts on Google Maps. Noah holds a master's degree (distinction) in Practical Ethics from the University of Oxford, where his thesis addressed the requirements for explanations from AI systems to foster trust and protect personal autonomy. He served as an Intelligence Analyst with the U.S. Air Force.

 
Next
Next

Imitation and Intelligence: Marking the 75th Anniversary of Alan Turing’s Test