Teaching AI to Play Chess with the Law
Written by CounterClaim Eagle (Julia Volovich)
CounterClaim Eagle is an AI Co-Counsel tool which leverages advanced LLM agents and a risk simulator to challenge opposing arguments with precision and depth. It tests counterclaims against precedent, highlights weaknesses in rival positions, and simulates litigation outcomes using Monte-Carlo methods. By programmatically analysing large volumes of case material, it crafts data-backed rebuttals and delivers them with visual clarity. Built-in safeguards—such as citation-backed responses, code-driven calculations, and programmatic prompting—ensure reliable, scalable performance even on complex or lengthy legal documents.
In competitive chess, a single weak pawn can compromise an entire structure. One miscalculation in a long combination, and the game is lost. The world of high-stakes international arbitration is no different. It is a legal battleground where arguments over billions in corporate investment collide with claims of national sovereignty, and where the outcome can dictate environmental policy and access to essential resources for entire populations. A single weak argument can unravel a meticulously constructed case.
The LLM x Law Hackathon, hosted by King's Entrepreneurship Lab and Stanford CodeX at the University of Cambridge, presented a range of legal challenges to tackle. For me, one challenge stood out: the goal to build an AI co-counsel for a high-stakes counterclaim. It offered an ideal testbed to explore whether structured, mathematical thinking could be encoded into an AI to create a genuine strategic partner for legal professionals.
The task was to build a solution for a complex environmental counterclaim, using a dataset provided courtesy of Jus Mundi, a leading legal‑tech company offering AI‑powered analytics in international law and arbitration. The amount of data was vast: nearly 250MB of dense legal cases, over 100x-1,000x too large for even the latest, most advanced LLMs to process in a single pass.
The approach, which was named CounterClaim Eagle, was built on a few core principles.
Scaling Precedent Data: What Global Embeddings Can't Do
The first obstacle was the sheer volume of data. A common approach is a so-called Retrieval Augmented Generation (RAG) with embeddings, which splits documents into chunks and converts them into high dimensional vectors, so-called embeddings. This method is fundamentally flawed for legal text. The meaning of a legal clause is highly context-dependent where context could be many pages away; ripping it alone is like trying to understand a complex novel by reading random, disconnected sentences. This is actually worse as legal texts often have addendums explicitly altering meaning of other parts of text.
A more robust technique was needed. This is where experimentation with a novel approach bore fruit: using a large language model as an orchestrator. Instead of chunking, the system programmatically generates a recursive "Table of Contents" for each case file and each decision within the case. Then, a large-context model, Google's Gemini Flash 2.5, which was fully released just last week, is instructed by an orchestrator, prompted to navigate this structure dynamically. It can request specific sections and analyse connections, ensuring the AI has full and effective context to read the law like a lawyer: holistically.
This initial stage takes several minutes to execute and makes hundreds of LLM calls, dozens in parallel. As is common with new approaches, not everything worked at the first try. Fortunately, Google Cloud generously provided LLM inference credits and request quotas which enabled solid experimentation.
Sharpening Reasoning: Beyond Nash Equilibrium
Once the AI could read, it needed to learn how to reason. The initial chess analogy is powerful, but it has its limits. Breakthroughs like AlphaZero demonstrated that in two-player, zero-sum games such as chess and Go, an AI can achieve superhuman ability through self-play, converging on a near-perfect "minimax equilibrium" strategy which is a specific type of Nash Equilibrium for these kinds of games.
But legal reasoning happens in a larger decision space than a zero-sum game. One isn't playing against a perfectly rational machine following fixed rules. The opponent is a human legal team with its own biases, interpretations, and strategies, and the stakes are high. International arbitration is not a game! A "Game Theory Optimal" argument, perfect in a vacuum, might be brittle against an unexpected, "exploitative" line of attack that targets a specific assumption.
Therefore, the AI’s reasoning cannot be about finding a single, perfect strategy. It must be adaptive. Last month, DeepMind published an algorithm called AlphaEvolve which is based on LLMs proposing and refining solutions. It was used to prove novel mathematical results, thus demonstrating that it could produce reasoning well beyond what was in the training set.
The core of CounterClaim Eagle is an adversarial modelling loop inspired by these ideas. It’s an internal process of iterative refinement where the system generates an argument, then prompts itself to find the strongest counter-argument, specifically modelling the types of arguments an opponent might favour. This adversarial loop, implemented using an orchestrator architecture where Gemini Pro 2.5 acts as the 'strategist' tasking the faster Gemini Flash 2.5 with specific analytical 'drills', forces a level of logical rigour that a single-pass generation cannot achieve. It simulates the intellectual pressure of a real, adaptive legal debate.
Quantifying Risks with Monte-Carlo Simulation
Legal strategy is fraught with uncertainty. Lawyers often rely on intuition to gauge risk, but human minds have a notoriously poor sensitivity to probability. This is where mathematics offers a powerful tool: the Monte Carlo simulation. Named after the casino in Monaco, this method maps out a distribution of possible outcomes by running a scenario thousands of times.
In CounterClaim Eagle, the LLM’s role is not to perform the calculation. This is a task at which language models are weak. Instead, it acts as the bridge between qualitative legal arguments and quantitative analysis. It reads the case facts, identifies key variables, and uses these to populate the parameters of a programmatic risk simulator. The lawyer can then "play" with these assumptions, instantly seeing a visual representation of how changing one variable might affect the probability of success. It transforms vague risk into a tangible, explorable landscape.
Wrapping it up for the End User Experience
The system runs in two stages. Conceptually it mirrors the workflow typical for a world class legal team, separating deep preparation from real-time strategic work.
- Deep Analysis: First, the system performs a computationally intensive, one-time analysis of the case, evaluating it against the entire dataset of precedent cases. This is analogous to a team of junior counsels spending days reading every document, cross-referencing precedents, and preparing a foundational briefing. This stage, which takes about 10 minutes to complete, uses hundreds of LLM calls, dozens in parallel, to produce a structured knowledge database specific to the case, complete with citations, related precedents, and argument chains.
- Interactive Strategy Session: With this deep analysis complete, the system transforms into a real-time strategic partner. The lawyer is presented with a suite of interactive tools: a detailed report, a Case Law Matrix for visual comparisons, and a case-specialised chatbot for quick queries. Most importantly, the Causation Analysis and Strategy Dashboard provide a visual graph of the argument structure, highlighting potential points of failure identified through the adversarial process. The Risk Simulator then allows the lawyer to "play" with variables and immediately understand the probabilistic outcomes of different strategic choices.
The Experience
Winning first prize was a tremendous honour. The Hackathon was an incredible opportunity to explore applications of some of the most recent advances in generative AI systems to a problem space with profound societal and financial impact.
I am deeply grateful to Jus Mundi for crafting a challenge equipped with a real-world, large-scale dataset. It was this data that truly allowed for a meaningful test of how game theory and risk quantification can guide large language models. I am very grateful to Julia Zeidan of Jus Mundi for helping me understand the problem space, data format and the invaluable advice on real world legal challenges.
My sincere thanks also go to the organisers, King's Entrepreneurship Lab and Stanford CodeX, and in particular Zarja Hude, for creating such an intellectually stimulating event. I am also grateful to Mahesh Yadav, for helping me structure the pitch after the Hackathon for the LegalTech conference and helping me think about how to turn this project into a real product.
This project was an intense, rewarding sprint. The potential for these techniques is vast, and I look forward to continuing to explore this space.
Julia Volovich is a mathematics undergraduate at Trinity College, University of Cambridge. She is interested in game theory and probability, large-scale machine learning, and mathematical modelling of risk. Julia loves programming and watching loss curves go down.