The Metric Trap: Why Co-Evolution Is the Missing Piece in LLM-Driven Discovery

February 15, 2026

The Metric Trap: Why Co-Evolution Is the Missing Piece in LLM-Driven Discovery

Subtitle: Every EvoLLM system hits the same wall. Biology solved it billions of years ago.

Date: February 2026

In three years, a family of systems combining LLMs with evolutionary algorithms – FunSearch, AlphaEvolve, ELM, ReEvo, Eureka, ShinkaEvolve – has broken a 56-year-old matrix multiplication record, discovered new extremal combinatorics bounds that had stood for two decades, optimized Google’s data center scheduling, and designed reward functions that outperform human experts. The recipe is simple: generate a population of candidate programs, evaluate them against a fitness function, select the best, mutate with an LLM, repeat. Evolution powered by the most capable code generators ever built. This is a new paradigm for automated discovery, and the results are not incremental.

But every single one of these systems shares a limitation so fundamental that it’s hiding in plain sight: they all require someone to hand them the answer key.

Every EvoLLM system, without exception, depends on a fixed, predetermined fitness function – a numeric score that tells the system whether one candidate is better than another. For mathematics, this is fine: you can verify whether a proposed cap set is valid, count its elements, and return a number. For code optimization, you can run benchmarks. For combinatorial problems, you can measure solution quality against known bounds.

But what about problems where nobody can write down the fitness function? Drug design where “therapeutic value” resists quantification. Scientific hypotheses where “interestingness” is the real goal. Creative artifacts where quality is recognized but not formalized. Software architecture where “good design” involves tradeoffs that no single metric captures.

These aren’t edge cases. They’re most problems. The EvoLLM revolution has so far operated exclusively in the narrow band of domains where evaluation is trivially computable. The moment you step outside that band, the entire paradigm breaks.

This is what I call the metric trap: the assumption that progress requires a fixed, external measure of quality. It’s so deeply embedded in how we think about optimization that questioning it feels like questioning gravity. But there’s a field that escaped the metric trap billions of years ago. It produced every adaptation you can name, including the brain you’re using to read this sentence. It did so without a fitness function designed by anyone.

That field is biology. And its mechanism is co-evolution.

To understand what’s missing, we need to understand what exists. Here is the current EvoLLM ecosystem, organized by what each system evolves and how it evaluates:

System            What It Evolves         How It Evaluates
------            ---------------         ----------------
FunSearch         Function stubs          Numeric score (math verification)
AlphaEvolve       Whole programs          Evaluator scripts (benchmarks)
ELM               Python programs         Task fitness (Sodarace sim)
ReEvo             Heuristics              COP solution quality
EoH               Thought+code pairs      Problem-specific metrics
Eureka            Reward functions        RL training performance
ShinkaEvolve      Programs                Numeric evaluators
The AI Scientist  ML experiments          LLM-based paper review
DGM               Coding agents           SWE-bench / Polyglot scores
QDAIF             Creative text           LLM quality judgments

Notice the pattern. The left column is wildly diverse – functions, programs, heuristics, reward functions, entire experiments. The right column is monotonous. Fixed metrics. Static evaluators. Scores that don’t change as the population evolves.

Two partial exceptions stand out. QDAIF uses an LLM to judge the quality and diversity of creative text, replacing the numeric fitness function with natural-language evaluation. The AI Scientist uses an LLM-based reviewer to evaluate generated papers. These hint at the direction I’m arguing for. But in both cases, the evaluator is frozen – the same LLM with the same prompt, applying the same criteria throughout the entire evolutionary run. The judge never adapts to what it’s seen. It never gets harder to impress. It never co-evolves.

This is like running a track meet where the qualifying time never changes. Early on, it’s challenging. After a few generations, every runner clears the bar. The system converges, stagnates, and eventually just shuffles between equally-good-enough solutions. In evolutionary computation, this is called loss of selection pressure – and it’s the death of open-ended discovery.

When “Impossible” Evaluations Fell to Simple Mechanisms

Before proposing a solution, let’s study a pattern from history. Multiple times in AI, a field was stuck because nobody could formalize a quality metric – and the breakthrough came not from writing a better metric, but from making the evaluation itself adaptive.

Generative Adversarial Networks (2014). Before GANs, generative modeling for images was stuck. Nobody could write down a differentiable loss function that captured “this image looks real.” Attempts used pixel-wise reconstruction loss, which produced blurry averages. The breakthrough wasn’t a better pixel metric. It was the realization that you could train a discriminator to distinguish real from generated images, and use that discriminator’s judgment as the loss function for the generator. The evaluator co-evolved with the thing being evaluated. As the generator improved, the discriminator got better at detecting fakes, maintaining selection pressure indefinitely. No one ever had to formalize “image quality” – the adversarial dynamic discovered it implicitly.

    Traditional approach:        GAN approach:

    [Generator] --> [Fixed Metric]    [Generator] <--> [Discriminator]
         |              |                  |                 |
         +-- improve -->+                  +-- arms race  --+
                                           (both improve)

Self-Play in Games (2016-2017). Before AlphaGo Zero, game-playing AI required human expert games for training and hand-crafted evaluation functions. How do you write a heuristic that captures “good Go play” when even professionals can’t fully articulate their intuitions? AlphaGo Zero eliminated both requirements by playing against itself. The opponent was the evaluation function, and it co-evolved with the agent. As the agent improved, so did the opponent it trained against. This produced superhuman play without any human knowledge beyond the rules – and critically, without anyone ever defining what “good play” means mathematically.

Next-Token Prediction (2018-present). Nobody knew how to formalize “language understanding,” “reasoning ability,” or “common sense.” The breakthrough wasn’t formalizing these concepts – it was discovering that predicting the next token, an objective that sounds almost comically simple, implicitly requires modeling all of them. The “metric” was not designed to capture intelligence, yet it produced it. The connection to co-evolution is looser here, but the key property holds: the evaluation signal comes from the structure of the data itself, not from a human-designed rubric for “understanding.”

The pattern across all these cases: the field was stuck on “we can’t write down the metric,” and the solution was “stop trying to write down the metric.” Instead, make the evaluation emerge from a dynamic process – adversarial training, self-play, prediction, or learned embeddings. The evaluation co-evolves with (or is implicitly defined by) the thing being evaluated.

EvoLLM hasn’t had its GAN moment yet. It’s still in the “pixel-wise loss” era – using fixed, externally defined metrics. The question is: what would co-evolutionary EvoLLM look like?

Biology’s Answer: The Red Queen and the Endless Arms Race

Before designing the AI version, let’s understand how biology escaped the metric trap. In nature, there is no fitness function. There is no external evaluator computing a score. Instead, fitness emerges from interactions between organisms and their environment – and the environment includes other evolving organisms.

The Red Queen hypothesis, proposed by Leigh Van Valen in 1973, captures this: organisms must constantly adapt and evolve not merely to gain reproductive advantage, but simply to survive against other evolving organisms. It’s named after the Red Queen in Through the Looking-Glass: “It takes all the running you can do, to keep in the same place.”

Consider predator-prey co-evolution. Cheetahs evolve to run faster. Gazelles evolve to run faster in response. Neither species is evaluated against a fixed “speed metric” – they are evaluated against each other, and the evaluation changes as both evolve. This produces an arms race that drives continuous innovation in locomotion, sensing, camouflage, and strategy.

Or consider host-parasite co-evolution. Parasites evolve to exploit hosts. Hosts evolve immune defenses. Parasites evolve to evade those defenses. This generates extraordinary molecular complexity – far more than either lineage would produce in isolation. The analogy to GANs is not metaphorical; one study (Kucharavy et al., 2020) explicitly modeled GAN training as host-pathogen dynamics and found that biologically-inspired co-evolutionary training improved stability.

The key properties of biological co-evolution that make it work:

No fixed metric. Fitness is relational, not absolute. Being “fast” only matters relative to what you’re running from or chasing.
Selection pressure is self-sustaining. As one side improves, the other must improve to keep up. There is no convergence to a “solved” state.
Novelty is rewarded implicitly. A mutation that exploits a new niche or an unexplored vulnerability succeeds precisely because the other side hasn’t adapted to it yet.
Complexity ratchets upward. Each adaptation creates new selection pressures, which drive further adaptation, which creates further pressures. This is how you get from single-celled organisms to brains.

A recent paper on open-ended cultural evolution (Winters & Charbonneau, 2025) formalized this insight: open-ended growth is “extremely rare, historically contingent, and only possible when cultural systems and search spaces co-evolve.” When either the solutions or the selection pressures are static, evolution plateaus. Only mutual co-evolution produces unbounded novelty.

This is precisely the property that current EvoLLM systems lack.

The Convergence: Co-Evolution Is Already Arriving

The argument I’m making is not purely theoretical. In the past eighteen months, a wave of papers has begun building co-evolutionary EvoLLM systems – though the field hasn’t yet recognized this as a coherent movement. Let me catalog the key developments.

Digital Red Queen (Kumar et al., January 2026). This paper makes the connection explicit. It uses an LLM to evolve assembly-like programs (“warriors”) that compete against each other in Core War, a Turing-complete adversarial environment. In each round, the LLM evolves a new warrior to defeat all previous ones. There is no fixed fitness function – fitness is defined by competition against the evolving population. The result: warriors become increasingly general over time, developing strategies that work against a wide range of opponents. The paper reports behavioral convergence across independent runs, analogous to convergent evolution in biology, and explicitly calls for “shifting from static objectives to dynamic Red Queen objectives.”

ASRO: Algorithm Space Response Oracles (Ke et al., January 2026). This paper reframes heuristic discovery as a two-player zero-sum game between a solver and an instance generator. Both sides maintain growing strategy pools and evolve via LLM-based best-response oracles. The solver evolves heuristics; the generator evolves problem instances designed to break those heuristics. This is adversarial co-evolution applied directly to the EvoLLM paradigm. Across multiple combinatorial optimization domains, ASRO consistently outperforms static-training baselines, achieving “substantially improved generalization and robustness on diverse and out-of-distribution instances.”

Foundation Model Self-Play (Dharna, Lu & Clune, July 2025). Jeff Clune’s group – responsible for some of the most important work on open-endedness – introduced self-play into the foundation model evolution paradigm. Their Quality-Diversity Self-Play (QDSP) creates diverse populations of high-quality strategies through competitive self-play, with an LLM generating code for both attacker and defender. In a pursuer-evader game, QDSP “surpasses strong human-designed strategies.” In an LLM jailbreaking scenario, it automatically red-teams and patches LLM defenses. The paper represents a direct synthesis of quality-diversity algorithms (from the open-endedness tradition) with adversarial self-play.

GAR: Generative Adversarial Reinforcement Learning (Wang et al., October 2025). Applied to formal theorem proving, GAR jointly trains a problem composer and a solver in an adversarial loop. The composer generates increasingly difficult theorems calibrated to the solver’s current ability. This is implicit curriculum learning via co-evolution – the difficulty of the “fitness function” (the theorems to prove) adapts to the solver’s capability. Results show consistent improvements on MiniF2F and ProofNet benchmarks.

MADE: Evolution Without an Oracle (Zhao et al., November 2025). This paper directly addresses the metric trap. It asks: “Can evolution thrive in a purely subjective landscape governed solely by LLM judges?” MADE replaces the oracle (fixed fitness function) with multi-agent LLM evaluation, using “Problem Specification” to decompose vague goals into verifiable sub-requirements. The result: over 50% improvement on software requirement satisfaction compared to baselines. The paper frames this as a “fundamental paradigm shift: moving from optimizing computable metrics to describable qualities.”

CoCoEvo (Li et al., February 2025). Co-evolution of programs and test cases. When no pre-defined test cases exist, CoCoEvo generates both programs and tests from natural language descriptions, then co-evolves them. Programs that pass more tests survive; tests that discriminate between programs survive. This is a clean instantiation of predator-prey dynamics: programs are “prey” trying to pass tests, tests are “predators” trying to catch bugs.

LLM-POET (Aki et al., June 2024). The original POET (Paired Open-Ended Trailblazer) algorithm co-evolved environments and agents. LLM-POET replaces the environment generator with an LLM, producing more complex and diverse environments than the original CPPN-based approach. The result: a 34% increase in performance gain from co-evolution, driven by agents learning more diverse skills from more complex environments.

Socratic-Zero (Wang et al., September 2025). Three LLM agents co-evolve: a Teacher that generates increasingly difficult questions based on the Solver’s weaknesses, a Solver that learns from both successful and failed attempts, and a Generator that distills the Teacher’s curriculum design strategy. Starting from just 100 seed questions, the resulting 8B-parameter model achieves gains of +20.2 percentage points over prior methods across seven math benchmarks.

These papers span different institutions, different problem domains, and different co-evolutionary architectures. But they all share the same insight: when you let the evaluation co-evolve with the solutions, you get better results than when you fix the evaluation in advance. And notably, many of them achieve this on problems where defining a fixed evaluation was the original bottleneck.

What Co-Evolutionary EvoLLM Looks Like

Let me synthesize these developments into a concrete architecture for co-evolutionary LLM-driven evolution. This isn’t speculative – every component exists in at least one published system. The contribution is the synthesis.

+------------------+         +-------------------+
|  SOLUTION        |         |  EVALUATOR        |
|  POPULATION      |         |  POPULATION       |
|                  |         |                   |
|  Programs,       |  <--->  |  Test cases,      |
|  heuristics,     | compete |  problem instances,|
|  designs,        |         |  critic LLMs,     |
|  hypotheses      |         |  reward functions  |
+--------+---------+         +---------+---------+
         |                             |
    LLM mutates                   LLM mutates
    solutions                     evaluators
         |                             |
+--------v---------+         +---------v---------+
|  Selection:       |         |  Selection:       |
|  Survive against  |         |  Discriminate     |
|  current          |         |  between current  |
|  evaluators       |         |  solutions        |
+------------------+         +-------------------+
         |                             |
         +-------- QD Archive ---------+
         |  (MAP-Elites / Novelty)     |
         +-----------------------------+

The key components:

Dual populations. Instead of one population of solutions evaluated against a fixed metric, maintain two co-evolving populations: solutions and evaluators. The evaluators can be test cases (CoCoEvo), problem instances (ASRO), competing agents (Digital Red Queen, FMSP), critic LLMs with evolving prompts (MADE), or reward functions (Eureka + evolution).

Adversarial selection. Solutions are selected based on performance against the current evaluator population. Evaluators are selected based on their ability to discriminate – to create separation between good and bad solutions. An evaluator that gives everything the same score is useless and gets eliminated. An evaluator that identifies meaningful quality differences survives.

LLM-driven mutation on both sides. The same LLM (or different LLMs) mutate both solutions and evaluators. This is critical: the LLM’s semantic understanding allows it to generate meaningful variations on evaluative criteria, not just on solutions. It can create test cases that probe edge cases, problem instances that exploit known weaknesses, or evaluation rubrics that capture neglected quality dimensions.

Quality-diversity archiving. Instead of optimizing a single metric, maintain a diverse archive of both solutions and evaluators using MAP-Elites or novelty search. This prevents the adversarial dynamic from collapsing into a single arms race and encourages exploration of the full space.

Escalation pressure. As solutions improve, evaluators must become more discriminating to remain useful. As evaluators become more stringent, solutions must become more robust. This creates the self-sustaining selection pressure that fixed metrics lack.

What does this buy you concretely? Consider three scenarios:

Scenario 1: Evolving better software. Instead of evaluating code against a fixed test suite, co-evolve the test suite alongside the code. The LLM generates both program mutations and new test cases. Tests that catch bugs in the current best programs survive. Programs that pass the most tests survive. The result: adversarial test generation that continuously probes for weaknesses, and programs that become robust against an expanding frontier of tests. CoCoEvo already demonstrates this works.

Scenario 2: Evolving scientific hypotheses. Instead of evaluating hypotheses against a fixed metric (p-value, effect size), co-evolve hypotheses alongside a population of critic LLMs with different evaluation perspectives. Some critics focus on novelty, others on explanatory power, others on testability. Hypotheses that satisfy diverse critics survive. Critics that successfully identify low-quality hypotheses survive. The result: a self-sharpening evaluation system that can judge “interestingness” without anyone having to define it upfront.

Scenario 3: Evolving AI systems themselves. The Darwin Godel Machine already evolves coding agents that improve on benchmarks. Now imagine co-evolving both the agents and the benchmarks. Agents that solve harder benchmarks survive. Benchmarks that challenge the best agents survive. This is open-ended improvement without anyone having to predict what capabilities matter.

The Open-Endedness Horizon

Jeff Clune’s 2019 paper on AI-Generating Algorithms identified three pillars needed for automatically generating general AI: meta-learning architectures, meta-learning learning algorithms, and generating effective learning environments. The third pillar – generating environments – is essentially the evaluator side of co-evolution. OMNI-EPIC (Faldor et al., 2024), from Clune’s group, generates environments that are learnable and interesting using foundation models, taking a step toward this third pillar.

But OMNI-EPIC still uses a fixed notion of “interesting” (encoded in an LLM prompt that doesn’t change). The Digital Red Queen and ASRO take the further step of letting the environment-generation process itself evolve in response to the learner. DeepMind’s position paper on open-endedness (Hughes et al., 2024) argues this is not optional but necessary: “open-endedness is an essential property of any artificial superhuman intelligence.”

The open-ended cultural evolution work (Winters & Charbonneau, 2025) provides the theoretical grounding. Their computational model demonstrates that open-ended growth requires:

Co-evolution between cultural systems and search spaces (neither can be static)
Stochastic perturbation strong enough to prevent equilibrium (novelty/diversity pressure)
Selection strong enough to maintain effectiveness (quality pressure)

This maps directly onto the quality-diversity co-evolutionary architecture described above. The QD archive provides the diversity pressure. The adversarial selection provides the quality pressure. The dual co-evolution prevents either side from stagnating.

The recent “Agentic Evolution” position paper (Lin et al., January 2026) pushes this further, proposing an “evolution-scaling hypothesis”: the capacity for adaptation scales with compute allocated to evolution, just as capability scales with training compute. If this is correct, co-evolutionary EvoLLM systems may exhibit their own scaling laws – more compute applied to the adversarial dynamic produces increasingly capable and general solutions, without anyone having to design better metrics.

The Bitter Lesson, Revisited

Rich Sutton’s “Bitter Lesson” of AI research is that methods leveraging computation ultimately beat methods leveraging human knowledge. I’d propose a corollary for the EvoLLM era:

Systems that evolve their own evaluation ultimately beat systems that rely on human-designed evaluation.

The current generation of EvoLLM tools is astonishingly effective within their domain. AlphaEvolve, FunSearch, and their descendants will continue to produce important results on problems with clean metrics. But the frontier will belong to systems where the evaluation itself is under evolutionary pressure.

This is not a radical claim. It’s the lesson of GANs, self-play, and arguably of natural selection itself. When you remove the ceiling of a fixed evaluator, you remove the ceiling on what evolution can discover.

The pieces are arriving: LLMs powerful enough to generate both solutions and evaluators, quality-diversity algorithms to maintain exploration, adversarial dynamics to sustain selection pressure, and enough compute to run the whole thing. Digital Red Queen, ASRO, Foundation Model Self-Play, CoCoEvo, MADE – these are the early examples. They already outperform their fixed-metric counterparts.

The metric trap is real, and it’s the bottleneck. Co-evolution is the exit.

The question isn’t whether EvoLLM will become co-evolutionary. It’s how soon.

References

Core EvoLLM Systems

Lehman, J. et al. “Evolution through Large Models.” arXiv: 2206.08896 (2022).
Romera-Paredes, B. et al. “Mathematical discoveries from program search with large language models.” Nature 625, 468-475 (2024). DOI
Novikov, A. et al. “AlphaEvolve: A coding agent for scientific and algorithmic discovery.” arXiv: 2506.13131 (2025).
Ye, H. et al. “ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution.” arXiv: 2402.01145 (2024).
Ma, Y.J. et al. “Eureka: Human-Level Reward Design via Coding Large Language Models.” arXiv: 2310.12931 (2023).
Lange, R.T. et al. “ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution.” arXiv: 2509.19349 (2025).
Liu, F. et al. “Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model.” arXiv: 2401.02051 (2024).
Lu, C. et al. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.” arXiv: 2408.06292 (2024).
Zhang, J. et al. “Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents.” arXiv: 2505.22954 (2025).
Bradley, H. et al. “Quality-Diversity through AI Feedback.” arXiv: 2310.13032 (2023).

Co-Evolutionary and Adversarial Approaches

Kumar, A. et al. “Digital Red Queen: Adversarial Program Evolution in Core War with LLMs.” arXiv: 2601.03335 (2026).
Ke, X. et al. “Game-Theoretic Co-Evolution for LLM-Based Heuristic Discovery.” arXiv: 2601.22896 (2026).
Dharna, A., Lu, C. & Clune, J. “Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models.” arXiv: 2507.06466 (2025).
Wang, R. et al. “GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving.” arXiv: 2510.11769 (2025).
Zhao, Z. et al. “Evolution without an Oracle: Driving Effective Evolution with LLM Judges.” arXiv: 2511.19489 (2025).
Li, K. et al. “CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation.” arXiv: 2502.10802 (2025).
Aki, F. et al. “LLM-POET: Evolving Complex Environments using Large Language Models.” arXiv: 2406.04663 (2024).
Wang, S. et al. “Socratic-Zero: Bootstrapping Reasoning via Data-Free Agent Co-evolution.” arXiv: 2509.24726 (2025).
Anne, T. et al. “Adversarial Coevolutionary Illumination with Generational Adversarial MAP-Elites.” arXiv: 2505.06617 (2025).
Guo, P., Zhang, Q. & Lin, X. “CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models.” arXiv: 2412.18890 (2024).
Huang, Z. et al. “CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design.” arXiv: 2505.12285 (2025).

Open-Endedness and Foundations

Clune, J. “AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence.” arXiv: 1905.10985 (2019).
Faldor, M. et al. “OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code.” arXiv: 2405.15568 (2024).
Hughes, E. et al. “Open-Endedness is Essential for Artificial Superhuman Intelligence.” arXiv: 2406.04268 (2024).
Lin, M. et al. “Position: Agentic Evolution is the Path to Evolving LLMs.” arXiv: 2602.00359 (2026).
Winters, J. & Charbonneau, M. “Modelling the emergence of open-ended cultural evolution.” arXiv: 2508.04828 (2025).
Wen, Y. et al. “Language Games as the Pathway to Artificial Superhuman Intelligence.” arXiv: 2501.18924 (2025).
Wang, C. et al. “Evolution of Benchmark: Black-Box Optimization Benchmark Design through Large Language Model.” arXiv: 2601.21877 (2026).

Historical Precedents

Goodfellow, I. et al. “Generative Adversarial Nets.” NeurIPS (2014).
Silver, D. et al. “Mastering the game of Go without human knowledge.” Nature 550, 354-359 (2017).
Kucharavy, A. et al. “Host-Pathogen Co-evolution Inspired Algorithm Enables Robust GAN Training.” arXiv: 2006.04720 (2020).
Van Valen, L. “A New Evolutionary Law.” Evolutionary Theory 1, 1-30 (1973).

Search This Blog

NitroxHead's Blog