What a Null Result Taught Us About AI Agent Evaluation

Q: What Happened in the FastMCP Refactor Test?

Target: FastMCP session ID refactor in . Open and unimplemented at the time of the experiment. Rubric: 6 binary criteria, pre-registered before any runs were examined. 9 of 10 delegates produced valid output. ran 93 messages and wrote no output file. Session log analysis confirmed the file-write instruction appeared only at the end of the delegate prompt, and the model drifted past it. This is a delegate authoring flaw with a known fix: bookend critical instructions at the start and end. Across the 9 valid runs, 5 of 6 criteria scored 100% in both groups. The only variance was C5 (must-not constraint violations): control 5.50/6, treatment 5.80/6, delta +0.30. The treatment scored marginally higher, but at n=4 vs n=5 with binary outcomes, Fisher’s exact test is degenerate (p = 1.0). Full per-criterion scores are in the raw data. The task was too easy. The rubric could not discriminate. We needed a harder target.

Q: Did a Stricter Methodology Change the Result?

Target: , a tree-sitter AST (Abstract Syntax Tree)-based security scanner evaluation. Harder task, requiring synthesis from source code rather than retrieval from issue text. Unimplemented when tested. Rubric: 7 binary criteria. C5, C6, and C7 required the delegate to read and reason about actual source code, not just summarize the issue. Pre-registered before any runs began. Methodology improvements over Experiment 1: Blinded file naming from the start ( through with sealed )Mann-Whitney U test pre-specified (two-tailed, alpha = 0.05)Wall-clock latency recorded per delegate Results: Table 1: Experiment 2 results. Zero variance in either group. Mann-Whitney U = 12.5, p = 1.0 (degenerate: complete ties, test cannot be evaluated). Every Scout, in every run, in both groups, scored 7/7. Even C5, C6, and C7, the synthesis criteria we specifically designed to require source code reasoning, hit 100% across the board. The 17.8% latency difference is in the expected direction (longer prompt, longer prefill), which is consistent with the paper’s Anthropic-specific latency caveat. At scale, that delta compounds: doubled tokens cost money, and the added prefill time costs throughput across every agent invocation. But n=5 cannot support any inference here, and the finding is further confounded by an infrastructure issue we discovered afterward. The scores told us nothing. The session logs told us something we did not expect.

A Google Research paper demonstrates that repeating the entire user prompt verbatim can lift accuracy by up to 76 percentage points at zero output cost. No chain-of-thought overhead. No reasoning budget. Just send the same instruction twice.

We ran 20 parallel agents across two experiments: 10 per experiment, 5 control vs. 5 treatment, blind-scored against a pre-registered rubric.

We found nothing. The nothing is the finding.

Contents

What Did the Paper Claim?
- Why Our Agent Seemed Like a Good Candidate
- Why This Matters for Engineering Teams
How Did We Design the Test?
What Happened in the FastMCP Refactor Test?
Did a Stricter Methodology Change the Result?
What Infrastructure Confound Did We Miss?
Why Did Both Experiments Hit 100%?
- Where the Boundary Falls
What Did We Learn About AI Evaluation Design?
When Should You Use Prompt Repetition?
- What Transfers to Your Team
Did Prompt Repetition Change Anything Else?
References

What Did the Paper Claim?

A 2025 paper by Leviathan et al. at Google Research proposes a simple technique: repeat the entire user prompt once, verbatim, before sending to the model.

The mechanism is structural, not empirical. Decoder-only transformers use causal masking: each token attends only to tokens before it. In a single-pass prompt, early tokens never see later context. Repeating the prompt creates a second copy where every token attends to the full instruction during prefill. This reduces the positional attention decay documented as the “lost in the middle” phenomenon (Liu et al., 2023). This is a fundamental limitation of the decoder-only architecture, not a quirk of specific benchmarks. A 675B-parameter Mixture-of-Experts frontier model and a 3B-active-parameter small language model (SLM) (NVIDIA, 2025) share it equally.

Standard prompt (single pass):
  Token 1  sees: [Token 1]
  Token 5  sees: [Token 1, 2, 3, 4, 5]
  Token 50 sees: [Token 1, 2, ..., 50]
  --> Early tokens are blind to later context

Repeated prompt (two copies):
  Token 51 sees: [Token 1, 2, ..., 50, 51]
  Token 55 sees: [Token 1, 2, ..., 50, 51, 52, 53, 54, 55]
  --> Every token in the second copy attends to the full first copy
  --> Full context available during prefill, zero generation costcausal-masking.txt

Code Snippet 1: Causal masking creates an asymmetry where early tokens cannot attend to later context. Repeating the prompt gives the second copy full visibility over the first.

The reported gains:

Gemini 2.0 Flash-Lite on NameIndex: 21.33% to 97.33% accuracy
GSM8K and MMLU-Pro gains across Gemini 2.0 Flash, GPT-4o, Claude 3.7 Sonnet, DeepSeek V3, and others
Input tokens double; output tokens unchanged in fixed-format benchmarks (no latency increase, unlike chain-of-thought)

The paper positions this as a Pareto improvement over reasoning-heavy approaches: same output budget, better accuracy.

Why Our Agent Seemed Like a Good Candidate

Our Scout delegate, the research agent in our subagent architecture (full recipe), runs on claude-haiku-4-5 at temperature 0.5 with extended thinking off. Haiku 4.5 is structurally a non-reasoning model (extended thinking is opt-in, not default), making it precisely the class of LLM the paper’s title targets.

The paper tested Claude 3 Haiku alongside six other models; its strongest gains came from Gemini 2.0 Flash-Lite and GPT-4o-mini. We tested Claude 4.5 Haiku, a different model generation. Anthropic does not publish architectural details for either model. Whether the technique transfers across generations is an open question this experiment cannot answer, because our ceiling effects prevented any treatment from showing lift.

Why This Matters for Engineering Teams

Teams adopt AI techniques from papers without field-testing them first. BCG reports that 50% of companies are stagnating with AI (BCG, 2025), partly because they ship optimizations without measuring baselines. Shipping an unvalidated prompt change to production would cost more: doubled input tokens on every request, with no accuracy gain to show for it. As we covered in observability for AI agents, optimizing without measuring before and after is flying blind.

How Did We Design the Test?

Both experiments shared the same core structure: 10 parallel async Scout delegates, split 5 control vs. 5 treatment, scored blind against a pre-registered rubric. For detailed methodology and raw data, see Supplementary Materials.

# Shared config across all 10 delegates
model: claude-haiku-4-5
temperature: 0.5
extensions:
  - developer
  - context7
  - brave_search
output: "scout-run-{{ id }}.json"delegate-config.yaml

Code Snippet 2: Shared delegate configuration. All 10 runs use the same model, temperature, and extensions.

Control group: standard Scout instructions (~3,805 characters, instructions x1).

Treatment group: instructions repeated verbatim (~7,633 characters, instructions x2), mimicking the paper’s <QUERY><QUERY> pattern applied to the agent’s system prompt.

The orchestrator spawned all 10 delegates simultaneously via Goose’s background task system and handed off structured JSON. A separate blind-scoring delegate received only the output files (no group labels) and scored each against the rubric. Group assignments were sealed in a label map before scoring began.

What Happened in the FastMCP Refactor Test?

Target: FastMCP session ID refactor in math-mcp-learning-server. Open and unimplemented at the time of the experiment.

Rubric: 6 binary criteria, pre-registered before any runs were examined.

9 of 10 delegates produced valid output. control-1 ran 93 messages and wrote no output file. Session log analysis confirmed the file-write instruction appeared only at the end of the delegate prompt, and the model drifted past it. This is a delegate authoring flaw with a known fix: bookend critical instructions at the start and end.

Across the 9 valid runs, 5 of 6 criteria scored 100% in both groups. The only variance was C5 (must-not constraint violations): control 5.50/6, treatment 5.80/6, delta +0.30. The treatment scored marginally higher, but at n=4 vs n=5 with binary outcomes, Fisher’s exact test is degenerate (p = 1.0). Full per-criterion scores are in the raw data.

The task was too easy. The rubric could not discriminate. We needed a harder target.

Did a Stricter Methodology Change the Result?

Target: aptu#737, a tree-sitter AST (Abstract Syntax Tree)-based security scanner evaluation. Harder task, requiring synthesis from source code rather than retrieval from issue text. Unimplemented when tested.

Rubric: 7 binary criteria. C5, C6, and C7 required the delegate to read and reason about actual source code, not just summarize the issue. Pre-registered before any runs began.

Methodology improvements over Experiment 1:

Blinded file naming from the start (scout-run-01.json through scout-run-10.json with sealed label-map.json)
Mann-Whitney U test pre-specified (two-tailed, alpha = 0.05)
Wall-clock latency recorded per delegate

Results:

Group	Score	Wall-clock median
Control (x1)	7/7 all runs	6m 21s
Treatment (x2)	7/7 all runs	7m 29s
Delta	0	+17.8%

Table 1: Experiment 2 results. Zero variance in either group. Mann-Whitney U = 12.5, p = 1.0 (degenerate: complete ties, test cannot be evaluated).

Every Scout, in every run, in both groups, scored 7/7. Even C5, C6, and C7, the synthesis criteria we specifically designed to require source code reasoning, hit 100% across the board.

The 17.8% latency difference is in the expected direction (longer prompt, longer prefill), which is consistent with the paper’s Anthropic-specific latency caveat. At scale, that delta compounds: doubled tokens cost money, and the added prefill time costs throughput across every agent invocation. But n=5 cannot support any inference here, and the finding is further confounded by an infrastructure issue we discovered afterward.

The scores told us nothing. The session logs told us something we did not expect.

What Infrastructure Confound Did We Miss?

Post-hoc session log analysis revealed a confound present in both experiments.

Goose enforces a hard cap of 5 concurrent background delegates. When all 10 delegates were spawned simultaneously, runs 06-10 hit the cap and were queued into a second batch after runs 01-05 completed.

The resulting batch structure was unbalanced:

Batch	Runs	Control	Treatment	Condition
1 (runs 01-05)	5	C1, C2, C3	T1, T2	Resource contested
2 (runs 06-10)	5	C4, C5	T3, T4, T5	No contention

Table 2: The 5-delegate concurrency cap split 10 simultaneous spawns into two unbalanced batches. Exact run assignments are in the raw data.

Treatment delegates landed disproportionately in the less-contested second batch, making any latency comparison between groups uninterpretable. Accuracy was unaffected (ceiling effects dominated regardless), but the exposure is worth naming: pre-registration does not protect against runtime infrastructure behavior you did not know existed.

The confound matters for latency. But the bigger question is why accuracy showed zero variance in the first place.

Why Did Both Experiments Hit 100%?

Two experiments, two rubrics designed to be harder than the last, two 100% results.

This is itself a finding. A well-designed Scout delegate on a well-scoped engineering issue is already operating above the baseline accuracy threshold where prompt repetition shows lift. The paper’s largest gains came from synthetic positional tasks, NameIndex (Leviathan et al., 2025), where the answer is a name buried in a list. Real engineering issues, even unimplemented ones, give the agent structured context, code references, and acceptance criteria. The agent finds what it needs without help from the prefill geometry.

This is the core finding: prompt repetition solves an attention problem that well-scoped engineering tasks do not have. The technique’s value is real, but the paper’s benchmarks do not cover agentic engineering tasks. Our experiments tested that boundary. When the agent already has structured context pointing it to the right code, repeating the instruction adds input tokens without adding accuracy signal. Understanding where SLMs succeed and fail on their own is not academic: hybrid architectures like SMART (Kim et al., 2025) use SLMs as the primary reasoning engine, with LLMs intervening only at critical junctures. Every prompt-level optimization that improves standalone SLM accuracy reduces how often the expensive backstop fires.

For teams evaluating prompt techniques at scale, the implication is financial: doubling input tokens across every agent invocation is a measurable cost increase. If your agents already converge correctly on well-scoped tasks, that spend returns nothing. Embracing negative results as a research practice (Berger et al., 2024) prevents exactly this kind of waste: publication bias toward positive results means the null findings that would have saved you the experiment often go unpublished.

Where the Boundary Falls

The gap is between task types, not between models:

Positional retrieval tasks (NameIndex, needle-in-haystack): high positional attention decay, repetition helps
Structured engineering tasks (scoped issues with code context): low positional decay, Scout already converges correctly

Dimension	Paper (Leviathan et al.)	Experiment 1	Experiment 2
Task type	Standard + custom retrieval (MMLU-Pro, NameIndex, others)	Issue analysis (FastMCP refactor)	Source code synthesis (AST scanner)
Model	Gemini 2.0 Flash-Lite, Claude 3 Haiku, 5 others	Claude 4.5 Haiku	Claude 4.5 Haiku
Sample size	McNemar test on full benchmark datasets (7 benchmarks, 7 models)	n=4 vs n=5 (1 dropped)	n=5 vs n=5
Accuracy delta	47/70 pairs improved, 0 regressed; +76pp on NameIndex (Flash-Lite)	+0.30 (noise)	0.00 (ceiling)
Confounds	None reported	Delegate authoring failure	5-delegate batch cap

Table 3: Comparison of experimental conditions. The paper’s gains concentrate on positional retrieval tasks; our structured engineering tasks hit ceiling effects before any treatment could show lift.

Designing a rubric that discriminates between good and very good on the second category is harder than it looks. Both of ours failed. The criteria require synthesis and judgment under genuine ambiguity, not retrieval from a well-scoped document.

What Did We Learn About AI Evaluation Design?

Rubric Design Is Harder Than Experiment Design

We iterated twice and hit the ceiling both times. A 7-point rubric with “source code synthesis” criteria is not automatically harder. It depends on whether the task actually creates ambiguity the agent must resolve. Ours did not.

A practical calibration target: if your scoring delegate can answer any criterion by reading the issue alone (without running the code), the criterion will not discriminate.

Infrastructure Behavior Is a Confounder

The 5-delegate cap is undocumented. It is enforced as a hard rejection in source (GOOSE_MAX_BACKGROUND_TASKS defaults to 5), with no queuing or retry. Excess delegates are dropped, not deferred. It silently split our groups into unbalanced batches. This category of confound (runtime resource limits, queue behavior, model routing) is endemic to agent systems and invisible without structured logging.

Future experiments: spawn delegates in explicit batches of 5 with documented batch assignments. Record session IDs. Treat infrastructure state as a variable, not background noise.

Delegate Authoring Has a Turn-Length Problem

Long sessions drift from instructions that appear only once. The control-1 failure (93 messages, no output) demonstrated the fix: bookend critical actions at both the start and end of delegate prompts. This class of failure is predictable and preventable, but only if you treat delegate prompt structure as part of your experimental design.

The blind scoring infrastructure proved its value here. Each run produced a structured justification the scorer generated without knowing group assignment:

{
  "run_id": "run-01",
  "C1": 1, "C2": 1, "C3": 1, "C4": 1,
  "C5": 1, "C6": 1, "C7": 1, "total": 7,
  "justifications": {
    "C1": "Issues #735/#736 explicitly identified as regex limitation.",
    "C5": "Backward compatibility addressed via hybrid approach.",
    "C7": "Synthesis connects tree-sitter AST parsing to existing rules."
  }
}scorer-output.json

Code Snippet 3: Blind scorer output for a single run. Each criterion includes a justification generated without knowledge of group assignment.

When Should You Use Prompt Repetition?

The null result is not a failure of the paper. Prompt repetition won 47 out of 70 benchmark-model combinations with zero losses (Leviathan et al., 2025). The technique works. The question is where.

The paper’s gains concentrate on benchmarks with positional retrieval components: NameIndex, MiddleMatch, options-first multiple choice. Tasks where the answer depends on information placement in the context window. The paper also notes a neutral-to-slight effect with reasoning prompts (5 wins, 1 loss, 22 neutral with step-by-step). Reasoning appears to compensate for the same attention decay that repetition addresses.

The industry trend is not exclusively toward reasoning models. Capable SLMs are gaining ground. NVIDIA’s Nemotron 3 Nano (NVIDIA, 2025) activates 3 billion of its 30 billion parameters per token, delivering 3.3x the throughput of Qwen3-30B on a single H200, designed explicitly for multi-agent systems at scale. RLP (Hatamizadeh et al., 2025) embeds reinforcement learning into pretraining itself, lifting math and science accuracy by 19% on a 1.7B-parameter model without post-training reasoning. These models are non-reasoning by default. The causal masking limitation that prompt repetition addresses is structural to the decoder-only architecture all of them share. Their users are also the most cost-sensitive to doubled input tokens. Every token matters when you are optimizing for throughput at the edge.

Our null result came from the other side of that boundary: structured engineering tasks where the agent already has scoped context, code references, and acceptance criteria. The ceiling was in the task, not the technique.

Decision flowchart for evaluating prompt repetition: check task type, measure baseline, run controlled experiment, adopt only with measurable accuracy gain

Figure 1: Decision flowchart for evaluating prompt repetition against your workload.

What Transfers to Your Team

Three things that transfer directly to any team evaluating AI agent behavior:

Baseline accuracy determines whether any prompt technique has room to work. Measure it before testing an optimization.
Infrastructure constraints are confounder candidates. Audit your delegate system’s limits before attributing latency or throughput differences to treatment variables.
Rubric discrimination is the bottleneck. Two rubrics, two ceiling effects. If your scoring criteria can be satisfied by reading the issue description alone, the rubric will not discriminate.

Did Prompt Repetition Change Anything Else?

One observation worth noting: treatment agents in both experiments used fewer output tokens and messages to reach the same scores. This is consistent with information-theoretic expectations; redundancy in the input reduces decoder uncertainty at each generation step, which should reduce exploratory turns in an agentic loop. The original paper’s benchmarks (MMLU, GSM8K) produce fixed-format answers where output length does not vary, making this effect invisible. Agentic workloads, where the model decides how many turns to take, may be where the efficiency signal surfaces.

The economics are also different than single-turn benchmarks suggest: in a multi-turn session, the doubled prompt adds single-digit overhead to accumulated input, not 100%. The growing conversation history dominates each API call. In our data, treatment agents used 2.5% fewer total input tokens and 13% fewer output tokens despite the longer prompt, because each avoided turn eliminates an entire context window from the running total. The effect is confounded and too small to draw conclusions, but it is a pattern worth investigating with a discriminating rubric.

References

BCG, “AI Adoption Puzzle: Why Usage Is Up But Impact Is Not” (2025) — https://www.bcg.com/publications/2025/ai-adoption-puzzle-why-usage-up-impact-not
Berger et al., “Position: Embracing Negative Results in Machine Learning” (2024) — https://arxiv.org/abs/2406.03980
Clouatre, H., “Orchestrating AI Agents: A Subagent Architecture for Code” (2025) — https://clouatre.ca/posts/orchestrating-ai-agents-subagent-architecture/
Clouatre, H., “Prompt Repetition Experiments: Supplementary Materials” (2026) — https://github.com/clouatre-labs/prompt-repetition-experiments
Hatamizadeh, A. et al., “RLP: Reinforcement as a Pretraining Objective” (2025) — https://arxiv.org/abs/2510.01265
Kim, Y. et al., “Guiding Reasoning in Small Language Models with LLM Assistance” (2025) — https://arxiv.org/abs/2504.09923
Leviathan, Y., Kalman, M., and Matias, Y., “Prompt Repetition Improves Non-Reasoning LLMs” (2025) — https://arxiv.org/abs/2512.14982
Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023) — https://arxiv.org/abs/2307.03172
NVIDIA, “Nemotron 3: Efficient and Open Intelligence” (2025) — https://arxiv.org/abs/2512.20856