Tag: evaluation

All the articles with the tag "evaluation".

SRE for AI Agents: Error Budgets, Trust, and 90 Trials

9 Mar, 2026 • Updated 23 Mar, 2026

Can an AI agent predict scope without hallucinating? We ran 90 trials. It added 1.7 phantom files per change. Error budgets and trust ladders are the gate.
What a Null Result Taught Us About AI Agent Evaluation

23 Feb, 2026 • Updated 28 Mar, 2026

We tested prompt repetition on 20 parallel AI agents. Ceiling effects dominated both experiments. The null result is a finding about evaluation design.

SRE for AI Agents: Error Budgets, Trust, and 90 Trials