Tag: evaluation
All the articles with the tag "evaluation".
-
SRE for AI Agents: Error Budgets, Trust, and 90 Trials
• UpdatedCan an AI agent predict scope without hallucinating? We ran 90 trials. It added 1.7 phantom files per change. Error budgets and trust ladders are the gate.
-
What a Null Result Taught Us About AI Agent Evaluation
• UpdatedWe tested prompt repetition on 20 parallel AI agents. Ceiling effects dominated both experiments. The null result is a finding about evaluation design.