Skip to content
Go back

SRE for AI Agents: Error Budgets, Trust, and 90 Trials

Updated
12 min read
Listen to article

AI tooling budgets hit record highs. We ran 90 file-prediction trials to measure what an AI agent gets wrong before it touches production. The model predicted 1.7 files beyond the actual change set on average, even on a well-structured codebase. SRE is not ceremony. It is the empirical gate between velocity and blast radius.

Table of contents

Contents

Why Is AI Widening the Dev/Ops Gap?

Toil is the repetitive, manual operational work that scales with system load rather than adding lasting value. Catchpoint’s annual SRE reports tracked toil rising from 25% to 34% between 2024 and 2026, the first sustained increase in five years. Over the same period, 92% of developers report that AI tools increase the blast radius of bad code needing to be debugged (DevOps.com, 2025). Defensive pipeline architectures can close part of this gap, but they address the pipeline, not the production governance layer.

Three root causes keep surfacing in post-mortems:

More frequent deploys, multiplied by more autonomous agents, against review capacity that has not scaled to match. METR found experienced developers took 19% longer with AI tools despite perceiving a 20% speedup (Becker et al., 2025).

The Perception Gap: Dashboards vs. Practitioner Reality

The perception gap makes this harder to fix. Directors reviewing dashboards see ticket counts drop and declare victory. Practitioners on the ground feel increased friction because toil shifted from “boring but predictable” to “novel and unpredictable.”

DimensionAI AcceleratedHuman Judgment Required
Code generationHigher code volumeArchitectural review
Test creationUnit test scaffoldingIntegration test design
Deploy frequencyHigher deploy cadenceChange risk assessment
Incident detectionFaster alert correlationRoot cause judgment
ComplianceAutomated scanningRegulatory interpretation

Table 1: AI accelerates delivery tasks (left column), but human judgment gaps (right column) are where incidents originate.

Chart showing AI investment and measured toil both climbing from 2021 to 2026

Figure 1: AI investment and measured toil both climbing, 2021-2026.

How Did We Measure AI’s Scope Creep?

We ran 90 file-prediction trials against tobymao/sqlglot, an MIT-licensed SQL transpiler with 9k+ stars: 30 merged PRs stratified across simple, medium, and complex tiers, 3 predictions each, using a single Claude Sonnet 4.6 Bedrock call with no agent loop or retrieval. Given a GitHub issue description and the repository file tree, the model predicted which files a human engineer modified. Scope hallucination counts files predicted beyond the human’s actual change set. For full data, see Supplementary Materials.

TierPrecision / RecallF1Scope Creep
Simple (1-2 files)0.645 / 0.8500.7081.3 files
Medium (3-5 files)0.540 / 0.5850.5522.2 files
Complex (6-15 files)0.769 / 0.6730.7121.6 files

Table 2: Results by complexity tier, 30 PRs x 3 runs = 90 total predictions. F1 is the harmonic mean of precision and recall. Full methodology, metrics, and per-tier results.

Why Medium-Tier PRs Underperformed

The non-monotonic curve is the headline finding: Jaccard similarity scored 0.60 for simple PRs, 0.41 for medium, and 0.58 for complex. Medium PRs are the hardest tier: too many candidate files to guess by elimination, yet not enough structural regularity to infer the change set from sqlglot’s dialect-file conventions. Complex PRs scored highest because sqlglot’s rigid directory structure makes multi-file change sets predictable from the dialect name alone. 12 of 30 PRs failed with a Jaccard score below 0.5, spread across all tiers; no tier is immune. In a shadow-mode deployment, every over-predicted file is a false positive the reviewer must filter before the change reaches production.

What Does SRE Mean in a Regulated Enterprise?

SRE is not DevOps with a different name. The distinction is structural. In regulated financial services, production reliability carries regulatory weight: OSFI’s B-13 guideline mandates technology risk management with board-level accountability, and the EU’s DORA regulation sets equivalent requirements for operational resilience across European financial services (European Parliament and Council, 2022).

Error Budgets as Compliance Evidence

SRE answers this with a reliability contract. Error budgets define how much unreliability a service can tolerate before feature work stops. SLOs (service level objectives) make reliability measurable rather than aspirational. Blameless postmortems treat incidents as system failures, not personnel failures. Google codified this framework in 2016 and enterprises have since adapted it, but in regulated environments the stakes include regulatory censure, not just customer churn.

Platform Engineering provides capability: the tools, the internal developer platform, the golden paths. SRE provides accountability: the error budgets, the incident response, the production governance. The question is whether that accountability holds when the agent making changes is not human.

How Does SRE Act as AI’s Production Conscience?

Deploying an AI agent is a reliability problem, not a monitoring problem. Monitoring tells you something broke; a reliability framework tells you how much breakage you can tolerate, who caused it, and whether to keep going.

Decision Provenance and Error Budget Separation

Decision provenance, the AI observability requirement that every agent action links to its inputs, reasoning, and authorization chain, goes beyond logging what an agent did. You need to trace why it made a choice, what context it consumed, and which prior decisions influenced the outcome. Without this, debugging an autonomous system is archaeology, not engineering. Under OSFI B-13 and DORA, an agent action without decision provenance is not just a debugging gap; it is a compliance liability.

Separate error budgets for AI-generated changes keep machine-authored deployments from hiding behind human baselines. If an AI agent burns through its error budget, its write permissions get revoked automatically, not the entire team’s. Our results showed 0.769 precision even on the best-performing tier, meaning roughly 1 in 4 predicted files was wrong. That error rate needs its own budget.

groups:
  - name: sli.deploys
    rules:
      - record: sli:deploy_success:ratio1h
        expr: |
          sum(rate(deploy_success_total{author_type="ai"}[1h]))
          / sum(rate(deploys_total{author_type="ai"}[1h]))
        labels:
          author_type: ai
          slo_target: "0.995" # Stricter than human baseline of 0.990
      - record: sli:deploy_success:ratio1h
        expr: |
          sum(rate(deploy_success_total{author_type="human"}[1h]))
          / sum(rate(deploys_total{author_type="human"}[1h]))
        labels:
          author_type: human
          slo_target: "0.990"
      - alert: AIChangeErrorBudgetBurnRate
        expr: |
          (1 - sli:deploy_success:ratio1h{author_type="ai"})
          / (1 - 0.995) > 14.4
        for: 5m
        labels:
          severity: critical
          team: sresli-and-burn-rate.yaml

Code Snippet 1: Prometheus recording rules and burn-rate alert for AI-authored deployments. Separate SLIs per author type; 14.4x burn-rate threshold (Google, 2018).

Production teams consistently trade agent capability for reliability, preferring narrower but predictable automation over broad but brittle autonomy (Pan et al., 2026). Separate error budgets formalize that trade-off.

Blast Radius Containment and the Trust Ladder

Blast radius containment means progressive rollout gates. No agent ships to 100% on day one. The trust ladder is a graduated set of AI guardrails where each rung grants broader blast radius only after the agent demonstrates reliability at the current level:

Our experiment is a proxy for what shadow mode catches. At the medium tier, where Jaccard dropped to 0.409, shadow mode would have flagged more than half the predicted change set as incorrect. The specific SLO threshold is yours to define; what matters is that it is explicit, measured, and tied to your error budget rather than a gut feeling.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ai-agent-readonly
rules:
  - apiGroups: [""]
    resources: [pods, services, configmaps]
    verbs: [get, list, watch]
  - apiGroups: [apps]
    resources: [deployments, replicasets]
    verbs: [get, list, watch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ai-agent-scoped-write
rules:
  - apiGroups: [""]
    resources: [pods, services, configmaps]
    verbs: [get, list, watch]
  - apiGroups: [apps]
    resources: [deployments]
    verbs: [get, list, watch, update, patch]
    resourceNames: [canary-payments]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ai-agent-production-write
rules:
  - apiGroups: [""]
    resources: [pods, services, configmaps]
    verbs: [get, list, watch, create, update, patch, delete]
  - apiGroups: [apps]
    resources: [deployments, replicasets]
    verbs: [get, list, watch, create, update, patch, delete]sre/trust-ladder-rbac.yaml

Code Snippet 2: Kubernetes RBAC ClusterRoles for each trust ladder tier. Promotion from readonly to scoped-write to production-write is a ServiceAccount rebinding; demotion reverses it.

Trust ladder diagram showing four tiers: read-only, shadow mode, supervised write, autonomous

Figure 2: Trust ladder for agentic AI: read-only, shadow mode, supervised write, autonomous.

Accuracy alone cannot distinguish an agent that fails on a fixed subset of tasks from one that fails unpredictably at the same rate (Rabanser et al., 2026). Our 90 runs confirmed consistency (27 of 30 PRs showed zero variance across runs, 3 showed near-zero) but exposed robustness and safety gaps on complex refactoring tasks. Consistent failures are exactly what shadow mode is designed to catch: the model’s errors are systematic, not random, and a human reviewer can filter them. Early evidence supports this approach: STRATUS, a multi-agent SRE system operating under similar progressive constraints, achieved a 1.5x improvement over baselines in automated failure mitigation (Chen et al., 2025).

Why Does Platform Maturity Gate AI Readiness?

The 2025 DORA report is explicit: AI’s impact depends on the quality of the underlying organizational system. Bolt AI onto a fragile platform and you get faster fragility. An AI agent that auto-scales a misconfigured service does not fix the misconfiguration; it scales the blast radius. The AIOpsLab framework shows agent performance varies significantly with the quality of instrumented infrastructure underneath (Chen et al., 2025).

The maturity sequence matters. Build the IDP (internal developer platform) first, layer SRE practices including LLMOps telemetry for token consumption, latency, and decision traces on top, then introduce agentic AI. Skip a step and the agents inherit your tech debt at machine speed.

The Learning Time Deficit

Only 6% of SREs have dedicated, protected learning time (Catchpoint, 2026). You cannot build an SRE practice when the people staffing it have no time to learn the discipline. Concretely, 10% protected time means one half-day per week where an SRE studies agent failure modes, reviews postmortems from other teams, or shadow-tests a new observability tool without on-call interruptions. The organizations with the lowest toil trends treat learning hours like error budgets: protected, measured, and non-negotiable.

Maturity LevelPlatform StateSRE StateAI Readiness
FoundationManual provisioningReactive ops, no SLOsNot ready
StandardizedSelf-service IDPSLOs defined, error budgetsRead-only agents
MeasuredGolden paths adoptedToil tracked, burn-rate alertsShadow mode agents
OptimizedPlatform-as-productBlameless culture, SLO-drivenSupervised write agents
AutonomousFull self-serviceProactive reliabilityAgentic AI with guardrails

Table 3: Platform + SRE maturity levels and what each unlocks.

Maturity sequence diagram showing platform engineering and SRE prerequisites gating each AI readiness level

Figure 3: Maturity sequence, platform engineering and SRE prerequisites gate each AI readiness level.

Read-only agents need SLOs because without a defined “good,” the agent cannot distinguish signal from noise. Supervised write agents need blameless culture because humans must feel safe overriding the machine.

Where Should You Start?

Enforce the prerequisites before enabling agentic AI on any service:

package sre.ai.readiness

import rego.v1

default allow_agentic_ai := false

allow_agentic_ai if {
    input.slo_defined
    input.error_budget_policy
    input.shadow_period_days >= 30
    input.decision_provenance      
    input.rollback_automated
    input.toil_measured
}sre-readiness-check.rego

Code Snippet 3: OPA policy gate enforcing the minimum bar before any service receives agentic AI write access. Shadow period and decision provenance are the two gates most commonly skipped in practice.

The Four Actions in Order

Four actions, in order:

Start with the toil audit, the only prerequisite you can measure without instrumentation already in place. Measure first. Then automate. Each rung you skip is a gap in your audit trail and a risk your board will eventually ask about.


References



Previous Post
What a Null Result Taught Us About AI Agent Evaluation

Related Posts