A 2025 analysis found security flaws in 45% of AI-generated code outputs when context was underspecified (Zigler, 2025). AI accelerated the part of development that was never the bottleneck.
This is the bottleneck inversion: implementation was already fast relative to the time spent on debugging, architecture, and review. AI compressed it further. The slow parts, the ones that require genuine domain understanding, got harder to do well when the code volume feeding into them increased without a corresponding increase in qualified human attention. The result is not a productivity gain. It is a productivity redistribution that benefits teams with strong review discipline and harms those without it.
The structural response is not better prompting. It is a governance layer that enforces human decision points at every stage where judgment matters: before code is written, before it is committed, and before it ships. This post describes that layer in concrete terms.
Table of contents
Contents
- Why Does AI-Assisted Development Create an Accountability Gap?
- How Does AI Shift Time From Implementation to Judgment?
- How Do Structured Workflows Enforce Human Decision Points?
- How Do You Maintain Accountability When AI Writes the Code?
- What Do Measured Time Savings Actually Look Like?
- When Does This Approach Work — and When Does It Not?
- What Does This Mean for Technical Leaders?
- References
Why Does AI-Assisted Development Create an Accountability Gap?
Software developers spend roughly 11% of their time coding; the rest is distributed across debugging, architecture, reviews, meetings, and operational tasks (Kumar et al., 2025). Coding was never the binding constraint.
The Volume Problem
When AI tools eliminate implementation friction, code volume increases. More pull requests, more changed lines, more design decisions embedded in generated output. But the human capacity for careful review does not scale with code volume. A reviewer evaluating 200 lines of handwritten code may face 800 lines of AI-generated output covering the same feature, with subtler assumptions embedded in the structure.
The Synthesis Degradation Finding
CHI 2025 research on AI-assisted development found that synthesis tasks, the core of architecture and design, show the steepest degradation under cognitive offloading (Shukla, Bui, Parsons et al., 2025). Reviewers who routinely approve AI output without deep engagement gradually lose the ability to catch what they once would have caught.
The accountability gap is structural: code is now generated faster than human judgment can validate it. Without explicit controls, the natural equilibrium is faster shipping with lower comprehension per merged line.
How Does AI Shift Time From Implementation to Judgment?
The Controlled Trial Baseline
The original case for AI coding assistance rested on controlled experiments. A randomized controlled trial with 95 developers found task completion approximately 55.8% faster with AI assistance (Peng et al., 2023). That result held for greenfield, well-specified programming tasks. Later research complicates it.
Where the Research Diverges
METR’s 2025 study found that experienced developers working on real tasks with AI tools took 19% longer while perceiving a 20% speedup (Becker et al., 2025). The gap between perceived and measured performance is itself informative: the feeling of productivity increased while actual throughput declined. That is the metacognitive failure mode that governance controls are designed to interrupt.
Table 1: Time allocation shift under AI-assisted development. Implementation shrinks; review and judgment must grow to hold quality constant.
| Activity | Traditional | AI-Assisted | Direction |
|---|---|---|---|
| Implementation | High (bottleneck) | Low (AI-generated) | Shrinks |
| Debugging own code | High | Lower (less handwritten code) | Shrinks |
| Review and judgment | Low (limited by implementation time) | Must grow to match volume | Must grow |
| Architecture and design | Fragmented | Consolidated | Grows |
| Cognitive synthesis | Frequent, shallow | Fewer, deeper sessions required | Deepens |
Metacognition as the Hidden Variable
Prather et al. (2024) found that metacognitive skill, not experience level, determines whether a developer benefits or is harmed by AI assistance. The skill in question is the ability to accurately assess what you understand and what you do not. Developers with strong metacognition catch AI errors because they notice when generated code does something they cannot explain. Developers without it ship the errors because the code looked plausible.
Karpathy’s distinction between vibe coding and agentic engineering (Karpathy, 2025) maps directly onto this: vibe coding raises the floor for simple tasks and lowers the ceiling for complex ones. Agentic engineering, with explicit spec design, diff review, eval design, security oversight, and quality judgment, preserves the ceiling. The difference is whether human judgment is structurally required or merely available.
How Do Structured Workflows Enforce Human Decision Points?
The recipe workflow used in this practice formalizes the judgment gates that distinguish agentic engineering from vibe coding. A recipe is a YAML workflow definition that codifies process: AI handles analysis, research, and implementation; the human approves direction at mandatory STOP points before any code is written.
The Five-Phase Recipe
Each phase requires explicit human approval before the next begins:
- ANALYZE: Understand the codebase, identify problem scope and affected files
- RESEARCH: Explore 2-3 solution approaches with trade-offs; human selects direction
- PLAN: Detailed implementation plan reviewed before any code is written
- IMPLEMENT: Code, tests, documentation; AI executes against approved plan
- PREPARE: PR creation, branch verification, push; human approves before opening
Code Snippet 1: GATE pattern from production recipe. AI presents constrained options; human selects direction before any code is written.
## Phase 1: RESEARCH
Understand scope and constraints:
- Read issue/PR description, linked discussions
- Identify affected files with `rg` and `analyze`
- Note CI requirements, test patterns, coding standards
### GATE: Research Summary
**STOP - Present to user:**
- Problem statement (1-2 sentences)
- Affected files and scope
- Constraints discovered (CI, tests, dependencies)
- 2-3 possible approaches with trade-offs
**ASK:** "Which approach do you prefer?"~/.config/goose/recipes/goose-coder.yaml
This pattern ensures that architectural direction is always a human decision, not an inference from incomplete context. The full recipe is available at goose-coder.yaml on GitHub Gist.
Hard Blocks vs. Guidelines
Governance controls operate at three layers. AGENTS.md is the policy declaration: it instructs the agent on commit conventions, identity requirements, and policy boundaries before any code is written. Local git hooks enforce the same contract at commit time.
Table 2: Three-layer governance enforcement stack
| Layer | Mechanism | Controls |
|---|---|---|
| Policy declaration | AGENTS.md | Commit conventions, identity requirements, policy boundaries |
| Commit-time | git hooks | Conventional commits, DCO sign-off, protected branch block |
| Repository | Branch rulesets, code owner review | GPG signing, SLSA provenance, OpenSSF certification |
Repository controls enforce the rules again at the server: branch rulesets, required code owner review, GPG signing, and provenance attestation. All three layers are intentionally aligned: a well-configured agent should never trip a hook. REPO-STANDARDS.md documents the full pipeline.
Code Snippet 2: Global commit-msg hook enforces conventional commits and DCO. Hard blocks prevent non-compliant commits from reaching review.
# Conventional commit format
CONVENTIONAL_REGEX='^(feat|fix|docs|...)(\([a-z0-9_-]+\))?(!)?: .{1,100}$'
if ! echo "$COMMIT_MSG" | grep -qE "$CONVENTIONAL_REGEX"; then
echo "BLOCKED: Commit message must follow conventional format"
exit 1
fi
# DCO required
if ! grep -q "^Signed-off-by:" "$COMMIT_MSG_FILE"; then
echo "BLOCKED: Missing DCO (Signed-off-by)"
exit 1
fi~/.githooks/commit-msg
How Do You Maintain Accountability When AI Writes the Code?
When any portion of code is AI-generated, the accountability question sharpens. The human who submits and certifies the code bears full responsibility for what it does, regardless of how much of it was generated. Making that accountability explicit requires controls that a policy document alone cannot provide. AI_POLICY.md formalizes four of them.
DORA 2025 confirms that AI acts as an organizational capability amplifier: the greatest return on investment accrues to teams with strong review discipline and platform engineering foundations, not to teams that simply adopt the tools (DORA / Google Cloud, 2025). The governance layer described here is precisely those foundations.
Figure 1: Governance chain. Each step adds a verifiable artifact; the chain is only as strong as the named human at the review gate.
Authorship and Attribution
Industry practice varies: the Linux kernel requires Assisted-by disclosure; Claude Code adds Co-Authored-By: Claude by default. The position here is accountability over attribution. DCO sign-off is a responsibility certification, not an originality one: the committer certifies they have the right to submit the change and understand its contents. That certification is only honest if the reviewer has engaged with the code. As AI generates more of it, the reviewer role shifts from syntax-checker to spec-verifier and security judge: a higher-accountability function, not a diminishing one.
Commit Integrity
All commits are GPG-signed. A GPG signature ties the commit cryptographically to a verified identity, making it impossible to silently alter commit history or impersonate a contributor. Combined with DCO, every commit carries a named, verified human who certified its contents. These two controls together mean the audit trail is tamper-evident: you can verify not just what changed but who vouched for it.
Review Accountability
Every non-trivial change requires a named human reviewer. The PR checklist includes an explicit attestation: the reviewer has read every line and can explain it. This standard matters specifically because AI-generated code can be syntactically correct and pass all tests while containing subtle assumptions the reviewer would catch if they engaged deeply. Approving on the basis of CI green alone is the failure mode the attestation is designed to prevent.
Build Provenance and Certification
SLSA Level 3 build provenance establishes a verifiable chain from source to artifact. Every build produces a signed attestation of what was compiled, from which commit, with which toolchain. OpenSSF Best Practices certification documents that these controls are maintained continuously.
These controls are not compliance theater. They are the structural response to the accountability gap: when code is generated faster than it can be understood, the governance layer is the only verified signal that a qualified human evaluated what ships.
What Do Measured Time Savings Actually Look Like?
Two production examples provide concrete data. Both are single-author measurements on a controlled codebase without a comparison group. They are illustrative, not generalizable.
CI Modernization (PR #52)
math-mcp-learning-server had no CI workflow. The judgment call: build from scratch or adapt patterns from a similar project. AI identified Ruff, uv, and pytest-cov as the right stack. Review covered the risk assessment, tooling fit, and zero-regression confirmation. Result: approximately 20 minutes versus an estimated 3-4 hours, with CI runtime at 5 seconds and 67 tests passing at 83% coverage. Source: PR #52.
Matrix Operations Feature (PR #109)
Five matrix operation tools with NumPy integration. The judgment call: implement incrementally or batch with shared validation patterns. AI identified the common infrastructure needs: dimension validation, ToolError handling, DoS prevention via size limits. Review covered API design, error handling conventions, and security boundaries. Result: 2 minutes from PR creation to merge, 5 tools, 21 tests, 395 lines. Source: PR #109.
Table 3: Measured time savings from two production examples (single-author, no control group)
| Task | AI-Assisted | Traditional Estimate | Savings |
|---|---|---|---|
| CI modernization (PR #52) | ~20 min | 3-4 hours | ~90% |
| Matrix operations, 5 tools (PR #109) | 2 min | 1-2 hours | ~95% |
At 10 infrastructure tasks per month, savings at this rate recover approximately 60 hours per year per engineer. That estimate depends entirely on reviewer quality being sufficient to catch AI errors; the governance controls in the previous section are what make that condition hold.
GitHub / Kalliamvakou (2024) found a 26% overall productivity increase across 4,867 developers, with 60-75% reporting increased job fulfillment. These aggregate figures align with the production examples directionally. The governance layer is what separates an individual positive experience from a reproducible organizational outcome.
When Does This Approach Work — and When Does It Not?
Table 4: Task type fit for AI-assisted development, based on production experience
| Task Type | Fit | Evidence |
|---|---|---|
| CI/DevOps automation | High | 20 min vs 3-4 hrs (PR #52) |
| Feature implementation with established patterns | High | 2 min for 5 tools (PR #109) |
| Boilerplate and scaffolding | High | Common pattern in both PRs |
| Legacy code: analysis and documentation | High | aptu-coder: structured agent access without full context load |
| Greenfield architecture | Medium | More judgment gates needed |
| Security-sensitive code | Low | Context underspecification risk |
| Regex and parsing logic | Low | Subtle bugs compound |
| Legacy code: generation and modification | Low | Hallucination risk without grounding context |
Security-Sensitive Tasks
The security-sensitive category warrants elaboration. The 45% security flaw rate from Veracode (reported in Zigler, 2025) applies specifically when context is underspecified: incomplete specifications, ambiguous threat models, or missing documentation of invariants. It is not a universal finding about all AI-generated code. Well-specified codebases with clear security requirements produce substantially better results. Reviewer attestation and SLSA provenance are the mechanism for verifying that the higher-quality path was actually taken.
The Candidate Generation Model
The critical success factor is consistent across task types: the human evaluating AI proposals must have sufficient expertise to recognize errors. DORA 2025 confirms that AI amplifies existing organizational capability rather than substituting for it (DORA / Google Cloud, 2025). A team without strong review culture will see AI increase their defect rate, not decrease it.
Backlund (2024) frames this as a candidate generation problem: in large-scale projects it is infeasible to thoroughly research every decision, but more candidates increase the likelihood that the ideal solution is among them. AI excels at candidate generation. Human judgment determines which candidate ships.
What Does This Mean for Technical Leaders?
The Perception Gap Risk
That perception gap is an organizational risk, not just an individual one. Teams adopting AI tools without governance controls may be degrading engineering throughput while believing they are improving it.
Implementation was already a small fraction of developer time (Kumar et al., 2025). AI compresses it further. The question is whether the time freed is reinvested in the judgment tasks that determine output quality.
Three Conditions for Success
The governance layer described here works when three conditions hold: reviewer expertise is sufficient to evaluate AI proposals, the workflow enforces human approval at structural decision points, and cryptographic controls make accountability non-repudiable.
DORA 2025 is direct on this: teams with strong engineering foundations see the gains; teams without them see increased complexity, larger pull requests, and architectural drift (DORA / Google Cloud, 2025). The governance layer is what creates the foundation.
This is Part 1 of a two-post series. Part 2 scales these controls across multi-agent workflows: Orchestrating AI Agents: Subagent Architecture.
References
- Backlund, Emil, “A Cost-Based Decision Framework for Software Engineers” (2024) — https://www.emilbacklund.com/p/a-cost-based-decision-framework-for
- Becker et al. / METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (2025) — https://arxiv.org/abs/2507.09089
- DORA / Google Cloud, “State of AI-assisted Software Development 2025” (2025) — https://dora.dev/dora-report-2025
- GitHub / Kalliamvakou, “Research: quantifying GitHub Copilot’s impact on developer productivity and happiness” (2024) — https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
- Karpathy, Andrej, “Software Is Changing (Again)” (2025) — https://x.com/karpathy/status/1886192184808149082
- Kumar et al., “Time Warp: The Gap Between Developers’ Ideal vs Actual Workweeks in an AI-Driven Era” (2025) — https://arxiv.org/abs/2502.15287
- Peng et al., “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot” (2023) — https://arxiv.org/abs/2302.06590
- Prather et al., “The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks” (2024) — https://arxiv.org/abs/2506.10051
- Shukla, Bui, Parsons et al., “De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design” (2025) — https://doi.org/10.1145/3706599.3719931
- Zigler, Andrew, “Mise en Place for Agentic Coding: Deliberate Preparation as Context Engineering Methodology” (2025) — https://arxiv.org/abs/2605.05400