AI-Assisted Development: The Accountability Layer

Q: When Does This Approach Work — and When Does It Not?

Table 4: Task type fit for AI-assisted development, based on production experience

A 2025 analysis found security flaws in 45% of AI-generated code outputs when context was underspecified (Zigler, 2025). AI accelerated the part of development that was never the bottleneck.

Implementation was already fast relative to debugging, architecture, and review. AI compressed it further. The slow parts, the ones requiring genuine domain understanding, got harder when code volume grew without a corresponding rise in qualified human attention. The result is not a productivity gain but a redistribution that benefits teams with strong review discipline and harms those without it.

The structural response is not better prompting. It is a governance layer that enforces human decision points where judgment matters: at authoring, at commit, and at ship. This post describes that layer in concrete terms.

Contents

Why Does AI-Assisted Development Create an Accountability Gap?
- The Volume Problem
- The Synthesis Degradation Finding
How Does AI Shift Time From Implementation to Judgment?
How Do Structured Workflows Enforce Human Decision Points?
- The Five-Phase Recipe
- Hard Blocks vs. Guidelines
How Do You Maintain Accountability When AI Writes the Code?
What Do Measured Time Savings Actually Look Like?
- CI Modernization (PR #52)
- Matrix Operations Feature (PR #109)
When Does This Approach Work — and When Does It Not?
- Security-Sensitive Tasks
- The Candidate Generation Model
What Does This Mean for Technical Leaders?
- The Perception Gap Risk
- Three Conditions for Success
References

Why Does AI-Assisted Development Create an Accountability Gap?

Software developers spend roughly 11% of their time coding; the rest is distributed across debugging, architecture, reviews, meetings, and operational tasks (Kumar et al., 2025). Coding was never the binding constraint.

The Volume Problem

When AI tools eliminate implementation friction, code volume increases. More pull requests, more changed lines, more design decisions embedded in generated output. But the human capacity for careful review does not scale with code volume. A reviewer evaluating 200 lines of handwritten code may face 800 lines of AI-generated output covering the same feature, with subtler assumptions embedded in the structure.

The Synthesis Degradation Finding

CHI 2025 research on AI-assisted development found that synthesis tasks, the core of architecture and design, show the steepest degradation under cognitive offloading (Shukla, Bui, Parsons et al., 2025). Reviewers who routinely approve AI output without deep engagement gradually lose the ability to catch what they once would have caught.

The accountability gap is structural: code is now generated faster than human judgment can validate it. Without explicit controls, the natural equilibrium is faster shipping with lower comprehension per merged line.

How Does AI Shift Time From Implementation to Judgment?

The Controlled Trial Baseline

The original case for AI coding assistance rested on controlled experiments. A randomized controlled trial with 95 developers found task completion approximately 55.8% faster with AI assistance (Peng et al., 2023). That result held for greenfield, well-specified programming tasks. Later research complicates it.

Where the Research Diverges

METR’s 2025 study found that experienced developers working on real tasks with AI tools took 19% longer while perceiving a 20% speedup (Becker et al., 2025). The gap between perceived and measured performance is itself informative: the feeling of productivity increased while actual throughput declined. That is the metacognitive failure mode that governance controls are designed to interrupt.

**Table 1:** Time allocation shift under AI-assisted development. Implementation shrinks; review and judgment must grow to hold quality constant.
Activity	Traditional	AI-Assisted	Direction
Implementation	High (bottleneck)	Low (AI-generated)	Shrinks
Debugging own code	High	Lower (less handwritten code)	Shrinks
Review and judgment	Low (limited by implementation time)	Must grow to match volume	Must grow
Architecture and design	Fragmented	Consolidated	Grows
Cognitive synthesis	Frequent, shallow	Fewer, deeper sessions required	Deepens

Metacognition as the Hidden Variable

Prather et al. (2024) found that metacognitive skill, not experience level, determines whether a developer benefits or is harmed by AI assistance. The skill in question is the ability to accurately assess what you understand and what you do not. Developers with strong metacognition catch AI errors because they notice when generated code does something they cannot explain. Developers without it ship the errors because the code looked plausible.

Karpathy’s distinction between vibe coding and agentic engineering (Karpathy, 2025) maps directly onto this: vibe coding raises the floor for simple tasks and lowers the ceiling for complex ones. Agentic engineering, with explicit spec design, diff review, eval design, security oversight, and quality judgment, preserves the ceiling. The difference is whether human judgment is structurally required or merely available.

How Do Structured Workflows Enforce Human Decision Points?

The recipe workflow used in this practice formalizes the judgment gates that distinguish agentic engineering from vibe coding. A recipe is a YAML workflow definition that codifies process: AI handles analysis, research, and implementation; the human approves direction at mandatory STOP points before any code is written.

The Five-Phase Recipe

Each phase requires explicit human approval before the next begins:

ANALYZE: Understand the codebase, identify problem scope and affected files
RESEARCH: Explore 2-3 solution approaches with trade-offs; human selects direction
PLAN: Detailed implementation plan reviewed before any code is written
IMPLEMENT: Code, tests, documentation; AI executes against approved plan
PREPARE: PR creation, branch verification, push; human approves before opening

Code Snippet 1: GATE pattern from production recipe. AI presents constrained options; human selects direction before any code is written.

## Phase 1: RESEARCH

Understand scope and constraints:
- Read issue/PR description, linked discussions
- Identify affected files with `rg` and `analyze`
- Note CI requirements, test patterns, coding standards

### GATE: Research Summary  

**STOP - Present to user:**
- Problem statement (1-2 sentences)
- Affected files and scope
- Constraints discovered (CI, tests, dependencies)
- 2-3 possible approaches with trade-offs

**ASK:** "Which approach do you prefer?"~/.config/goose/recipes/goose-coder.yaml

This pattern ensures that architectural direction is always a human decision, not an inference from incomplete context. The full recipe is available at goose-coder.yaml on GitHub Gist.

Hard Blocks vs. Guidelines

Governance controls operate at three layers. AGENTS.md is the policy declaration: it instructs the agent on commit conventions, identity requirements, and policy boundaries before any code is written. Local git hooks enforce the same contract at commit time.

**Table 2:** Three-layer governance enforcement stack
Layer	Mechanism	Controls
Policy declaration	AGENTS.md	Commit conventions, identity requirements, policy boundaries
Commit-time	git hooks	Conventional commits, DCO sign-off, protected branch block
Repository	Branch rulesets, code owner review	GPG signing, SLSA provenance, OpenSSF certification

Repository controls enforce the rules again at the server: branch rulesets, required code owner review, GPG signing, and provenance attestation. All three layers are intentionally aligned: a well-configured agent should never trip a hook. REPO-STANDARDS.md documents the full pipeline.

Code Snippet 2: Global commit-msg hook enforces conventional commits and DCO. Hard blocks prevent non-compliant commits from reaching review.

# Conventional commit format
CONVENTIONAL_REGEX='^(feat|fix|docs|...)(\([a-z0-9_-]+\))?(!)?: .{1,100}$'

if ! echo "$COMMIT_MSG" | grep -qE "$CONVENTIONAL_REGEX"; then
    echo "BLOCKED: Commit message must follow conventional format"
    exit 1
fi

# DCO required
if ! grep -q "^Signed-off-by:" "$COMMIT_MSG_FILE"; then
    echo "BLOCKED: Missing DCO (Signed-off-by)"
    exit 1
fi~/.githooks/commit-msg

How Do You Maintain Accountability When AI Writes the Code?

When any portion of code is AI-generated, the accountability question sharpens. The human who submits and certifies the code bears full responsibility for what it does, regardless of how much of it was generated. Making that accountability explicit requires controls that a policy document alone cannot provide. AI_POLICY.md formalizes four of them.

DORA 2025 confirms that AI acts as an organizational capability amplifier: the greatest return on investment accrues to teams with strong review discipline and platform engineering foundations, not to teams that simply adopt the tools (DORA / Google Cloud, 2025). The governance layer described here is precisely those foundations.

Governance chain from AI proposal to GPG-signed commit to SLSA provenance to OpenSSF certification — **Figure 1:** Governance chain. Each step adds a verifiable artifact; the chain is only as strong as the named human at the review gate.

Authorship and Attribution

Industry practice varies: the Linux kernel requires Assisted-by disclosure; Claude Code adds Co-Authored-By: Claude by default. The position here is accountability over attribution. DCO (Developer Certificate of Origin) sign-off is a responsibility certification, not an originality one: the committer certifies they have the right to submit the change and understand its contents. That certification is only honest if the reviewer has engaged with the code. As AI generates more of it, the reviewer role shifts from syntax-checker to spec-verifier and security judge: a higher-accountability function, not a diminishing one.

Commit Integrity

All commits are GPG-signed. A GPG signature ties the commit cryptographically to a verified identity, making it impossible to silently alter commit history or impersonate a contributor. Combined with DCO, every commit carries a named, verified human who certified its contents. These two controls together mean the audit trail is tamper-evident: you can verify not just what changed but who vouched for it.

Review Accountability

Every non-trivial change requires a named human reviewer. The PR checklist includes an explicit attestation: the reviewer has read every line and can explain it. This standard matters specifically because AI-generated code can be syntactically correct and pass all tests while containing subtle assumptions the reviewer would catch if they engaged deeply. Approving on the basis of CI green alone is the failure mode the attestation is designed to prevent.

Build Provenance and Certification

SLSA (Supply-chain Levels for Software Artifacts, a build integrity framework) Level 3 build provenance establishes a verifiable chain from source to artifact. Every build produces a signed attestation of what was compiled, from which commit, with which toolchain. OpenSSF (Open Source Security Foundation) Best Practices certification documents that these controls are maintained continuously.

These controls are not compliance theater. They are the structural response to the accountability gap: when code is generated faster than it can be understood, the governance layer is the only verified signal that a qualified human evaluated what ships.

What Do Measured Time Savings Actually Look Like?

Two production examples provide concrete data. Both are single-author measurements on a controlled codebase without a comparison group. They are illustrative, not generalizable. These gains held because the governance controls in the previous section ensured reviewer quality was sufficient to catch AI errors; without that condition, the same speed would produce a different outcome.

CI Modernization (PR #52)

math-mcp-learning-server had no CI workflow. The judgment call: build from scratch or adapt patterns from a similar project. AI identified Ruff, uv, and pytest-cov as the right stack. Review covered the risk assessment, tooling fit, and zero-regression confirmation. Result: approximately 20 minutes versus an estimated 3-4 hours, with CI runtime at 5 seconds and 67 tests passing at 83% coverage. Source: PR #52.

Matrix Operations Feature (PR #109)

Five matrix operation tools with NumPy integration. The judgment call: implement incrementally or batch with shared validation patterns. AI identified the common infrastructure needs: dimension validation, ToolError handling, DoS prevention via size limits. Review covered API design, error handling conventions, and security boundaries. Result: 2 minutes from PR creation to merge, 5 tools, 21 tests, 395 lines. Source: PR #109.

**Table 3:** Measured time savings from two production examples (single-author, no control group)
Task	AI-Assisted	Traditional Estimate	Savings
CI modernization (PR #52)	~20 min	3-4 hours	~90%
Matrix operations, 5 tools (PR #109)	2 min	1-2 hours	~95%

At 10 infrastructure tasks per month, savings at this rate recover approximately 60 hours per year per engineer. That estimate depends entirely on reviewer quality being sufficient to catch AI errors; the governance controls in the previous section are what make that condition hold.

GitHub / Kalliamvakou (2024) found a 26% overall productivity increase across 4,867 developers, with 60-75% reporting increased job fulfillment. These aggregate figures align with the production examples directionally. The governance layer is what separates an individual positive experience from a reproducible organizational outcome.

When Does This Approach Work — and When Does It Not?

**Table 4:** Task type fit for AI-assisted development, based on production experience
Task Type	Fit	Evidence
CI/DevOps automation	High	20 min vs 3-4 hrs (PR #52)
Feature implementation with established patterns	High	2 min for 5 tools (PR #109)
Boilerplate and scaffolding	High	Common pattern in both PRs
Legacy code: analysis and documentation	High	aptu-coder: structured agent access without full context load
Greenfield architecture	Medium	More judgment gates needed
Security-sensitive code	Low	Context underspecification risk
Regex and parsing logic	Low	Subtle bugs compound
Legacy code: generation and modification	Low	Hallucination risk without grounding context

Security-Sensitive Tasks

The security-sensitive category warrants elaboration. The 45% security flaw rate from Veracode (reported in Zigler, 2025) applies specifically when context is underspecified: incomplete specifications, ambiguous threat models, or missing documentation of invariants. It is not a universal finding about all AI-generated code. Well-specified codebases with clear security requirements produce substantially better results. Reviewer attestation and SLSA provenance are the mechanism for verifying that the higher-quality path was actually taken.

The Candidate Generation Model

The critical success factor is consistent across task types: the human evaluating AI proposals must have sufficient expertise to recognize errors. DORA 2025 confirms that AI amplifies existing organizational capability rather than substituting for it (DORA / Google Cloud, 2025). A team without strong review culture will see AI increase their defect rate, not decrease it.

Backlund (2024) frames this as a candidate generation problem: in large-scale projects it is infeasible to thoroughly research every decision, but more candidates increase the likelihood that the ideal solution is among them. AI excels at candidate generation. Human judgment determines which candidate ships.

What Does This Mean for Technical Leaders?

The Perception Gap Risk

That perception gap is an organizational risk, not just an individual one. Teams adopting AI tools without governance controls may be degrading engineering throughput while believing they are improving it.

Implementation was already a small fraction of developer time (Kumar et al., 2025). AI compresses it further. The question is whether the time freed is reinvested in the judgment tasks that determine output quality.

Three Conditions for Success

The governance layer described here works when three conditions hold: reviewer expertise is sufficient to evaluate AI proposals, the workflow enforces human approval at structural decision points, and cryptographic controls make accountability non-repudiable.

DORA 2025 is direct on this: teams with strong engineering foundations see the gains; teams without them see increased complexity, larger pull requests, and architectural drift (DORA / Google Cloud, 2025). The governance layer is what creates the foundation.

This is Part 1 of a two-post series. Part 2 scales these controls across multi-agent workflows: Orchestrating AI Agents: Subagent Architecture.

The practical test is simple: audit your last ten pull requests and ask how many reviewers could explain every line of AI-generated code before approving. That gap between “CI passed” and “I understand this” is where the accountability layer lives.

References

Backlund, Emil, “A Cost-Based Decision Framework for Software Engineers” (2024) — https://www.emilbacklund.com/p/a-cost-based-decision-framework-for
Becker et al. / METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (2025) — https://arxiv.org/abs/2507.09089
DORA / Google Cloud, “State of AI-assisted Software Development 2025” (2025) — https://dora.dev/dora-report-2025
GitHub / Kalliamvakou, “Research: quantifying GitHub Copilot’s impact on developer productivity and happiness” (2024) — https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
Karpathy, Andrej, “Software Is Changing (Again)” (2025) — https://x.com/karpathy/status/1886192184808149082
Kumar et al., “Time Warp: The Gap Between Developers’ Ideal vs Actual Workweeks in an AI-Driven Era” (2025) — https://arxiv.org/abs/2502.15287
Peng et al., “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot” (2023) — https://arxiv.org/abs/2302.06590
Prather et al., “The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks” (2024) — https://arxiv.org/abs/2506.10051
Shukla, Bui, Parsons et al., “De-skilling, Cognitive Offloading, and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design” (2025) — https://doi.org/10.1145/3706599.3719931
Zigler, Andrew, “Mise en Place for Agentic Coding: Deliberate Preparation as Context Engineering Methodology” (2025) — https://arxiv.org/abs/2605.05400