Agentic XP: Moving Rigour Left

The Pull Request Is No Longer a Reliable Control

As AI-assisted coding has become routine, a structural problem has emerged in modern engineering workflows. The pull request (PR) no longer functions as a reliable mechanism for establishing correctness. This is not a failure of individual diligence, but of workflow design. Under the volume and velocity of AI generation, PRs fail because the cognitive load makes careful reasoning increasingly difficult to sustain over time.

When the system proposing a solution also defines the criteria by which it is evaluated, verifiability collapses. The problem is not just that errors occur, but that the chain between intent, decision and outcome can no longer be reliably reconstructed.

Under this model, review shifts from verification to a judgement call based on plausibility. In 2025, GitHub reported 43.2 million monthly merged pull requests, a 23% year-over-year increase driven by AI adoption (GitHub, 2025). However, this volume has not yielded equivalent value. Analysis confirms that merge success rates are declining as teams struggle to filter redundant slop from the surge in agent-generated code (GitClear, 2025; Purdie, 2025).

In earlier work, I described this as a breakdown in reasoning integrity. In engineering terms, it reflects a deeper issue: inspection has been pushed to the end of the process, where it no longer scales.

What Extreme Programming Understood

Extreme Programming (XP) is often remembered for its rituals: pair programming, test-driven development (TDD) and small batch delivery. Its core insight was far more fundamental: correctness is established continuously during creation rather than inspected after the fact (Beck, 2004). XP was about maintaining alignment so that correctness emerged during development rather than being inferred later.

In traditional XP, two humans share responsibility in real time. One focuses on implementation while the other maintains context, constraints and future implications. Tests are written first as executable definitions of what must be true. This approach historically reduced defect density by 40% to 90% by moving the cost of inspection to the point of creation (Beck, 2002).

When practiced rigorously, this discipline reduced the need for post hoc review. PRs were reserved for exceptional changes because validity was established through TDD, pairing, and immediate integration (Beck, 2004; Duvall et al., 2007).

While this discipline has faded in favour of the pull request, the conditions that made it valuable have intensified. Agentic XP preserves the principle of continuous correctness while reallocating the division of labour between humans and AI. It does not automate XP; it transforms the mechanism by which rigour is enforced.

Agentic XP is not just AI-assisted coding. It is a shift in where correctness is established. Instead of prompting a model and hoping for the best, the human defines the success criteria first. The system then iterates until the output is verified against those metrics.

AI Changes the Division of Labour, Not the Responsibility

AI systems can now produce code at a speed that overwhelms traditional review workflows. This has led some teams to treat review as optional, or to rely on surface-level checks.

That response is understandable, but incorrect.

There is a practical conservation effect at work. If rigour is removed from the end of the process, it must appear earlier or the system degrades. The question is not whether rigour exists but whether it is explicit, testable, and accountable.

In the presence of AI, the XP division of labour can be restored, but the roles change.

Figure 1: Where Rigour Lives: Traditional vs. Agentic XP

Where Rigour Lives: Traditional vs. Agentic XP

Agentic XP: Definition Before Optimisation

In an AI-mediated workflow the human and the system do not collaborate through conversation. They collaborate through a contract that exists independently of any single interaction.

The human takes responsibility for definition:

Inputs and outputs
Types, schemas, and invariants
Failure modes and prohibited behaviour
The criteria that determine success

The AI takes responsibility for optimisation:

Exploring implementation strategies
Iterating toward the defined criteria
Producing code that satisfies the contract

This is not prompt engineering in the conversational sense. It is closer to compilation.

This makes the definition layer containing contracts, metrics, and evaluation pipelines the new foundational artifact. Its creation becomes the primary engineering challenge, demanding a discipline distinct from both traditional coding and prompt engineering.

Signatures as Contracts

The human defines the shape of a valid solution before any implementation exists. This includes what information is provided, what form the output must take and which constraints are non-negotiable. This shift from conversational "prompting" to declarative "programming" is embodied in frameworks like DSPy, where the human defines signatures (Input/Output contracts) and metrics (Success criteria) rather than raw instructions (Stanford NLP, 2025).

The signature does not specify how the problem should be solved. It specifies what a correct solution must look like.

This establishes the first guardrail.

Tests as Metrics

In traditional TDD tests fail until the implementation satisfies them. In an agentic workflow, tests function as metrics. They score outputs against explicit criteria.

These metrics can include:

Static analysis and type safety
Complexity bounds
Behavioural preservation
Explicit stylistic or safety constraints

Crucially, these metrics are written by humans and evaluated mechanically. The system generating the solution does not define success. It is measured against it.

A metric failure indicates not just incorrect behaviour, but a breach of an explicit statement of intent.

This restores intent fidelity. The definition of “done” exists independently of the system attempting to satisfy it. It can be inspected, challenged and revised without rerunning the work.

The Metric Is the Primary Attack Surface

Any system that optimises against a metric will attempt to exploit it. In agentic systems, the metric effectively becomes the primary attack surface. Agentic XP does not solve the fundamental risk that a proxy ceases to be a reliable measure when used as a target, as formalised in the variants of Goodhart’s Law (Manheim & Garrabrant, 2019). Instead, it forces teams to confront this risk earlier, when failure is cheaper and assumptions are still inspectable.

Writing a metric that cannot be gamed by a capable model is not a routine task. It is a Tier-1 engineering challenge, often harder than writing the implementation it replaces. A weak metric does not merely fail to enforce correctness but actively creates false confidence.

Agentic XP does not eliminate this risk. It makes it explicit. By moving optimisation behind a human-defined metric the primary failure mode shifts from hidden implementation defects to visible, auditable definition errors.

The Hidden Layer – Who Builds the Loops?

Agentic XP introduces a hidden layer comprising three distinct responsibilities:

Definition: Capturing intent and non-negotiable constraints.
Audit: Stress-testing metrics against gaming and misalignment.
Infrastructure: Maintaining the optimisation and evaluation pipelines.

This layer is essential: without it, “metric auditing” falls back into informal review and the intended rigour is lost.

Optimisation as a Compile Loop

Given a contract and a metric the AI’s role is not creative exploration in the abstract. It is constrained optimisation. Modern tooling allows this process to run iteratively, with systems like TextGrad treating metric failures as textual gradients that refine the solution until the contract is satisfied (Yuksekgonul, 2025). Frameworks like these provide the syntax but operationalising this at scale requires building and governing new production-grade pipelines for optimisation and evaluation. This is a significant platform investment.

The system proposes a candidate solution, evaluates it against the metric, adjusts and iterates until the contract is satisfied. This loop may execute many times before a result is presented to a human.

Because the optimisation loop is driven by externally defined metrics rather than internal model confidence, each iteration reduces variance rather than increasing it. By the time the output is reviewed, baseline correctness is no longer in question. Syntax, basic behaviour and defined constraints have already been enforced computationally.

This shifts human effort away from inspection and toward judgement.

The Cost of the Loop

Iterative optimisation is not free. The compute, token and tooling costs must be considered and balanced against the efficiency gains, regardless of how responsibilities are structured within the team or organisation.

Why This Reduces the Pull Request Bottleneck

In a traditional pull request, reviewers are asked to do several things at once:

Verify correctness
Enforce style
Detect unintended consequences
Assess architectural alignment

Under AI-scale output, this becomes impractical.

In an agentic XP workflow most mechanical checks occur before human attention is required. Style, structure, and baseline correctness are enforced by metrics. Repetition is handled computationally.

What remains is strategic review. Questions of long-term direction, boundary definition and fundamental assumptions are not suited to automated evaluation.

The pull request does not disappear. Its function changes.

For routine, well-bounded work, merging can be automated once the metric threshold is met. For novel or architectural changes, review becomes a high-bandwidth consultation focused on intent rather than correctness.

This is a direct application of the guardrails over gates principle. Inspection is replaced with constraint.

This represents a transition to management by exception. By automating the verification of baseline correctness, the PR ceases to be a gate for the mundane and becomes a high-value consultation on the exceptional.

The Resulting Shift in Skills

When verification moves from human inspection to executable definition, the skills required to ensure correctness change accordingly.

This model does not eliminate responsibility. It concentrates it earlier in the process.

The Shift to Metric Auditing

When implementation is automated, the primary engineering function shifts from writing code to auditing definitions. This transition is critical because AI adoption has reached 90% while developer distrust remains high at 46%. This "Trust Paradox" can lead to significant verification overhead if metrics are not robust (DORA, 2025; Metr.org, 2025).

This audit function focuses on the integrity of the definition:

Can the metric be gamed?
Does it encode the right trade-offs?
Which failure modes remain untested?
Are the underlying assumptions still valid?

This work is less visible than code review but far higher leverage. It also changes the nature of mentorship. Experienced practitioners lead this audit while teaching the discipline to others. This process of learning to define constraints rather than just correcting code acts as a necessary counter to the contextual blindness I described in The Senior Janitor.

Maintaining Shared Context

As coding throughput increases, context becomes the limiting factor. Examples, constraints and shared definitions determine whether optimisation produces value or noise.

Stewardship of this context shifts from throughput tracking toward maintaining:

Consistency: Ensuring contracts, metrics and examples remain coherent.
Relevance: Aligning constraints with current system and business goals.
Efficiency: Keeping optimisation loops within economic limits, including compute and token budgets.

Limits and Failure Modes

This model is not a universal solution. Its success is bounded by:

The Definition of Intent: Work resistant to executable specification, such as exploratory design or refactoring legacy systems with unknown constraints.
The Strength of Metrics: The risk that a poorly designed metric becomes a target to be gamed, creating false confidence.
The Fidelity of the Loop: The cost and reliability of the iterative optimisation pipeline itself.

Metrics can also be wrong. They can encode incorrect assumptions or omit critical constraints. Agentic XP does not eliminate failure, but it replaces opaque review fatigue with explicit, inspectable constraints. Responsibility is not removed; it is made explicit.

Conclusion: Rigour Still Sets the Pace

AI has not removed the need for discipline in software engineering. It has exposed where that discipline must live.

If correctness is left to late-stage inspection, teams will drown in output they cannot meaningfully verify. If rigour is moved upstream into contracts, metrics, and definitions then scale becomes possible without abandoning validity or accountability.

This is not a return to Extreme Programming as it was practiced decades ago. It is an adaptation of its core insight: correctness is something you design for, not something you hope to detect later.

The tools have changed. The responsibility has not.

References

Beck, K. (2002). Test-Driven Development: By Example. Addison-Wesley.

Beck, K. (2004). Extreme Programming Explained: Embrace Change (2nd ed.). Addison-Wesley.

Duvall, P., Matyas, S., & Glover, A. (2007). Continuous Integration: Improving Software Quality and Reducing Risk. Addison-Wesley.

DORA. (2025). The State of AI-assisted Software Development: 2025 Research Report. Google Cloud / DevOps Research and Assessment. https://cloud.google.com/resources/content/2025-dora-ai-assisted-software-development-report

GitClear. (2025). AI Assistant Code Quality: 2025 Research. GitClear. https://www.gitclear.com/ai_assistant_code_quality_2025_research

GitHub. (2025, October 29). Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. GitHub Blog. https://github.blog/news-insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/

Metr.org. (2025, July 10). Experienced open-source developers took 19% longer to complete tasks with AI assistance. METR Blog. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Manheim, D., & Garrabrant, S. (2019). Categorizing variants of Goodhart's Law. arXiv. https://arxiv.org/abs/1803.04585

Purdie, S. (2025, December 20). VS Code PR Analysis: Mapping the AI Agent Surge. GitHub. https://github.com/SimonPurdie/VS-Code-PR-Analysis

Stanford NLP. (2025). DSPy: Programming—not prompting—LMs. DSPy.ai. https://dspy.ai/

Yuksekgonul, M., et al. (2025). TextGrad: Automatic "Differentiation" via Text. Stanford University. https://github.com/zou-group/textgrad