Intent-Driven Development: Measuring Intent Fidelity

Pop art illustration showing a professional woman reviewing intent fidelity metrics while AI systems operate in the background, with risk dials moving from stop to caution to trust, representing measured and governed Intent-Driven Development success.

How to know when it’s safe to trust AI more?

You’ve invested in Intent-Driven Development (IDD).

You’re writing specifications that separate human intent from AI implementation. You’ve introduced explicit risk dials at key decision points. Your framework survives both model evolution and architectural change.

Now comes the question every senior leader eventually asks:

How do we know it’s actually working?

This question marks the dividing line between organisations stuck in pilot purgatory and the small minority that scale AI successfully. The organisations that succeed don’t guess when to increase automation. They measure. They adjust based on evidence. They move from caution to confidence through proof, not promises.

In Intent-Driven Development (IDD), intent fidelity is the primary control metric. It tells you when AI systems are behaving as intended, when they are not, and when it is safe to trust them more.

This article defines how intent fidelity is measured, and how those measurements give leaders objective confidence to move risk dials from 🔴 to 🟡 to 🟢.

Why the 94% Stay Stuck

In traditional software development, teams could “try it and see.” The cost of failure was bounded, a few developers, a few weeks, a limited blast radius.

Agentic AI changes that equation entirely.

A single agent can generate thousands of lines of code in minutes, deploy infrastructure, modify databases, and integrate across systems. The potential impact of a mistake is no longer measured in developer-hours.

Organisations freeze between two competing fears:

  • Move too slowly → Competitors who adopt AI faster, gain advantage
  • Move too fast → A major failure damages trust and derails adoption

The small percentage of organisations that move beyond pilots resolve this tension through measurement.

They increase automation deliberately, guided by evidence of intent fidelity, the degree to which AI implementation aligns with clearly articulated human intent.

When intent fidelity is high and stable, automation increases safely. When intent fidelity degrades, control tightens immediately.

Measurement makes trust explicit rather than assumed.

The Four Dimensions of Intent Fidelity

Intent fidelity is not a single score. AI systems fail in different ways, so fidelity must be measured across four distinct dimensions. Without all four, blind spots remain.

Common failure patterns include:

  • Building the right thing incorrectly (high completeness, low correctness)
  • Building the wrong thing correctly (low alignment, high correctness)
  • Building something that works once but degrades over time (low consistency)

Measuring all four dimensions gives a complete, actionable picture.

1. Completeness

Did the AI implementation address everything specified in the intent?

Completeness checks whether all success criteria, constraints, ethical considerations, and validation scenarios defined in the IDD specification were implemented.

Example (shopping cart specification):

  • Cart persistence across sessions? ✅
  • Accessible from any device? ✅
  • Performance under 200ms? ✅
  • GDPR deletion capability? ✅
  • No dark patterns? ✅

Completeness: 100%

Red flags: missing features, ignored constraints, skipped ethical considerations.

2. Correctness

Does the implementation actually work as specified?

Correctness measures whether the implementation:

  • Passes validation tests
  • Handles edge cases
  • Fails safely
  • Meets performance and security requirements

Example:

  • Cart persistence works ✅
  • Loads in 850ms ❌ (spec required <200ms)
  • Breaks when cart exceeds 100 items ❌ (should fail gracefully)

Correctness: 67%

Red flags: edge-case failures, performance regressions, security vulnerabilities.

3. Alignment

Does the implementation reflect the real intent behind the specification?

Alignment is the most critical, and least automatable, dimension. It requires human judgment about whether the solution solves the right problem, not just the stated one.

Example:

Intent: “Enable users to save items for future purchase.

  • Implementation A: Saves cart for 24 hours, then deletes ❌
  • Implementation B: Saves cart indefinitely until user deletes ✅

Only the second implementation reflects the underlying user intent.

Alignment failures are not tooling failures, they are intent interpretation failures.

Red flags: stakeholder feedback of “that’s not what we meant,” domain model violations, technically correct but conceptually wrong solutions.

4. Consistency

Does the implementation fit coherently within the existing system?

Consistency measures adherence to architectural patterns, domain conventions, and system design principles.

Example:

  • System standard: event-driven state changes
  • AI implementation: direct database writes ❌
  • No domain events emitted ❌

Red flags: architectural violations, divergent patterns, silent introduction of technical debt.

Calculating Intent Fidelity

Intent fidelity combines all four dimensions into a single, trackable metric.

Intent Fidelity =
(Correctness × 0.35) + (Completeness × 0.25) + (Alignment × 0.25) + (Consistency × 0.15)

Why these weights?

  • Correctness (35%): broken systems fail regardless of intent
  • Completeness (25%): missing intent creates silent gaps
  • Alignment (25%): solving the wrong problem wastes all effort
  • Consistency (15%): refactorable, but still costly

The specific weights matter less than consistency and coverage. Organisations may tune them, but all four dimensions must remain present.

Example:

  • Correctness: 85%
  • Completeness: 100%
  • Alignment: 90%
  • Consistency: 95%

Intent Fidelity = 91.5%

Measuring Human Intent Fidelity (The Baseline Principle)

In Intent-Driven Development, intent fidelity is measured at the implementation boundary, not at the actor. Whether an implementation is produced by:

  • a human engineer
  • an AI agent
  • or a human–AI collaboration

…it is evaluated against the same four dimensions: completeness, correctness, alignment, and consistency.

This is a foundational rule of IDD: Humans are not exempt from measurement, and AI is not held to a harsher standard. Intent is the contract = and all implementations are judged against it.

Why the Human Baseline Matters

Without measuring human implementations:

  • AI has no credible reference point
  • failures are misattributed to tooling rather than unclear intent
  • organisations mistake anecdote for governance

IDD requires a human baseline to establish what “good” looks like in practice.

This baseline is not the best engineer on their best day. It is a representative view of how intent has historically been implemented across the organisation.

Interpreting the Results

Human and AI implementations use the same scoring model – but interpretation differs.

When human intent fidelity is low, root causes typically include:

  • incomplete or ambiguous intent specifications
  • undocumented domain assumptions
  • coordination and time-pressure effects

When AI intent fidelity is low, root causes typically include:

  • specification gaps
  • domain modelling weaknesses
  • inappropriate autonomy for the task

A low score does not indict the actor, it diagnoses the system.

When both human and AI scores are low, the issue is neither people nor machines.
It is intent quality.

The Executive Insight

Intent-Driven Development does not ask leaders to trust AI more than humans.

It asks them to trust measurement more than intuition.

Holding humans and AI to the same intent fidelity standard makes governance fair, defensible, and scalable.

How to Measure: The Comparative Build Method

The most reliable way to measure intent fidelity is comparison against a human baseline.

Process:

  1. Parallel implementation: Same IDD specification implemented by both human and AI
  2. Compare outcomes: Identical validation, performance tests, and stakeholder review
  3. Analyse divergence: Differences are categorised as:AI limitation (maintain 🔴)
    1. Equally valid alternative (acceptable)
    2. Improvement (AI outperforms baseline)
    3. Specification ambiguity (fix intent, not AI)

This works because AI is not measured against perfection, it’s measured against what “good” already looks like in the organisation.

When to Move Risk Dials

Intent fidelity scores determine when automation can safely increase.

Single-Agent Systems

Stay at 🔴 (Full human review):

  • Intent fidelity <85%
  • Any alignment failures
  • First 10–20 implementations of a new task type
  • Security, compliance, or ethical concerns

Move to 🟡 (Spot-check):

  • 85–95% fidelity sustained across 20+ implementations
  • No recent alignment failures
  • Well-bounded task category

Move to 🟢 (Monitoring):

  • 95% fidelity sustained across 50+ implementations
  • Automated validation reliably detects issues

Multi-Agent Systems

Different agents earn trust at different rates.

  • Test agents: move to 🟢 fastest
  • Security agents: often remain 🔴 permanently
  • Backend / frontend agents: earn 🟡 after 20–30 successes
  • Architect / coordinator agents: remain 🔴 longest

Architectural mistakes carry the highest long-term cost.

Demonstrating ROI to Leadership

Measurement enables translation from technical metrics to business outcomes.

The timelines below illustrate how measurement compresses uncertainty over time. They describe confidence progression, not fixed delivery schedules.

Velocity Improvement

Before IDD:
2 developers × 3 weeks = 6 developer-weeks

After IDD (🟡 stage):

  • Specification: 2 days
  • AI implementation + spot-check: 2 days

Result: ~6× velocity improvement with maintained quality
(Illustrative, directionally consistent with early adopters.)

Risk Reduction

Without measurement:

  • 15% AI-generated changes cause production issues
  • Cost per issue: $50k
  • Annual cost: $750k

With intent fidelity measurement:

  • 2% issue rate
  • Annual cost: $100k
  • ≈ $650k annual risk reduction

Confidence to Scale

Time to production confidence:

  • Without measurement: 6–12 months stuck in pilots
  • With intent fidelity: predictable progression to scale within 9 months

Measurement Cadence

Measure deliberately, not continuously.

  • Weeks 1–4: measure every implementation (baseline)
  • Months 2–3: measure every third implementation
  • Months 4–6: selective measurement by category
  • Month 7+: quarterly sampling and incident-driven reviews

How This Fits the Full IDD Journey

  • Article 1: AI builds fast, but is it building the right thing?
  • Article 2: IDD integrates UCD, DDD, BDD, TDD around intent
  • Article 3: Risk dials provide explicit human control
  • Article 4: IDD survives model evolution
  • Article 5: IDD scales across architectures and agents
  • Article 6: Intent fidelity measurement provides evidence to scale safely

This is how AI adoption is de-risked:

  • Separate the stable (human intent) from the fluid (AI implementation)
  • Govern with explicit risk dials
  • Survive inevitable evolution
  • Measure intent fidelity
  • Scale where evidence supports it

The organisations that succeed do all five.

Your Next Steps

If you are implementing IDD:

  • Month 1: establish baseline intent fidelity
  • Months 2–3: identify patterns, improve specifications
  • Months 4–6: move to 🟡 selectively where evidence supports it
  • Month 7+: scale with confidence, backed by data

You don’t need to trust AI blindly.
You don’t need to stay cautious indefinitely.

You measure intent fidelity.
You adjust risk dials based on evidence.
You scale where data supports it.

This is how confidence replacescaution.

#IntentDrivenDevelopment #IDD #IntentFidelity #AIGovernance #EvidenceBasedAI #TechLeadership

Check out the other articles in this series …

Pop art banner showing a woman adjusting red, amber and green risk dials on an Intent-Driven Development Interface, directing a cute multi-agent robot team with a coordinator, illustrating intent in and software out with interchangeable agents.

Intent-Driven Development via Multi-Agent Systems

Multi-agent systems are emerging as the next evolution in AI-powered development, but they don’t change how we should specify human intent. By separating intent from AI architecture, Intent-Driven Development ensures specifications remain stable, tool-agnostic, and future-proof, no matter how agents, models, or orchestration patterns evolve.

Intent-Driven Development Maturity Model shows a Pop art–style 16:9 banner illustrating “The IDD Maturity Model – Scaling Autonomy Without Losing Control.” A confident professional woman stands in the foreground holding a clipboard, symbolising leadership and oversight. Behind her, a four-level staircase progresses from red to green, labelled “1 Supervised Learning,” “2 Selective Delegation,” “3 Sustained Alignment,” and “4 Continuous Optimisation.” A large risk dial gauge transitions from red to green, marked “Evidence-Gated Progression.” Surrounding elements include checklists, magnifying glasses, gears, a security shield, robotic arm, analytics charts, and upward arrows—representing governance, measurement, autonomy, and enterprise AI maturity.

Intent-Driven Development: Maturity Model

In today’s AI-accelerated world, the challenge isn’t whether technology can build software faster, it’s whether organisations can ensure that what gets built actually reflects human intent. Traditional maturity models tend to measure adoption by counting tools or automated outputs, but this risks conflating activity with alignment. True capability emerges not from the number of agents deployed, but from an organisation’s capacity to expand autonomy while preserving clarity, accountability and control.

0 Comments

Leave a Reply

Interviews

Are you looking for some interviews with leading industry experts? Then check out these 👇
Anti-Money Laundering – Future of Finance

Anti-Money Laundering – Future of Finance

This is the second article in our Future of Finance series, in which the amazing Dr Janet Bastiman talks about how “intelligence driven” anti-money laundering and compliance technology can rise to the challenges of different payment devices, microtransactions, and digital currencies. There are also some juicy AI/ML topics to sink your teeth into!

AI In Reality

AI In Reality

AI in Reality is a realistic view of the current state of AI and ethics, looking beyond the hype of ChatGPT and Generative AI, with industry expert Nayur Khan

Discover more from Richard Stockley

Subscribe now to keep reading and get access to the full archive.

Continue reading