top of page

AI Productization Report: Measuring Agents in Production

The First Large-Scale Study of What Actually Works in Enterprise AI Agent Deployments


Report Series: AI Productization Deep DivesReport Number: 003Date: December 2025Source


Paper: "Measuring Agents in Production" (arXiv:2512.04123)Authors: Melissa Z. Pan et al. (25 authors), UC BerkeleyPaper Date: December 2, 2025


Executive Summary

This report analyzes the first large-scale systematic study of AI agents running in production environments. The Berkeley research team surveyed 306 practitioners and conducted 20 in-depth case studies across 26 industry domains, providing unprecedented visibility into what actually works when organizations deploy AI agents at scale.

Headline Findings

ree

The Central Insight


Production agents are far simpler than academic literature suggests. While research papers showcase complex multi-agent systems with dozens of steps and sophisticated reasoning chains, real-world deployments favor controllable, human-supervised systems that prioritize reliability over autonomy.


This isn't a failure of ambition — it's pragmatic engineering. Organizations have learned that simpler agents with robust human oversight deliver more consistent value than complex autonomous systems that fail unpredictably.


What This Means for Practitioners

  1. Start simple: 10-step agents with human checkpoints outperform ambitious autonomous designs

  2. Skip fine-tuning (initially): Prompting off-the-shelf models gets you to production faster

  3. Invest in evaluation: Human-in-the-loop and LLM-as-judge pipelines are table stakes

  4. Expect reliability challenges: This is the #1 blocker — plan for it from day one

  5. Measure productivity gains: This is how successful teams justify continued investment


Study Methodology

Research Design

The Berkeley team employed a mixed-methods approach combining quantitative survey data with qualitative case study interviews. This dual approach provides both statistical breadth and operational depth.


Survey Component:

  • 306 valid responses from practitioners working on AI agents

  • Filtered to production and pilot systems only (excludes prototypes, research artifacts, retired systems)

  • Structured questions: single-select, multi-select, and numeric formats

  • Minimal post-processing required due to structured format


Case Study Component:

  • 20 in-depth interviews with teams operating production agents

  • Interview duration: 30-90 minutes each

  • Interview teams: 2-5 organizationally neutral interviewers per session

  • Semi-structured protocol covering 11 topic areas:

    • System architecture

    • Evaluation mechanisms

    • Deployment challenges

    • Operational requirements

    • Measurable agent value

    • (Plus 6 additional areas)


Rigor and Validation

The research team implemented several quality controls:

  • Cross-validation: Final summaries validated among all interviewers

  • Anonymization: All data anonymized per confidentiality agreements

  • Aggregate presentation: Findings presented in aggregate to protect individual organizations

  • Recording protocols: Interviews recorded based on participant preferences, with human note-takers


Filtering Criteria

The study distinguishes between deployment stages:

Stage

Definition

Included?

Production

Fully deployed, used by target end users in live environments

✅ Yes

Pilot

Deployed to controlled user groups for evaluation or phased rollout

✅ Yes

Prototype

Development artifacts not yet deployed

❌ No

Research

Academic or experimental systems

❌ No

Retired

Previously deployed but no longer active

❌ No

This filtering ensures findings reflect real operational experience rather than aspirational designs.


Domain Coverage

The study spans 26 distinct domains, providing cross-industry perspective. While specific domain breakdowns are anonymized, the case studies include:

  • Human resources operations

  • Cloud infrastructure management

  • Business analytics

  • Customer support (voice assistance)

  • Financial services

  • Healthcare operations

  • Software development

  • And 19+ additional domains


ree

Key Findings: Architecture


Finding 1: Agents Execute Fewer Steps Than Expected

68% of production agents execute at most 10 steps before requiring human intervention.

This finding challenges the prevailing narrative of autonomous agents operating independently for extended periods. In practice, successful production agents are:

  • Short-loop systems: Complete discrete tasks, then checkpoint with humans

  • Human-supervised: Designed with intervention points, not despite them

  • Bounded in scope: Tackle well-defined subtasks rather than open-ended goals

Why this matters: Teams often over-engineer initial agent deployments, building for autonomy they don't need. Starting with 5-10 step workflows and adding complexity incrementally produces better outcomes than launching with ambitious multi-step designs.


Finding 2: Prompting Dominates Over Fine-Tuning

70% of production agents rely on prompting off-the-shelf models instead of weight tuning.

Fine-tuning, while powerful, introduces operational complexity:

  • Training infrastructure requirements

  • Model versioning and deployment pipelines

  • Ongoing maintenance as base models evolve

  • Evaluation challenges for custom weights

Production teams have discovered that well-crafted prompts on frontier models often match or exceed fine-tuned performance for their specific use cases — with dramatically lower operational overhead.

Implication: Unless you have a compelling reason to fine-tune (proprietary data, extreme latency requirements, cost optimization at massive scale), start with prompting. You can always fine-tune later if needed.


Finding 3: Multi-Model Architectures Reflect Operations, Not Task Complexity

The study reveals a counterintuitive insight about multi-model deployments:

"Multi-model architectures can emerge from lifecycle management needs rather than complex reasoning requirements for the agent task."

Organizations run multiple models for operational reasons:

Reason

Description

Model migration

Legacy models maintained alongside newer versions during transitions

Behavioral consistency

Agent scaffolds and evaluation suites depend on specific model behaviors

Governance routing

Subtasks routed to different endpoints based on access levels

Gradual rollout

New models tested on subset of traffic before full deployment

Key insight: If you see a production system using 3-4 models, don't assume the task requires that complexity. It may simply reflect prudent operational practices around change management.


Finding 4: Architectural Complexity Correlates with Deployment Stage

The full survey data (including prototypes and research agents) shows a heavier tail toward agents using more distinct models. However, this complexity diminishes as systems move toward production:

Stage

Architectural Tendency

Research/Prototype

Many models, complex architectures

Pilot

Moderate complexity, some consolidation

Production

Simpler architectures, fewer models

ree

This pattern suggests that production pressures force architectural simplification — complex designs that work in development often prove unmaintainable in production.

Key Findings: Evaluation


Finding 5: Human Evaluation Dominates

74% of production agents depend primarily on human evaluation.

Despite significant investment in automated evaluation, human judgment remains the gold standard for assessing agent output quality. The breakdown of evaluation methods:

Method

Usage Rate

Notes

Human-in-the-loop

74.2%

Dominant approach across domains

LLM-as-a-judge

51.6%

Growing but not yet replacing humans

Rule-based verification

42.9%

Useful for structured outputs only

Why humans still dominate: Production agents handle tasks requiring nuanced judgment — customer support, HR operations, business analytics — where rule-based verification proves insufficient and LLM judges lack domain expertise.


Finding 6: No "Five 9s" for Agents

A striking finding from the case studies:

"No team reports applying standard production reliability metrics such as five 9s availability to their agent systems."

Traditional software reliability metrics (99.999% uptime) don't translate to AI agents. Instead, evaluation centers on:

  • Output correctness: Did the agent produce the right answer?

  • Response quality: Was the output well-formed and useful?

  • Task completion: Did the agent achieve the user's goal?

Implication: Don't try to force traditional SRE metrics onto agent systems. Develop agent-native quality metrics that reflect actual user value.


Finding 7: Evaluation Pipelines Are Converging

Despite diverse domains and organizational contexts, the study reveals a consistent evaluation pattern emerging across teams. The five-stage pipeline below appeared independently across HR, cloud infrastructure, analytics, customer support, and other domains:


ree

Pipeline characteristics:

  • Extends from development through production runtime

  • Creates continuous feedback loop (Stage 5 feeds back to Stage 2)

  • Enables ongoing quality assessment without manual review of every interaction


Research opportunity: The convergence of nearly identical pipelines across diverse contexts suggests opportunities for:

  • Reusable data ingestion pipelines

  • Standardized curation methods for golden sets

  • Synthetic generation techniques for evaluation datasets


Finding 8: Baseline Comparisons Are Uncommon

Only 38.7% of survey respondents compare their deployed agents against non-agentic baselines (existing software, traditional workflows, or human execution).

This is a missed opportunity. Without baseline comparisons, teams cannot:

  • Quantify the value agents provide

  • Identify regression when agents underperform traditional approaches

  • Make informed build-vs-buy decisions

Recommendation: Always establish a baseline before deploying agents. Even a simple "human doing this task manually" baseline provides essential context for measuring agent value.


Key Findings: Challenges


Finding 9: Reliability Is the #1 Challenge

When asked about top development challenges, survey respondents across all agent stages ranked concerns as follows:

Challenge Category

Selection Rate

Core Technical Focus

37.9%

Compliance

17.0%

Governance

3.4%

"Core Technical Focus" encompasses reliability challenges — ensuring agents produce correct outputs consistently. This dominates over governance and compliance concerns by a wide margin.

Why reliability trumps governance: You can't govern an agent that doesn't work reliably. Teams are discovering that fundamental correctness challenges must be solved before higher-level concerns become relevant.


Finding 10: Reliability Challenges Are Multifaceted

The reliability challenge breaks down into several sub-problems:

Ensuring correctness:

  • Agents produce plausible but wrong outputs

  • Edge cases trigger unexpected behaviors

  • Context limitations cause information loss

Evaluating correctness:

  • Ground truth is often unavailable or ambiguous

  • Human evaluation doesn't scale

  • Automated metrics don't capture real quality

Maintaining correctness:

  • Model updates change agent behavior

  • Prompt drift over time

  • Data distribution shifts


Finding 11: Complex Tasks Require Human Judgment

The dominance of human evaluation (74.2%) and human-in-the-loop (74.2%) reflects a fundamental reality:

"Production agents already handle complex tasks beyond classification, entity resolution, or pattern matching. These agents operate in domains requiring nuanced judgment where rule-based methods prove insufficient."

Examples from case studies:

  • Customer support voice assistance: Requires understanding context, emotion, and appropriate escalation

  • HR operations: Involves sensitive decisions with legal and ethical implications

  • Business analytics: Demands domain expertise to interpret ambiguous data

Implication: Don't expect to fully automate evaluation for complex agent tasks. Budget for ongoing human review as a feature, not a bug.


Case Study Highlights

The 20 in-depth case studies provide rich operational detail. While anonymized, several patterns emerge across representative examples:


Case Study Pattern A: Customer Service Voice Agent

Attribute

Detail

Domain

Customer support

Architecture

Single model, prompting-based

Steps before human

5-8 typical

Evaluation

Human review of call transcripts + customer satisfaction scores

Key learnings:

  • Voice adds complexity (ASR errors compound with LLM errors)

  • Escalation triggers are critical safety mechanisms

  • Customer satisfaction correlates weakly with "correct" answers — tone matters


Case Study Pattern B: Cloud Infrastructure Assistant

Attribute

Detail

Domain

DevOps / SRE

Architecture

Multi-model (different models for different subtasks)

Steps before human

8-12 typical

Evaluation

Golden command sets + human review of suggested changes

Key learnings:

  • High-stakes actions (delete, modify) require human approval

  • Explanations matter as much as actions

  • Teams maintain "shadow mode" for weeks before enabling actions


Case Study Pattern C: Business Analytics Agent

Attribute

Detail

Domain

Finance / Analytics

Architecture

Single model with RAG

Steps before human

3-5 typical

Evaluation

SME review + comparison to manual analysis

Key learnings:

  • Numeric accuracy is non-negotiable

  • Citation/sourcing critical for trust

  • Users prefer conservative agents that say "I don't know"


Case Study Pattern D: HR Operations Assistant

Attribute

Detail

Domain

Human Resources

Architecture

Single model, heavy prompt engineering

Steps before human

2-4 typical (very short loops)

Evaluation

Legal/HR review + employee feedback

Key learnings:

  • Compliance requirements drive ultra-short loops

  • Audit trails mandatory for all recommendations

  • Agents handle gathering/summarization; humans make decisions


Case Study Pattern E: Software Development Agent

Attribute

Detail

Domain

Engineering

Architecture

Multi-model (code gen + review)

Steps before human

10-15 typical

Evaluation

Code review + test pass rates + developer satisfaction

Key learnings:

  • Highest step counts in the study — developers tolerate more autonomy

  • But: all code still goes through human review before merge

  • Value measured in time saved, not code quality (humans ensure quality)


Case Study Pattern F: Document Processing Agent

Attribute

Detail

Domain

Legal / Compliance

Architecture

Single model with structured output

Steps before human

4-6 typical

Evaluation

Attorney review + accuracy sampling

Key learnings:

  • Extraction accuracy must exceed 95% for adoption

  • Confidence scores help prioritize human review

  • Agents accelerate review but don't replace it


Cross-Case Patterns

Pattern

Frequency

Description

Short loops in high-stakes domains

High

HR, finance, healthcare use 2-5 step agents

Shadow mode before production

High

Weeks/months of side-by-side comparison

Explanation requirements

High

Users want to know "why," not just "what"

Escalation as feature

Universal

Every production agent has escalation paths

Human-final-decision

Near-universal

Agents recommend; humans decide

Data & Demographics

Survey Population

The study captured responses from practitioners across the agent development lifecycle:


Role Distribution:

  • Engineers / Developers

  • Product Managers

  • Data Scientists / ML Engineers

  • Technical Leads / Architects

  • Operations / SRE

  • Research Scientists

(Specific percentages anonymized in source paper)


Deployment Stage Distribution

Stage

Description

Production

Live deployment with real users

Pilot

Controlled rollout for evaluation

Development

Active building (excluded from main analysis)

Research

Experimental (excluded from main analysis)

The analysis focuses on deployed agents (production + pilot) to ensure findings reflect operational reality.


Domain Coverage

26 domains represented, including:

Sector

Example Domains

Enterprise Operations

HR, Finance, Legal, Procurement

Technical

DevOps, SRE, Software Development

Customer-Facing

Support, Sales, Marketing

Specialized

Healthcare, Manufacturing, Logistics

This breadth suggests findings generalize across industries, not just tech-forward sectors.


Agent Characteristics

Typical production agent profile:

Characteristic

Typical Value

Steps before human

≤10 (68%)

Model approach

Prompting (70%)

Primary evaluation

Human (74%)

Model count

1-2 (most common)

Deployment age

Months to 1 year

Benefits Realized

When asked about benefits from deployed agents, respondents selected:

Benefit

Selection Rate

Increasing Productivity

73%

Other benefits

Varies

Operational Stability

Lowest

"Increasing productivity" — completing tasks faster than previous approaches — is the dominant realized benefit. "Operational stability" (mitigating risk, accelerating failure recovery) is least often selected.


Interpretation: Current production agents excel at speed improvements but haven't yet demonstrated reliability advantages over traditional systems.


Strategic Implications


For Engineering Leaders

1. Reset complexity expectations

The 68% / 10-step finding should calibrate your planning. If your team is designing a 50-step autonomous agent for v1, reconsider. Successful production deployments:

  • Start with bounded scope (5-10 steps)

  • Add human checkpoints deliberately

  • Expand autonomy incrementally based on observed reliability

2. Defer fine-tuning decisions

With 70% of production agents using prompting alone, fine-tuning should be a later optimization, not a launch requirement. Fine-tune when you have:

  • Sufficient production data to train on

  • Clear evidence prompting is the bottleneck

  • Infrastructure to maintain custom models

3. Budget for human evaluation

The 74% human evaluation finding isn't a temporary state — it reflects fundamental task complexity. Plan for:

  • Ongoing human review capacity

  • Tooling to make human review efficient

  • Feedback loops from reviewers to improve agents


For Product Leaders

4. Position agents as productivity tools

The 73% "increasing productivity" finding provides your value proposition. Frame agents as:

  • "Complete this task in 10 minutes instead of 2 hours"

  • NOT "Fully autonomous system that replaces humans"

Users and buyers respond to concrete time savings more than abstract autonomy promises.

5. Design for human-in-the-loop from day one

Human oversight isn't a compromise — it's a feature. Products that embrace human-agent collaboration will outperform those that promise (and fail to deliver) full autonomy.

6. Establish baselines before launch

Join the 38.7% who compare against baselines. Before deploying:

  • Measure current task completion time

  • Document current error rates

  • Establish user satisfaction baseline


For AI/ML Teams

7. Invest in evaluation infrastructure

The convergent evaluation pipeline (golden sets → feedback → SME → LLM-judge) represents emerging best practice. Build or adopt:

  • Golden set management tooling

  • Feedback collection mechanisms

  • SME review workflows

  • LLM-as-judge pipelines

8. Develop agent-native reliability metrics

Traditional software metrics don't apply. Define metrics that capture:

  • Output correctness rate

  • Appropriate escalation rate

  • User goal completion rate

  • Time to human intervention

9. Plan for model migration complexity

Multi-model architectures often reflect migration challenges, not task requirements. Prepare for:

  • Behavioral differences across model versions

  • Evaluation suite dependencies on specific models

  • Gradual rollout requirements


For Executives

10. Reliability is the bottleneck

The 37.9% "Core Technical Focus" finding identifies where investment is needed. Governance and compliance (3.4% + 17%) are important but secondary — you can't govern unreliable systems.

11. Expect iteration, not instant success

Production agents require ongoing refinement. Budget for:

  • Multiple deployment iterations

  • Continuous evaluation

  • Incremental capability expansion

12. Measure what matters

Productivity gains (73% cite this) provide the clearest ROI signal. Track:

  • Time saved per task

  • Tasks completed per period

  • User satisfaction with agent assistance


Related Work

This study fills a critical gap in the AI agent literature. Prior work falls into several categories:


Industry Reports

Organizations like PwC, Capgemini, McKinsey, and Microsoft have published agent-related surveys focusing on:

  • Organizational readiness

  • Market trends

  • Executive perspectives

  • Technology adoption patterns

These provide valuable context but lack engineering-level technical detail.


Practitioner Surveys

LangChain's "State of AI Agents 2024" surveyed 1,300+ professionals on agent motivations and challenges. This study differs in:

  • Scope: Focuses specifically on production/pilot systems

  • Depth: Includes 20 in-depth case studies

  • Technical detail: Captures architecture, evaluation, and operational data


Academic Agent Literature

Extensive research examines LLM-powered agents from theoretical and benchmark perspectives. However, academic work typically:

  • Evaluates on research benchmarks, not production metrics

  • Studies prototype systems, not deployed agents

  • Focuses on capability, not reliability


How This Study Differs

Dimension

Prior Work

This Study

Systems studied

Prototypes, benchmarks

Production deployments

Data source

Benchmarks, papers

Practitioner surveys, interviews

Focus

Capabilities

Reliability, operations

Perspective

What's possible

What works

Conclusion

"Measuring Agents in Production" provides the most comprehensive empirical view of deployed AI agents to date. The findings challenge assumptions prevalent in research and industry discourse:

  • Agents are simpler than expected — 10 steps, not 100

  • Prompting beats fine-tuning — for 70% of production use cases

  • Humans remain essential — 74% rely on human evaluation

  • Reliability is the bottleneck — not governance, not compliance


For practitioners, the message is clear: start simple, embrace human oversight, invest in evaluation, and measure productivity gains. The path to production runs through pragmatic engineering, not ambitious autonomy.

References

Primary Source:


Related Papers from Repository:

  • "Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" (arXiv:2509.19708)

  • "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (arXiv:2507.09089)

  • "Enterprise Large Language Model Evaluation Benchmark" (arXiv:2506.20274)


Arindam Banerji, PhD

 
 
 

Comments


bottom of page