AI Productization Report: Measuring Agents in Production

Dec 7, 2025
11 min read

The First Large-Scale Study of What Actually Works in Enterprise AI Agent Deployments

Report Series: AI Productization Deep DivesReport Number: 003Date: December 2025Source

Paper: "Measuring Agents in Production" (arXiv:2512.04123)Authors: Melissa Z. Pan et al. (25 authors), UC BerkeleyPaper Date: December 2, 2025

Executive Summary

This report analyzes the first large-scale systematic study of AI agents running in production environments. The Berkeley research team surveyed 306 practitioners and conducted 20 in-depth case studies across 26 industry domains, providing unprecedented visibility into what actually works when organizations deploy AI agents at scale.

Headline Findings

The Central Insight

Production agents are far simpler than academic literature suggests. While research papers showcase complex multi-agent systems with dozens of steps and sophisticated reasoning chains, real-world deployments favor controllable, human-supervised systems that prioritize reliability over autonomy.

This isn't a failure of ambition — it's pragmatic engineering. Organizations have learned that simpler agents with robust human oversight deliver more consistent value than complex autonomous systems that fail unpredictably.

What This Means for Practitioners

Start simple: 10-step agents with human checkpoints outperform ambitious autonomous designs
Skip fine-tuning (initially): Prompting off-the-shelf models gets you to production faster
Invest in evaluation: Human-in-the-loop and LLM-as-judge pipelines are table stakes
Expect reliability challenges: This is the #1 blocker — plan for it from day one
Measure productivity gains: This is how successful teams justify continued investment

Study Methodology

Research Design

The Berkeley team employed a mixed-methods approach combining quantitative survey data with qualitative case study interviews. This dual approach provides both statistical breadth and operational depth.

Survey Component:

306 valid responses from practitioners working on AI agents
Filtered to production and pilot systems only (excludes prototypes, research artifacts, retired systems)
Structured questions: single-select, multi-select, and numeric formats
Minimal post-processing required due to structured format

Case Study Component:

20 in-depth interviews with teams operating production agents
Interview duration: 30-90 minutes each
Interview teams: 2-5 organizationally neutral interviewers per session
Semi-structured protocol covering 11 topic areas:
- System architecture
- Evaluation mechanisms
- Deployment challenges
- Operational requirements
- Measurable agent value
- (Plus 6 additional areas)

Rigor and Validation

The research team implemented several quality controls:

Cross-validation: Final summaries validated among all interviewers
Anonymization: All data anonymized per confidentiality agreements
Aggregate presentation: Findings presented in aggregate to protect individual organizations
Recording protocols: Interviews recorded based on participant preferences, with human note-takers

Filtering Criteria

The study distinguishes between deployment stages:

Stage	Definition	Included?
Production	Fully deployed, used by target end users in live environments	✅ Yes
Pilot	Deployed to controlled user groups for evaluation or phased rollout	✅ Yes
Prototype	Development artifacts not yet deployed	❌ No
Research	Academic or experimental systems	❌ No
Retired	Previously deployed but no longer active	❌ No

This filtering ensures findings reflect real operational experience rather than aspirational designs.

Domain Coverage

The study spans 26 distinct domains, providing cross-industry perspective. While specific domain breakdowns are anonymized, the case studies include:

Human resources operations
Cloud infrastructure management
Business analytics
Customer support (voice assistance)
Financial services
Healthcare operations
Software development
And 19+ additional domains

Key Findings: Architecture

Finding 1: Agents Execute Fewer Steps Than Expected

68% of production agents execute at most 10 steps before requiring human intervention.

This finding challenges the prevailing narrative of autonomous agents operating independently for extended periods. In practice, successful production agents are:

Short-loop systems: Complete discrete tasks, then checkpoint with humans
Human-supervised: Designed with intervention points, not despite them
Bounded in scope: Tackle well-defined subtasks rather than open-ended goals

Why this matters: Teams often over-engineer initial agent deployments, building for autonomy they don't need. Starting with 5-10 step workflows and adding complexity incrementally produces better outcomes than launching with ambitious multi-step designs.

Finding 2: Prompting Dominates Over Fine-Tuning

70% of production agents rely on prompting off-the-shelf models instead of weight tuning.

Fine-tuning, while powerful, introduces operational complexity:

Training infrastructure requirements
Model versioning and deployment pipelines
Ongoing maintenance as base models evolve
Evaluation challenges for custom weights

Production teams have discovered that well-crafted prompts on frontier models often match or exceed fine-tuned performance for their specific use cases — with dramatically lower operational overhead.

Implication: Unless you have a compelling reason to fine-tune (proprietary data, extreme latency requirements, cost optimization at massive scale), start with prompting. You can always fine-tune later if needed.

Finding 3: Multi-Model Architectures Reflect Operations, Not Task Complexity

The study reveals a counterintuitive insight about multi-model deployments:

"Multi-model architectures can emerge from lifecycle management needs rather than complex reasoning requirements for the agent task."

Organizations run multiple models for operational reasons:

Reason	Description
Model migration	Legacy models maintained alongside newer versions during transitions
Behavioral consistency	Agent scaffolds and evaluation suites depend on specific model behaviors
Governance routing	Subtasks routed to different endpoints based on access levels
Gradual rollout	New models tested on subset of traffic before full deployment

Key insight: If you see a production system using 3-4 models, don't assume the task requires that complexity. It may simply reflect prudent operational practices around change management.

Finding 4: Architectural Complexity Correlates with Deployment Stage

The full survey data (including prototypes and research agents) shows a heavier tail toward agents using more distinct models. However, this complexity diminishes as systems move toward production:

Stage	Architectural Tendency
Research/Prototype	Many models, complex architectures
Pilot	Moderate complexity, some consolidation
Production	Simpler architectures, fewer models

This pattern suggests that production pressures force architectural simplification — complex designs that work in development often prove unmaintainable in production.

Key Findings: Evaluation

Finding 5: Human Evaluation Dominates

74% of production agents depend primarily on human evaluation.

Despite significant investment in automated evaluation, human judgment remains the gold standard for assessing agent output quality. The breakdown of evaluation methods:

Method	Usage Rate	Notes
Human-in-the-loop	74.2%	Dominant approach across domains
LLM-as-a-judge	51.6%	Growing but not yet replacing humans
Rule-based verification	42.9%	Useful for structured outputs only

Why humans still dominate: Production agents handle tasks requiring nuanced judgment — customer support, HR operations, business analytics — where rule-based verification proves insufficient and LLM judges lack domain expertise.

Finding 6: No "Five 9s" for Agents

A striking finding from the case studies:

"No team reports applying standard production reliability metrics such as five 9s availability to their agent systems."

Traditional software reliability metrics (99.999% uptime) don't translate to AI agents. Instead, evaluation centers on:

Output correctness: Did the agent produce the right answer?
Response quality: Was the output well-formed and useful?
Task completion: Did the agent achieve the user's goal?

Implication: Don't try to force traditional SRE metrics onto agent systems. Develop agent-native quality metrics that reflect actual user value.

Finding 7: Evaluation Pipelines Are Converging

Despite diverse domains and organizational contexts, the study reveals a consistent evaluation pattern emerging across teams. The five-stage pipeline below appeared independently across HR, cloud infrastructure, analytics, customer support, and other domains:

Pipeline characteristics:

Extends from development through production runtime
Creates continuous feedback loop (Stage 5 feeds back to Stage 2)
Enables ongoing quality assessment without manual review of every interaction

Research opportunity: The convergence of nearly identical pipelines across diverse contexts suggests opportunities for:

Reusable data ingestion pipelines
Standardized curation methods for golden sets
Synthetic generation techniques for evaluation datasets

Finding 8: Baseline Comparisons Are Uncommon

Only 38.7% of survey respondents compare their deployed agents against non-agentic baselines (existing software, traditional workflows, or human execution).

This is a missed opportunity. Without baseline comparisons, teams cannot:

Quantify the value agents provide
Identify regression when agents underperform traditional approaches
Make informed build-vs-buy decisions

Recommendation: Always establish a baseline before deploying agents. Even a simple "human doing this task manually" baseline provides essential context for measuring agent value.

Key Findings: Challenges

Finding 9: Reliability Is the #1 Challenge

When asked about top development challenges, survey respondents across all agent stages ranked concerns as follows:

Challenge Category	Selection Rate
Core Technical Focus	37.9%
Compliance	17.0%
Governance	3.4%

"Core Technical Focus" encompasses reliability challenges — ensuring agents produce correct outputs consistently. This dominates over governance and compliance concerns by a wide margin.

Why reliability trumps governance: You can't govern an agent that doesn't work reliably. Teams are discovering that fundamental correctness challenges must be solved before higher-level concerns become relevant.

Finding 10: Reliability Challenges Are Multifaceted

The reliability challenge breaks down into several sub-problems:

Ensuring correctness:

Agents produce plausible but wrong outputs
Edge cases trigger unexpected behaviors
Context limitations cause information loss

Evaluating correctness:

Ground truth is often unavailable or ambiguous
Human evaluation doesn't scale
Automated metrics don't capture real quality

Maintaining correctness:

Model updates change agent behavior
Prompt drift over time
Data distribution shifts

Finding 11: Complex Tasks Require Human Judgment

The dominance of human evaluation (74.2%) and human-in-the-loop (74.2%) reflects a fundamental reality:

"Production agents already handle complex tasks beyond classification, entity resolution, or pattern matching. These agents operate in domains requiring nuanced judgment where rule-based methods prove insufficient."

Examples from case studies:

Customer support voice assistance: Requires understanding context, emotion, and appropriate escalation
HR operations: Involves sensitive decisions with legal and ethical implications
Business analytics: Demands domain expertise to interpret ambiguous data

Implication: Don't expect to fully automate evaluation for complex agent tasks. Budget for ongoing human review as a feature, not a bug.

Case Study Highlights

The 20 in-depth case studies provide rich operational detail. While anonymized, several patterns emerge across representative examples:

Case Study Pattern A: Customer Service Voice Agent

Attribute	Detail
Domain	Customer support
Architecture	Single model, prompting-based
Steps before human	5-8 typical
Evaluation	Human review of call transcripts + customer satisfaction scores

Key learnings:

Voice adds complexity (ASR errors compound with LLM errors)
Escalation triggers are critical safety mechanisms
Customer satisfaction correlates weakly with "correct" answers — tone matters

Case Study Pattern B: Cloud Infrastructure Assistant

Attribute	Detail
Domain	DevOps / SRE
Architecture	Multi-model (different models for different subtasks)
Steps before human	8-12 typical
Evaluation	Golden command sets + human review of suggested changes

Key learnings:

High-stakes actions (delete, modify) require human approval
Explanations matter as much as actions
Teams maintain "shadow mode" for weeks before enabling actions

Case Study Pattern C: Business Analytics Agent

Attribute	Detail
Domain	Finance / Analytics
Architecture	Single model with RAG
Steps before human	3-5 typical
Evaluation	SME review + comparison to manual analysis

Key learnings:

Numeric accuracy is non-negotiable
Citation/sourcing critical for trust
Users prefer conservative agents that say "I don't know"

Case Study Pattern D: HR Operations Assistant

Attribute	Detail
Domain	Human Resources
Architecture	Single model, heavy prompt engineering
Steps before human	2-4 typical (very short loops)
Evaluation	Legal/HR review + employee feedback

Key learnings:

Compliance requirements drive ultra-short loops
Audit trails mandatory for all recommendations
Agents handle gathering/summarization; humans make decisions

Case Study Pattern E: Software Development Agent

Attribute	Detail
Domain	Engineering
Architecture	Multi-model (code gen + review)
Steps before human	10-15 typical
Evaluation	Code review + test pass rates + developer satisfaction

Key learnings:

Highest step counts in the study — developers tolerate more autonomy
But: all code still goes through human review before merge
Value measured in time saved, not code quality (humans ensure quality)

Case Study Pattern F: Document Processing Agent

Attribute	Detail
Domain	Legal / Compliance
Architecture	Single model with structured output
Steps before human	4-6 typical
Evaluation	Attorney review + accuracy sampling

Key learnings:

Extraction accuracy must exceed 95% for adoption
Confidence scores help prioritize human review
Agents accelerate review but don't replace it

Cross-Case Patterns

Pattern	Frequency	Description
Short loops in high-stakes domains	High	HR, finance, healthcare use 2-5 step agents
Shadow mode before production	High	Weeks/months of side-by-side comparison
Explanation requirements	High	Users want to know "why," not just "what"
Escalation as feature	Universal	Every production agent has escalation paths
Human-final-decision	Near-universal	Agents recommend; humans decide

Data & Demographics

Survey Population

The study captured responses from practitioners across the agent development lifecycle:

Role Distribution:

Engineers / Developers
Product Managers
Data Scientists / ML Engineers
Technical Leads / Architects
Operations / SRE
Research Scientists

(Specific percentages anonymized in source paper)

Deployment Stage Distribution

Stage	Description
Production	Live deployment with real users
Pilot	Controlled rollout for evaluation
Development	Active building (excluded from main analysis)
Research	Experimental (excluded from main analysis)

The analysis focuses on deployed agents (production + pilot) to ensure findings reflect operational reality.

Domain Coverage

26 domains represented, including:

Sector	Example Domains
Enterprise Operations	HR, Finance, Legal, Procurement
Technical	DevOps, SRE, Software Development
Customer-Facing	Support, Sales, Marketing
Specialized	Healthcare, Manufacturing, Logistics

This breadth suggests findings generalize across industries, not just tech-forward sectors.

Agent Characteristics

Typical production agent profile:

Characteristic	Typical Value
Steps before human	≤10 (68%)
Model approach	Prompting (70%)
Primary evaluation	Human (74%)
Model count	1-2 (most common)
Deployment age	Months to 1 year

Benefits Realized

When asked about benefits from deployed agents, respondents selected:

Benefit	Selection Rate
Increasing Productivity	73%
Other benefits	Varies
Operational Stability	Lowest

"Increasing productivity" — completing tasks faster than previous approaches — is the dominant realized benefit. "Operational stability" (mitigating risk, accelerating failure recovery) is least often selected.

Interpretation: Current production agents excel at speed improvements but haven't yet demonstrated reliability advantages over traditional systems.

Strategic Implications

For Engineering Leaders

1. Reset complexity expectations

The 68% / 10-step finding should calibrate your planning. If your team is designing a 50-step autonomous agent for v1, reconsider. Successful production deployments:

Start with bounded scope (5-10 steps)
Add human checkpoints deliberately
Expand autonomy incrementally based on observed reliability

2. Defer fine-tuning decisions

With 70% of production agents using prompting alone, fine-tuning should be a later optimization, not a launch requirement. Fine-tune when you have:

Sufficient production data to train on
Clear evidence prompting is the bottleneck
Infrastructure to maintain custom models

3. Budget for human evaluation

The 74% human evaluation finding isn't a temporary state — it reflects fundamental task complexity. Plan for:

Ongoing human review capacity
Tooling to make human review efficient
Feedback loops from reviewers to improve agents

For Product Leaders

4. Position agents as productivity tools

The 73% "increasing productivity" finding provides your value proposition. Frame agents as:

"Complete this task in 10 minutes instead of 2 hours"
NOT "Fully autonomous system that replaces humans"

Users and buyers respond to concrete time savings more than abstract autonomy promises.

5. Design for human-in-the-loop from day one

Human oversight isn't a compromise — it's a feature. Products that embrace human-agent collaboration will outperform those that promise (and fail to deliver) full autonomy.

6. Establish baselines before launch

Join the 38.7% who compare against baselines. Before deploying:

Measure current task completion time
Document current error rates
Establish user satisfaction baseline

For AI/ML Teams

7. Invest in evaluation infrastructure

The convergent evaluation pipeline (golden sets → feedback → SME → LLM-judge) represents emerging best practice. Build or adopt:

Golden set management tooling
Feedback collection mechanisms
SME review workflows
LLM-as-judge pipelines

8. Develop agent-native reliability metrics

Traditional software metrics don't apply. Define metrics that capture:

Output correctness rate
Appropriate escalation rate
User goal completion rate
Time to human intervention

9. Plan for model migration complexity

Multi-model architectures often reflect migration challenges, not task requirements. Prepare for:

Behavioral differences across model versions
Evaluation suite dependencies on specific models
Gradual rollout requirements

For Executives

10. Reliability is the bottleneck

The 37.9% "Core Technical Focus" finding identifies where investment is needed. Governance and compliance (3.4% + 17%) are important but secondary — you can't govern unreliable systems.

11. Expect iteration, not instant success

Production agents require ongoing refinement. Budget for:

Multiple deployment iterations
Continuous evaluation
Incremental capability expansion

12. Measure what matters

Productivity gains (73% cite this) provide the clearest ROI signal. Track:

Time saved per task
Tasks completed per period
User satisfaction with agent assistance

Related Work

This study fills a critical gap in the AI agent literature. Prior work falls into several categories:

Industry Reports

Organizations like PwC, Capgemini, McKinsey, and Microsoft have published agent-related surveys focusing on:

Organizational readiness
Market trends
Executive perspectives
Technology adoption patterns

These provide valuable context but lack engineering-level technical detail.

Practitioner Surveys

LangChain's "State of AI Agents 2024" surveyed 1,300+ professionals on agent motivations and challenges. This study differs in:

Scope: Focuses specifically on production/pilot systems
Depth: Includes 20 in-depth case studies
Technical detail: Captures architecture, evaluation, and operational data

Academic Agent Literature

Extensive research examines LLM-powered agents from theoretical and benchmark perspectives. However, academic work typically:

Evaluates on research benchmarks, not production metrics
Studies prototype systems, not deployed agents
Focuses on capability, not reliability

How This Study Differs

Dimension	Prior Work	This Study
Systems studied	Prototypes, benchmarks	Production deployments
Data source	Benchmarks, papers	Practitioner surveys, interviews
Focus	Capabilities	Reliability, operations
Perspective	What's possible	What works

Conclusion

"Measuring Agents in Production" provides the most comprehensive empirical view of deployed AI agents to date. The findings challenge assumptions prevalent in research and industry discourse:

Agents are simpler than expected — 10 steps, not 100
Prompting beats fine-tuning — for 70% of production use cases
Humans remain essential — 74% rely on human evaluation
Reliability is the bottleneck — not governance, not compliance

For practitioners, the message is clear: start simple, embrace human oversight, invest in evaluation, and measure productivity gains. The path to production runs through pragmatic engineering, not ambitious autonomy.

References

Primary Source:

Pan, M.Z. et al. (2025). "Measuring Agents in Production." arXiv:2512.04123. https://arxiv.org/abs/2512.04123

Related Papers from Repository:

"Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" (arXiv:2509.19708)
"Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (arXiv:2507.09089)
"Enterprise Large Language Model Evaluation Benchmark" (arXiv:2506.20274)

Arindam Banerji, PhD

banerji.arindam@gmail.com

AI Productization Report: Measuring Agents in Production

Recent Posts

Comments

Stay Connected with us