AI Productization Report: Measuring Agents in Production
- Arindom Banerjee
- Dec 7
- 11 min read
The First Large-Scale Study of What Actually Works in Enterprise AI Agent Deployments
Report Series: AI Productization Deep DivesReport Number: 003Date: December 2025Source
Paper: "Measuring Agents in Production" (arXiv:2512.04123)Authors: Melissa Z. Pan et al. (25 authors), UC BerkeleyPaper Date: December 2, 2025
Executive Summary
This report analyzes the first large-scale systematic study of AI agents running in production environments. The Berkeley research team surveyed 306 practitioners and conducted 20 in-depth case studies across 26 industry domains, providing unprecedented visibility into what actually works when organizations deploy AI agents at scale.
Headline Findings

The Central Insight
Production agents are far simpler than academic literature suggests. While research papers showcase complex multi-agent systems with dozens of steps and sophisticated reasoning chains, real-world deployments favor controllable, human-supervised systems that prioritize reliability over autonomy.
This isn't a failure of ambition — it's pragmatic engineering. Organizations have learned that simpler agents with robust human oversight deliver more consistent value than complex autonomous systems that fail unpredictably.
What This Means for Practitioners
Start simple: 10-step agents with human checkpoints outperform ambitious autonomous designs
Skip fine-tuning (initially): Prompting off-the-shelf models gets you to production faster
Invest in evaluation: Human-in-the-loop and LLM-as-judge pipelines are table stakes
Expect reliability challenges: This is the #1 blocker — plan for it from day one
Measure productivity gains: This is how successful teams justify continued investment
Study Methodology
Research Design
The Berkeley team employed a mixed-methods approach combining quantitative survey data with qualitative case study interviews. This dual approach provides both statistical breadth and operational depth.
Survey Component:
306 valid responses from practitioners working on AI agents
Filtered to production and pilot systems only (excludes prototypes, research artifacts, retired systems)
Structured questions: single-select, multi-select, and numeric formats
Minimal post-processing required due to structured format
Case Study Component:
20 in-depth interviews with teams operating production agents
Interview duration: 30-90 minutes each
Interview teams: 2-5 organizationally neutral interviewers per session
Semi-structured protocol covering 11 topic areas:
System architecture
Evaluation mechanisms
Deployment challenges
Operational requirements
Measurable agent value
(Plus 6 additional areas)
Rigor and Validation
The research team implemented several quality controls:
Cross-validation: Final summaries validated among all interviewers
Anonymization: All data anonymized per confidentiality agreements
Aggregate presentation: Findings presented in aggregate to protect individual organizations
Recording protocols: Interviews recorded based on participant preferences, with human note-takers
Filtering Criteria
The study distinguishes between deployment stages:
Stage | Definition | Included? |
Production | Fully deployed, used by target end users in live environments | ✅ Yes |
Pilot | Deployed to controlled user groups for evaluation or phased rollout | ✅ Yes |
Prototype | Development artifacts not yet deployed | ❌ No |
Research | Academic or experimental systems | ❌ No |
Retired | Previously deployed but no longer active | ❌ No |
This filtering ensures findings reflect real operational experience rather than aspirational designs.
Domain Coverage
The study spans 26 distinct domains, providing cross-industry perspective. While specific domain breakdowns are anonymized, the case studies include:
Human resources operations
Cloud infrastructure management
Business analytics
Customer support (voice assistance)
Financial services
Healthcare operations
Software development
And 19+ additional domains

Key Findings: Architecture
Finding 1: Agents Execute Fewer Steps Than Expected
68% of production agents execute at most 10 steps before requiring human intervention.
This finding challenges the prevailing narrative of autonomous agents operating independently for extended periods. In practice, successful production agents are:
Short-loop systems: Complete discrete tasks, then checkpoint with humans
Human-supervised: Designed with intervention points, not despite them
Bounded in scope: Tackle well-defined subtasks rather than open-ended goals
Why this matters: Teams often over-engineer initial agent deployments, building for autonomy they don't need. Starting with 5-10 step workflows and adding complexity incrementally produces better outcomes than launching with ambitious multi-step designs.
Finding 2: Prompting Dominates Over Fine-Tuning
70% of production agents rely on prompting off-the-shelf models instead of weight tuning.
Fine-tuning, while powerful, introduces operational complexity:
Training infrastructure requirements
Model versioning and deployment pipelines
Ongoing maintenance as base models evolve
Evaluation challenges for custom weights
Production teams have discovered that well-crafted prompts on frontier models often match or exceed fine-tuned performance for their specific use cases — with dramatically lower operational overhead.
Implication: Unless you have a compelling reason to fine-tune (proprietary data, extreme latency requirements, cost optimization at massive scale), start with prompting. You can always fine-tune later if needed.
Finding 3: Multi-Model Architectures Reflect Operations, Not Task Complexity
The study reveals a counterintuitive insight about multi-model deployments:
"Multi-model architectures can emerge from lifecycle management needs rather than complex reasoning requirements for the agent task."
Organizations run multiple models for operational reasons:
Reason | Description |
Model migration | Legacy models maintained alongside newer versions during transitions |
Behavioral consistency | Agent scaffolds and evaluation suites depend on specific model behaviors |
Governance routing | Subtasks routed to different endpoints based on access levels |
Gradual rollout | New models tested on subset of traffic before full deployment |
Key insight: If you see a production system using 3-4 models, don't assume the task requires that complexity. It may simply reflect prudent operational practices around change management.
Finding 4: Architectural Complexity Correlates with Deployment Stage
The full survey data (including prototypes and research agents) shows a heavier tail toward agents using more distinct models. However, this complexity diminishes as systems move toward production:
Stage | Architectural Tendency |
Research/Prototype | Many models, complex architectures |
Pilot | Moderate complexity, some consolidation |
Production | Simpler architectures, fewer models |

This pattern suggests that production pressures force architectural simplification — complex designs that work in development often prove unmaintainable in production.
Key Findings: Evaluation
Finding 5: Human Evaluation Dominates
74% of production agents depend primarily on human evaluation.
Despite significant investment in automated evaluation, human judgment remains the gold standard for assessing agent output quality. The breakdown of evaluation methods:
Method | Usage Rate | Notes |
Human-in-the-loop | 74.2% | Dominant approach across domains |
LLM-as-a-judge | 51.6% | Growing but not yet replacing humans |
Rule-based verification | 42.9% | Useful for structured outputs only |
Why humans still dominate: Production agents handle tasks requiring nuanced judgment — customer support, HR operations, business analytics — where rule-based verification proves insufficient and LLM judges lack domain expertise.
Finding 6: No "Five 9s" for Agents
A striking finding from the case studies:
"No team reports applying standard production reliability metrics such as five 9s availability to their agent systems."
Traditional software reliability metrics (99.999% uptime) don't translate to AI agents. Instead, evaluation centers on:
Output correctness: Did the agent produce the right answer?
Response quality: Was the output well-formed and useful?
Task completion: Did the agent achieve the user's goal?
Implication: Don't try to force traditional SRE metrics onto agent systems. Develop agent-native quality metrics that reflect actual user value.
Finding 7: Evaluation Pipelines Are Converging
Despite diverse domains and organizational contexts, the study reveals a consistent evaluation pattern emerging across teams. The five-stage pipeline below appeared independently across HR, cloud infrastructure, analytics, customer support, and other domains:

Pipeline characteristics:
Extends from development through production runtime
Creates continuous feedback loop (Stage 5 feeds back to Stage 2)
Enables ongoing quality assessment without manual review of every interaction
Research opportunity: The convergence of nearly identical pipelines across diverse contexts suggests opportunities for:
Reusable data ingestion pipelines
Standardized curation methods for golden sets
Synthetic generation techniques for evaluation datasets
Finding 8: Baseline Comparisons Are Uncommon
Only 38.7% of survey respondents compare their deployed agents against non-agentic baselines (existing software, traditional workflows, or human execution).
This is a missed opportunity. Without baseline comparisons, teams cannot:
Quantify the value agents provide
Identify regression when agents underperform traditional approaches
Make informed build-vs-buy decisions
Recommendation: Always establish a baseline before deploying agents. Even a simple "human doing this task manually" baseline provides essential context for measuring agent value.
Key Findings: Challenges
Finding 9: Reliability Is the #1 Challenge
When asked about top development challenges, survey respondents across all agent stages ranked concerns as follows:
Challenge Category | Selection Rate |
Core Technical Focus | 37.9% |
Compliance | 17.0% |
Governance | 3.4% |
"Core Technical Focus" encompasses reliability challenges — ensuring agents produce correct outputs consistently. This dominates over governance and compliance concerns by a wide margin.
Why reliability trumps governance: You can't govern an agent that doesn't work reliably. Teams are discovering that fundamental correctness challenges must be solved before higher-level concerns become relevant.
Finding 10: Reliability Challenges Are Multifaceted
The reliability challenge breaks down into several sub-problems:
Ensuring correctness:
Agents produce plausible but wrong outputs
Edge cases trigger unexpected behaviors
Context limitations cause information loss
Evaluating correctness:
Ground truth is often unavailable or ambiguous
Human evaluation doesn't scale
Automated metrics don't capture real quality
Maintaining correctness:
Model updates change agent behavior
Prompt drift over time
Data distribution shifts
Finding 11: Complex Tasks Require Human Judgment
The dominance of human evaluation (74.2%) and human-in-the-loop (74.2%) reflects a fundamental reality:
"Production agents already handle complex tasks beyond classification, entity resolution, or pattern matching. These agents operate in domains requiring nuanced judgment where rule-based methods prove insufficient."
Examples from case studies:
Customer support voice assistance: Requires understanding context, emotion, and appropriate escalation
HR operations: Involves sensitive decisions with legal and ethical implications
Business analytics: Demands domain expertise to interpret ambiguous data
Implication: Don't expect to fully automate evaluation for complex agent tasks. Budget for ongoing human review as a feature, not a bug.
Case Study Highlights
The 20 in-depth case studies provide rich operational detail. While anonymized, several patterns emerge across representative examples:
Case Study Pattern A: Customer Service Voice Agent
Attribute | Detail |
Domain | Customer support |
Architecture | Single model, prompting-based |
Steps before human | 5-8 typical |
Evaluation | Human review of call transcripts + customer satisfaction scores |
Key learnings:
Voice adds complexity (ASR errors compound with LLM errors)
Escalation triggers are critical safety mechanisms
Customer satisfaction correlates weakly with "correct" answers — tone matters
Case Study Pattern B: Cloud Infrastructure Assistant
Attribute | Detail |
Domain | DevOps / SRE |
Architecture | Multi-model (different models for different subtasks) |
Steps before human | 8-12 typical |
Evaluation | Golden command sets + human review of suggested changes |
Key learnings:
High-stakes actions (delete, modify) require human approval
Explanations matter as much as actions
Teams maintain "shadow mode" for weeks before enabling actions
Case Study Pattern C: Business Analytics Agent
Attribute | Detail |
Domain | Finance / Analytics |
Architecture | Single model with RAG |
Steps before human | 3-5 typical |
Evaluation | SME review + comparison to manual analysis |
Key learnings:
Numeric accuracy is non-negotiable
Citation/sourcing critical for trust
Users prefer conservative agents that say "I don't know"
Case Study Pattern D: HR Operations Assistant
Attribute | Detail |
Domain | Human Resources |
Architecture | Single model, heavy prompt engineering |
Steps before human | 2-4 typical (very short loops) |
Evaluation | Legal/HR review + employee feedback |
Key learnings:
Compliance requirements drive ultra-short loops
Audit trails mandatory for all recommendations
Agents handle gathering/summarization; humans make decisions
Case Study Pattern E: Software Development Agent
Attribute | Detail |
Domain | Engineering |
Architecture | Multi-model (code gen + review) |
Steps before human | 10-15 typical |
Evaluation | Code review + test pass rates + developer satisfaction |
Key learnings:
Highest step counts in the study — developers tolerate more autonomy
But: all code still goes through human review before merge
Value measured in time saved, not code quality (humans ensure quality)
Case Study Pattern F: Document Processing Agent
Attribute | Detail |
Domain | Legal / Compliance |
Architecture | Single model with structured output |
Steps before human | 4-6 typical |
Evaluation | Attorney review + accuracy sampling |
Key learnings:
Extraction accuracy must exceed 95% for adoption
Confidence scores help prioritize human review
Agents accelerate review but don't replace it
Cross-Case Patterns
Pattern | Frequency | Description |
Short loops in high-stakes domains | High | HR, finance, healthcare use 2-5 step agents |
Shadow mode before production | High | Weeks/months of side-by-side comparison |
Explanation requirements | High | Users want to know "why," not just "what" |
Escalation as feature | Universal | Every production agent has escalation paths |
Human-final-decision | Near-universal | Agents recommend; humans decide |
Data & Demographics
Survey Population
The study captured responses from practitioners across the agent development lifecycle:
Role Distribution:
Engineers / Developers
Product Managers
Data Scientists / ML Engineers
Technical Leads / Architects
Operations / SRE
Research Scientists
(Specific percentages anonymized in source paper)
Deployment Stage Distribution
Stage | Description |
Production | Live deployment with real users |
Pilot | Controlled rollout for evaluation |
Development | Active building (excluded from main analysis) |
Research | Experimental (excluded from main analysis) |
The analysis focuses on deployed agents (production + pilot) to ensure findings reflect operational reality.
Domain Coverage
26 domains represented, including:
Sector | Example Domains |
Enterprise Operations | HR, Finance, Legal, Procurement |
Technical | DevOps, SRE, Software Development |
Customer-Facing | Support, Sales, Marketing |
Specialized | Healthcare, Manufacturing, Logistics |
This breadth suggests findings generalize across industries, not just tech-forward sectors.
Agent Characteristics
Typical production agent profile:
Characteristic | Typical Value |
Steps before human | ≤10 (68%) |
Model approach | Prompting (70%) |
Primary evaluation | Human (74%) |
Model count | 1-2 (most common) |
Deployment age | Months to 1 year |
Benefits Realized
When asked about benefits from deployed agents, respondents selected:
Benefit | Selection Rate |
Increasing Productivity | 73% |
Other benefits | Varies |
Operational Stability | Lowest |
"Increasing productivity" — completing tasks faster than previous approaches — is the dominant realized benefit. "Operational stability" (mitigating risk, accelerating failure recovery) is least often selected.
Interpretation: Current production agents excel at speed improvements but haven't yet demonstrated reliability advantages over traditional systems.
Strategic Implications
For Engineering Leaders
1. Reset complexity expectations
The 68% / 10-step finding should calibrate your planning. If your team is designing a 50-step autonomous agent for v1, reconsider. Successful production deployments:
Start with bounded scope (5-10 steps)
Add human checkpoints deliberately
Expand autonomy incrementally based on observed reliability
2. Defer fine-tuning decisions
With 70% of production agents using prompting alone, fine-tuning should be a later optimization, not a launch requirement. Fine-tune when you have:
Sufficient production data to train on
Clear evidence prompting is the bottleneck
Infrastructure to maintain custom models
3. Budget for human evaluation
The 74% human evaluation finding isn't a temporary state — it reflects fundamental task complexity. Plan for:
Ongoing human review capacity
Tooling to make human review efficient
Feedback loops from reviewers to improve agents
For Product Leaders
4. Position agents as productivity tools
The 73% "increasing productivity" finding provides your value proposition. Frame agents as:
"Complete this task in 10 minutes instead of 2 hours"
NOT "Fully autonomous system that replaces humans"
Users and buyers respond to concrete time savings more than abstract autonomy promises.
5. Design for human-in-the-loop from day one
Human oversight isn't a compromise — it's a feature. Products that embrace human-agent collaboration will outperform those that promise (and fail to deliver) full autonomy.
6. Establish baselines before launch
Join the 38.7% who compare against baselines. Before deploying:
Measure current task completion time
Document current error rates
Establish user satisfaction baseline
For AI/ML Teams
7. Invest in evaluation infrastructure
The convergent evaluation pipeline (golden sets → feedback → SME → LLM-judge) represents emerging best practice. Build or adopt:
Golden set management tooling
Feedback collection mechanisms
SME review workflows
LLM-as-judge pipelines
8. Develop agent-native reliability metrics
Traditional software metrics don't apply. Define metrics that capture:
Output correctness rate
Appropriate escalation rate
User goal completion rate
Time to human intervention
9. Plan for model migration complexity
Multi-model architectures often reflect migration challenges, not task requirements. Prepare for:
Behavioral differences across model versions
Evaluation suite dependencies on specific models
Gradual rollout requirements
For Executives
10. Reliability is the bottleneck
The 37.9% "Core Technical Focus" finding identifies where investment is needed. Governance and compliance (3.4% + 17%) are important but secondary — you can't govern unreliable systems.
11. Expect iteration, not instant success
Production agents require ongoing refinement. Budget for:
Multiple deployment iterations
Continuous evaluation
Incremental capability expansion
12. Measure what matters
Productivity gains (73% cite this) provide the clearest ROI signal. Track:
Time saved per task
Tasks completed per period
User satisfaction with agent assistance
Related Work
This study fills a critical gap in the AI agent literature. Prior work falls into several categories:
Industry Reports
Organizations like PwC, Capgemini, McKinsey, and Microsoft have published agent-related surveys focusing on:
Organizational readiness
Market trends
Executive perspectives
Technology adoption patterns
These provide valuable context but lack engineering-level technical detail.
Practitioner Surveys
LangChain's "State of AI Agents 2024" surveyed 1,300+ professionals on agent motivations and challenges. This study differs in:
Scope: Focuses specifically on production/pilot systems
Depth: Includes 20 in-depth case studies
Technical detail: Captures architecture, evaluation, and operational data
Academic Agent Literature
Extensive research examines LLM-powered agents from theoretical and benchmark perspectives. However, academic work typically:
Evaluates on research benchmarks, not production metrics
Studies prototype systems, not deployed agents
Focuses on capability, not reliability
How This Study Differs
Dimension | Prior Work | This Study |
Systems studied | Prototypes, benchmarks | Production deployments |
Data source | Benchmarks, papers | Practitioner surveys, interviews |
Focus | Capabilities | Reliability, operations |
Perspective | What's possible | What works |
Conclusion
"Measuring Agents in Production" provides the most comprehensive empirical view of deployed AI agents to date. The findings challenge assumptions prevalent in research and industry discourse:
Agents are simpler than expected — 10 steps, not 100
Prompting beats fine-tuning — for 70% of production use cases
Humans remain essential — 74% rely on human evaluation
Reliability is the bottleneck — not governance, not compliance
For practitioners, the message is clear: start simple, embrace human oversight, invest in evaluation, and measure productivity gains. The path to production runs through pragmatic engineering, not ambitious autonomy.
References
Primary Source:
Pan, M.Z. et al. (2025). "Measuring Agents in Production." arXiv:2512.04123. https://arxiv.org/abs/2512.04123
Related Papers from Repository:
"Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" (arXiv:2509.19708)
"Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (arXiv:2507.09089)
"Enterprise Large Language Model Evaluation Benchmark" (arXiv:2506.20274)
Arindam Banerji, PhD



Comments