RAG-MCP: Taming Tool Bloat in the MCP Era

Arindom Banerjee
Nov 22
12 min read

Design and evaluation of a retrieval-driven MCP selector for large tool registries

Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Executive Summary

RAG-MCP addresses a critical scalability challenge facing modern LLM systems: the "prompt bloat" problem that emerges when large language models must select from hundreds or thousands of external tools. The paper introduces a Retrieval-Augmented Generation framework that dynamically retrieves only the most relevant tools from an external index, achieving remarkable results: ≈50% reduction in prompt tokens (49.2% precisely) and 3.2× improvement in tool selection accuracy (43.13% vs 13.62% baseline). This work is particularly timely given the rapid proliferation of Model Context Protocol (MCP) servers—which grew from zero to over 4,400 implementations in just five months following Anthropic's November 2024 release (and has since grown to 17,000+ servers as of late 2025).

The fundamental insight is elegant: instead of overwhelming the LLM with all available tool descriptions, treat tool discovery as a retrieval problem analogous to how RAG systems retrieve relevant passages from large corpora. This architectural shift transforms an intractable context management problem into a scalable, production-ready solution.

Critical Finding: Both baselines and RAG-MCP show sharp performance degradation once the candidate MCP pool exceeds roughly 100 tools, empirically establishing retrieval-based selection as mandatory rather than optional beyond that scale.

1. Problem Context & Motivation

1.1 The Tool Proliferation Challenge

Modern LLMs are increasingly augmented with external tools to overcome their fundamental limitations—static knowledge cutoffs and inability to perform real-world actions. While function calling and tool use have become standard capabilities (exemplified by GPT-4's function calling API, Claude's computer use, and the broader ecosystem), a new bottleneck has emerged: scalability of tool selection.

The problem manifests in two critical ways:

Prompt Bloat: Including descriptions for dozens or hundreds of tools exhausts the context window. Even with 100K+ token contexts (Claude 2, GPT-4 Turbo), listing detailed schemas, parameters, and usage examples for 100+ tools can consume 50-80% of available context, leaving insufficient room for actual task reasoning.

Selection Complexity: As the tool count grows, the LLM faces increasing cognitive load in distinguishing between similar tools with nuanced differences. Research cited in the paper shows even frontier models like GPT-4 and Claude make critical errors—hallucinating non-existent APIs or selecting inappropriate tools—when presented with large toolsets.

1.2 The Model Context Protocol (MCP) Catalyst

Anthropic's introduction of MCP in November 2024 transformed this from a theoretical concern to an urgent practical problem. MCP standardizes how AI systems connect to external data sources and tools through a universal protocol—essentially "USB-C for AI applications." The protocol's rapid adoption created an explosion of available tools:

4,400+ MCP servers listed on mcp.so as of April 2025 (paper-time), with many more (17,000+ servers) as of late 2025
Major adoption by OpenAI (March 2025), Google DeepMind, and development tools (Replit, Sourcegraph, Zed)
Pre-built connectors for Google Drive, Slack, GitHub, Postgres, Puppeteer, Stripe

This ecosystem growth, while beneficial for capability expansion, dramatically exacerbates the prompt bloat problem. The authors note this creates an "N×M integration problem"—every AI application potentially needing to handle thousands of tools.

1.3 Why Existing Approaches Fall Short

Prior work on LLM tool use (Toolformer, ReAct, Gorilla, WebGPT) focused on how models learn to use tools, not how to scale tool discovery. These approaches assume:

Small, curated toolsets (typically 5-20 tools)
Hand-picked tools for specific domains
Static tool registries

None address the dynamic, large-scale tool selection problem that MCP's success created.

2. Approach & Technical Design

2.1 Core Architecture: Three-Step Pipeline

RAG-MCP introduces a retrieval-first architecture that decouples tool discovery from tool execution:

Step 1: Semantic Retrieval

User query encoded using lightweight LLM-based retriever (implementation uses Qwen-max-0125)
Semantic search over external vector index containing all MCP metadata
Returns top-k most relevant MCP candidates based on cosine similarity
Key innovation: Tool descriptions represented in same semantic space as user queries

Step 2: Validation (Optional)

Generated few-shot example queries for each retrieved MCP
Basic compatibility testing ("sanity check") before invocation
Ensures functional correctness without full activation
Filters out false positive retrievals

Step 3: Selective Invocation

Only the single best MCP description injected into main LLM prompt
Includes full tool-use parameters and schemas
LLM performs planning and execution without concern for tool discovery
Dramatically reduced context consumption

2.2 Technical Implementation Details

Vector Indexing:

Each MCP's metadata (name, description, parameters, usage examples) embedded into dense vectors
External index maintained separately from LLM inference
New MCPs added by simply indexing their metadata—no LLM retraining required

Retriever Design:

Lightweight compared to main task LLM
Optimized for speed (semantic search vs. full LLM reasoning)
Can be updated/fine-tuned independently of main model

Resource Efficiency Benefits:

Unlike conventional MCP clients that instantiate all registered servers before interaction
RAG-MCP activates only selected MCP server on-demand
Eliminates infrastructure bottleneck of simultaneous server instantiation
Critical for production deployment with thousands of potential tools

2.3 Design Philosophy: RAG Principles for Tool Selection

The approach draws direct inspiration from Retrieval-Augmented Generation (Lewis et al., NeurIPS 2020):

Traditional RAG	RAG-MCP
Retrieve relevant passages from knowledge corpus	Retrieve relevant tools from MCP registry
Avoid feeding entire Wikipedia to model	Avoid feeding all tool descriptions to model
Dynamic, query-specific context	Dynamic, task-specific tool selection
External index of documents	External index of tool metadata

This parallelism is more than metaphorical—it represents a fundamental architectural pattern for managing large-scale external resources in LLM systems.

2.4 Why RAG-MCP Works: Three Mechanisms

1. Prompt Token Efficiency:

Selective tool injection reduces context from thousands to hundreds of tokens
Preserves room for actual task reasoning and multi-turn interaction
Enables cost-effective scaling (fewer prompt tokens = lower API costs)

2. Cognitive Load Reduction:

LLM reasoning focused on 1-2 relevant tools instead of hundreds
Eliminates decision paralysis from overwhelming choices
Reduces hallucination risk from tool confusion

3. Resource Efficiency / Infrastructure Impact:

Unlike naive MCP clients that instantiate all servers up front, RAG-MCP only activates the selected MCP per query
Reduces startup overhead and makes it feasible to support very large tool registries
Critical for production deployments where simultaneous instantiation of thousands of servers would be infeasible

3. Experimental Design & Evaluation

3.1 The MCP Stress Test

Inspired by the Needle-in-a-Haystack (NIAH) test for long-context evaluation, the authors designed a novel stress test to quantify tool selection degradation:

Setup:

Vary N (number of candidate MCPs) from 1 to 11,100 in 26 intervals
Each trial: 1 ground-truth MCP + (N-1) distractor MCPs randomly sampled from 4,400+ registry
Task: WebSearch operations drawn from a fixed set of MCPBench WebSearch queries
Measure: Selection accuracy, task success rate, prompt tokens, latency

Key Finding: Sharp non-monotonic performance degradation:

High success (>90%) when N < 30
Mid-range variability (31-70 MCPs): intermittent failures from semantic overlap
Critical threshold: Performance collapse beyond ~100 tools, with a few "islands" of success
Severe degradation continues as N increases to thousands

Visualization: Heat map showing per-trial success (yellow) vs. failure (purple) across MCP positions reveals clear threshold effects—model performance doesn't degrade linearly but rather exhibits phase transitions at critical scales.

3.2 Benchmark Evaluation: MCPBench

Dataset: WebSearch subset of MCPBench (evaluation report by Luo et al., 2025)

20 independent trials per baseline
Success criterion: >10 correct answers out of 20
Up to 10 rounds of interaction with MCP servers per trial

Baselines Compared:

Blank Conditioning (Naive)
- Prompt LLM with all MCP descriptions at once
- No pre-filtering
- Represents current common practice
Actual Match (Keyword)
- Simple keyword matching on task description and MCP metadata
- Pre-filter candidates before prompting
- Represents basic optimization
RAG-MCP (Proposed)
- Semantic retrieval from vector index
- Top-k candidate selection
- Inject only best candidate

Automated Evaluation:

The paper describes using DeepSeek-v3 as evaluator, but also later refers to a Llama-based "Llama as Judge" verifier. We treat this as a minor editorial inconsistency; the core setup assumes an automated LLM judge rather than human labeling.
Controlled network environment (eliminate connectivity failures)
Qwen-max-0125 as base LLM for all methods
Metrics: Accuracy, Avg Prompt Tokens, Avg Completion Tokens

Evaluation Context: Note that RAG-MCP is evaluated only on the WebSearch subset of MCPBench; Luo et al. (2025) show that even the best MCPs (e.g., Bing Web Search) reach ~64% accuracy and that MCP itself does not automatically outperform traditional function-call APIs. RAG-MCP's gains are therefore about tool selection and prompt management, not about fundamentally stronger MCP servers.

4. Results & Key Findings

4.1 Quantitative Performance

Baseline	Accuracy (%)	Avg Prompt Tokens	Avg Completion Tokens
RAG-MCP	43.13	1,084	78.14
Actual Match	18.20	1,646	23.60
Blank Conditioning	13.62	2,133.84	162.25

Key Insights:

3.2× Accuracy Improvement: RAG-MCP (43.13%) vs. Blank Conditioning baseline (13.62%)
≈50% Token Reduction: 1,084 tokens vs. 2,133.84 (Blank Conditioning) = 49.2% reduction
Superior to Keyword Matching: 2.4× better than Actual Match (18.20%)

The combination of higher accuracy and lower token usage is particularly notable—efficiency typically trades off against performance, but RAG-MCP achieves both.

4.2 Scaling Analysis

Stress Test Results:

Linear performance (>90% accuracy) up to ~30 MCPs
Degradation zone: 31-100 MCPs (intermittent failures)
Collapse zone: >100 MCPs (retrieval precision issues dominate)

Implications:

Current approaches (Blank Conditioning, keyword matching) completely fail beyond 100 tools
RAG-MCP maintains viability through semantic understanding but still degrades
Future work needed on hierarchical retrieval for extreme scales (>1000 tools)

4.3 Ablation & Error Analysis

What Makes RAG-MCP Work?

Semantic vs. Keyword Retrieval:
- RAG-MCP's 2.4× advantage over Actual Match shows semantic understanding crucial
- Keyword matching misses synonyms, contextual relevance, functional equivalence
Top-k Selection Strategy:
- Paper doesn't extensively ablate k values
- Defaults to k=1 (single best tool)
- Open question: Would k=3-5 improve robustness?
Retriever Quality Dependency:
- System only as good as its retrieval step
- False negatives in top-k = guaranteed task failure
- No recovery mechanism if correct tool missed

Common Failure Modes:

Semantic Ambiguity:
- Multiple MCPs with overlapping functionality
- Query doesn't disambiguate intent
- Example: "search the web" could match general search, academic search, product search
Out-of-Distribution Queries:
- Novel task compositions not seen in training
- Tool descriptions optimized for common use cases
- Edge cases may not surface in retrieval
Metadata Quality:
- Poor MCP descriptions lead to bad embeddings
- Inconsistent naming conventions across tools
- Missing usage examples reduce match quality

5. Architectural Deep Dive

5.1 System Components

External Vector Index:

Stores embeddings for all MCP metadata
Updated asynchronously as new tools register
Supports rapid semantic search (<100ms latency)
Can leverage existing vector databases (Pinecone, Weaviate, Chroma)

Retrieval Pipeline:

Query encoding (user intent → dense vector)
Similarity search (cosine distance in embedding space)
Top-k selection (configurable k, default k=1)
Metadata extraction (pull full tool schemas for selected MCPs)

Execution Layer:

Receives compact prompt (user query + selected tool description)
Performs standard tool calling / function invocation
No awareness of broader MCP ecosystem
Operates identically to single-tool scenarios

5.2 Production Considerations

Latency Profile:

Retrieval overhead: ~50-100ms (vector search)
Total added latency: <200ms in most cases
Negligible compared to LLM inference time (seconds)
Acceptable for real-time applications

Cost Structure:

Reduced prompt tokens directly lower API costs
50% token reduction ≈ 50% cost savings on input
Retrieval infrastructure cost minimal (vector DB hosting)
Net positive economics at scale

Reliability & Robustness:

Single point of failure: Vector index availability
Mitigation: Replicated indices, caching strategies
Fallback option: Degrade to keyword matching if retrieval fails
Monitoring critical: Track retrieval precision, tool selection accuracy

Extensibility:

Adding new tools: Simply embed and index metadata (no retraining)
Updating tools: Re-embed description, update index
Tool deprecation: Remove from index
Zero downtime for model itself

5.3 Integration Patterns

Scenario 1: IDE Code Assistants

Register 100+ MCPs (GitHub, StackOverflow, documentation, package managers)
User query: "How to implement OAuth in FastAPI?"
RAG-MCP retrieves: FastAPI docs MCP, OAuth tutorial MCP
LLM generates code with accurate, contextual tool use

Scenario 2: Enterprise Support Chatbot

Register 500+ MCPs (Salesforce, Jira, Confluence, email, calendar, knowledge base)
User query: "Find all high-priority bugs assigned to my team"
RAG-MCP retrieves: Jira query MCP
LLM executes precise API call without seeing other 499 tools

Scenario 3: Personal AI Assistant

Register 50+ MCPs (Google Drive, Gmail, Calendar, Notion, Spotify, weather, news)
User query: "Summarize my emails from this morning and create a to-do list"
RAG-MCP retrieves: Gmail MCP, task management MCP
LLM orchestrates multi-step workflow

6. Comparison with Alternative Approaches

6.1 Naive Approaches (Status Quo)

Full Context Injection:

Load all tool descriptions into prompt
Works for <20 tools, fails for >50
Exhausts context window
High cognitive load on LLM

Manual Curation:

Humans pre-select relevant tool subset per task
Doesn't scale (requires domain expertise per query)
Labor-intensive, slow iteration

6.2 Structured Approaches

Hierarchical Tool Organization:

Organize tools into categories (e.g., "data sources," "computation," "communication")
Two-stage selection: category → specific tool
Pros: Reduces search space
Cons: Requires manual taxonomy, rigid structure, doesn't adapt to cross-category needs

Rule-Based Routing:

If-then rules map query patterns to tools
Example: "Send email" → Email MCP
Pros: Deterministic, fast
Cons: Brittle, doesn't generalize, maintenance nightmare at scale

6.3 Learning-Based Alternatives

Fine-Tuned Tool Selectors:

Train small model specifically for tool selection
Pros: Potentially very accurate if trained on domain data
Cons: Requires labeled data, needs retraining for new tools, less flexible

Reinforcement Learning:

Learn tool selection policy via trial-and-error
Pros: Can discover non-obvious tool combinations
Cons: Sample inefficient, slow to adapt, complex to implement

RAG-MCP's Advantages Over Alternatives:

Zero-shot generalization to new tools (just add to index)
Leverages pre-trained LLM semantics (no task-specific training)
Scales logarithmically rather than linearly with tool count
Practical infrastructure (vector DB is commodity technology)

7. Limitations & Future Work

7.1 Acknowledged Limitations

Single-Task Evaluation:

Only tested on WebSearch subset of MCPBench
Doesn't cover multi-tool workflows
Limited task diversity (no code generation, data analysis, content creation)
Unknown performance on tool chains or compositions

Single Base LLM:

Experiments use Qwen-max-0125 exclusively
Generalization to GPT-4, Claude, Llama, Gemini unclear
Model-specific biases may affect results

No Human Evaluation:

Automated judge may miss nuanced failures
User satisfaction, task completion quality not assessed
Real-world usability unknown

Implementation Availability: As of this writing, no official RAG-MCP implementation has been open-sourced; reproducibility depends on re-implementing their retrieval + MCPBench setup from the paper.

7.2 Technical Gaps

Retrieval Precision Ceiling:

Quality of semantic retrieval crucial
What if retriever fails to surface correct tool in top-k?
Error propagation from retrieval to execution stage

Validation Step Unclear:

"Optional" validation mentioned but not deeply evaluated
Trade-offs not fully characterized
When is validation necessary vs. overhead?

7.3 Open Research Questions

Optimal Retriever Design:
- Lightweight LLM (Qwen) vs. specialized embedding models?
- Fine-tuning retriever on tool selection data?
- Multi-vector representations for tools?
Dynamic k Selection:
- How many tools to retrieve varies by query complexity
- Can this be learned/predicted?
- Trade-off between coverage and precision
Multi-Tool Scenarios:
- Real workflows often require tool chaining
- How to retrieve sequences of tools?
- Tool dependency graphs?
Human-in-the-Loop:
- User feedback on tool selection
- Active learning for retrieval improvement
- Personalized tool preferences
Cross-Model Portability:
- Does same index work for GPT-4, Claude, Llama?
- Model-specific vs. universal retrievers?

8. Industry & Research Impact

8.1 Immediate Industry Applications

Enterprise AI Platforms:

Salesforce, Microsoft Copilot, Google Workspace AI
Can now support comprehensive tool ecosystems
Enables "AI assistant with access to everything"

Developer Tools:

IDEs (Cursor, Windsurf, Replit)
Can offer hundreds of code-related MCP tools
Reduces prompt engineering burden

Vertical AI Solutions:

Healthcare: Medical databases, EHR systems, diagnostic tools
Finance: Market data, trading APIs, compliance tools
Legal: Case databases, document management, research tools

8.2 Research Trajectory

Short-Term (6-12 months):

Replication studies with different LLMs
Extensions to multi-tool workflows
Hierarchical retrieval variants
Real-world deployment studies

Medium-Term (1-2 years):

Learned retrievers fine-tuned on tool selection
Tool usage patterns and popularity metrics
Personalization and context-aware retrieval
Integration with agent frameworks (AutoGPT, LangGraph)

Long-Term (2-5 years):

Tool composition and workflow synthesis
Automated tool discovery and registration
Cross-modal tool use (vision, audio, robotics)
Standardization beyond MCP (Tool Discovery Protocol?)

8.3 Broader AI Implications

Paradigm Shift: From "what can this model do?" to "what can this model access?"

Success increasingly depends on ecosystem connectivity
Model capabilities plateau; tool access becomes differentiator

Democratization: Smaller models with good retrieval can rival larger models

7B model + comprehensive tools > 70B model + limited tools
Reduces compute requirements for capable agents

Safety Considerations:

Tool access increases potential for harm
RAG-MCP's selective activation may improve safety (smaller attack surface)
But retrieval errors could lead to unintended tool use

9. Conclusion & Recommendations

9.1 Key Takeaways

Problem Validation: Prompt bloat is real, measurable, and worsening with ecosystem growth
Solution Viability: RAG-MCP demonstrates retrieval-based tool selection works at scale
Production Readiness: Architecture addresses practical deployment concerns (cost, latency, extensibility)
Ecosystem Enabler: Removes ceiling on MCP ecosystem growth, validating universal protocol vision
Critical Scaling Threshold: Empirically, both baselines and RAG-MCP show sharp degradation once the candidate MCP pool exceeds roughly 100 tools; this supports treating retrieval-based selection as mandatory rather than optional beyond that scale

9.2 For Practitioners

Suggested Actions:

Implement RAG-MCP pattern for systems with >20 tools
Build/maintain tool metadata indices
Monitor retrieval quality metrics
Plan for scaling beyond current tool counts

Long-Term Steps:

Invest in semantic tool representation
Develop tool discovery infrastructure
Consider hierarchical organization for large tool libraries
Prepare for multi-tool workflow orchestration

9.3 For Researchers

High-Priority Areas:

Retriever optimization for tool selection
Multi-tool workflow retrieval
Evaluation benchmarks for large-scale tool use
Safety and robustness in retrieved tool execution

Open Datasets Needed:

Large-scale tool selection benchmarks
Real-world tool usage patterns
Multi-tool workflow traces

9.4 Final Assessment

RAG-MCP represents a critical architectural contribution to the productionization of tool-augmented LLMs. While the core insight—apply RAG to tool discovery—is conceptually simple, its execution demonstrates deep understanding of both the technical challenge (prompt bloat, selection complexity) and the ecosystem dynamics (MCP proliferation, production constraints).

The paper's timing is excellent, arriving just as the MCP ecosystem reaches inflection point where naive approaches fail. Its 3.2× accuracy improvement and ≈50% token reduction are not incremental gains but paradigm-enabling results that make previously intractable applications feasible.

Rating: 4.5/5

Novel application of RAG principles ⭐⭐⭐⭐⭐
Rigorous experimental methodology ⭐⭐⭐⭐⭐
Production-ready architecture ⭐⭐⭐⭐⭐
Limited task diversity in evaluation ⭐⭐⭐
Needs broader LLM validation ⭐⭐⭐⭐

Recommended for: Researchers and practitioners working on tool-augmented LLMs, agentic AI systems, MCP integration, and production AI deployment.

References

Primary Paper: Gan, T., & Sun, Q. (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation. arXiv:2505.03275.

Model Context Protocol:

Anthropic (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol
Anthropic (2025). Code Execution with MCP. https://www.anthropic.com/engineering/code-execution-with-mcp

Related Work:

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
Patil, S.G., et al. (2024). Gorilla: Large Language Model Connected with Massive APIs. NeurIPS.
Luo, Z., et al. (2025). Evaluation Report on MCP Servers. arXiv:2504.11094.

Additional Context:

MCP Server Registry: https://mcp.so/ (4,400+ servers as of April 2025, 17,000+ as of late 2025)
MCP Documentation: https://docs.anthropic.com/en/docs/agents-and-tools/mcp
MCP GitHub: https://github.com/modelcontextprotocol

Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection

Authors: Tiantian Gan and Qiyao Sun (Beijing University of Posts and Telecommunications + Queen Mary University of London)Publication Date: May 6, 2025arXiv: https://arxiv.org/abs/2505.03275

Arindam Banerji

banerji.arindam@gmail.com

RAG-MCP: Taming Tool Bloat in the MCP Era

Recent Posts

Comments

Stay Connected with us