RAG-MCP: Taming Tool Bloat in the MCP Era
- Arindom Banerjee
- Nov 22
- 12 min read
Design and evaluation of a retrieval-driven MCP selector for large tool registries
Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation
Executive Summary
RAG-MCP addresses a critical scalability challenge facing modern LLM systems: the "prompt bloat" problem that emerges when large language models must select from hundreds or thousands of external tools. The paper introduces a Retrieval-Augmented Generation framework that dynamically retrieves only the most relevant tools from an external index, achieving remarkable results: ≈50% reduction in prompt tokens (49.2% precisely) and 3.2× improvement in tool selection accuracy (43.13% vs 13.62% baseline). This work is particularly timely given the rapid proliferation of Model Context Protocol (MCP) servers—which grew from zero to over 4,400 implementations in just five months following Anthropic's November 2024 release (and has since grown to 17,000+ servers as of late 2025).
The fundamental insight is elegant: instead of overwhelming the LLM with all available tool descriptions, treat tool discovery as a retrieval problem analogous to how RAG systems retrieve relevant passages from large corpora. This architectural shift transforms an intractable context management problem into a scalable, production-ready solution.
Critical Finding: Both baselines and RAG-MCP show sharp performance degradation once the candidate MCP pool exceeds roughly 100 tools, empirically establishing retrieval-based selection as mandatory rather than optional beyond that scale.
1. Problem Context & Motivation
1.1 The Tool Proliferation Challenge
Modern LLMs are increasingly augmented with external tools to overcome their fundamental limitations—static knowledge cutoffs and inability to perform real-world actions. While function calling and tool use have become standard capabilities (exemplified by GPT-4's function calling API, Claude's computer use, and the broader ecosystem), a new bottleneck has emerged: scalability of tool selection.
The problem manifests in two critical ways:
Prompt Bloat: Including descriptions for dozens or hundreds of tools exhausts the context window. Even with 100K+ token contexts (Claude 2, GPT-4 Turbo), listing detailed schemas, parameters, and usage examples for 100+ tools can consume 50-80% of available context, leaving insufficient room for actual task reasoning.
Selection Complexity: As the tool count grows, the LLM faces increasing cognitive load in distinguishing between similar tools with nuanced differences. Research cited in the paper shows even frontier models like GPT-4 and Claude make critical errors—hallucinating non-existent APIs or selecting inappropriate tools—when presented with large toolsets.
1.2 The Model Context Protocol (MCP) Catalyst
Anthropic's introduction of MCP in November 2024 transformed this from a theoretical concern to an urgent practical problem. MCP standardizes how AI systems connect to external data sources and tools through a universal protocol—essentially "USB-C for AI applications." The protocol's rapid adoption created an explosion of available tools:
4,400+ MCP servers listed on mcp.so as of April 2025 (paper-time), with many more (17,000+ servers) as of late 2025
Major adoption by OpenAI (March 2025), Google DeepMind, and development tools (Replit, Sourcegraph, Zed)
Pre-built connectors for Google Drive, Slack, GitHub, Postgres, Puppeteer, Stripe
This ecosystem growth, while beneficial for capability expansion, dramatically exacerbates the prompt bloat problem. The authors note this creates an "N×M integration problem"—every AI application potentially needing to handle thousands of tools.
1.3 Why Existing Approaches Fall Short
Prior work on LLM tool use (Toolformer, ReAct, Gorilla, WebGPT) focused on how models learn to use tools, not how to scale tool discovery. These approaches assume:
Small, curated toolsets (typically 5-20 tools)
Hand-picked tools for specific domains
Static tool registries
None address the dynamic, large-scale tool selection problem that MCP's success created.
2. Approach & Technical Design

2.1 Core Architecture: Three-Step Pipeline
RAG-MCP introduces a retrieval-first architecture that decouples tool discovery from tool execution:
Step 1: Semantic Retrieval
User query encoded using lightweight LLM-based retriever (implementation uses Qwen-max-0125)
Semantic search over external vector index containing all MCP metadata
Returns top-k most relevant MCP candidates based on cosine similarity
Key innovation: Tool descriptions represented in same semantic space as user queries
Step 2: Validation (Optional)
Generated few-shot example queries for each retrieved MCP
Basic compatibility testing ("sanity check") before invocation
Ensures functional correctness without full activation
Filters out false positive retrievals
Step 3: Selective Invocation
Only the single best MCP description injected into main LLM prompt
Includes full tool-use parameters and schemas
LLM performs planning and execution without concern for tool discovery
Dramatically reduced context consumption
2.2 Technical Implementation Details
Vector Indexing:
Each MCP's metadata (name, description, parameters, usage examples) embedded into dense vectors
External index maintained separately from LLM inference
New MCPs added by simply indexing their metadata—no LLM retraining required
Retriever Design:
Lightweight compared to main task LLM
Optimized for speed (semantic search vs. full LLM reasoning)
Can be updated/fine-tuned independently of main model
Resource Efficiency Benefits:
Unlike conventional MCP clients that instantiate all registered servers before interaction
RAG-MCP activates only selected MCP server on-demand
Eliminates infrastructure bottleneck of simultaneous server instantiation
Critical for production deployment with thousands of potential tools
2.3 Design Philosophy: RAG Principles for Tool Selection
The approach draws direct inspiration from Retrieval-Augmented Generation (Lewis et al., NeurIPS 2020):
Traditional RAG | RAG-MCP |
Retrieve relevant passages from knowledge corpus | Retrieve relevant tools from MCP registry |
Avoid feeding entire Wikipedia to model | Avoid feeding all tool descriptions to model |
Dynamic, query-specific context | Dynamic, task-specific tool selection |
External index of documents | External index of tool metadata |
This parallelism is more than metaphorical—it represents a fundamental architectural pattern for managing large-scale external resources in LLM systems.
2.4 Why RAG-MCP Works: Three Mechanisms
1. Prompt Token Efficiency:
Selective tool injection reduces context from thousands to hundreds of tokens
Preserves room for actual task reasoning and multi-turn interaction
Enables cost-effective scaling (fewer prompt tokens = lower API costs)
2. Cognitive Load Reduction:
LLM reasoning focused on 1-2 relevant tools instead of hundreds
Eliminates decision paralysis from overwhelming choices
Reduces hallucination risk from tool confusion
3. Resource Efficiency / Infrastructure Impact:
Unlike naive MCP clients that instantiate all servers up front, RAG-MCP only activates the selected MCP per query
Reduces startup overhead and makes it feasible to support very large tool registries
Critical for production deployments where simultaneous instantiation of thousands of servers would be infeasible
3. Experimental Design & Evaluation

3.1 The MCP Stress Test
Inspired by the Needle-in-a-Haystack (NIAH) test for long-context evaluation, the authors designed a novel stress test to quantify tool selection degradation:
Setup:
Vary N (number of candidate MCPs) from 1 to 11,100 in 26 intervals
Each trial: 1 ground-truth MCP + (N-1) distractor MCPs randomly sampled from 4,400+ registry
Task: WebSearch operations drawn from a fixed set of MCPBench WebSearch queries
Measure: Selection accuracy, task success rate, prompt tokens, latency
Key Finding: Sharp non-monotonic performance degradation:
High success (>90%) when N < 30
Mid-range variability (31-70 MCPs): intermittent failures from semantic overlap
Critical threshold: Performance collapse beyond ~100 tools, with a few "islands" of success
Severe degradation continues as N increases to thousands
Visualization: Heat map showing per-trial success (yellow) vs. failure (purple) across MCP positions reveals clear threshold effects—model performance doesn't degrade linearly but rather exhibits phase transitions at critical scales.
3.2 Benchmark Evaluation: MCPBench
Dataset: WebSearch subset of MCPBench (evaluation report by Luo et al., 2025)
20 independent trials per baseline
Success criterion: >10 correct answers out of 20
Up to 10 rounds of interaction with MCP servers per trial
Baselines Compared:
Blank Conditioning (Naive)
Prompt LLM with all MCP descriptions at once
No pre-filtering
Represents current common practice
Actual Match (Keyword)
Simple keyword matching on task description and MCP metadata
Pre-filter candidates before prompting
Represents basic optimization
RAG-MCP (Proposed)
Semantic retrieval from vector index
Top-k candidate selection
Inject only best candidate
Automated Evaluation:
The paper describes using DeepSeek-v3 as evaluator, but also later refers to a Llama-based "Llama as Judge" verifier. We treat this as a minor editorial inconsistency; the core setup assumes an automated LLM judge rather than human labeling.
Controlled network environment (eliminate connectivity failures)
Qwen-max-0125 as base LLM for all methods
Metrics: Accuracy, Avg Prompt Tokens, Avg Completion Tokens
Evaluation Context: Note that RAG-MCP is evaluated only on the WebSearch subset of MCPBench; Luo et al. (2025) show that even the best MCPs (e.g., Bing Web Search) reach ~64% accuracy and that MCP itself does not automatically outperform traditional function-call APIs. RAG-MCP's gains are therefore about tool selection and prompt management, not about fundamentally stronger MCP servers.
4. Results & Key Findings
4.1 Quantitative Performance
Baseline | Accuracy (%) | Avg Prompt Tokens | Avg Completion Tokens |
RAG-MCP | 43.13 | 1,084 | 78.14 |
Actual Match | 18.20 | 1,646 | 23.60 |
Blank Conditioning | 13.62 | 2,133.84 | 162.25 |
Key Insights:
3.2× Accuracy Improvement: RAG-MCP (43.13%) vs. Blank Conditioning baseline (13.62%)
≈50% Token Reduction: 1,084 tokens vs. 2,133.84 (Blank Conditioning) = 49.2% reduction
Superior to Keyword Matching: 2.4× better than Actual Match (18.20%)
The combination of higher accuracy and lower token usage is particularly notable—efficiency typically trades off against performance, but RAG-MCP achieves both.
4.2 Scaling Analysis
Stress Test Results:
Linear performance (>90% accuracy) up to ~30 MCPs
Degradation zone: 31-100 MCPs (intermittent failures)
Collapse zone: >100 MCPs (retrieval precision issues dominate)
Implications:
Current approaches (Blank Conditioning, keyword matching) completely fail beyond 100 tools
RAG-MCP maintains viability through semantic understanding but still degrades
Future work needed on hierarchical retrieval for extreme scales (>1000 tools)
4.3 Ablation & Error Analysis
What Makes RAG-MCP Work?
Semantic vs. Keyword Retrieval:
RAG-MCP's 2.4× advantage over Actual Match shows semantic understanding crucial
Keyword matching misses synonyms, contextual relevance, functional equivalence
Top-k Selection Strategy:
Paper doesn't extensively ablate k values
Defaults to k=1 (single best tool)
Open question: Would k=3-5 improve robustness?
Retriever Quality Dependency:
System only as good as its retrieval step
False negatives in top-k = guaranteed task failure
No recovery mechanism if correct tool missed
Common Failure Modes:
Semantic Ambiguity:
Multiple MCPs with overlapping functionality
Query doesn't disambiguate intent
Example: "search the web" could match general search, academic search, product search
Out-of-Distribution Queries:
Novel task compositions not seen in training
Tool descriptions optimized for common use cases
Edge cases may not surface in retrieval
Metadata Quality:
Poor MCP descriptions lead to bad embeddings
Inconsistent naming conventions across tools
Missing usage examples reduce match quality
5. Architectural Deep Dive
5.1 System Components
External Vector Index:
Stores embeddings for all MCP metadata
Updated asynchronously as new tools register
Supports rapid semantic search (<100ms latency)
Can leverage existing vector databases (Pinecone, Weaviate, Chroma)
Retrieval Pipeline:
Query encoding (user intent → dense vector)
Similarity search (cosine distance in embedding space)
Top-k selection (configurable k, default k=1)
Metadata extraction (pull full tool schemas for selected MCPs)
Execution Layer:
Receives compact prompt (user query + selected tool description)
Performs standard tool calling / function invocation
No awareness of broader MCP ecosystem
Operates identically to single-tool scenarios
5.2 Production Considerations
Latency Profile:
Retrieval overhead: ~50-100ms (vector search)
Total added latency: <200ms in most cases
Negligible compared to LLM inference time (seconds)
Acceptable for real-time applications
Cost Structure:
Reduced prompt tokens directly lower API costs
50% token reduction ≈ 50% cost savings on input
Retrieval infrastructure cost minimal (vector DB hosting)
Net positive economics at scale
Reliability & Robustness:
Single point of failure: Vector index availability
Mitigation: Replicated indices, caching strategies
Fallback option: Degrade to keyword matching if retrieval fails
Monitoring critical: Track retrieval precision, tool selection accuracy
Extensibility:
Adding new tools: Simply embed and index metadata (no retraining)
Updating tools: Re-embed description, update index
Tool deprecation: Remove from index
Zero downtime for model itself
5.3 Integration Patterns
Scenario 1: IDE Code Assistants
Register 100+ MCPs (GitHub, StackOverflow, documentation, package managers)
User query: "How to implement OAuth in FastAPI?"
RAG-MCP retrieves: FastAPI docs MCP, OAuth tutorial MCP
LLM generates code with accurate, contextual tool use
Scenario 2: Enterprise Support Chatbot
Register 500+ MCPs (Salesforce, Jira, Confluence, email, calendar, knowledge base)
User query: "Find all high-priority bugs assigned to my team"
RAG-MCP retrieves: Jira query MCP
LLM executes precise API call without seeing other 499 tools
Scenario 3: Personal AI Assistant
Register 50+ MCPs (Google Drive, Gmail, Calendar, Notion, Spotify, weather, news)
User query: "Summarize my emails from this morning and create a to-do list"
RAG-MCP retrieves: Gmail MCP, task management MCP
LLM orchestrates multi-step workflow
6. Comparison with Alternative Approaches
6.1 Naive Approaches (Status Quo)
Full Context Injection:
Load all tool descriptions into prompt
Works for <20 tools, fails for >50
Exhausts context window
High cognitive load on LLM
Manual Curation:
Humans pre-select relevant tool subset per task
Doesn't scale (requires domain expertise per query)
Labor-intensive, slow iteration
6.2 Structured Approaches
Hierarchical Tool Organization:
Organize tools into categories (e.g., "data sources," "computation," "communication")
Two-stage selection: category → specific tool
Pros: Reduces search space
Cons: Requires manual taxonomy, rigid structure, doesn't adapt to cross-category needs
Rule-Based Routing:
If-then rules map query patterns to tools
Example: "Send email" → Email MCP
Pros: Deterministic, fast
Cons: Brittle, doesn't generalize, maintenance nightmare at scale
6.3 Learning-Based Alternatives
Fine-Tuned Tool Selectors:
Train small model specifically for tool selection
Pros: Potentially very accurate if trained on domain data
Cons: Requires labeled data, needs retraining for new tools, less flexible
Reinforcement Learning:
Learn tool selection policy via trial-and-error
Pros: Can discover non-obvious tool combinations
Cons: Sample inefficient, slow to adapt, complex to implement
RAG-MCP's Advantages Over Alternatives:
Zero-shot generalization to new tools (just add to index)
Leverages pre-trained LLM semantics (no task-specific training)
Scales logarithmically rather than linearly with tool count
Practical infrastructure (vector DB is commodity technology)
7. Limitations & Future Work
7.1 Acknowledged Limitations
Single-Task Evaluation:
Only tested on WebSearch subset of MCPBench
Doesn't cover multi-tool workflows
Limited task diversity (no code generation, data analysis, content creation)
Unknown performance on tool chains or compositions
Single Base LLM:
Experiments use Qwen-max-0125 exclusively
Generalization to GPT-4, Claude, Llama, Gemini unclear
Model-specific biases may affect results
No Human Evaluation:
Automated judge may miss nuanced failures
User satisfaction, task completion quality not assessed
Real-world usability unknown
Implementation Availability: As of this writing, no official RAG-MCP implementation has been open-sourced; reproducibility depends on re-implementing their retrieval + MCPBench setup from the paper.
7.2 Technical Gaps
Retrieval Precision Ceiling:
Quality of semantic retrieval crucial
What if retriever fails to surface correct tool in top-k?
Error propagation from retrieval to execution stage
Validation Step Unclear:
"Optional" validation mentioned but not deeply evaluated
Trade-offs not fully characterized
When is validation necessary vs. overhead?
7.3 Open Research Questions
Optimal Retriever Design:
Lightweight LLM (Qwen) vs. specialized embedding models?
Fine-tuning retriever on tool selection data?
Multi-vector representations for tools?
Dynamic k Selection:
How many tools to retrieve varies by query complexity
Can this be learned/predicted?
Trade-off between coverage and precision
Multi-Tool Scenarios:
Real workflows often require tool chaining
How to retrieve sequences of tools?
Tool dependency graphs?
Human-in-the-Loop:
User feedback on tool selection
Active learning for retrieval improvement
Personalized tool preferences
Cross-Model Portability:
Does same index work for GPT-4, Claude, Llama?
Model-specific vs. universal retrievers?
8. Industry & Research Impact
8.1 Immediate Industry Applications
Enterprise AI Platforms:
Salesforce, Microsoft Copilot, Google Workspace AI
Can now support comprehensive tool ecosystems
Enables "AI assistant with access to everything"
Developer Tools:
IDEs (Cursor, Windsurf, Replit)
Can offer hundreds of code-related MCP tools
Reduces prompt engineering burden
Vertical AI Solutions:
Healthcare: Medical databases, EHR systems, diagnostic tools
Finance: Market data, trading APIs, compliance tools
Legal: Case databases, document management, research tools
8.2 Research Trajectory
Short-Term (6-12 months):
Replication studies with different LLMs
Extensions to multi-tool workflows
Hierarchical retrieval variants
Real-world deployment studies
Medium-Term (1-2 years):
Learned retrievers fine-tuned on tool selection
Tool usage patterns and popularity metrics
Personalization and context-aware retrieval
Integration with agent frameworks (AutoGPT, LangGraph)
Long-Term (2-5 years):
Tool composition and workflow synthesis
Automated tool discovery and registration
Cross-modal tool use (vision, audio, robotics)
Standardization beyond MCP (Tool Discovery Protocol?)
8.3 Broader AI Implications
Paradigm Shift: From "what can this model do?" to "what can this model access?"
Success increasingly depends on ecosystem connectivity
Model capabilities plateau; tool access becomes differentiator
Democratization: Smaller models with good retrieval can rival larger models
7B model + comprehensive tools > 70B model + limited tools
Reduces compute requirements for capable agents
Safety Considerations:
Tool access increases potential for harm
RAG-MCP's selective activation may improve safety (smaller attack surface)
But retrieval errors could lead to unintended tool use
9. Conclusion & Recommendations
9.1 Key Takeaways
Problem Validation: Prompt bloat is real, measurable, and worsening with ecosystem growth
Solution Viability: RAG-MCP demonstrates retrieval-based tool selection works at scale
Production Readiness: Architecture addresses practical deployment concerns (cost, latency, extensibility)
Ecosystem Enabler: Removes ceiling on MCP ecosystem growth, validating universal protocol vision
Critical Scaling Threshold: Empirically, both baselines and RAG-MCP show sharp degradation once the candidate MCP pool exceeds roughly 100 tools; this supports treating retrieval-based selection as mandatory rather than optional beyond that scale
9.2 For Practitioners
Suggested Actions:
Implement RAG-MCP pattern for systems with >20 tools
Build/maintain tool metadata indices
Monitor retrieval quality metrics
Plan for scaling beyond current tool counts
Long-Term Steps:
Invest in semantic tool representation
Develop tool discovery infrastructure
Consider hierarchical organization for large tool libraries
Prepare for multi-tool workflow orchestration
9.3 For Researchers
High-Priority Areas:
Retriever optimization for tool selection
Multi-tool workflow retrieval
Evaluation benchmarks for large-scale tool use
Safety and robustness in retrieved tool execution
Open Datasets Needed:
Large-scale tool selection benchmarks
Real-world tool usage patterns
Multi-tool workflow traces
9.4 Final Assessment
RAG-MCP represents a critical architectural contribution to the productionization of tool-augmented LLMs. While the core insight—apply RAG to tool discovery—is conceptually simple, its execution demonstrates deep understanding of both the technical challenge (prompt bloat, selection complexity) and the ecosystem dynamics (MCP proliferation, production constraints).
The paper's timing is excellent, arriving just as the MCP ecosystem reaches inflection point where naive approaches fail. Its 3.2× accuracy improvement and ≈50% token reduction are not incremental gains but paradigm-enabling results that make previously intractable applications feasible.
Rating: 4.5/5
Novel application of RAG principles ⭐⭐⭐⭐⭐
Rigorous experimental methodology ⭐⭐⭐⭐⭐
Production-ready architecture ⭐⭐⭐⭐⭐
Limited task diversity in evaluation ⭐⭐⭐
Needs broader LLM validation ⭐⭐⭐⭐
Recommended for: Researchers and practitioners working on tool-augmented LLMs, agentic AI systems, MCP integration, and production AI deployment.
References
Primary Paper: Gan, T., & Sun, Q. (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation. arXiv:2505.03275.
Model Context Protocol:
Anthropic (2024). Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol
Anthropic (2025). Code Execution with MCP. https://www.anthropic.com/engineering/code-execution-with-mcp
Related Work:
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
Patil, S.G., et al. (2024). Gorilla: Large Language Model Connected with Massive APIs. NeurIPS.
Luo, Z., et al. (2025). Evaluation Report on MCP Servers. arXiv:2504.11094.
Additional Context:
MCP Server Registry: https://mcp.so/ (4,400+ servers as of April 2025, 17,000+ as of late 2025)
MCP Documentation: https://docs.anthropic.com/en/docs/agents-and-tools/mcp
MCP GitHub: https://github.com/modelcontextprotocol
Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection
Authors: Tiantian Gan and Qiyao Sun (Beijing University of Posts and Telecommunications + Queen Mary University of London)Publication Date: May 6, 2025arXiv: https://arxiv.org/abs/2505.03275
Arindam Banerji



Comments