top of page

RAG-MCP: Taming Tool Bloat in the MCP Era

Design and evaluation of a retrieval-driven MCP selector for large tool registries


Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation


Executive Summary

RAG-MCP addresses a critical scalability challenge facing modern LLM systems: the "prompt bloat" problem that emerges when large language models must select from hundreds or thousands of external tools. The paper introduces a Retrieval-Augmented Generation framework that dynamically retrieves only the most relevant tools from an external index, achieving remarkable results: ≈50% reduction in prompt tokens (49.2% precisely) and 3.2× improvement in tool selection accuracy (43.13% vs 13.62% baseline). This work is particularly timely given the rapid proliferation of Model Context Protocol (MCP) servers—which grew from zero to over 4,400 implementations in just five months following Anthropic's November 2024 release (and has since grown to 17,000+ servers as of late 2025).


The fundamental insight is elegant: instead of overwhelming the LLM with all available tool descriptions, treat tool discovery as a retrieval problem analogous to how RAG systems retrieve relevant passages from large corpora. This architectural shift transforms an intractable context management problem into a scalable, production-ready solution.


Critical Finding: Both baselines and RAG-MCP show sharp performance degradation once the candidate MCP pool exceeds roughly 100 tools, empirically establishing retrieval-based selection as mandatory rather than optional beyond that scale.


1. Problem Context & Motivation


1.1 The Tool Proliferation Challenge

Modern LLMs are increasingly augmented with external tools to overcome their fundamental limitations—static knowledge cutoffs and inability to perform real-world actions. While function calling and tool use have become standard capabilities (exemplified by GPT-4's function calling API, Claude's computer use, and the broader ecosystem), a new bottleneck has emerged: scalability of tool selection.


The problem manifests in two critical ways:

Prompt Bloat: Including descriptions for dozens or hundreds of tools exhausts the context window. Even with 100K+ token contexts (Claude 2, GPT-4 Turbo), listing detailed schemas, parameters, and usage examples for 100+ tools can consume 50-80% of available context, leaving insufficient room for actual task reasoning.


Selection Complexity: As the tool count grows, the LLM faces increasing cognitive load in distinguishing between similar tools with nuanced differences. Research cited in the paper shows even frontier models like GPT-4 and Claude make critical errors—hallucinating non-existent APIs or selecting inappropriate tools—when presented with large toolsets.


1.2 The Model Context Protocol (MCP) Catalyst

Anthropic's introduction of MCP in November 2024 transformed this from a theoretical concern to an urgent practical problem. MCP standardizes how AI systems connect to external data sources and tools through a universal protocol—essentially "USB-C for AI applications." The protocol's rapid adoption created an explosion of available tools:

  • 4,400+ MCP servers listed on mcp.so as of April 2025 (paper-time), with many more (17,000+ servers) as of late 2025

  • Major adoption by OpenAI (March 2025), Google DeepMind, and development tools (Replit, Sourcegraph, Zed)

  • Pre-built connectors for Google Drive, Slack, GitHub, Postgres, Puppeteer, Stripe


This ecosystem growth, while beneficial for capability expansion, dramatically exacerbates the prompt bloat problem. The authors note this creates an "N×M integration problem"—every AI application potentially needing to handle thousands of tools.


1.3 Why Existing Approaches Fall Short

Prior work on LLM tool use (Toolformer, ReAct, Gorilla, WebGPT) focused on how models learn to use tools, not how to scale tool discovery. These approaches assume:

  • Small, curated toolsets (typically 5-20 tools)

  • Hand-picked tools for specific domains

  • Static tool registries

None address the dynamic, large-scale tool selection problem that MCP's success created.


2. Approach & Technical Design

ree

2.1 Core Architecture: Three-Step Pipeline

RAG-MCP introduces a retrieval-first architecture that decouples tool discovery from tool execution:

Step 1: Semantic Retrieval

  • User query encoded using lightweight LLM-based retriever (implementation uses Qwen-max-0125)

  • Semantic search over external vector index containing all MCP metadata

  • Returns top-k most relevant MCP candidates based on cosine similarity

  • Key innovation: Tool descriptions represented in same semantic space as user queries

Step 2: Validation (Optional)

  • Generated few-shot example queries for each retrieved MCP

  • Basic compatibility testing ("sanity check") before invocation

  • Ensures functional correctness without full activation

  • Filters out false positive retrievals


Step 3: Selective Invocation

  • Only the single best MCP description injected into main LLM prompt

  • Includes full tool-use parameters and schemas

  • LLM performs planning and execution without concern for tool discovery

  • Dramatically reduced context consumption


2.2 Technical Implementation Details


Vector Indexing:

  • Each MCP's metadata (name, description, parameters, usage examples) embedded into dense vectors

  • External index maintained separately from LLM inference

  • New MCPs added by simply indexing their metadata—no LLM retraining required


Retriever Design:

  • Lightweight compared to main task LLM

  • Optimized for speed (semantic search vs. full LLM reasoning)

  • Can be updated/fine-tuned independently of main model


Resource Efficiency Benefits:

  • Unlike conventional MCP clients that instantiate all registered servers before interaction

  • RAG-MCP activates only selected MCP server on-demand

  • Eliminates infrastructure bottleneck of simultaneous server instantiation

  • Critical for production deployment with thousands of potential tools


2.3 Design Philosophy: RAG Principles for Tool Selection

The approach draws direct inspiration from Retrieval-Augmented Generation (Lewis et al., NeurIPS 2020):

Traditional RAG

RAG-MCP

Retrieve relevant passages from knowledge corpus

Retrieve relevant tools from MCP registry

Avoid feeding entire Wikipedia to model

Avoid feeding all tool descriptions to model

Dynamic, query-specific context

Dynamic, task-specific tool selection

External index of documents

External index of tool metadata

This parallelism is more than metaphorical—it represents a fundamental architectural pattern for managing large-scale external resources in LLM systems.


2.4 Why RAG-MCP Works: Three Mechanisms


1. Prompt Token Efficiency:

  • Selective tool injection reduces context from thousands to hundreds of tokens

  • Preserves room for actual task reasoning and multi-turn interaction

  • Enables cost-effective scaling (fewer prompt tokens = lower API costs)

2. Cognitive Load Reduction:

  • LLM reasoning focused on 1-2 relevant tools instead of hundreds

  • Eliminates decision paralysis from overwhelming choices

  • Reduces hallucination risk from tool confusion

3. Resource Efficiency / Infrastructure Impact:

  • Unlike naive MCP clients that instantiate all servers up front, RAG-MCP only activates the selected MCP per query

  • Reduces startup overhead and makes it feasible to support very large tool registries

  • Critical for production deployments where simultaneous instantiation of thousands of servers would be infeasible


3. Experimental Design & Evaluation

ree

3.1 The MCP Stress Test

Inspired by the Needle-in-a-Haystack (NIAH) test for long-context evaluation, the authors designed a novel stress test to quantify tool selection degradation:

Setup:

  • Vary N (number of candidate MCPs) from 1 to 11,100 in 26 intervals

  • Each trial: 1 ground-truth MCP + (N-1) distractor MCPs randomly sampled from 4,400+ registry

  • Task: WebSearch operations drawn from a fixed set of MCPBench WebSearch queries

  • Measure: Selection accuracy, task success rate, prompt tokens, latency


Key Finding: Sharp non-monotonic performance degradation:

  • High success (>90%) when N < 30

  • Mid-range variability (31-70 MCPs): intermittent failures from semantic overlap

  • Critical threshold: Performance collapse beyond ~100 tools, with a few "islands" of success

  • Severe degradation continues as N increases to thousands


Visualization: Heat map showing per-trial success (yellow) vs. failure (purple) across MCP positions reveals clear threshold effects—model performance doesn't degrade linearly but rather exhibits phase transitions at critical scales.


3.2 Benchmark Evaluation: MCPBench

Dataset: WebSearch subset of MCPBench (evaluation report by Luo et al., 2025)

  • 20 independent trials per baseline

  • Success criterion: >10 correct answers out of 20

  • Up to 10 rounds of interaction with MCP servers per trial


Baselines Compared:

  1. Blank Conditioning (Naive)

    • Prompt LLM with all MCP descriptions at once

    • No pre-filtering

    • Represents current common practice

  2. Actual Match (Keyword)

    • Simple keyword matching on task description and MCP metadata

    • Pre-filter candidates before prompting

    • Represents basic optimization

  3. RAG-MCP (Proposed)

    • Semantic retrieval from vector index

    • Top-k candidate selection

    • Inject only best candidate


Automated Evaluation:

  • The paper describes using DeepSeek-v3 as evaluator, but also later refers to a Llama-based "Llama as Judge" verifier. We treat this as a minor editorial inconsistency; the core setup assumes an automated LLM judge rather than human labeling.

  • Controlled network environment (eliminate connectivity failures)

  • Qwen-max-0125 as base LLM for all methods

  • Metrics: Accuracy, Avg Prompt Tokens, Avg Completion Tokens


Evaluation Context: Note that RAG-MCP is evaluated only on the WebSearch subset of MCPBench; Luo et al. (2025) show that even the best MCPs (e.g., Bing Web Search) reach ~64% accuracy and that MCP itself does not automatically outperform traditional function-call APIs. RAG-MCP's gains are therefore about tool selection and prompt management, not about fundamentally stronger MCP servers.


4. Results & Key Findings

4.1 Quantitative Performance

Baseline

Accuracy (%)

Avg Prompt Tokens

Avg Completion Tokens

RAG-MCP

43.13

1,084

78.14

Actual Match

18.20

1,646

23.60

Blank Conditioning

13.62

2,133.84

162.25


Key Insights:

  1. 3.2× Accuracy Improvement: RAG-MCP (43.13%) vs. Blank Conditioning baseline (13.62%)

  2. ≈50% Token Reduction: 1,084 tokens vs. 2,133.84 (Blank Conditioning) = 49.2% reduction

  3. Superior to Keyword Matching: 2.4× better than Actual Match (18.20%)

The combination of higher accuracy and lower token usage is particularly notable—efficiency typically trades off against performance, but RAG-MCP achieves both.


4.2 Scaling Analysis


Stress Test Results:

  • Linear performance (>90% accuracy) up to ~30 MCPs

  • Degradation zone: 31-100 MCPs (intermittent failures)

  • Collapse zone: >100 MCPs (retrieval precision issues dominate)


Implications:

  • Current approaches (Blank Conditioning, keyword matching) completely fail beyond 100 tools

  • RAG-MCP maintains viability through semantic understanding but still degrades

  • Future work needed on hierarchical retrieval for extreme scales (>1000 tools)


4.3 Ablation & Error Analysis

What Makes RAG-MCP Work?

  1. Semantic vs. Keyword Retrieval: 

    • RAG-MCP's 2.4× advantage over Actual Match shows semantic understanding crucial

    • Keyword matching misses synonyms, contextual relevance, functional equivalence

  2. Top-k Selection Strategy: 

    • Paper doesn't extensively ablate k values

    • Defaults to k=1 (single best tool)

    • Open question: Would k=3-5 improve robustness?

  3. Retriever Quality Dependency: 

    • System only as good as its retrieval step

    • False negatives in top-k = guaranteed task failure

    • No recovery mechanism if correct tool missed


Common Failure Modes:

  1. Semantic Ambiguity: 

    • Multiple MCPs with overlapping functionality

    • Query doesn't disambiguate intent

    • Example: "search the web" could match general search, academic search, product search

  2. Out-of-Distribution Queries: 

    • Novel task compositions not seen in training

    • Tool descriptions optimized for common use cases

    • Edge cases may not surface in retrieval

  3. Metadata Quality: 

    • Poor MCP descriptions lead to bad embeddings

    • Inconsistent naming conventions across tools

    • Missing usage examples reduce match quality


5. Architectural Deep Dive


5.1 System Components

External Vector Index:

  • Stores embeddings for all MCP metadata

  • Updated asynchronously as new tools register

  • Supports rapid semantic search (<100ms latency)

  • Can leverage existing vector databases (Pinecone, Weaviate, Chroma)


Retrieval Pipeline:

  1. Query encoding (user intent → dense vector)

  2. Similarity search (cosine distance in embedding space)

  3. Top-k selection (configurable k, default k=1)

  4. Metadata extraction (pull full tool schemas for selected MCPs)


Execution Layer:

  • Receives compact prompt (user query + selected tool description)

  • Performs standard tool calling / function invocation

  • No awareness of broader MCP ecosystem

  • Operates identically to single-tool scenarios


5.2 Production Considerations

Latency Profile:

  • Retrieval overhead: ~50-100ms (vector search)

  • Total added latency: <200ms in most cases

  • Negligible compared to LLM inference time (seconds)

  • Acceptable for real-time applications

Cost Structure:

  • Reduced prompt tokens directly lower API costs

  • 50% token reduction ≈ 50% cost savings on input

  • Retrieval infrastructure cost minimal (vector DB hosting)

  • Net positive economics at scale

Reliability & Robustness:

  • Single point of failure: Vector index availability

  • Mitigation: Replicated indices, caching strategies

  • Fallback option: Degrade to keyword matching if retrieval fails

  • Monitoring critical: Track retrieval precision, tool selection accuracy

Extensibility:

  • Adding new tools: Simply embed and index metadata (no retraining)

  • Updating tools: Re-embed description, update index

  • Tool deprecation: Remove from index

  • Zero downtime for model itself


5.3 Integration Patterns

Scenario 1: IDE Code Assistants

  • Register 100+ MCPs (GitHub, StackOverflow, documentation, package managers)

  • User query: "How to implement OAuth in FastAPI?"

  • RAG-MCP retrieves: FastAPI docs MCP, OAuth tutorial MCP

  • LLM generates code with accurate, contextual tool use

Scenario 2: Enterprise Support Chatbot

  • Register 500+ MCPs (Salesforce, Jira, Confluence, email, calendar, knowledge base)

  • User query: "Find all high-priority bugs assigned to my team"

  • RAG-MCP retrieves: Jira query MCP

  • LLM executes precise API call without seeing other 499 tools

Scenario 3: Personal AI Assistant

  • Register 50+ MCPs (Google Drive, Gmail, Calendar, Notion, Spotify, weather, news)

  • User query: "Summarize my emails from this morning and create a to-do list"

  • RAG-MCP retrieves: Gmail MCP, task management MCP

  • LLM orchestrates multi-step workflow


6. Comparison with Alternative Approaches


6.1 Naive Approaches (Status Quo)

Full Context Injection:

  • Load all tool descriptions into prompt

  • Works for <20 tools, fails for >50

  • Exhausts context window

  • High cognitive load on LLM

Manual Curation:

  • Humans pre-select relevant tool subset per task

  • Doesn't scale (requires domain expertise per query)

  • Labor-intensive, slow iteration


6.2 Structured Approaches

Hierarchical Tool Organization:

  • Organize tools into categories (e.g., "data sources," "computation," "communication")

  • Two-stage selection: category → specific tool

  • Pros: Reduces search space

  • Cons: Requires manual taxonomy, rigid structure, doesn't adapt to cross-category needs

Rule-Based Routing:

  • If-then rules map query patterns to tools

  • Example: "Send email" → Email MCP

  • Pros: Deterministic, fast

  • Cons: Brittle, doesn't generalize, maintenance nightmare at scale


6.3 Learning-Based Alternatives

Fine-Tuned Tool Selectors:

  • Train small model specifically for tool selection

  • Pros: Potentially very accurate if trained on domain data

  • Cons: Requires labeled data, needs retraining for new tools, less flexible

Reinforcement Learning:

  • Learn tool selection policy via trial-and-error

  • Pros: Can discover non-obvious tool combinations

  • Cons: Sample inefficient, slow to adapt, complex to implement

RAG-MCP's Advantages Over Alternatives:

  • Zero-shot generalization to new tools (just add to index)

  • Leverages pre-trained LLM semantics (no task-specific training)

  • Scales logarithmically rather than linearly with tool count

  • Practical infrastructure (vector DB is commodity technology)


7. Limitations & Future Work

7.1 Acknowledged Limitations

Single-Task Evaluation:

  • Only tested on WebSearch subset of MCPBench

  • Doesn't cover multi-tool workflows

  • Limited task diversity (no code generation, data analysis, content creation)

  • Unknown performance on tool chains or compositions

Single Base LLM:

  • Experiments use Qwen-max-0125 exclusively

  • Generalization to GPT-4, Claude, Llama, Gemini unclear

  • Model-specific biases may affect results

No Human Evaluation:

  • Automated judge may miss nuanced failures

  • User satisfaction, task completion quality not assessed

  • Real-world usability unknown


Implementation Availability: As of this writing, no official RAG-MCP implementation has been open-sourced; reproducibility depends on re-implementing their retrieval + MCPBench setup from the paper.


7.2 Technical Gaps

Retrieval Precision Ceiling:

  • Quality of semantic retrieval crucial

  • What if retriever fails to surface correct tool in top-k?

  • Error propagation from retrieval to execution stage

Validation Step Unclear:

  • "Optional" validation mentioned but not deeply evaluated

  • Trade-offs not fully characterized

  • When is validation necessary vs. overhead?


7.3 Open Research Questions

  1. Optimal Retriever Design: 

    • Lightweight LLM (Qwen) vs. specialized embedding models?

    • Fine-tuning retriever on tool selection data?

    • Multi-vector representations for tools?

  2. Dynamic k Selection: 

    • How many tools to retrieve varies by query complexity

    • Can this be learned/predicted?

    • Trade-off between coverage and precision

  3. Multi-Tool Scenarios: 

    • Real workflows often require tool chaining

    • How to retrieve sequences of tools?

    • Tool dependency graphs?

  4. Human-in-the-Loop:

    • User feedback on tool selection

    • Active learning for retrieval improvement

    • Personalized tool preferences

  5. Cross-Model Portability: 

    • Does same index work for GPT-4, Claude, Llama?

    • Model-specific vs. universal retrievers?


8. Industry & Research Impact

8.1 Immediate Industry Applications

Enterprise AI Platforms:

  • Salesforce, Microsoft Copilot, Google Workspace AI

  • Can now support comprehensive tool ecosystems

  • Enables "AI assistant with access to everything"

Developer Tools:

  • IDEs (Cursor, Windsurf, Replit)

  • Can offer hundreds of code-related MCP tools

  • Reduces prompt engineering burden

Vertical AI Solutions:

  • Healthcare: Medical databases, EHR systems, diagnostic tools

  • Finance: Market data, trading APIs, compliance tools

  • Legal: Case databases, document management, research tools


8.2 Research Trajectory

Short-Term (6-12 months):

  • Replication studies with different LLMs

  • Extensions to multi-tool workflows

  • Hierarchical retrieval variants

  • Real-world deployment studies

Medium-Term (1-2 years):

  • Learned retrievers fine-tuned on tool selection

  • Tool usage patterns and popularity metrics

  • Personalization and context-aware retrieval

  • Integration with agent frameworks (AutoGPT, LangGraph)

Long-Term (2-5 years):

  • Tool composition and workflow synthesis

  • Automated tool discovery and registration

  • Cross-modal tool use (vision, audio, robotics)

  • Standardization beyond MCP (Tool Discovery Protocol?)


8.3 Broader AI Implications

Paradigm Shift: From "what can this model do?" to "what can this model access?"

  • Success increasingly depends on ecosystem connectivity

  • Model capabilities plateau; tool access becomes differentiator

Democratization: Smaller models with good retrieval can rival larger models

  • 7B model + comprehensive tools > 70B model + limited tools

  • Reduces compute requirements for capable agents

Safety Considerations:

  • Tool access increases potential for harm

  • RAG-MCP's selective activation may improve safety (smaller attack surface)

  • But retrieval errors could lead to unintended tool use


9. Conclusion & Recommendations

9.1 Key Takeaways

  1. Problem Validation: Prompt bloat is real, measurable, and worsening with ecosystem growth

  2. Solution Viability: RAG-MCP demonstrates retrieval-based tool selection works at scale

  3. Production Readiness: Architecture addresses practical deployment concerns (cost, latency, extensibility)

  4. Ecosystem Enabler: Removes ceiling on MCP ecosystem growth, validating universal protocol vision

  5. Critical Scaling Threshold: Empirically, both baselines and RAG-MCP show sharp degradation once the candidate MCP pool exceeds roughly 100 tools; this supports treating retrieval-based selection as mandatory rather than optional beyond that scale


9.2 For Practitioners

Suggested Actions:

  • Implement RAG-MCP pattern for systems with >20 tools

  • Build/maintain tool metadata indices

  • Monitor retrieval quality metrics

  • Plan for scaling beyond current tool counts


Long-Term Steps:

  • Invest in semantic tool representation

  • Develop tool discovery infrastructure

  • Consider hierarchical organization for large tool libraries

  • Prepare for multi-tool workflow orchestration


9.3 For Researchers

High-Priority Areas:

  1. Retriever optimization for tool selection

  2. Multi-tool workflow retrieval

  3. Evaluation benchmarks for large-scale tool use

  4. Safety and robustness in retrieved tool execution


Open Datasets Needed:

  • Large-scale tool selection benchmarks

  • Real-world tool usage patterns

  • Multi-tool workflow traces


9.4 Final Assessment

RAG-MCP represents a critical architectural contribution to the productionization of tool-augmented LLMs. While the core insight—apply RAG to tool discovery—is conceptually simple, its execution demonstrates deep understanding of both the technical challenge (prompt bloat, selection complexity) and the ecosystem dynamics (MCP proliferation, production constraints).

The paper's timing is excellent, arriving just as the MCP ecosystem reaches inflection point where naive approaches fail. Its 3.2× accuracy improvement and ≈50% token reduction are not incremental gains but paradigm-enabling results that make previously intractable applications feasible.

Rating: 4.5/5

  • Novel application of RAG principles ⭐⭐⭐⭐⭐

  • Rigorous experimental methodology ⭐⭐⭐⭐⭐

  • Production-ready architecture ⭐⭐⭐⭐⭐

  • Limited task diversity in evaluation ⭐⭐⭐

  • Needs broader LLM validation ⭐⭐⭐⭐


Recommended for: Researchers and practitioners working on tool-augmented LLMs, agentic AI systems, MCP integration, and production AI deployment.

References

Primary Paper: Gan, T., & Sun, Q. (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation. arXiv:2505.03275.

Model Context Protocol:

Related Work:

  • Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.

  • Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS.

  • Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.

  • Patil, S.G., et al. (2024). Gorilla: Large Language Model Connected with Massive APIs. NeurIPS.

  • Luo, Z., et al. (2025). Evaluation Report on MCP Servers. arXiv:2504.11094.

Additional Context:

 

Paper: RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection

Authors: Tiantian Gan and Qiyao Sun (Beijing University of Posts and Telecommunications + Queen Mary University of London)Publication Date: May 6, 2025arXiv: https://arxiv.org/abs/2505.03275


Arindam Banerji

 
 
 

Comments


bottom of page