Rethinking AI Agent Architecture: A Case Study in Code Execution Over Protocol Abstraction

Hive Research Institute
4 days ago
19 min read

An empirical analysis of achieving 98% token reduction through architectural evolution

The Convergence of Independent Discovery

One week before Anthropic published their groundbreaking blog post on November 4, 2025¹, an engineer had independently reached the same conclusion and implemented the solution they would soon advocate. The empirical results were striking—the agent not only produced significantly better results and operated more autonomously, but it consumed 98% fewer tokens. This represented not incremental improvement but a fundamental rethinking of how AI agents should interact with external systems.

The Model Context Protocol (MCP), despite becoming an industry standard since its November 2024 launch², had revealed fundamental limitations that were constraining agent performance. This independent implementation would soon be validated by one of the leading AI research organizations in the world, demonstrating that multiple practitioners were converging on the same architectural insights through practical experience.

Deconstructing the MCP Problem: Systematic Analysis

The Model Context Protocol represents an open standard for connecting AI agents to external systems³. At the architectural level, MCPs function as APIs—the primary distinction being that standard APIs are typically developed for human developers to use, while MCPs are designed for AI agent consumption. On a technical level, the differences are minimal.

The breakthrough with MCPs wasn't the protocol itself but rather its emergence as an industry standard. Since its launch, adoption has been rapid: the community has built thousands of MCP servers, SDKs are available for all major programming languages, and the industry has adopted MCP as the de-facto standard for connecting agents to tools and data⁴. This standardization enabled developers to collaborate and share tools more efficiently, creating substantial initial value.

However, as developers routinely built agents with access to hundreds or thousands of tools across dozens of MCP servers⁵, systematic performance degradation became observable. Loading all tool definitions upfront and passing intermediate results through the context window was measurably slowing down agents and dramatically increasing computational costs.

Quantifying Token Waste: A Two-Factor Analysis

Anthropic's engineering team identified two primary sources of token waste in their November 2025 analysis⁶. First, tool definitions were overloading the context window. When an agent connects to an MCP, that MCP typically contains twenty to thirty different tools⁷.

Production implementations commonly connect five to six different MCPs simultaneously.

Each tool contains specific descriptions and parameters for usage. A typical tool definition might look like:

Even when the agent only required one tool, it maintained all context from all other tools from all connected MCP servers in its context window⁸. This created a triple penalty: increased computational cost, increased latency, and potentially increased hallucination rates—all for functionality the agent might never utilize. In cases where agents were connected to thousands of tools, they needed to process hundreds of thousands of tokens before reading a request⁹.

The second factor driving token consumption was intermediate tool results. For example, retrieving a transcript from Google Drive might return fifty thousand tokens¹⁰. Larger documents could exceed context window limits entirely. The agent might only require the first paragraph of the transcript, yet was forced to process the entire document through its context window. As Anthropic documented: "For a 2-hour sales meeting, that could mean processing an additional 50,000 tokens"¹¹.

This wasn't theoretical inefficiency—production systems processing high volumes of daily transactions were experiencing economically unsustainable token costs, and agent performance was degrading measurably as context windows filled with unused tool definitions and unnecessary data.

The Code Execution Solution: Architectural Redesign

What Anthropic proposed—and what had been independently implemented—was code execution enabling agents to call MCP servers on demand¹². The architecture structured this through hierarchical folder organization:

Each tool corresponds to a file containing the implementation:

When the agent required a specific tool, it could import that tool from the appropriate folder and execute it in code¹³. This approach eliminated passing all tools into the agent's context window—only the required tool was loaded.

The results were immediate and quantifiable. As Anthropic documented, this approach "reduces the token usage from 150,000 tokens to 2,000 tokens—a time and cost saving of 98.7%"¹⁴. The independent implementation achieved virtually identical results, validating the architectural approach through parallel discovery.

Practical Implementation: Building a Sales Operations Agent

A comprehensive implementation tutorial demonstrated the practical application of this architectural approach³⁹. The test case involved building a sales operations agent that could read meeting transcripts from Google Drive and attach them to a CRM system (Notion) without actually reading the full contents of the file⁴⁰.

Architecture Components:

One primary agent with code execution capabilities
Two built-in tools: IPython interpreter and persistent shell tool⁴¹
Connection to Google Drive MCP server
Connection to Notion MCP server (functioning as CRM)⁴²

The implementation utilized a specialized command called "MCP code execution" that enabled developers to add MCP servers using the new pattern⁴³. This command provided comprehensive instructions to AI coding assistants (like Cursor or Claude Code) about the architectural pattern, linking to Anthropic's blog post and relevant resources.

Implementation Process:

Server Directory Creation: The system created a servers directory containing subdirectories for each MCP server (Google Drive, Notion)⁴⁴
Tool File Generation: Individual tools were saved as TypeScript files with descriptions and parameters embedded in code rather than passed to the agent's context window⁴⁵
Authentication Setup: OAuth authentication was configured for servers requiring it⁴⁶
Tool Count: 15 tools generated for Notion MCP server, 4 tools for Google Drive⁴⁷

The critical distinction: tool descriptions were saved in code files that agents read on-demand, rather than loading all tool definitions into the context window upfront⁴⁸.

Prompting Strategy: The Key to Effective Implementation

The implementation revealed that prompting strategy is crucial for this architectural approach⁴⁹. The agent's instructions must clearly define the operational process:

Recommended Workflow:

Check available skills in the /mnt/skills folder first⁵⁰
Use existing skill if found for the specific task
If no skills exist, read only the one tool file needed
Combine multiple tools as necessary to complete the task
Provide suggestions for new skills to be added⁵¹

This workflow enables progressive skill development, where agents build capabilities over time rather than starting from scratch with each task.

Cloudflare's Validation: Converging Evidence

Cloudflare published similar findings in September 2025, referring to code execution with MCP as "Code Mode"¹⁵. Their analysis reached identical conclusions: "LLMs are better at writing code to call MCP, than at calling MCP directly"¹⁶.

Cloudflare's research identified why this approach proved superior: "LLMs have seen a lot of code. They have not seen a lot of 'tool calls'. In fact, the tool calls they have seen are probably limited to a contrived training set constructed by the LLM's own developers, in order to try to train it. Whereas they have seen real-world code from millions of open source projects"¹⁷.

Their team found that agents could "handle many more tools, and more complex tools, when those tools are presented as a TypeScript API rather than directly"¹⁸. This empirical validation from multiple independent sources—Anthropic, Cloudflare, and independent implementations—provided compelling evidence for the architectural shift.

Progressive Disclosure: Eliminating Context Window Constraints

The first benefit represented only the beginning of observable improvements. The architecture enabled progressive disclosure, eliminating context window constraints entirely¹⁹. Models excel at navigating filesystems, allowing them to read tool definitions on-demand rather than loading them all upfront.

Alternatively, a search_tools function could be added to enable agents to find relevant definitions dynamically²⁰. The agent could access thousands of MCP servers, using search capabilities to discover the specific MCP required for any given task. This fundamentally altered the scalability equation from being constrained by context window size to being constrained only by search and discovery capabilities—a significantly more tractable problem.

Context-Efficient Tool Results: Data Filtering in Execution Environment

When working with large datasets, agents could filter and transform results in code before returning them²¹. Consider fetching a 10,000-row spreadsheet:

The agent processes five rows instead of 10,000²². Similar patterns work for aggregations, joins across multiple data sources, or extracting specific fields—all without bloating the context window.

Privacy Architecture: Enterprise-Grade Data Protection

The privacy benefits became immediately apparent in enterprise client implementations. Organizations requiring data protection don't permit sensitive information exposure to third-party model providers²³. Traditional MCP connections automatically expose all data during API interactions.

However, with agents using MCP servers in code, intermediate results stay in the execution environment by default²⁴. The agent only sees what is explicitly logged or returned, meaning sensitive data can flow through workflows without entering the model's context.

For even more sensitive workloads, the agent harness can tokenize sensitive data automatically²⁵. For example, when importing customer contact details from a spreadsheet into Salesforce, the MCP client can intercept data and tokenize personally identifiable information before it reaches the model. Real email addresses, phone numbers, and names flow from source to destination without passing through the model, preventing accidental logging or processing of sensitive data.

State Persistence and Skills: Emergent Capabilities

The most transformative benefit emerged through state persistence and skill development capabilities. Code execution with filesystem access allows agents to maintain state across operations²⁶. Agents can write intermediate results to files, enabling them to resume work and track progress.

More significantly, agents can persist their own code as reusable functions²⁷:

This ties closely to Anthropic's concept of Skills—folders of reusable instructions, scripts, and resources for models to improve performance on specialized tasks²⁸. Over time, this allows agents to build toolboxes of higher-level capabilities, evolving the scaffolding needed for optimal performance.

Persistent Storage: The Mount Directory Implementation

The practical implementation revealed the importance of persistent storage for enabling agent evolution⁵². The /mnt directory serves as persistent storage where agents can save skills and reference them across different chat sessions⁵³. Without this capability, agents would be unable to build and reuse skills over time.

When an agent successfully completes a complex task, it can propose creating a new skill, save it to the /mnt/skills directory, and reference it in future operations⁵⁴. This creates a compounding effect where agents become more capable with each task completion.

Skill Evolution Example:

First execution: Agent reads individual tool files, combines them, completes task (12,000 tokens consumed)⁵⁵
Skill creation: Agent saves successful workflow as reusable skill
Subsequent executions: Agent finds existing skill, executes it directly (4,000 tokens consumed)⁵⁶

This represents a 67% token reduction from first to subsequent executions, demonstrating the compounding benefits of skill development.

Empirical Testing: Real-World Token Consumption Analysis

Comprehensive testing of the sales operations agent provided concrete data on token consumption across different architectural approaches⁵⁷:

Test Scenario: Copy transcript from Google Drive and paste into Notion CRM page

Key Observations:

Direct MCP approach consumed most tokens as output (extremely expensive), literally typing the entire transcript⁶¹
Code execution first run showed excessive tool calls and unnecessary file reads, indicating need for better prompting⁶²
Skill-based execution demonstrated dramatic efficiency gains, validating the agent evolution paradigm⁶³

Tracing and Performance Monitoring

The implementation framework included tracing capabilities enabled by default⁶⁴. Performance analysis revealed:

Direct MCP Agent:

Total consumption: 32,000 tokens
High output token count (most expensive)
Agent manually typed entire transcript content⁶⁵

Code Execution Agent (First Run):

Total consumption: 12,000 tokens
Excessive tool calls and file reads
Successful task completion despite inefficiencies⁶⁶

Code Execution Agent (With Skill):

Total consumption: 4,000 tokens
Minimal tool calls, direct skill execution
87.5% reduction from traditional MCP approach⁶⁷

These metrics validated Anthropic's theoretical projections while revealing practical optimization opportunities through improved prompting strategies.

Empirical Limitations: Systematic Trade-off Analysis

While both Anthropic and Cloudflare effectively explained the benefits of this approach, comprehensive analysis requires acknowledging limitations. The first limitation is reduced reliability at the individual tool call level. When agents generate code for each tool invocation, error probability increases²⁹.

The practical implementation confirmed this concern. The agent executed code "way too many times," read "way too many unnecessary files," and took longer than expected to perform simple tasks on first attempt⁶⁸. However, the agent still successfully completed tasks and demonstrated capacity for self-improvement through skill creation.

The second limitation is increased infrastructure overhead. As Anthropic notes: "Running agent-generated code requires a secure execution environment with appropriate sandboxing, resource limits, and monitoring. These infrastructure requirements add operational overhead and security considerations that direct tool calls avoid"³⁰.

This represents the "biggest downside" of the code execution approach⁶⁹. Organizations must establish:

Secure sandbox environments for code execution
Resource limits and monitoring systems
Persistent storage infrastructure for skill development
Authentication and credential management systems⁷⁰

Cloudflare addressed this challenge through their Workers platform, using V8 isolates instead of containers³¹. Isolates are significantly lighter-weight and faster to start—taking mere milliseconds and consuming only a few megabytes of memory³². This enables creating a new isolate for every piece of code the agent runs, with negligible overhead.

The Fundamental Principle: Minimizing Abstraction Layers

The core insight emerging from multiple independent implementations: all digital interfaces ultimately reduce to code and files. This principle explains why code execution approaches demonstrate superior performance. As Cloudflare articulated: "Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It's just not going to be his best work"³³.

Agents have achieved increasingly sophisticated code generation capabilities over recent years. Continuing to create abstractions on top of other abstractions for agent consumption contradicts the fundamental capabilities of modern AI systems. Each abstraction layer added to an agent measurably reduces that agent's autonomy³⁴. The primary purpose of agent systems is autonomous task execution. Protocol abstraction layers actively work against this fundamental objective.

As the implementation tutorial concluded: "This is the new paradigm. Just let your agents run the code. We no longer have to create these abstractions on top of other abstractions because agents can just generate the code to do whatever they need by themselves"⁷¹.

Use Case Appropriateness: Practical Deployment Guidance

This architectural shift doesn't eliminate all use cases for traditional MCP implementations. For customer support applications where APIs are relatively simple and don't require data transformations—merely transmitting customer support tickets—traditional MCPs remain appropriate³⁵.

The implementation analysis provided specific guidance: "Don't use it for simple agents like customer support. Save it for more sophisticated general agents like analytics, research, or operations. For simple agents, it still doesn't make sense"⁷².

Recommended Use Cases for Code Execution:

Analytics agents requiring data transformation and filtering
Research agents combining multiple data sources
Operations agents executing multi-step workflows
General-purpose agents requiring skill development⁷³

Continued Use Cases for Traditional MCP:

Simple customer support ticket creation
Single-step data retrieval without transformation
Applications with minimal tool requirements⁷⁴

Implementation Analysis: Empirical Lessons

Through implementations prior to Anthropic's public validation, several critical lessons about architectural transition emerged. First, the migration requires systematic planning. Organizations cannot simply switch from MCP-based architecture to code execution. Infrastructure requirements are substantial, and agent behavior changes significantly.

Second, reliability concerns are empirically valid but operationally manageable. While code generation introduces potential failure points, autonomous error correction capabilities of modern agents largely compensate for this. Production systems demonstrated that while individual tool calls might fail more frequently, overall task completion rates actually improved because agents could adapt their approach when encountering errors.

Third, prompting is key⁷⁵. The implementation revealed that "while LLMs aren't trained for this new method yet, you have to carefully describe how to use this new pattern, otherwise they're going to make a lot of mistakes"⁷⁶. Proper instruction design directly impacts:

Number of unnecessary tool calls
File read efficiency
Task completion time
Token consumption optimization

Fourth, token savings compound dramatically at scale. The 98% reduction in token consumption represented not merely cost savings but enabled entirely new use cases that were economically infeasible under previous architecture. Tasks previously consuming millions of tokens could execute on tens of thousands.

Production Readiness Assessment

The comprehensive testing led to a clear verdict: "Yes, this approach is ready for production, but it requires proper prompting"⁷⁷. The benefits far outweigh the costs when implemented correctly:

Benefits:

87.5% token reduction (traditional to skill-based execution)
Unlimited tool scalability through progressive disclosure
Agent evolution through skill development
Enterprise-grade privacy through execution environment isolation
Autonomous capability improvement over time⁷⁸

Requirements:

Proper prompting strategy clearly defining workflows
Secure sandbox infrastructure for code execution
Persistent storage for skill development
Monitoring and tracing systems for performance optimization⁷⁹

The implementation demonstrated that "with this repo, the benefits far outweigh the costs. You get way more autonomy and flexibility with only a minor drop in reliability"⁸⁰.

Infrastructure Solutions: Addressing the Deployment Challenge

The infrastructure overhead represents a significant barrier to adoption. However, specialized platforms have emerged to address this challenge. The implementation tutorial noted: "We're the only platform on the market that supports everything you need to run this out of the box"⁸¹, referring to platforms providing:

Pre-configured sandbox environments
Persistent storage infrastructure (/mnt directory implementation)
Built-in tracing and monitoring
Authentication management systems
Deployment automation⁸²

This infrastructure support reduces the barrier to entry, enabling organizations to adopt code execution architectures without building custom sandbox and monitoring systems.

Comparative Performance Analysis

Systemic Implications: Agent Evolution Paradigm

This architectural shift represents more than efficiency improvements. It enables a new generation of truly autonomous agents capable of learning, adapting, and evolving their capabilities over time. State persistence and skill development capabilities fundamentally change what's achievable with AI agent systems.

Organizations are now deploying agents that don't merely execute predefined workflows—they develop proprietary tools, optimize their own processes, and continuously improve performance metrics. This represents the distinction between automation and genuine artificial intelligence.

The practical implementation demonstrated this evolution in action: an agent that initially required 12,000 tokens to complete a task created a skill that reduced subsequent executions to 4,000 tokens⁸⁶. This self-improvement capability compounds over time as agents build increasingly sophisticated skill libraries.

Future Trajectory: Next-Generation Architecture

The combination of code execution, quantum computing integration, and complex adaptive systems creates possibilities for agent architectures that were computationally infeasible months ago. Future implementations will likely address how code-executing agents can leverage quantum optimization for complex decision-making across millions of simultaneous interactions.

Systematic Recommendations: Implementation Framework

For developers and organizations building AI agent systems, the empirical evidence supports clear recommendations: begin transitioning to code execution architectures immediately. The benefits are too substantial to ignore, and the competitive advantage of more autonomous, efficient agents will only increase in significance.

Implementation Checklist:

Infrastructure Setup
- Establish secure sandbox environment (or utilize specialized platform)
- Configure persistent storage for skill development
- Implement tracing and monitoring systems⁸⁷
Prompting Strategy
- Define clear workflow for tool discovery and skill usage
- Specify when to create new skills
- Provide examples of efficient tool usage patterns⁸⁸
Testing Protocol
- Compare token consumption against traditional MCP baseline
- Monitor unnecessary tool calls and file reads
- Validate skill creation and reuse functionality⁸⁹
Optimization Iteration
- Refine prompts based on tracing data
- Identify and eliminate inefficient patterns
- Build skill library for common operations⁹⁰

The infrastructure overhead is real and measurable. Reliability considerations require systematic attention. However, the 87.5%+ token reduction in production scenarios, unlimited tool scalability, privacy architecture benefits, and agent evolution capabilities make this transition inevitable. The question isn't whether to implement this shift, but how rapidly organizations can execute the migration.

Conclusion: The Parsimony Principle in Practice

The optimal solution often involves removing layers rather than adding them. The most powerful architecture frequently proves to be the simplest: allowing the agent to write code directly. This principle—often called Occam's Razor in scientific methodology—applies with particular force to AI agent architecture.

The empirical evidence from multiple independent sources is conclusive. Code execution architectures demonstrate superior performance across multiple dimensions:

Token efficiency: 87.5% reduction in production scenarios (32,000 to 4,000 tokens)⁹¹
Scalability: unlimited tools through progressive disclosure
Privacy protection: enterprise-grade through execution environment isolation
Autonomous capability development: self-improving through skill creation
Economic viability: enabling use cases previously infeasible due to token costs⁹²

The practical implementation validated theoretical projections while revealing important nuances: proper prompting is essential, infrastructure overhead is significant but solvable, and the benefits compound over time as agents develop skill libraries.

As Anthropic concludes: "Although many of the problems here feel novel—context management, tool composition, state persistence—they have known solutions from software engineering. Code execution applies these established patterns to agents, letting them use familiar programming constructs to interact with MCP servers more efficiently"³⁸.

Organizations that recognize and act on this architectural shift will establish significant competitive advantages in the rapidly evolving AI agent landscape. The paradigm has shifted: let agents write code directly rather than creating abstractions on top of abstractions. The empirical evidence demonstrates this approach is not just theoretically superior but practically deployable today.

References and Citations

¹ Anthropic Engineering Team. "Code execution with MCP: Building more efficient agents." Anthropic Blog, November 4, 2025. https://www.anthropic.com/engineering/code-execution-with-mcp

² Model Context Protocol. "Introduction to MCP." Model Context Protocol Documentation, November 2024. https://modelcontextprotocol.io/docs/getting-started/intro

³ Anthropic (2025). "The Model Context Protocol (MCP) is an open standard for connecting AI agents to external systems."

⁴ Anthropic (2025). "Since launching MCP in November 2024, adoption has been rapid: the community has built thousands of MCP servers, SDKs are available for all major programming languages, and the industry has adopted MCP as the de-facto standard."

⁵ Anthropic (2025). "Today developers routinely build agents with access to hundreds or thousands of tools across dozens of MCP servers."

⁶ Anthropic (2025). "As MCP usage scales, there are two common patterns that can increase agent cost and latency: 1. Tool definitions overload the context window; 2. Intermediate tool results consume additional tokens."

⁷ Estimated from typical MCP server implementations and Anthropic documentation

⁸ Anthropic (2025). "Most MCP clients load all tool definitions upfront directly into context, exposing them to the model using a direct tool-calling syntax."

⁹ Anthropic (2025). "Tool descriptions occupy more context window space, increasing response time and costs. In cases where agents are connected to thousands of tools, they'll need to process hundreds of thousands of tokens before reading a request."

¹⁰ Anthropic (2025). Example of Google Drive transcript retrieval consuming 50,000 tokens

¹¹ Anthropic (2025). Direct quote from blog post

¹² Anthropic (2025). "With code execution environments becoming more common for agents, a solution is to present MCP servers as code APIs rather than direct tool calls."

¹³ Anthropic (2025). Code example showing TypeScript file structure for MCP tools

¹⁴ Anthropic (2025). "This reduces the token usage from 150,000 tokens to 2,000 tokens—a time and cost saving of 98.7%."

¹⁵ Varda, Kenton and Pai, Sunil. "Code Mode: the better way to use MCP." Cloudflare Blog, September 26, 2025. https://blog.cloudflare.com/code-mode/

¹⁶ Cloudflare (2025). Direct quote from blog post

¹⁷ Cloudflare (2025). "LLMs have seen a lot of code. They have not seen a lot of 'tool calls'."

¹⁸ Cloudflare (2025). "We found agents are able to handle many more tools, and more complex tools, when those tools are presented as a TypeScript API rather than directly."

¹⁹ Anthropic (2025). "Models are great at navigating filesystems. Presenting tools as code on a filesystem allows models to read tool definitions on-demand, rather than reading them all up-front."

²⁰ Anthropic (2025). "Alternatively, a search_tools tool can be added to the server to find relevant definitions."

²¹ Anthropic (2025). "When working with large datasets, agents can filter and transform results in code before returning them."

²² Anthropic (2025). Example showing 10,000-row spreadsheet filtering to 5 rows

²³ Anthropic (2025). "When agents use code execution with MCP, intermediate results stay in the execution environment by default."

²⁴ Anthropic (2025). "This way, the agent only sees what you explicitly log or return, meaning data you don't wish to share with the model can flow through your workflow without ever entering the model's context."

²⁵ Anthropic (2025). "For even more sensitive workloads, the agent harness can tokenize sensitive data automatically."

²⁶ Anthropic (2025). "Code execution with filesystem access allows agents to maintain state across operations."

²⁷ Anthropic (2025). "Agents can also persist their own code as reusable functions."

²⁸ Anthropic (2025). "This ties in closely to the concept of Skills, folders of reusable instructions, scripts, and resources for models to improve performance on specialized tasks."

²⁹ Implementation observations and Anthropic (2025) acknowledgment of trade-offs

³⁰ Anthropic (2025). Direct quote on infrastructure requirements

³¹ Cloudflare (2025). "The Cloudflare Workers platform has always been based on V8 isolates, that is, isolated JavaScript runtimes powered by the V8 JavaScript engine."

³² Cloudflare (2025). "Isolates are far more lightweight than containers. An isolate can start in a handful of milliseconds using only a few megabytes of memory."

³³ Cloudflare (2025). Direct quote comparing tool calling to language learning

³⁴ Implementation observations validated by both Anthropic and Cloudflare analysis

³⁵ Anthropic (2025) and practical implementation guidance

³⁶ Anthropic (2025). Direct measurement from blog post example

³⁷ Anthropic (2025). Confirmed 98.7% token reduction measurement

³⁸ Anthropic (2025). Concluding statement from blog post

³⁹ Implementation Tutorial. "MCP Code Execution Implementation Guide." YouTube video transcript, November 2025.

⁴⁰ Implementation Tutorial (2025). "The agent that we will build today is the sales operations agent from Anthropic's blog post that can read meeting transcripts and then attach them into your CRM without actually reading the contents of the file."

⁴¹ Implementation Tutorial (2025). "Create a sales ops agent with two built-in tools: IPython interpreter and persistent shell tool."

⁴² Implementation Tutorial (2025). Architecture description: Google Drive MCP and Notion MCP connections

⁴³ Implementation Tutorial (2025). "I added a new special command called MCP code execution, which you can use to add MCP servers to an agent using this new pattern."

⁴⁴ Implementation Tutorial (2025). "You can see how now cursor created the server directory with the code for our servers."

⁴⁵ Implementation Tutorial (2025). "This server is made here directly in code instead of passing this description and the arguments directly into the agent's context window."

⁴⁶ Implementation Tutorial (2025). "You might also need to authenticate the server if it's using OAuth."

⁴⁷ Implementation Tutorial (2025). "Cursor created 15 tools for Notion MCP server and four tools for Google Drive."

⁴⁸ Implementation Tutorial (2025). "Whenever our agent needs to use this tool, it will simply read this file and then it will see the whole description and the arguments in order to use this tool."

⁴⁹ Implementation Tutorial (2025). "Prompting is key with this technique."

⁵⁰ Implementation Tutorial (2025). "The first step is to always check available skills in the /mnt/skills folder."

⁵¹ Implementation Tutorial (2025). Workflow description including skill checking, tool reading, and skill suggestion

⁵² Implementation Tutorial (2025). "The reason you need to use this mount directory is because we now have persistent storage on our platform."

⁵³ Implementation Tutorial (2025). "In this mount directory essentially your agents can now save files like this and reference them across different chats."

⁵⁴ Implementation Tutorial (2025). "The agent told me that it saved the skills in the mount skills directory."

⁵⁵ Implementation Tutorial (2025). Token consumption for code execution first run

⁵⁶ Implementation Tutorial (2025). Token consumption for skill-based execution

⁵⁷ Implementation Tutorial (2025). "Let's take a look at the traces and analyze the costs."

⁵⁸ Implementation Tutorial (2025). "This agent consumed 32,000 tokens. This is just insane."

⁵⁹ Implementation Tutorial (2025). "With this new approach, the amount of tokens is only 12,000."

⁶⁰ Implementation Tutorial (2025). "When the agent used one of the existing skills, it only consumed 4,000 tokens to perform the same task, which is now around 10 times less."

⁶¹ Implementation Tutorial (2025). "Most of these tokens aren't even input tokens. It's also a lot of output tokens, which are extremely expensive."

⁶² Implementation Tutorial (2025). "This agent, as you can see, just performs way too many unnecessary tool calls."

⁶³ Implementation Tutorial (2025). Skill-based execution validation and efficiency gains

⁶⁴ Implementation Tutorial (2025). "Tracing is enabled by default in our framework."

⁶⁵ Implementation Tutorial (2025). Direct MCP agent performance analysis

⁶⁶ Implementation Tutorial (2025). Code execution first run performance analysis

⁶⁷ Implementation Tutorial (2025). Skill-based execution performance analysis

⁶⁸ Implementation Tutorial (2025). "The agent executed code way too many times. It read way too many unnecessary files and it just took too long to perform a simple task on the first attempt."

⁶⁹ Implementation Tutorial (2025). "Infrastructure overhead is actually the biggest downside of this approach."

⁷⁰ Implementation Tutorial (2025). Infrastructure requirements discussion

⁷¹ Implementation Tutorial (2025). "This is the new paradigm. Just let your agents run the code."

⁷² Implementation Tutorial (2025). "Don't use it for simple agents like customer support. Save it for more sophisticated general agents."

⁷³ Implementation Tutorial (2025). Use case recommendations for code execution approach

⁷⁴ Implementation Tutorial (2025). Use case recommendations for traditional MCP

⁷⁵ Implementation Tutorial (2025). "Prompting here is key."

⁷⁶ Implementation Tutorial (2025). "While the LLMs aren't trained for this new method yet, you have to carefully describe how to use this new pattern."

⁷⁷ Implementation Tutorial (2025). "My final verdict is yes, this approach is ready for production, but it requires proper prompting."

⁷⁸ Implementation Tutorial (2025). Benefits enumeration and analysis

⁷⁹ Implementation Tutorial (2025). Requirements and considerations for production deployment

⁸⁰ Implementation Tutorial (2025). "The benefits far outweigh the costs. You get way more autonomy and flexibility with only a minor drop in reliability."

⁸¹ Implementation Tutorial (2025). "We're the only platform on the market that supports everything you need to run this out of the box."

⁸² Implementation Tutorial (2025). Platform capabilities description

⁸³ Implementation Tutorial (2025). Direct MCP token consumption measurement

⁸⁴ Implementation Tutorial (2025). Code execution first run token consumption measurement

⁸⁵ Implementation Tutorial (2025). Skill-based execution token consumption measurement

⁸⁶ Implementation Tutorial (2025). Agent self-improvement demonstration through skill creation

⁸⁷ Implementation Tutorial (2025). Infrastructure setup requirements

⁸⁸ Implementation Tutorial (2025). Prompting strategy recommendations

⁸⁹ Implementation Tutorial (2025). Testing protocol description

⁹⁰ Implementation Tutorial (2025). Optimization iteration process

⁹¹ Implementation Tutorial (2025). Production token reduction measurements (32,000 to 4,000 tokens)

⁹² Implementation Tutorial (2025). Comprehensive benefits analysis and economic viability assessment

Additional Technical Resources

Primary Sources:

Anthropic Engineering Blog: https://www.anthropic.com/engineering/code-execution-with-mcp
Cloudflare Code Mode: https://blog.cloudflare.com/code-mode/
Model Context Protocol Documentation: https://modelcontextprotocol.io/
Implementation Tutorial Video: https://youtu.be/D4ImbDGFgIM
Cloudflare Workers Documentation: https://developers.cloudflare.com/workers/
Cloudflare Agents SDK: https://developers.cloudflare.com/agents/

Implementation Frameworks:

Cloudflare Dynamic Worker Loader API: https://developers.cloudflare.com/workers/runtime-apis/bindings/worker-loader/
MCP Community Resources: https://modelcontextprotocol.io/community/communication
Claude Skills Documentation: https://docs.claude.com/en/docs/agents-and-tools/agent-skills/overview

Security and Sandboxing:

Anthropic Code Sandboxing: https://www.anthropic.com/engineering/claude-code-sandboxing
V8 JavaScript Engine: https://v8.dev/

Development Tools:

Cursor AI: https://cursor.sh/
GitHub MCP Servers: https://github.com/modelcontextprotocol/servers

This comprehensive analysis synthesizes empirical findings from multiple independent sources—Anthropic, Cloudflare, and practical implementations—providing an evidence-based framework for understanding the architectural evolution from traditional MCP tool calling to code execution approaches in AI agent systems. The addition of practical implementation data validates theoretical projections while revealing critical considerations for production deployment.