Claude Managed Agents and MCP in production

Table of Contents

Running Claude agents in a demo is straightforward. Running them reliably under production load — with cost controls, secure sandboxes, private network access, and real monitoring — is a different problem entirely. This guide walks you through deploying Claude Managed Agents backed by MCP connectors in a production environment, covering the patterns that actually hold up at scale.

If you’re new to the orchestration model, start with the Claude Managed Agents multi-agent orchestration guide before continuing here.


What changes when you move Claude agents to production

In development, you call the API, get a response, and move on. In production, you’re dealing with:

  • Concurrent agent runs that share token budgets
  • Tool calls hitting internal APIs behind VPCs
  • Runaway agents that burn through credits on a single bad prompt
  • No visibility into which sub-agent failed and why

Claude Managed Agents (available via the Anthropic API with claude-opus-4-8 and claude-sonnet-4-5 as of mid-2026) give you a structured way to handle these concerns. The agent runtime manages tool use loops, sub-agent delegation, and state — but you still own the infrastructure around it.


Configuring a Managed Agent with effort levels and token caching

The first production lever you should pull is effort level. Claude’s API supports a thinking budget that controls how much extended reasoning the model applies before responding. Higher effort = better results = more tokens = higher cost.

import anthropic

client = anthropic.Anthropic()

def run_agent(task: str, effort: str = "medium"):
    budget_map = {
        "low": 1024,
        "medium": 8000,
        "high": 32000,
    }

    response = client.beta.messages.create(
        model="claude-opus-4-8",
        max_tokens=16000,
        thinking={
            "type": "enabled",
            "budget_tokens": budget_map[effort],
        },
        system="You are a production operations agent. Be concise and precise.",
        messages=[{"role": "user", "content": task}],
        betas=["managed-tools-2025-06-01"],
    )
    return response

For high-volume workflows — like processing 500 support tickets per day — use low effort for classification tasks and high effort only when a ticket escalates. This alone can cut token costs by 60-70% on typical workloads.

Token caching is the other side of the cost equation. If your system prompt is long (tool schemas, API docs, persona instructions), mark it as cacheable:

response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=8000,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 2000+ tokens
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": task}],
    betas=["managed-tools-2025-06-01", "prompt-caching-2024-07-31"],
)

Ephemeral cache has a 5-minute TTL. For agents that process tasks in batches, structure your queue so tasks run within that window — you’ll pay for cached tokens at roughly 10% of the normal rate.


Setting up MCP connectors for private network access

MCP (Model Context Protocol) is how you give Claude access to tools and data sources. In production, the challenge is that your tools often live inside a VPC — not accessible from Anthropic’s infrastructure.

The solution is an MCP tunnel: a lightweight server you run at the edge that proxies requests from Claude’s managed runtime to your internal services.

Here’s a minimal MCP server deployed to Cloudflare Workers that tunnels to an internal API:

// mcp-tunnel/worker.js
export default {
  async fetch(request, env) {
    const url = new URL(request.url);

    // Validate the shared secret Anthropic sends
    const authHeader = request.headers.get("X-MCP-Auth");
    if (authHeader !== env.MCP_SHARED_SECRET) {
      return new Response("Unauthorized", { status: 401 });
    }

    if (url.pathname === "/tools/list") {
      return Response.json({
        tools: [
          {
            name: "query_internal_db",
            description: "Query the internal customer database",
            inputSchema: {
              type: "object",
              properties: {
                sql: { type: "string", description: "Read-only SQL query" },
              },
              required: ["sql"],
            },
          },
        ],
      });
    }

    if (url.pathname === "/tools/call") {
      const body = await request.json();
      // Forward to your internal API via a private tunnel (e.g., Cloudflare Tunnel)
      const internalResponse = await fetch(
        `${env.INTERNAL_API_URL}/query`,
        {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify(body.input),
        }
      );
      const data = await internalResponse.json();
      return Response.json({ content: [{ type: "text", text: JSON.stringify(data) }] });
    }

    return new Response("Not Found", { status: 404 });
  },
};

Deploy this with wrangler deploy, set up a Cloudflare Tunnel pointing to your internal service, and you have a secure bridge — no inbound firewall rules needed on your VPC.

Register the MCP server when creating your agent session:

response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=8000,
    tools=[
        {
            "type": "mcp",
            "server_url": "https://your-worker.workers.dev",
            "server_label": "internal-db",
            "headers": {"X-MCP-Auth": MCP_SHARED_SECRET},
            "allowed_tools": ["query_internal_db"],
        }
    ],
    messages=[{"role": "user", "content": task}],
    betas=["managed-tools-2025-06-01"],
)

The allowed_tools filter is important — don’t expose every tool your MCP server has to every agent task.


Integrating with AWS and Google Workspace via MCP

For AWS, the pattern is to wrap Boto3 calls in an MCP server running on a Lambda or ECS task with an appropriate IAM role. Avoid giving the agent broad permissions — scope the IAM policy to exactly what the task needs:

# iam-policy.yaml
Version: "2012-10-17"
Statement:
  - Effect: Allow
    Action:
      - s3:GetObject
      - s3:ListBucket
    Resource:
      - arn:aws:s3:::your-data-bucket
      - arn:aws:s3:::your-data-bucket/*
  - Effect: Allow
    Action:
      - cloudwatch:PutMetricData
    Resource: "*"

For Google Workspace, use a service account with domain-wide delegation, and expose Gmail/Drive/Sheets operations through your MCP server. The key is to implement scoped sessions: when Claude calls read_gmail, your MCP server checks which user context the request is for and only returns that user’s data.

# mcp_google_workspace.py
from googleapiclient.discovery import build
from google.oauth2 import service_account

def read_gmail(user_email: str, query: str, max_results: int = 10):
    creds = service_account.Credentials.from_service_account_file(
        "service-account.json",
        scopes=["https://www.googleapis.com/auth/gmail.readonly"],
    ).with_subject(user_email)  # impersonate the specific user

    service = build("gmail", "v1", credentials=creds)
    results = service.users().messages().list(
        userId="me", q=query, maxResults=max_results
    ).execute()
    return results.get("messages", [])

Never let the agent pass arbitrary user_email values from user input directly to with_subject() — validate against an allowlist first.


Monitoring agent performance in high-volume workflows

Standard application monitoring doesn’t capture what you need for agents. You want to know: which tool calls are slow, which tasks exhaust their token budget, and where the agent is making unexpected decisions.

Log structured data on every agent run:

import time
import json

def run_monitored_agent(task_id: str, task: str, effort: str = "medium"):
    start = time.time()

    response = run_agent(task, effort)

    duration = time.time() - start
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    cache_read = getattr(response.usage, "cache_read_input_tokens", 0)

    log_entry = {
        "task_id": task_id,
        "effort": effort,
        "duration_seconds": round(duration, 2),
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cache_read_tokens": cache_read,
        "stop_reason": response.stop_reason,
        "model": response.model,
    }

    print(json.dumps(log_entry))  # ship to CloudWatch / Datadog / Loki
    return response

Send these logs to CloudWatch or Datadog, then set alerts on:

  • stop_reason == "max_tokens" — agent ran out of budget, result may be incomplete
  • duration_seconds > 30 — investigate which tool call is blocking
  • cache_read_tokens == 0 — your caching isn’t hitting, check task timing

For automated workflows like CI/CD pipelines or scheduled DevOps routines, wire these metrics into your existing observability stack rather than building a separate dashboard.


Key takeaways

  • Match effort to task complexity: use budget_tokens to avoid paying for extended reasoning on simple classification or extraction tasks.
  • Use MCP tunnels for private access: Cloudflare Workers + Cloudflare Tunnel is the lowest-friction way to expose internal APIs to Claude’s managed runtime without opening your VPC.
  • Log structured agent telemetry: stop_reason, token usage, and cache hit rate are the three metrics that tell you whether your agent deployment is healthy and cost-efficient.

The patterns here pair naturally with building your first agent — if you want to go deeper on the Python side before wiring up MCP connectors, the Python and Claude agent fundamentals guide is a solid starting point. For the MCP server setup in local dev before you push to production, see building custom MCP servers for local development workflows.