Skip to content

13 -- Production Checklist

Before you go live, walk through this list.

Security

  • [ ] TLS termination configured (reverse proxy or load balancer)
  • [ ] auth.enabled: true and auth.allow_anonymous: false
  • [ ] API keys created for each service principal
  • [ ] RBAC roles assigned with least-privilege
  • [ ] Tool access policies set for sensitive tools
  • [ ] Secrets use environment variable interpolation (${VAR}), not plain text in config
  • [ ] Docker MCP servers use read_only: true and network: none where possible

Reliability

  • [ ] Health checks enabled on all MCP servers (health_check_interval_s)
  • [ ] Circuit breaker thresholds tuned (max_consecutive_failures)
  • [ ] MCP Server groups configured for critical MCP servers (at least 2 members)
  • [ ] min_healthy set to match your SLA requirements
  • [ ] Idle TTL set appropriately (300s for subprocess, 600s for containers)
  • [ ] Rate limiting enabled to prevent overload
  • [ ] Event store configured (event_store.driver: sqlite)

Observability

  • [ ] Prometheus scraping /metrics endpoint
  • [ ] Grafana dashboards imported from monitoring/grafana/
  • [ ] Alertmanager rules configured for:
  • MCP server state transitions to DEAD
  • Circuit breaker OPEN events
  • Health check failure rate above threshold
  • Tool call error rate above threshold
  • [ ] Structured JSON logging enabled (MCP_JSON_LOGS=true)
  • [ ] Log level set to INFO for production (MCP_LOG_LEVEL=INFO)

Configuration

  • [ ] Config file reviewed for correctness (no validate subcommand exists)
  • [ ] Hot-reload tested via mcp-hangar add API (no SIGHUP handler exists)
  • [ ] Environment-specific configs separated (dev/staging/prod)

Deployment

  • [ ] Running behind a reverse proxy (nginx, Caddy, Envoy)
  • [ ] Health probe endpoints exposed for orchestrator (/health/live, /health/ready, /health/startup)
  • [ ] Graceful shutdown configured (SIGTERM handling)
  • [ ] Resource limits set (memory, CPU) for container deployments
  • [ ] Persistent volume for event store SQLite database
  • [ ] Docker image pinned to specific version tag, not latest

Kubernetes (if applicable)

The MCP-Hangar Operator is an external component shipped from hangar-operator. See Recipe 11 for install instructions.

  • [ ] MCP-Hangar Operator installed (see Recipe 11 prerequisites)
  • [ ] CRDs applied (MCPServer, MCPServerGroup, MCPDiscoverySource)
  • [ ] RBAC (Kubernetes) configured for operator service account
  • [ ] Network policies restricting MCP server-to-MCP server communication
  • [ ] Resource requests and limits in Helm values
  • [ ] PodDisruptionBudget for Hangar deployment

Testing

  • [ ] Failover tested: kill a primary MCP server, verify backup takes over
  • [ ] Cold start tested: invoke a tool on a cold MCP server, verify latency
  • [ ] Rate limit tested: flood API, verify 429 responses
  • [ ] Auth tested: invalid key returns 401, insufficient role returns 403
  • [ ] Config reload tested: edit config.yaml, verify changes apply
  • [ ] Recovery tested: kill all MCP servers, verify they reinitialize

Runbook

  • [ ] Incident response documented
  • [ ] MCP Server restart procedure documented
  • [ ] Config rollback procedure documented
  • [ ] Contact list for MCP server owners maintained