13 -- Production Checklist¶
Before you go live, walk through this list.
Security¶
- [ ] TLS termination configured (reverse proxy or load balancer)
- [ ]
auth.enabled: trueandauth.allow_anonymous: false - [ ] API keys created for each service principal
- [ ] RBAC roles assigned with least-privilege
- [ ] Tool access policies set for sensitive tools
- [ ] Secrets use environment variable interpolation (
${VAR}), not plain text in config - [ ] Docker MCP servers use
read_only: trueandnetwork: nonewhere possible
Reliability¶
- [ ] Health checks enabled on all MCP servers (
health_check_interval_s) - [ ] Circuit breaker thresholds tuned (
max_consecutive_failures) - [ ] MCP Server groups configured for critical MCP servers (at least 2 members)
- [ ]
min_healthyset to match your SLA requirements - [ ] Idle TTL set appropriately (300s for subprocess, 600s for containers)
- [ ] Rate limiting enabled to prevent overload
- [ ] Event store configured (
event_store.driver: sqlite)
Observability¶
- [ ] Prometheus scraping
/metricsendpoint - [ ] Grafana dashboards imported from
monitoring/grafana/ - [ ] Alertmanager rules configured for:
- MCP server state transitions to DEAD
- Circuit breaker OPEN events
- Health check failure rate above threshold
- Tool call error rate above threshold
- [ ] Structured JSON logging enabled (
MCP_JSON_LOGS=true) - [ ] Log level set to
INFOfor production (MCP_LOG_LEVEL=INFO)
Configuration¶
- [ ] Config file reviewed for correctness (no
validatesubcommand exists) - [ ] Hot-reload tested via
mcp-hangar addAPI (no SIGHUP handler exists) - [ ] Environment-specific configs separated (dev/staging/prod)
Deployment¶
- [ ] Running behind a reverse proxy (nginx, Caddy, Envoy)
- [ ] Health probe endpoints exposed for orchestrator (
/health/live,/health/ready,/health/startup) - [ ] Graceful shutdown configured (SIGTERM handling)
- [ ] Resource limits set (memory, CPU) for container deployments
- [ ] Persistent volume for event store SQLite database
- [ ] Docker image pinned to specific version tag, not
latest
Kubernetes (if applicable)¶
The MCP-Hangar Operator is an external component shipped from hangar-operator. See Recipe 11 for install instructions.
- [ ] MCP-Hangar Operator installed (see Recipe 11 prerequisites)
- [ ] CRDs applied (
MCPServer,MCPServerGroup,MCPDiscoverySource) - [ ] RBAC (Kubernetes) configured for operator service account
- [ ] Network policies restricting MCP server-to-MCP server communication
- [ ] Resource requests and limits in Helm values
- [ ] PodDisruptionBudget for Hangar deployment
Testing¶
- [ ] Failover tested: kill a primary MCP server, verify backup takes over
- [ ] Cold start tested: invoke a tool on a cold MCP server, verify latency
- [ ] Rate limit tested: flood API, verify 429 responses
- [ ] Auth tested: invalid key returns 401, insufficient role returns 403
- [ ] Config reload tested: edit config.yaml, verify changes apply
- [ ] Recovery tested: kill all MCP servers, verify they reinitialize
Runbook¶
- [ ] Incident response documented
- [ ] MCP Server restart procedure documented
- [ ] Config rollback procedure documented
- [ ] Contact list for MCP server owners maintained