Observability Guide¶
This guide covers MCP Hangar's observability features: metrics, tracing, logging, and health checks.
Table of Contents¶
- Quick Start
- Monitoring Stack
- Metrics
- Grafana Dashboards
- Alerting
- Tracing
- Langfuse Integration
- Logging
- Health Checks
- SLIs/SLOs
- Troubleshooting
- Best Practices
Quick Start¶
Prerequisites¶
# Core package
pip install mcp-hangar
# For full observability support
pip install mcp-hangar[observability]
Start Monitoring Stack¶
The monitoring stack is in monitoring/ and includes Prometheus, Grafana, and Alertmanager:
# Using Docker Compose
cd monitoring
docker compose up -d
# Using Podman
cd monitoring
podman compose up -d
Access dashboards:
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | - |
| Alertmanager | http://localhost:9093 | - |
Start MCP Hangar with Metrics¶
# HTTP mode (exposes /metrics endpoint)
mcp-hangar serve --http --port 8000
# With custom config
MCP_CONFIG=config.yaml mcp-hangar serve --http --port 8000
Verify metrics are exposed:
Monitoring Stack¶
Architecture¶
+----------------+ scrape +------------+
| MCP Hangar |---------------->| Prometheus |
| :8000/metrics | | :9090 |
+----------------+ +-----+------+
|
| query
v
+------------+
| Grafana |
| :3000 |
+------------+
+----------------+ alerts +-------------+
| Prometheus |---------------->| Alertmanager|
| alert rules | | :9093 |
+----------------+ +-------------+
Configuration Files¶
| File | Purpose |
|---|---|
monitoring/docker-compose.yaml | Container orchestration |
monitoring/prometheus/prometheus.yaml | Scrape configuration |
monitoring/prometheus/alerts.yaml | Alert rules |
monitoring/alertmanager/alertmanager.yaml | Notification routing |
monitoring/grafana/provisioning/ | Dashboard/datasource provisioning |
monitoring/grafana/dashboards/ | Pre-built dashboard JSON files |
Prometheus Configuration¶
The default configuration scrapes MCP Hangar every 10 seconds:
# monitoring/prometheus/prometheus.yaml
scrape_configs:
- job_name: 'mcp-hangar'
static_configs:
- targets: ['host.docker.internal:8000']
labels:
service: 'mcp-hangar'
tier: 'application'
metrics_path: /metrics
scrape_interval: 10s
scrape_timeout: 5s
For Kubernetes deployments, use service discovery:
scrape_configs:
- job_name: 'mcp-hangar'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: mcp-hangar
action: keep
Metrics¶
MCP Hangar exports Prometheus metrics at /metrics. All metrics use the mcp_hangar_ prefix.
Currently Exported Metrics¶
Tool Invocations¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_tool_calls_total | Counter | MCP server, tool, status | Total tool invocations |
mcp_hangar_tool_call_duration_seconds | Histogram | MCP server, tool | Invocation latency (buckets: 0.01-30s) |
mcp_hangar_tool_call_errors_total | Counter | MCP server, tool, error_type | Failed invocations by error type |
Example queries:
# Tool call rate by mcp_server
sum(rate(mcp_hangar_tool_calls_total[5m])) by (mcp_server)
# P95 latency by tool
histogram_quantile(0.95, sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le, tool))
# Error rate
sum(rate(mcp_hangar_tool_call_errors_total[5m])) / sum(rate(mcp_hangar_tool_calls_total[5m]))
Batch Invocations¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_batch_calls_total | Counter | result | Batch invocations (success/failure) |
mcp_hangar_batch_duration_seconds | Histogram | - | Batch execution time |
mcp_hangar_batch_size | Histogram | - | Number of calls per batch |
mcp_hangar_batch_cancellations_total | Counter | - | Cancelled batches |
mcp_hangar_batch_circuit_breaker_rejections_total | Counter | - | Circuit breaker rejections |
mcp_hangar_batch_concurrency | Gauge | - | Current parallel executions |
Example queries:
# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))
# Average batch size
rate(mcp_hangar_batch_size_sum[5m]) / rate(mcp_hangar_batch_size_count[5m])
Health Checks¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_health_checks_total | Counter | MCP server, result | Health check executions |
mcp_hangar_health_check_duration_seconds | Histogram | MCP server | Health check latency |
mcp_hangar_health_check_consecutive_failures | Gauge | MCP server | Current consecutive failure count |
Example queries:
# Unhealthy mcp_servers (>2 consecutive failures)
mcp_hangar_health_check_consecutive_failures > 2
# Health check success rate
sum(rate(mcp_hangar_health_checks_total{result="healthy"}[5m])) by (mcp_server)
/ sum(rate(mcp_hangar_health_checks_total[5m])) by (mcp_server)
MCP Server Lifecycle¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_mcp_server_state | Gauge | mcp_server | Current state (0=cold, 1=initializing, 2=ready, 3=degraded, 4=dead) |
mcp_hangar_mcp_server_up | Gauge | mcp_server | 1 if MCP server is reachable |
mcp_hangar_mcp_server_starts_total | Counter | mcp_server | MCP server start attempts |
mcp_hangar_mcp_server_initialized | Gauge | mcp_server | 1 if MCP server has been initialized |
mcp_hangar_mcp_server_cold_start_seconds | Histogram | mcp_server | Cold start latency |
mcp_hangar_mcp_server_cold_start_in_progress | Gauge | mcp_server | 1 if cold start is in progress |
Discovery¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_discovery_mcp_servers | Gauge | source | Discovered MCP servers per source |
mcp_hangar_discovery_registrations_total | Counter | source | New registrations |
mcp_hangar_discovery_errors_total | Counter | source | Errors by source |
mcp_hangar_discovery_cycle_duration_seconds | Histogram | source | Discovery cycle duration |
HTTP Transport¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_http_requests_total | Counter | method, status | HTTP requests to remote MCP servers |
mcp_hangar_http_request_duration_seconds | Histogram | method | HTTP request latency |
mcp_hangar_http_connections | Gauge | mcp_server | Active HTTP connections |
Rate Limiting¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_rate_limit_hits_total | Counter | principal | Rate limit rejections |
GC (Garbage Collection)¶
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_hangar_gc_cycles_total | Counter | - | GC cycle executions |
mcp_hangar_gc_cycle_duration_seconds | Histogram | - | GC cycle duration |
Grafana Dashboards¶
Pre-built dashboards are provisioned automatically from monitoring/grafana/dashboards/:
Overview Dashboard¶
File: overview.json URL: http://localhost:3000/d/mcp-hangar-overview
Provides high-level system health:
- Request rate and error rate trends
- Latency percentiles (P50, P95, P99)
- MCP Server health status
- Batch invocation success/failure rates
- Health check results
- GC cycle performance
MCP Server Details Dashboard¶
File: MCP server-details.json URL: http://localhost:3000/d/mcp-hangar-MCP server-details
Deep dive into individual MCP servers:
- Tool call breakdown by tool name
- Per-tool latency histograms
- Error distribution by type
- Health check history
- Consecutive failure tracking
Alerts Dashboard¶
File: alerts.json URL: http://localhost:3000/d/mcp-hangar-alerts
Alert monitoring and trends:
- Active alerts by severity
- Alert condition trends (error rate, latency, health)
- Historical alert timeline
Importing Dashboards Manually¶
If not using provisioning:
- Open Grafana at http://localhost:3000
- Go to Dashboards > Import
- Upload JSON file from
monitoring/grafana/dashboards/ - Select Prometheus data source
- Click Import
Alerting¶
Alert Configuration¶
Alert rules are defined in monitoring/prometheus/alerts.yaml and organized by severity:
Critical Alerts (Page On-Call)¶
| Alert | Condition | For | Description |
|---|---|---|---|
MCPHangarNotResponding | up{job="mcp-hangar"} == 0 | 1m | Service unreachable |
MCPHangarHighErrorRate | Error rate > 10% | 2m | Significant failures |
MCPHangarBatchHighFailureRate | Batch failure > 20% | 3m | Batch operations failing |
MCPHangarCircuitBreakerTripped | CB rejections > 10/5m | 2m | MCP Server isolated |
MCPHangarProviderUnhealthy | Consecutive failures > 5 | 2m | MCP Server critically unhealthy |
Warning Alerts (Investigate)¶
| Alert | Condition | For | Description |
|---|---|---|---|
MCPHangarHighConsecutiveFailures | Consecutive failures > 2 | 2m | Health check issues |
MCPHangarHealthCheckSlow | P95 health check > 5s | 5m | Slow health checks |
MCPHangarHighLatencyP95 | P95 latency > 3s | 5m | Performance degradation |
MCPHangarHighLatencyP99 | P99 latency > 5s | 5m | Tail latency issues |
MCPHangarHighLatencyByTool | P95 per-tool > 5s | 5m | Specific tool slow |
MCPHangarFrequentColdStarts | Start rate > 0.1/s | 10m | Consider increasing idle_ttl |
MCPHangarBatchSlowExecution | P95 batch > 30s | 5m | Slow batch processing |
MCPHangarBatchHighCancellationRate | Cancellation > 10% | 5m | Batches timing out |
MCPHangarBatchSizeTooLarge | P95 size > 50 | 5m | Consider smaller batches |
MCPHangarGCSlowCycles | P95 GC > 0.5s | 5m | GC performance issue |
MCPHangarHighMemoryUsage | Memory > 2GB | 10m | Memory pressure |
MCPHangarHighCPUUsage | CPU > 80% | 10m | CPU saturation |
Info Alerts (Tracking)¶
| Alert | Condition | Description |
|---|---|---|
MCPHangarMcpServerStarted | Any MCP server start | MCP Server lifecycle event |
MCPHangarHighToolCallVolume | Rate > 100/s | High traffic notification |
Alertmanager Configuration¶
Configure notification routing in monitoring/alertmanager/alertmanager.yaml:
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://your-webhook-endpoint'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<your-service-key>'
- name: 'slack'
slack_configs:
- api_url: '<your-slack-webhook-url>'
channel: '#mcp-hangar-alerts'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
Testing Alerts¶
Verify alert rules are loaded:
# Check Prometheus rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'
# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
Tracing¶
OpenTelemetry Integration¶
MCP Hangar supports distributed tracing via OpenTelemetry. Every tool invocation produces an OTEL span carrying MCP governance attributes (mcp.server.id, mcp.tool.name, mcp.tool.status, enforcement context, and identity context when available).
For the full MCP attribute taxonomy, partner backend recipes (OTEL Collector, OpenLIT, Langfuse, Grafana), and reference docker-compose setups, see: OpenTelemetry Integrations.
from mcp_hangar.observability import init_tracing, trace_span
# Initialize once at startup
init_tracing(
service_name="mcp-hangar",
otlp_endpoint="http://localhost:4317",
)
# Create spans for operations
with trace_span("process_request", {"request.id": req_id}) as span:
span.add_event("checkpoint_reached")
result = do_work()
MCP Governance Attributes on Spans¶
TracedMcpServerService automatically creates an OTEL span for each tool invocation with standard MCP governance attributes via set_governance_attributes():
from mcp_hangar.observability.conventions import McpServer, MCP, set_governance_attributes
# set_governance_attributes(span, ...) sets all applicable attributes in one call.
# None values are omitted -- no empty strings pollute OTLP backends.
set_governance_attributes(
span,
mcp_server_id="math",
tool_name="add",
user_id="alice", # optional
session_id="sess-42", # optional
policy_result="allow", # optional
enforcement_action=None, # omitted from span
)
OTLP Audit Export¶
Security-relevant domain events (tool invocations, MCP server state transitions) are automatically exported as OTLP log records when OTEL_EXPORTER_OTLP_ENDPOINT is set. This is handled by OTLPAuditExporter and OTLPAuditEventHandler -- no additional configuration needed.
Events exported:
ToolInvocationCompleted/ToolInvocationFailed-- with MCP server, tool, status, duration, caller identity, cost attributionMcpServerStateChanged-- with MCP server, from_state, to_state
Caller identity attributes (mcp.caller.type, mcp.caller.id, mcp.caller.roles) are automatically propagated from the event's identity_context when available.
Cost attributes (mcp.cost.cents, mcp.cost.model, mcp.cost.input_tokens, mcp.cost.output_tokens) are included when cost attribution is configured.
Compliance Export Formats (Enterprise)¶
Enterprise deployments can export audit events in SIEM-compatible formats alongside OTLP. Available exporters (in src/mcp_hangar/compliance/):
| Format | Class | Use Case |
|---|---|---|
| CEF | CEFExporter | ArcSight, QRadar, Splunk via CEF |
| JSON-lines | JSONLinesExporter | Splunk HEC, Elasticsearch, custom pipelines |
| LEEF | LEEFExporter | IBM QRadar native format |
| Syslog (RFC 5424) | SyslogExporter | Any syslog-compatible SIEM |
All exporters implement the IAuditExporter protocol and output to file, callback, or stderr (for container log collection). Configure via the compliance bootstrap.
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
MCP_TRACING_ENABLED | true | Enable/disable tracing |
OTEL_EXPORTER_OTLP_ENDPOINT | http://localhost:4317 | OTLP collector endpoint (also activates OTLP audit export) |
OTEL_SERVICE_NAME | mcp-hangar | Service name in traces |
Trace Context Propagation¶
W3C TraceContext is automatically propagated across agent -> Hangar -> MCP server boundaries:
- Inbound:
BatchExecutorextractstraceparentfrom call metadata, creating child spans linked to the agent's root trace. - Outbound:
HttpClientinjectstraceparentinto outbound HTTP headers when calling remote MCP servers. - Stdio: Not supported (JSON-RPC over stdin/stdout has no header mechanism).
Manual propagation is also available:
from mcp_hangar.observability import inject_trace_context, extract_trace_context
# Inject into outgoing requests
headers = {}
inject_trace_context(headers)
# Extract from incoming requests
context = extract_trace_context(request_headers)
Langfuse Integration¶
MCP Hangar integrates with Langfuse for LLM-specific observability.
Configuration¶
export MCP_LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com
Or via config.yaml:
observability:
langfuse:
enabled: true
public_key: ${LANGFUSE_PUBLIC_KEY}
secret_key: ${LANGFUSE_SECRET_KEY}
host: https://cloud.langfuse.com
sample_rate: 1.0
Trace Propagation¶
from mcp_hangar.application.services import TracedMcpServerService
result = traced_service.invoke_tool(
mcp_server_id="math",
tool_name="add",
arguments={"a": 1, "b": 2},
trace_id="your-langfuse-trace-id",
user_id="user-123",
session_id="session-456",
)
See ADR-007 for architectural details.
Logging¶
Structured Logging¶
MCP Hangar uses structlog for structured JSON logging:
{
"timestamp": "2026-02-03T10:30:00.123Z",
"level": "info",
"event": "tool_invoked",
"mcp_server": "math",
"tool": "add",
"duration_ms": 150,
"service": "mcp-hangar"
}
Configuration¶
logging:
level: INFO # DEBUG, INFO, WARNING, ERROR
json_format: true # JSON output for log aggregation
Environment variable:
Log Correlation¶
Include trace IDs for correlation with distributed traces:
from mcp_hangar.observability import get_current_trace_id
from mcp_hangar.logging_config import get_logger
logger = get_logger(__name__)
logger.info("processing", trace_id=get_current_trace_id())
Health Checks¶
HTTP Endpoints¶
| Endpoint | Purpose | Use Case |
|---|---|---|
/health/live | Liveness | Container restart decisions |
/health/ready | Readiness | Traffic routing |
/health/startup | Startup | Initial boot gate |
Response Format¶
{
"status": "healthy",
"checks": [
{
"name": "mcp_servers",
"status": "healthy",
"duration_ms": 1.2
}
],
"version": "0.6.3",
"uptime_seconds": 3600.5
}
Kubernetes Configuration¶
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
SLIs/SLOs¶
Service Level Indicators¶
| SLI | Metric | Measurement |
|---|---|---|
| Availability | Service up | up{job="mcp-hangar"} |
| Latency | Tool call duration | P95 < 3s |
| Error Rate | Failed invocations | Error rate < 1% |
| Batch Success | Batch completion | Success rate > 95% |
Recommended SLOs¶
| SLI | Target | Window |
|---|---|---|
| Availability | 99.9% | 30 days |
| Latency (P95) | < 3s | 5 minutes |
| Error Rate | < 1% | 5 minutes |
| Batch Success | > 95% | 5 minutes |
PromQL Queries¶
# Availability (service up ratio over 30d)
avg_over_time(up{job="mcp-hangar"}[30d])
# Error budget remaining
1 - (
sum(increase(mcp_hangar_tool_call_errors_total[30d]))
/ sum(increase(mcp_hangar_tool_calls_total[30d]))
) / 0.01
# P95 latency
histogram_quantile(0.95,
sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le)
)
# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))
Troubleshooting¶
Metrics Not Visible¶
- Verify endpoint:
-
Check Prometheus targets at http://localhost:9090/targets
-
Verify network connectivity (use
host.docker.internalfor Docker on Mac/Windows)
Alerts Not Firing¶
- Check alert rules loaded:
-
Verify metrics exist for alert expressions
-
Check Alertmanager connectivity:
High Consecutive Failures¶
If MCPHangarHighConsecutiveFailures fires:
- Check MCP server logs for errors
- Verify MCP server command/configuration
- Restart the MCP server by restarting Hangar or invoking the MCP server (the first tool call triggers a cold start):
MCP Server Start Errors¶
Common patterns and fixes:
| Error | Cause | Fix |
|---|---|---|
ModuleNotFoundError | Missing dependency | pip install <package> |
FileNotFoundError | Wrong path | Check command in config |
PermissionError | Not executable | chmod +x <script> |
| Exit code 137 | OOM killed | Increase memory limits |
Best Practices¶
Metrics¶
- Monitor the right things - Focus on user-facing SLIs
- Set appropriate retention - 15 days for metrics, 7 days for traces
- Avoid high cardinality - Don't use unbounded values as labels
Alerting¶
- Create runbooks - Document response procedures
- Start conservative - Tune thresholds based on baseline
- Test regularly - Verify notification channels work
- Use severity correctly - Critical = page, Warning = ticket
Dashboards¶
- Layer information - Overview -> Details -> Debug
- Include time selectors - Allow drilling into incidents
- Add annotations - Mark deployments and incidents
Production Readiness Checklist¶
- [ ] Prometheus scraping MCP Hangar metrics
- [ ] Grafana dashboards imported and working
- [ ] Alertmanager configured with notification routes
- [ ] Critical alerts tested (e.g., stop service, verify page)
- [ ] Runbooks created for each alert
- [ ] Log aggregation configured (ELK, Loki, etc.)
- [ ] Tracing enabled and traces visible in Jaeger/Langfuse