Skip to content

Observability Guide

This guide covers MCP Hangar's observability features: metrics, tracing, logging, and health checks.

Table of Contents

Quick Start

Prerequisites

# Core package
pip install mcp-hangar

# For full observability support
pip install mcp-hangar[observability]

Start Monitoring Stack

The monitoring stack is in monitoring/ and includes Prometheus, Grafana, and Alertmanager:

# Using Docker Compose
cd monitoring
docker compose up -d

# Using Podman
cd monitoring
podman compose up -d

Access dashboards:

Service URL Credentials
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090 -
Alertmanager http://localhost:9093 -

Start MCP Hangar with Metrics

# HTTP mode (exposes /metrics endpoint)
mcp-hangar serve --http --port 8000

# With custom config
MCP_CONFIG=config.yaml mcp-hangar serve --http --port 8000

Verify metrics are exposed:

curl http://localhost:8000/metrics | grep mcp_hangar

Monitoring Stack

Architecture

+----------------+     scrape      +------------+
|  MCP Hangar    |---------------->| Prometheus |
|  :8000/metrics |                 |   :9090    |
+----------------+                 +-----+------+
                                         |
                                         | query
                                         v
                                   +------------+
                                   |  Grafana   |
                                   |   :3000    |
                                   +------------+

+----------------+     alerts      +-------------+
|  Prometheus    |---------------->| Alertmanager|
|  alert rules   |                 |    :9093    |
+----------------+                 +-------------+

Configuration Files

File Purpose
monitoring/docker-compose.yaml Container orchestration
monitoring/prometheus/prometheus.yaml Scrape configuration
monitoring/prometheus/alerts.yaml Alert rules
monitoring/alertmanager/alertmanager.yaml Notification routing
monitoring/grafana/provisioning/ Dashboard/datasource provisioning
monitoring/grafana/dashboards/ Pre-built dashboard JSON files

Prometheus Configuration

The default configuration scrapes MCP Hangar every 10 seconds:

# monitoring/prometheus/prometheus.yaml
scrape_configs:
  - job_name: 'mcp-hangar'
    static_configs:
      - targets: ['host.docker.internal:8000']
        labels:
          service: 'mcp-hangar'
          tier: 'application'
    metrics_path: /metrics
    scrape_interval: 10s
    scrape_timeout: 5s

For Kubernetes deployments, use service discovery:

scrape_configs:
  - job_name: 'mcp-hangar'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: mcp-hangar
        action: keep

Metrics

MCP Hangar exports Prometheus metrics at /metrics. All metrics use the mcp_hangar_ prefix.

Currently Exported Metrics

Tool Invocations

Metric Type Labels Description
mcp_hangar_tool_calls_total Counter MCP server, tool, status Total tool invocations
mcp_hangar_tool_call_duration_seconds Histogram MCP server, tool Invocation latency (buckets: 0.01-30s)
mcp_hangar_tool_call_errors_total Counter MCP server, tool, error_type Failed invocations by error type

Example queries:

# Tool call rate by mcp_server
sum(rate(mcp_hangar_tool_calls_total[5m])) by (mcp_server)

# P95 latency by tool
histogram_quantile(0.95, sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le, tool))

# Error rate
sum(rate(mcp_hangar_tool_call_errors_total[5m])) / sum(rate(mcp_hangar_tool_calls_total[5m]))

Batch Invocations

Metric Type Labels Description
mcp_hangar_batch_calls_total Counter result Batch invocations (success/failure)
mcp_hangar_batch_duration_seconds Histogram - Batch execution time
mcp_hangar_batch_size Histogram - Number of calls per batch
mcp_hangar_batch_cancellations_total Counter - Cancelled batches
mcp_hangar_batch_circuit_breaker_rejections_total Counter - Circuit breaker rejections
mcp_hangar_batch_concurrency Gauge - Current parallel executions

Example queries:

# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

# Average batch size
rate(mcp_hangar_batch_size_sum[5m]) / rate(mcp_hangar_batch_size_count[5m])

Health Checks

Metric Type Labels Description
mcp_hangar_health_checks_total Counter MCP server, result Health check executions
mcp_hangar_health_check_duration_seconds Histogram MCP server Health check latency
mcp_hangar_health_check_consecutive_failures Gauge MCP server Current consecutive failure count

Example queries:

# Unhealthy mcp_servers (>2 consecutive failures)
mcp_hangar_health_check_consecutive_failures > 2

# Health check success rate
sum(rate(mcp_hangar_health_checks_total{result="healthy"}[5m])) by (mcp_server)
/ sum(rate(mcp_hangar_health_checks_total[5m])) by (mcp_server)

MCP Server Lifecycle

Metric Type Labels Description
mcp_hangar_mcp_server_state Gauge mcp_server Current state (0=cold, 1=initializing, 2=ready, 3=degraded, 4=dead)
mcp_hangar_mcp_server_up Gauge mcp_server 1 if MCP server is reachable
mcp_hangar_mcp_server_starts_total Counter mcp_server MCP server start attempts
mcp_hangar_mcp_server_initialized Gauge mcp_server 1 if MCP server has been initialized
mcp_hangar_mcp_server_cold_start_seconds Histogram mcp_server Cold start latency
mcp_hangar_mcp_server_cold_start_in_progress Gauge mcp_server 1 if cold start is in progress

Discovery

Metric Type Labels Description
mcp_hangar_discovery_mcp_servers Gauge source Discovered MCP servers per source
mcp_hangar_discovery_registrations_total Counter source New registrations
mcp_hangar_discovery_errors_total Counter source Errors by source
mcp_hangar_discovery_cycle_duration_seconds Histogram source Discovery cycle duration

HTTP Transport

Metric Type Labels Description
mcp_hangar_http_requests_total Counter method, status HTTP requests to remote MCP servers
mcp_hangar_http_request_duration_seconds Histogram method HTTP request latency
mcp_hangar_http_connections Gauge mcp_server Active HTTP connections

Rate Limiting

Metric Type Labels Description
mcp_hangar_rate_limit_hits_total Counter principal Rate limit rejections

GC (Garbage Collection)

Metric Type Labels Description
mcp_hangar_gc_cycles_total Counter - GC cycle executions
mcp_hangar_gc_cycle_duration_seconds Histogram - GC cycle duration

Grafana Dashboards

Pre-built dashboards are provisioned automatically from monitoring/grafana/dashboards/:

Overview Dashboard

File: overview.json URL: http://localhost:3000/d/mcp-hangar-overview

Provides high-level system health:

  • Request rate and error rate trends
  • Latency percentiles (P50, P95, P99)
  • MCP Server health status
  • Batch invocation success/failure rates
  • Health check results
  • GC cycle performance

MCP Server Details Dashboard

File: MCP server-details.json URL: http://localhost:3000/d/mcp-hangar-MCP server-details

Deep dive into individual MCP servers:

  • Tool call breakdown by tool name
  • Per-tool latency histograms
  • Error distribution by type
  • Health check history
  • Consecutive failure tracking

Alerts Dashboard

File: alerts.json URL: http://localhost:3000/d/mcp-hangar-alerts

Alert monitoring and trends:

  • Active alerts by severity
  • Alert condition trends (error rate, latency, health)
  • Historical alert timeline

Importing Dashboards Manually

If not using provisioning:

  1. Open Grafana at http://localhost:3000
  2. Go to Dashboards > Import
  3. Upload JSON file from monitoring/grafana/dashboards/
  4. Select Prometheus data source
  5. Click Import

Alerting

Alert Configuration

Alert rules are defined in monitoring/prometheus/alerts.yaml and organized by severity:

Critical Alerts (Page On-Call)

Alert Condition For Description
MCPHangarNotResponding up{job="mcp-hangar"} == 0 1m Service unreachable
MCPHangarHighErrorRate Error rate > 10% 2m Significant failures
MCPHangarBatchHighFailureRate Batch failure > 20% 3m Batch operations failing
MCPHangarCircuitBreakerTripped CB rejections > 10/5m 2m MCP Server isolated
MCPHangarProviderUnhealthy Consecutive failures > 5 2m MCP Server critically unhealthy

Warning Alerts (Investigate)

Alert Condition For Description
MCPHangarHighConsecutiveFailures Consecutive failures > 2 2m Health check issues
MCPHangarHealthCheckSlow P95 health check > 5s 5m Slow health checks
MCPHangarHighLatencyP95 P95 latency > 3s 5m Performance degradation
MCPHangarHighLatencyP99 P99 latency > 5s 5m Tail latency issues
MCPHangarHighLatencyByTool P95 per-tool > 5s 5m Specific tool slow
MCPHangarFrequentColdStarts Start rate > 0.1/s 10m Consider increasing idle_ttl
MCPHangarBatchSlowExecution P95 batch > 30s 5m Slow batch processing
MCPHangarBatchHighCancellationRate Cancellation > 10% 5m Batches timing out
MCPHangarBatchSizeTooLarge P95 size > 50 5m Consider smaller batches
MCPHangarGCSlowCycles P95 GC > 0.5s 5m GC performance issue
MCPHangarHighMemoryUsage Memory > 2GB 10m Memory pressure
MCPHangarHighCPUUsage CPU > 80% 10m CPU saturation

Info Alerts (Tracking)

Alert Condition Description
MCPHangarMcpServerStarted Any MCP server start MCP Server lifecycle event
MCPHangarHighToolCallVolume Rate > 100/s High traffic notification

Alertmanager Configuration

Configure notification routing in monitoring/alertmanager/alertmanager.yaml:

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<your-service-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<your-slack-webhook-url>'
        channel: '#mcp-hangar-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Testing Alerts

Verify alert rules are loaded:

# Check Prometheus rules
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'

# Check for firing alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Tracing

OpenTelemetry Integration

MCP Hangar supports distributed tracing via OpenTelemetry. Every tool invocation produces an OTEL span carrying MCP governance attributes (mcp.server.id, mcp.tool.name, mcp.tool.status, enforcement context, and identity context when available).

For the full MCP attribute taxonomy, partner backend recipes (OTEL Collector, OpenLIT, Langfuse, Grafana), and reference docker-compose setups, see: OpenTelemetry Integrations.

from mcp_hangar.observability import init_tracing, trace_span

# Initialize once at startup
init_tracing(
    service_name="mcp-hangar",
    otlp_endpoint="http://localhost:4317",
)

# Create spans for operations
with trace_span("process_request", {"request.id": req_id}) as span:
    span.add_event("checkpoint_reached")
    result = do_work()

MCP Governance Attributes on Spans

TracedMcpServerService automatically creates an OTEL span for each tool invocation with standard MCP governance attributes via set_governance_attributes():

from mcp_hangar.observability.conventions import McpServer, MCP, set_governance_attributes

# set_governance_attributes(span, ...) sets all applicable attributes in one call.
# None values are omitted -- no empty strings pollute OTLP backends.
set_governance_attributes(
    span,
    mcp_server_id="math",
    tool_name="add",
    user_id="alice",           # optional
    session_id="sess-42",      # optional
    policy_result="allow",     # optional
    enforcement_action=None,   # omitted from span
)

OTLP Audit Export

Security-relevant domain events (tool invocations, MCP server state transitions) are automatically exported as OTLP log records when OTEL_EXPORTER_OTLP_ENDPOINT is set. This is handled by OTLPAuditExporter and OTLPAuditEventHandler -- no additional configuration needed.

Events exported:

  • ToolInvocationCompleted / ToolInvocationFailed -- with MCP server, tool, status, duration, caller identity, cost attribution
  • McpServerStateChanged -- with MCP server, from_state, to_state

Caller identity attributes (mcp.caller.type, mcp.caller.id, mcp.caller.roles) are automatically propagated from the event's identity_context when available.

Cost attributes (mcp.cost.cents, mcp.cost.model, mcp.cost.input_tokens, mcp.cost.output_tokens) are included when cost attribution is configured.

Compliance Export Formats (Enterprise)

Enterprise deployments can export audit events in SIEM-compatible formats alongside OTLP. Available exporters (in src/mcp_hangar/compliance/):

Format Class Use Case
CEF CEFExporter ArcSight, QRadar, Splunk via CEF
JSON-lines JSONLinesExporter Splunk HEC, Elasticsearch, custom pipelines
LEEF LEEFExporter IBM QRadar native format
Syslog (RFC 5424) SyslogExporter Any syslog-compatible SIEM

All exporters implement the IAuditExporter protocol and output to file, callback, or stderr (for container log collection). Configure via the compliance bootstrap.

Environment Variables

Variable Default Description
MCP_TRACING_ENABLED true Enable/disable tracing
OTEL_EXPORTER_OTLP_ENDPOINT http://localhost:4317 OTLP collector endpoint (also activates OTLP audit export)
OTEL_SERVICE_NAME mcp-hangar Service name in traces

Trace Context Propagation

W3C TraceContext is automatically propagated across agent -> Hangar -> MCP server boundaries:

  • Inbound: BatchExecutor extracts traceparent from call metadata, creating child spans linked to the agent's root trace.
  • Outbound: HttpClient injects traceparent into outbound HTTP headers when calling remote MCP servers.
  • Stdio: Not supported (JSON-RPC over stdin/stdout has no header mechanism).

Manual propagation is also available:

from mcp_hangar.observability import inject_trace_context, extract_trace_context

# Inject into outgoing requests
headers = {}
inject_trace_context(headers)

# Extract from incoming requests
context = extract_trace_context(request_headers)

Langfuse Integration

MCP Hangar integrates with Langfuse for LLM-specific observability.

Configuration

export MCP_LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com

Or via config.yaml:

observability:
  langfuse:
    enabled: true
    public_key: ${LANGFUSE_PUBLIC_KEY}
    secret_key: ${LANGFUSE_SECRET_KEY}
    host: https://cloud.langfuse.com
    sample_rate: 1.0

Trace Propagation

from mcp_hangar.application.services import TracedMcpServerService

result = traced_service.invoke_tool(
    mcp_server_id="math",
    tool_name="add",
    arguments={"a": 1, "b": 2},
    trace_id="your-langfuse-trace-id",
    user_id="user-123",
    session_id="session-456",
)

See ADR-007 for architectural details.

Logging

Structured Logging

MCP Hangar uses structlog for structured JSON logging:

{
  "timestamp": "2026-02-03T10:30:00.123Z",
  "level": "info",
  "event": "tool_invoked",
  "mcp_server": "math",
  "tool": "add",
  "duration_ms": 150,
  "service": "mcp-hangar"
}

Configuration

logging:
  level: INFO          # DEBUG, INFO, WARNING, ERROR
  json_format: true    # JSON output for log aggregation

Environment variable:

MCP_LOG_LEVEL=DEBUG mcp-hangar serve --http

Log Correlation

Include trace IDs for correlation with distributed traces:

from mcp_hangar.observability import get_current_trace_id
from mcp_hangar.logging_config import get_logger

logger = get_logger(__name__)
logger.info("processing", trace_id=get_current_trace_id())

Health Checks

HTTP Endpoints

Endpoint Purpose Use Case
/health/live Liveness Container restart decisions
/health/ready Readiness Traffic routing
/health/startup Startup Initial boot gate

Response Format

{
  "status": "healthy",
  "checks": [
    {
      "name": "mcp_servers",
      "status": "healthy",
      "duration_ms": 1.2
    }
  ],
  "version": "0.6.3",
  "uptime_seconds": 3600.5
}

Kubernetes Configuration

livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 5

SLIs/SLOs

Service Level Indicators

SLI Metric Measurement
Availability Service up up{job="mcp-hangar"}
Latency Tool call duration P95 < 3s
Error Rate Failed invocations Error rate < 1%
Batch Success Batch completion Success rate > 95%
SLI Target Window
Availability 99.9% 30 days
Latency (P95) < 3s 5 minutes
Error Rate < 1% 5 minutes
Batch Success > 95% 5 minutes

PromQL Queries

# Availability (service up ratio over 30d)
avg_over_time(up{job="mcp-hangar"}[30d])

# Error budget remaining
1 - (
  sum(increase(mcp_hangar_tool_call_errors_total[30d]))
  / sum(increase(mcp_hangar_tool_calls_total[30d]))
) / 0.01

# P95 latency
histogram_quantile(0.95,
  sum(rate(mcp_hangar_tool_call_duration_seconds_bucket[5m])) by (le)
)

# Batch success rate
sum(rate(mcp_hangar_batch_calls_total{result="success"}[5m]))
/ sum(rate(mcp_hangar_batch_calls_total[5m]))

Troubleshooting

Metrics Not Visible

  1. Verify endpoint:
curl http://localhost:8000/metrics | head -20
  1. Check Prometheus targets at http://localhost:9090/targets

  2. Verify network connectivity (use host.docker.internal for Docker on Mac/Windows)

Alerts Not Firing

  1. Check alert rules loaded:
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
  1. Verify metrics exist for alert expressions

  2. Check Alertmanager connectivity:

curl http://localhost:9093/api/v1/status

High Consecutive Failures

If MCPHangarHighConsecutiveFailures fires:

  1. Check MCP server logs for errors
  2. Verify MCP server command/configuration
  3. Restart the MCP server by restarting Hangar or invoking the MCP server (the first tool call triggers a cold start):
mcp-hangar status

MCP Server Start Errors

Common patterns and fixes:

Error Cause Fix
ModuleNotFoundError Missing dependency pip install <package>
FileNotFoundError Wrong path Check command in config
PermissionError Not executable chmod +x <script>
Exit code 137 OOM killed Increase memory limits

Best Practices

Metrics

  1. Monitor the right things - Focus on user-facing SLIs
  2. Set appropriate retention - 15 days for metrics, 7 days for traces
  3. Avoid high cardinality - Don't use unbounded values as labels

Alerting

  1. Create runbooks - Document response procedures
  2. Start conservative - Tune thresholds based on baseline
  3. Test regularly - Verify notification channels work
  4. Use severity correctly - Critical = page, Warning = ticket

Dashboards

  1. Layer information - Overview -> Details -> Debug
  2. Include time selectors - Allow drilling into incidents
  3. Add annotations - Mark deployments and incidents

Production Readiness Checklist

  • [ ] Prometheus scraping MCP Hangar metrics
  • [ ] Grafana dashboards imported and working
  • [ ] Alertmanager configured with notification routes
  • [ ] Critical alerts tested (e.g., stop service, verify page)
  • [ ] Runbooks created for each alert
  • [ ] Log aggregation configured (ELK, Loki, etc.)
  • [ ] Tracing enabled and traces visible in Jaeger/Langfuse