Skip to content

Monitoring

Monitor RevenProx health and performance.

Health Endpoint

Check proxy health:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "connections": 12345,
  "uptime_seconds": 86400
}

Key Metrics

Connection Metrics

Metric Description Type
connections_active Current open connections Gauge
connections_total Total connections since start Counter
connections_rejected Rejected connections Counter
connection_duration_seconds Connection lifetime histogram Histogram

Message Metrics

Metric Description Type
messages_sent_total Total messages delivered Counter
messages_dropped_total Messages dropped (backpressure) Counter
bytes_sent_total Total bytes transmitted Counter
message_latency_ms Message delivery latency Histogram

Authentication Metrics

Metric Description Type
auth_requests_total Total authentication requests Counter
auth_cache_hits Cache hit count Counter
auth_cache_misses Cache miss count Counter
auth_failures_total Authentication failures Counter
circuit_breaker_state Circuit breaker status (0=closed, 1=open) Gauge

System Metrics

Metric Description Type
memory_used_bytes Current memory usage Gauge
cpu_usage_percent CPU utilization Gauge
fd_used File descriptors in use Gauge
goroutines Active goroutines/threads Gauge

Distributed State Metrics

Metric Description Type
peers_connected Connected peer proxies Gauge
sync_messages_sent Sync messages transmitted Counter
sync_lag_seconds Time since last successful sync Gauge
topics_active Active topics Gauge

Prometheus Integration

Metrics Endpoint

Expose metrics in Prometheus format:

curl http://localhost:8080/metrics

Output:

# HELP revenprox_connections_active Current active connections
# TYPE revenprox_connections_active gauge
revenprox_connections_active 12345

# HELP revenprox_messages_sent_total Total messages sent
# TYPE revenprox_messages_sent_total counter
revenprox_messages_sent_total 987654321

Prometheus Configuration

scrape_configs:
  - job_name: 'revenprox'
    static_configs:
      - targets: ['proxy1:8080', 'proxy2:8080', 'proxy3:8080']
    scrape_interval: 15s
    metrics_path: /metrics

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: revenprox
spec:
  selector:
    matchLabels:
      app: revenprox
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Grafana Dashboards

Connection Overview

{
  "panels": [
    {
      "title": "Active Connections",
      "targets": [{
        "expr": "revenprox_connections_active"
      }]
    },
    {
      "title": "Connection Rate",
      "targets": [{
        "expr": "rate(revenprox_connections_total[5m])"
      }]
    },
    {
      "title": "Rejected Connections",
      "targets": [{
        "expr": "rate(revenprox_connections_rejected[5m])"
      }]
    }
  ]
}

Message Throughput

{
  "panels": [
    {
      "title": "Messages/sec",
      "targets": [{
        "expr": "rate(revenprox_messages_sent_total[1m])"
      }]
    },
    {
      "title": "Bytes/sec",
      "targets": [{
        "expr": "rate(revenprox_bytes_sent_total[1m])"
      }]
    },
    {
      "title": "Message Latency P99",
      "targets": [{
        "expr": "histogram_quantile(0.99, revenprox_message_latency_ms_bucket)"
      }]
    }
  ]
}

Alerting

Prometheus Alert Rules

groups:
- name: revenprox
  rules:
  # High connection count
  - alert: RevenProxHighConnections
    expr: revenprox_connections_active > 80000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High connection count on {{ $labels.instance }}"

  # Connection rejections
  - alert: RevenProxConnectionsRejected
    expr: rate(revenprox_connections_rejected[5m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Connections being rejected on {{ $labels.instance }}"

  # Authentication failures spike
  - alert: RevenProxAuthFailures
    expr: rate(revenprox_auth_failures_total[5m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High authentication failure rate"

  # Circuit breaker open
  - alert: RevenProxCircuitBreakerOpen
    expr: revenprox_circuit_breaker_state == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Circuit breaker open - auth webhook failing"

  # Memory usage high
  - alert: RevenProxMemoryHigh
    expr: revenprox_memory_used_bytes / revenprox_memory_limit_bytes > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Memory usage above 85%"

  # Peer disconnected
  - alert: RevenProxPeerDisconnected
    expr: revenprox_peers_connected < 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Peer proxy disconnected"

Logging

Log Levels

Level Description
debug Detailed debugging information
info Normal operational messages
warn Warning conditions
error Error conditions

Configuration

log_level = "info"

Log Format

[2024-01-15T10:30:45Z INFO] HTTP server started on 0.0.0.0:8080
[2024-01-15T10:30:46Z INFO] Connected to peer tcp://proxy2:5555
[2024-01-15T10:31:00Z WARN] Rate limit exceeded for IP 192.168.1.100
[2024-01-15T10:32:15Z ERROR] Webhook verification failed: timeout

Structured Logging

For JSON output (for log aggregation):

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "INFO",
  "message": "HTTP server started",
  "bind_address": "0.0.0.0:8080",
  "proxy_id": "proxy-1"
}

Distributed Tracing

OpenTelemetry Integration

[telemetry]
enabled = true
endpoint = "http://jaeger:14268/api/traces"
service_name = "revenprox"

Trace Context

Propagate trace context in headers:

GET /events/topic HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Performance Monitoring

Connection Pool Stats

curl http://localhost:8080/debug/pool
{
  "active_connections": 12345,
  "idle_connections": 234,
  "total_created": 45678,
  "total_closed": 33333
}

Message Queue Stats

curl http://localhost:8080/debug/queues
{
  "total_queues": 5000,
  "avg_queue_depth": 12,
  "max_queue_depth": 89,
  "messages_pending": 60000
}

SLA Monitoring

Key SLIs

SLI Target Measurement
Availability 99.9% Health check success rate
Latency P99 < 100ms Message delivery time
Error rate < 0.1% Failed requests / total

SLO Dashboard

Track error budget:

Error Budget = 100% - (Errors / Total Requests) * 100
Remaining = SLO Target - Current Error Rate

Next Steps