Monitoring

Monitor RevenProx health and performance.

Health Endpoint

Check proxy health:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "connections": 12345,
  "uptime_seconds": 86400
}

Key Metrics

Connection Metrics

Metric	Description	Type
`connections_active`	Current open connections	Gauge
`connections_total`	Total connections since start	Counter
`connections_rejected`	Rejected connections	Counter
`connection_duration_seconds`	Connection lifetime histogram	Histogram

Message Metrics

Metric	Description	Type
`messages_sent_total`	Total messages delivered	Counter
`messages_dropped_total`	Messages dropped (backpressure)	Counter
`bytes_sent_total`	Total bytes transmitted	Counter
`message_latency_ms`	Message delivery latency	Histogram

Authentication Metrics

Metric	Description	Type
`auth_requests_total`	Total authentication requests	Counter
`auth_cache_hits`	Cache hit count	Counter
`auth_cache_misses`	Cache miss count	Counter
`auth_failures_total`	Authentication failures	Counter
`circuit_breaker_state`	Circuit breaker status (0=closed, 1=open)	Gauge

System Metrics

Metric	Description	Type
`memory_used_bytes`	Current memory usage	Gauge
`cpu_usage_percent`	CPU utilization	Gauge
`fd_used`	File descriptors in use	Gauge
`goroutines`	Active goroutines/threads	Gauge

Distributed State Metrics

Metric	Description	Type
`peers_connected`	Connected peer proxies	Gauge
`sync_messages_sent`	Sync messages transmitted	Counter
`sync_lag_seconds`	Time since last successful sync	Gauge
`topics_active`	Active topics	Gauge

Prometheus Integration

Metrics Endpoint

Expose metrics in Prometheus format:

curl http://localhost:8080/metrics

Output:

# HELP revenprox_connections_active Current active connections
# TYPE revenprox_connections_active gauge
revenprox_connections_active 12345

# HELP revenprox_messages_sent_total Total messages sent
# TYPE revenprox_messages_sent_total counter
revenprox_messages_sent_total 987654321

Prometheus Configuration

scrape_configs:
  - job_name: 'revenprox'
    static_configs:
      - targets: ['proxy1:8080', 'proxy2:8080', 'proxy3:8080']
    scrape_interval: 15s
    metrics_path: /metrics

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: revenprox
spec:
  selector:
    matchLabels:
      app: revenprox
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Grafana Dashboards

Connection Overview

{
  "panels": [
    {
      "title": "Active Connections",
      "targets": [{
        "expr": "revenprox_connections_active"
      }]
    },
    {
      "title": "Connection Rate",
      "targets": [{
        "expr": "rate(revenprox_connections_total[5m])"
      }]
    },
    {
      "title": "Rejected Connections",
      "targets": [{
        "expr": "rate(revenprox_connections_rejected[5m])"
      }]
    }
  ]
}

Message Throughput

{
  "panels": [
    {
      "title": "Messages/sec",
      "targets": [{
        "expr": "rate(revenprox_messages_sent_total[1m])"
      }]
    },
    {
      "title": "Bytes/sec",
      "targets": [{
        "expr": "rate(revenprox_bytes_sent_total[1m])"
      }]
    },
    {
      "title": "Message Latency P99",
      "targets": [{
        "expr": "histogram_quantile(0.99, revenprox_message_latency_ms_bucket)"
      }]
    }
  ]
}

Alerting

Prometheus Alert Rules

groups:
- name: revenprox
  rules:
  # High connection count
  - alert: RevenProxHighConnections
    expr: revenprox_connections_active > 80000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High connection count on {{ $labels.instance }}"

  # Connection rejections
  - alert: RevenProxConnectionsRejected
    expr: rate(revenprox_connections_rejected[5m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Connections being rejected on {{ $labels.instance }}"

  # Authentication failures spike
  - alert: RevenProxAuthFailures
    expr: rate(revenprox_auth_failures_total[5m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High authentication failure rate"

  # Circuit breaker open
  - alert: RevenProxCircuitBreakerOpen
    expr: revenprox_circuit_breaker_state == 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Circuit breaker open - auth webhook failing"

  # Memory usage high
  - alert: RevenProxMemoryHigh
    expr: revenprox_memory_used_bytes / revenprox_memory_limit_bytes > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Memory usage above 85%"

  # Peer disconnected
  - alert: RevenProxPeerDisconnected
    expr: revenprox_peers_connected < 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Peer proxy disconnected"

Logging

Log Levels

Level	Description
`debug`	Detailed debugging information
`info`	Normal operational messages
`warn`	Warning conditions
`error`	Error conditions

Configuration

log_level = "info"

Log Format

[2024-01-15T10:30:45Z INFO] HTTP server started on 0.0.0.0:8080
[2024-01-15T10:30:46Z INFO] Connected to peer tcp://proxy2:5555
[2024-01-15T10:31:00Z WARN] Rate limit exceeded for IP 192.168.1.100
[2024-01-15T10:32:15Z ERROR] Webhook verification failed: timeout

Structured Logging

For JSON output (for log aggregation):

{
  "timestamp": "2024-01-15T10:30:45Z",
  "level": "INFO",
  "message": "HTTP server started",
  "bind_address": "0.0.0.0:8080",
  "proxy_id": "proxy-1"
}

Distributed Tracing

OpenTelemetry Integration

[telemetry]
enabled = true
endpoint = "http://jaeger:14268/api/traces"
service_name = "revenprox"

Trace Context

Propagate trace context in headers:

GET /events/topic HTTP/1.1
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Performance Monitoring

Connection Pool Stats

curl http://localhost:8080/debug/pool

{
  "active_connections": 12345,
  "idle_connections": 234,
  "total_created": 45678,
  "total_closed": 33333
}

Message Queue Stats

curl http://localhost:8080/debug/queues

{
  "total_queues": 5000,
  "avg_queue_depth": 12,
  "max_queue_depth": 89,
  "messages_pending": 60000
}

SLA Monitoring

Key SLIs

SLI	Target	Measurement
Availability	99.9%	Health check success rate
Latency P99	< 100ms	Message delivery time
Error rate	< 0.1%	Failed requests / total

SLO Dashboard

Track error budget:

Error Budget = 100% - (Errors / Total Requests) * 100
Remaining = SLO Target - Current Error Rate