Monitoring
Monitor RevenProx health and performance.
Health Endpoint
Check proxy health:
Response:
Key Metrics
Connection Metrics
| Metric | Description | Type |
|---|---|---|
connections_active |
Current open connections | Gauge |
connections_total |
Total connections since start | Counter |
connections_rejected |
Rejected connections | Counter |
connection_duration_seconds |
Connection lifetime histogram | Histogram |
Message Metrics
| Metric | Description | Type |
|---|---|---|
messages_sent_total |
Total messages delivered | Counter |
messages_dropped_total |
Messages dropped (backpressure) | Counter |
bytes_sent_total |
Total bytes transmitted | Counter |
message_latency_ms |
Message delivery latency | Histogram |
Authentication Metrics
| Metric | Description | Type |
|---|---|---|
auth_requests_total |
Total authentication requests | Counter |
auth_cache_hits |
Cache hit count | Counter |
auth_cache_misses |
Cache miss count | Counter |
auth_failures_total |
Authentication failures | Counter |
circuit_breaker_state |
Circuit breaker status (0=closed, 1=open) | Gauge |
System Metrics
| Metric | Description | Type |
|---|---|---|
memory_used_bytes |
Current memory usage | Gauge |
cpu_usage_percent |
CPU utilization | Gauge |
fd_used |
File descriptors in use | Gauge |
goroutines |
Active goroutines/threads | Gauge |
Distributed State Metrics
| Metric | Description | Type |
|---|---|---|
peers_connected |
Connected peer proxies | Gauge |
sync_messages_sent |
Sync messages transmitted | Counter |
sync_lag_seconds |
Time since last successful sync | Gauge |
topics_active |
Active topics | Gauge |
Prometheus Integration
Metrics Endpoint
Expose metrics in Prometheus format:
Output:
# HELP revenprox_connections_active Current active connections
# TYPE revenprox_connections_active gauge
revenprox_connections_active 12345
# HELP revenprox_messages_sent_total Total messages sent
# TYPE revenprox_messages_sent_total counter
revenprox_messages_sent_total 987654321
Prometheus Configuration
scrape_configs:
- job_name: 'revenprox'
static_configs:
- targets: ['proxy1:8080', 'proxy2:8080', 'proxy3:8080']
scrape_interval: 15s
metrics_path: /metrics
Kubernetes ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: revenprox
spec:
selector:
matchLabels:
app: revenprox
endpoints:
- port: http
path: /metrics
interval: 15s
Grafana Dashboards
Connection Overview
{
"panels": [
{
"title": "Active Connections",
"targets": [{
"expr": "revenprox_connections_active"
}]
},
{
"title": "Connection Rate",
"targets": [{
"expr": "rate(revenprox_connections_total[5m])"
}]
},
{
"title": "Rejected Connections",
"targets": [{
"expr": "rate(revenprox_connections_rejected[5m])"
}]
}
]
}
Message Throughput
{
"panels": [
{
"title": "Messages/sec",
"targets": [{
"expr": "rate(revenprox_messages_sent_total[1m])"
}]
},
{
"title": "Bytes/sec",
"targets": [{
"expr": "rate(revenprox_bytes_sent_total[1m])"
}]
},
{
"title": "Message Latency P99",
"targets": [{
"expr": "histogram_quantile(0.99, revenprox_message_latency_ms_bucket)"
}]
}
]
}
Alerting
Prometheus Alert Rules
groups:
- name: revenprox
rules:
# High connection count
- alert: RevenProxHighConnections
expr: revenprox_connections_active > 80000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count on {{ $labels.instance }}"
# Connection rejections
- alert: RevenProxConnectionsRejected
expr: rate(revenprox_connections_rejected[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Connections being rejected on {{ $labels.instance }}"
# Authentication failures spike
- alert: RevenProxAuthFailures
expr: rate(revenprox_auth_failures_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"
# Circuit breaker open
- alert: RevenProxCircuitBreakerOpen
expr: revenprox_circuit_breaker_state == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open - auth webhook failing"
# Memory usage high
- alert: RevenProxMemoryHigh
expr: revenprox_memory_used_bytes / revenprox_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 85%"
# Peer disconnected
- alert: RevenProxPeerDisconnected
expr: revenprox_peers_connected < 2
for: 2m
labels:
severity: warning
annotations:
summary: "Peer proxy disconnected"
Logging
Log Levels
| Level | Description |
|---|---|
debug |
Detailed debugging information |
info |
Normal operational messages |
warn |
Warning conditions |
error |
Error conditions |
Configuration
Log Format
[2024-01-15T10:30:45Z INFO] HTTP server started on 0.0.0.0:8080
[2024-01-15T10:30:46Z INFO] Connected to peer tcp://proxy2:5555
[2024-01-15T10:31:00Z WARN] Rate limit exceeded for IP 192.168.1.100
[2024-01-15T10:32:15Z ERROR] Webhook verification failed: timeout
Structured Logging
For JSON output (for log aggregation):
{
"timestamp": "2024-01-15T10:30:45Z",
"level": "INFO",
"message": "HTTP server started",
"bind_address": "0.0.0.0:8080",
"proxy_id": "proxy-1"
}
Distributed Tracing
OpenTelemetry Integration
Trace Context
Propagate trace context in headers:
Performance Monitoring
Connection Pool Stats
{
"active_connections": 12345,
"idle_connections": 234,
"total_created": 45678,
"total_closed": 33333
}
Message Queue Stats
SLA Monitoring
Key SLIs
| SLI | Target | Measurement |
|---|---|---|
| Availability | 99.9% | Health check success rate |
| Latency P99 | < 100ms | Message delivery time |
| Error rate | < 0.1% | Failed requests / total |
SLO Dashboard
Track error budget: