Skip to content

Distributed State Configuration

Configure state synchronization for horizontally scaled deployments.

Overview

When running multiple RevenProx instances, distributed state ensures:

  • Subscriptions are synchronized across all instances
  • Events reach clients regardless of which proxy they're connected to
  • Consistent view of active topics and connections

NNG Configuration

NNG (nanomsg-next-generation) handles inter-proxy communication.

listen_addresses

Addresses to listen for peer connections.

[nng]
listen_addresses = ["tcp://*:5555"]

Formats: - tcp://*:5555 - Listen on all interfaces - tcp://192.168.1.10:5555 - Listen on specific interface - ipc:///tmp/revenprox.sock - Unix socket (same host)

peer_addresses

Addresses of peer proxies to connect to.

[nng]
peer_addresses = [
  "tcp://proxy-2.internal:5555",
  "tcp://proxy-3.internal:5555"
]

Tip

Use DNS names or service discovery for dynamic peer lists.

Buffer Sizes

Configure message buffers:

[nng]
pub_buffer_size = 65536    # Publisher buffer (bytes)
sub_buffer_size = 131072   # Subscriber buffer (bytes)

Larger buffers handle burst traffic but use more memory.

reconnect_interval_ms

Time between reconnection attempts to failed peers.

[nng]
reconnect_interval_ms = 5000

heartbeat_interval_sec

Interval for peer health checks.

[nng]
heartbeat_interval_sec = 30

Message Batching

Batch messages for efficiency:

[nng]
message_batch_size = 100       # Messages per batch
message_batch_timeout_ms = 50  # Max wait time

Distributed State Configuration

gossip_interval_sec

Interval for gossip protocol updates.

[distributed_state]
gossip_interval_sec = 30

Lower values = faster sync, higher network overhead.

Bloom Filter Settings

Bloom filters enable efficient subscription lookups:

[distributed_state]
bloom_filter_capacity = 1000000   # Expected subscriptions
bloom_filter_fpr = 0.01           # False positive rate (1%)

Capacity planning: - Set to expected max subscriptions - Higher capacity = more memory - Memory usage: capacity * -ln(fpr) / (ln(2)^2) / 8 bytes

vector_clock_cleanup_interval_sec

Interval for cleaning up old vector clock entries.

[distributed_state]
vector_clock_cleanup_interval_sec = 3600

max_subscription_events

Maximum subscription events to track for sync.

[distributed_state]
max_subscription_events = 100000

merkle_tree_rebuild_threshold

Trigger Merkle tree rebuild after this many changes.

[distributed_state]
merkle_tree_rebuild_threshold = 1000

peer_timeout_sec

Time before considering a peer dead.

[distributed_state]
peer_timeout_sec = 180

Deployment Topologies

Two-Node Active-Active

┌─────────────┐     NNG      ┌─────────────┐
│   Proxy A   │◄────────────►│   Proxy B   │
│ :8080/:5555 │              │ :8080/:5555 │
└─────────────┘              └─────────────┘

Proxy A config:

[nng]
listen_addresses = ["tcp://*:5555"]
peer_addresses = ["tcp://proxy-b:5555"]

Proxy B config:

[nng]
listen_addresses = ["tcp://*:5555"]
peer_addresses = ["tcp://proxy-a:5555"]

Multi-Node Mesh

        ┌─────────────┐
        │   Proxy A   │
        └──────┬──────┘
    ┌──────────┼──────────┐
    │          │          │
┌───▼───┐  ┌───▼───┐  ┌───▼───┐
│Proxy B│  │Proxy C│  │Proxy D│
└───────┘  └───────┘  └───────┘

Each proxy connects to all others:

[nng]
listen_addresses = ["tcp://*:5555"]
peer_addresses = [
  "tcp://proxy-a:5555",
  "tcp://proxy-b:5555",
  "tcp://proxy-c:5555",
  "tcp://proxy-d:5555"
]

Kubernetes Deployment

Use headless service for peer discovery:

apiVersion: v1
kind: Service
metadata:
  name: revenprox-peers
spec:
  clusterIP: None
  selector:
    app: revenprox
  ports:
    - port: 5555
      name: nng

Config:

[nng]
listen_addresses = ["tcp://*:5555"]
# Use DNS SRV records or init container for peer discovery

Consistency Model

RevenProx uses eventual consistency:

  1. Local First: Operations apply locally immediately
  2. Async Sync: Changes propagate to peers asynchronously
  3. Conflict Resolution: Last-write-wins with vector clocks

Sync Guarantees

Operation Local Remote
Subscribe Immediate < gossip_interval
Unsubscribe Immediate < gossip_interval
Event delivery Immediate < message_batch_timeout

Monitoring

Key Metrics

Metric Description
peers_connected Number of active peer connections
sync_lag_ms Time since last successful sync
bloom_filter_fpr Actual false positive rate
gossip_messages_sent Gossip messages transmitted

Health Checks

Verify peer connectivity:

# Check peer status
curl http://localhost:8080/health/peers

Troubleshooting

Peers Not Connecting

  1. Check network connectivity between hosts
  2. Verify firewall allows NNG port (default 5555)
  3. Check peer addresses are correct
  4. Review logs for connection errors

Slow Synchronization

  1. Reduce gossip_interval_sec
  2. Increase message_batch_size
  3. Check network latency between peers
  4. Monitor sync_lag_ms metric

Memory Usage High

  1. Reduce bloom_filter_capacity
  2. Lower max_subscription_events
  3. Decrease buffer sizes
  4. Check for subscription leaks

Example: Production Cluster

# Unique ID per instance
proxy_id = "proxy-prod-1"

[nng]
listen_addresses = ["tcp://*:5555"]
peer_addresses = [
  "tcp://proxy-prod-2.internal:5555",
  "tcp://proxy-prod-3.internal:5555"
]
pub_buffer_size = 131072
sub_buffer_size = 262144
reconnect_interval_ms = 2000
heartbeat_interval_sec = 15

[distributed_state]
gossip_interval_sec = 10
bloom_filter_capacity = 5000000
bloom_filter_fpr = 0.001
max_subscription_events = 500000
peer_timeout_sec = 60

Next Steps