Skip to content

Monitoring

Monitor your Machineuse deployment with built-in metrics and integrations.

Metrics Overview

Machineuse collects and stores metrics using DuckDB for time-series analytics.

Cluster Metrics

  • Total instances
  • Running/dormant/failed instances
  • Node health status
  • Resource utilization

Node Metrics

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Network throughput
  • Instance count

Instance Metrics

  • CPU percentage
  • Memory usage
  • Disk usage
  • Network I/O
  • Activity status

Accessing Metrics

CLI

# Cluster overview
machineuse-cli cluster status

# Node metrics
machineuse-cli metrics --node worker-1

# Instance metrics
machineuse-cli metrics --instance abc123def456

# Historical data
machineuse-cli metrics --node worker-1 --period 24h

API

# Cluster metrics
curl http://localhost:8000/v2/metrics

# Node metrics
curl http://localhost:8000/v2/metrics?node_id=worker-1

# Time-series data
curl http://localhost:8000/v2/metrics?period=24h

Response Example

{
  "cluster": {
    "total_instances": 38,
    "running_instances": 35,
    "dormant_instances": 3,
    "failed_instances": 0,
    "nodes_online": 2,
    "nodes_offline": 1
  },
  "resources": {
    "avg_cpu_percent": 56.0,
    "avg_memory_percent": 70.0,
    "avg_disk_percent": 45.0
  },
  "timeseries": {
    "interval": "5m",
    "timestamps": ["2026-03-19T10:00:00Z", "2026-03-19T10:05:00Z"],
    "cpu": [45.0, 52.0],
    "memory": [65.0, 68.0],
    "instances": [35, 36]
  }
}

Health Checks

Service Health

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "components": {
    "api": "up",
    "storage": "up",
    "messaging": "up",
    "scheduler": "up"
  },
  "instances_count": 38,
  "running_instances": 35
}

Node Health

Nodes send heartbeats to the control plane. View node status:

machineuse-cli nodes list

Unhealthy indicators: - Last heartbeat > 60 seconds ago - High resource utilization (>90%) - Failed instance ratio > 10%

Alerting

Threshold Configuration

Configure alerting thresholds in your configuration:

{
  "monitoring": {
    "cpu_warning_threshold": 80,
    "cpu_critical_threshold": 95,
    "memory_warning_threshold": 80,
    "memory_critical_threshold": 95,
    "disk_warning_threshold": 80,
    "disk_critical_threshold": 90
  }
}

Alert Levels

Level Description
info Informational message
warning Approaching threshold
critical Threshold exceeded

Log Files

Locations

Component Location
API Server journalctl -u container-manager
Worker Agent /var/lib/machineuse/logs/agent.log
Nginx /var/log/nginx/access.log

Log Levels

Set log level via environment variable:

export MACHINEUSE_LOG_LEVEL=DEBUG

Levels: DEBUG, INFO, WARNING, ERROR

Structured Logging

Logs are JSON-formatted for easy parsing:

{
  "timestamp": "2026-03-19T10:30:00.123Z",
  "level": "INFO",
  "component": "scheduler",
  "message": "Instance scheduled",
  "instance_id": "abc123def456",
  "node_id": "worker-1"
}

Prometheus Integration

Export metrics for Prometheus:

curl http://localhost:8000/metrics

Example metrics:

# HELP machineuse_instances_total Total number of instances
# TYPE machineuse_instances_total gauge
machineuse_instances_total{status="running"} 35
machineuse_instances_total{status="dormant"} 3

# HELP machineuse_node_cpu_percent Node CPU utilization
# TYPE machineuse_node_cpu_percent gauge
machineuse_node_cpu_percent{node="worker-1"} 45.2
machineuse_node_cpu_percent{node="worker-2"} 67.8

Prometheus Configuration

scrape_configs:
  - job_name: 'machineuse'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

Grafana Dashboards

Import the included Grafana dashboard for visualization:

# Dashboard JSON location
ls dashboards/grafana-machineuse.json

Dashboard panels: - Cluster overview - Node resource utilization - Instance lifecycle events - Historical trends

Debugging

Common Issues

High Memory Usage

# Check instance memory
machineuse-cli metrics --node worker-1

# Find memory-heavy instances
curl "http://localhost:8000/v2/instances?sort=memory_desc&limit=10"

Slow Instance Creation

# Check scheduler logs
journalctl -u container-manager | grep scheduler

# View pending instances
curl "http://localhost:8000/v2/instances?status=creating"

Node Offline

# Check node status
machineuse-cli nodes list

# View node logs
ssh worker-1 journalctl -u machineuse-agent

Best Practices

  1. Set up alerting for critical thresholds
  2. Monitor disk space - snapshots consume storage
  3. Review idle instances regularly for cleanup
  4. Use distributed mode for high availability
  5. Configure log rotation to prevent disk fill