Monitoring¶
Monitor your Machineuse deployment with built-in metrics and integrations.
Metrics Overview¶
Machineuse collects and stores metrics using DuckDB for time-series analytics.
Cluster Metrics¶
- Total instances
- Running/dormant/failed instances
- Node health status
- Resource utilization
Node Metrics¶
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
- Instance count
Instance Metrics¶
- CPU percentage
- Memory usage
- Disk usage
- Network I/O
- Activity status
Accessing Metrics¶
CLI¶
# Cluster overview
machineuse-cli cluster status
# Node metrics
machineuse-cli metrics --node worker-1
# Instance metrics
machineuse-cli metrics --instance abc123def456
# Historical data
machineuse-cli metrics --node worker-1 --period 24h
API¶
# Cluster metrics
curl http://localhost:8000/v2/metrics
# Node metrics
curl http://localhost:8000/v2/metrics?node_id=worker-1
# Time-series data
curl http://localhost:8000/v2/metrics?period=24h
Response Example¶
{
"cluster": {
"total_instances": 38,
"running_instances": 35,
"dormant_instances": 3,
"failed_instances": 0,
"nodes_online": 2,
"nodes_offline": 1
},
"resources": {
"avg_cpu_percent": 56.0,
"avg_memory_percent": 70.0,
"avg_disk_percent": 45.0
},
"timeseries": {
"interval": "5m",
"timestamps": ["2026-03-19T10:00:00Z", "2026-03-19T10:05:00Z"],
"cpu": [45.0, 52.0],
"memory": [65.0, 68.0],
"instances": [35, 36]
}
}
Health Checks¶
Service Health¶
Response:
{
"status": "healthy",
"components": {
"api": "up",
"storage": "up",
"messaging": "up",
"scheduler": "up"
},
"instances_count": 38,
"running_instances": 35
}
Node Health¶
Nodes send heartbeats to the control plane. View node status:
Unhealthy indicators: - Last heartbeat > 60 seconds ago - High resource utilization (>90%) - Failed instance ratio > 10%
Alerting¶
Threshold Configuration¶
Configure alerting thresholds in your configuration:
{
"monitoring": {
"cpu_warning_threshold": 80,
"cpu_critical_threshold": 95,
"memory_warning_threshold": 80,
"memory_critical_threshold": 95,
"disk_warning_threshold": 80,
"disk_critical_threshold": 90
}
}
Alert Levels¶
| Level | Description |
|---|---|
info | Informational message |
warning | Approaching threshold |
critical | Threshold exceeded |
Log Files¶
Locations¶
| Component | Location |
|---|---|
| API Server | journalctl -u container-manager |
| Worker Agent | /var/lib/machineuse/logs/agent.log |
| Nginx | /var/log/nginx/access.log |
Log Levels¶
Set log level via environment variable:
Levels: DEBUG, INFO, WARNING, ERROR
Structured Logging¶
Logs are JSON-formatted for easy parsing:
{
"timestamp": "2026-03-19T10:30:00.123Z",
"level": "INFO",
"component": "scheduler",
"message": "Instance scheduled",
"instance_id": "abc123def456",
"node_id": "worker-1"
}
Prometheus Integration¶
Export metrics for Prometheus:
Example metrics:
# HELP machineuse_instances_total Total number of instances
# TYPE machineuse_instances_total gauge
machineuse_instances_total{status="running"} 35
machineuse_instances_total{status="dormant"} 3
# HELP machineuse_node_cpu_percent Node CPU utilization
# TYPE machineuse_node_cpu_percent gauge
machineuse_node_cpu_percent{node="worker-1"} 45.2
machineuse_node_cpu_percent{node="worker-2"} 67.8
Prometheus Configuration¶
scrape_configs:
- job_name: 'machineuse'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
Grafana Dashboards¶
Import the included Grafana dashboard for visualization:
Dashboard panels: - Cluster overview - Node resource utilization - Instance lifecycle events - Historical trends
Debugging¶
Common Issues¶
High Memory Usage
# Check instance memory
machineuse-cli metrics --node worker-1
# Find memory-heavy instances
curl "http://localhost:8000/v2/instances?sort=memory_desc&limit=10"
Slow Instance Creation
# Check scheduler logs
journalctl -u container-manager | grep scheduler
# View pending instances
curl "http://localhost:8000/v2/instances?status=creating"
Node Offline
# Check node status
machineuse-cli nodes list
# View node logs
ssh worker-1 journalctl -u machineuse-agent
Best Practices¶
- Set up alerting for critical thresholds
- Monitor disk space - snapshots consume storage
- Review idle instances regularly for cleanup
- Use distributed mode for high availability
- Configure log rotation to prevent disk fill