Monitoring & Alerts

Built-in monitoring with no external services required. Voxeltron collects CPU, memory, and network metrics from every Docker container, aggregates logs, and fires alerts when thresholds are breached.

Metrics Collection

The MetricsCollector polls the Docker stats API at a configurable interval, capturing per-container resource usage. Data is persisted in the MetricsStore (backed by SQLite) for historical queries and dashboard rendering.

Collected Metrics

  • CPU — usage percentage per core and aggregate
  • Memory — RSS, cache, swap, and limit
  • Network — bytes and packets in/out per interface
  • Disk I/O — read/write bytes and operations
  • Container state — running, stopped, restarting, OOM-killed

Log Aggregation

Container logs are streamed in real time via the Docker API and indexed for fast retrieval. Logs are queryable by deployment, time range, and level (debug, info, warn, error).

# Query logs via the TUI
voxeltron logs my-app --since 1h --level error

# Stream live logs
voxeltron logs my-app --follow

Alerting

Threshold-based alerts fire when resource usage exceeds configured limits. Each alert rule targets a specific metric and can route notifications to one or more channels.

CPU Alert

Fires when CPU usage exceeds a threshold (e.g. > 90%) for a sustained duration.

Memory Alert

Fires when memory usage exceeds a threshold (e.g. > 85%) to catch leaks before OOM.

Error Rate Alert

Fires when the error log rate exceeds a count per window (e.g. > 50 errors/min).

Disk Alert

Fires when disk usage crosses a threshold to prevent storage exhaustion.

Alert Channels

  • Webhook — POST JSON payloads to any HTTP endpoint
  • Email — SMTP-based notifications with configurable recipients
  • Slack — delivered via WASM plugins for full customization

Configuration

Configure monitoring in /etc/voxeltron/config.toml under the [monitoring] section:

[monitoring]
enabled = true
interval_seconds = 15          # Metrics polling interval
retention_days = 30            # How long to keep historical data
log_level_filter = "info"      # Minimum log level to index

[monitoring.alerts]
enabled = true
evaluation_interval = "60s"    # How often alert rules are evaluated

[[monitoring.alerts.rules]]
name = "high-cpu"
metric = "cpu_percent"
threshold = 90.0
duration = "5m"
channels = ["webhook", "email"]

[[monitoring.alerts.rules]]
name = "high-memory"
metric = "memory_percent"
threshold = 85.0
duration = "2m"
channels = ["webhook"]

[[monitoring.alerts.rules]]
name = "error-spike"
metric = "error_rate"
threshold = 50.0               # errors per minute
duration = "1m"
channels = ["webhook", "email"]

[monitoring.alerts.channels.webhook]
url = "https://hooks.example.com/voxeltron"

[monitoring.alerts.channels.email]
smtp_host = "smtp.example.com"
smtp_port = 587
from = "alerts@example.com"
to = ["ops@example.com"]

AI Integration

The built-in AI DevOps agent can interact with the monitoring subsystem through dedicated tools:

  • query_metrics — retrieve CPU, memory, and network metrics for a deployment over a time range
  • query_logs — search and filter container logs by deployment, level, and time window
  • list_alerts — enumerate active and resolved alerts, including firing status and history

These tools allow the AI agent to diagnose issues, correlate metrics with log events, and surface actionable insights without manual investigation.