AI Integration

Voxeltron has a built-in AI panel powered by Crush (Charmbracelet's AI toolkit). Hit ctrl+a in the TUI and ask your AI anything about your infrastructure. It can see your logs, config, and deploy history — and it can actually take action.

The AI runs in an agentic mode: it can propose and execute actions (deploy, rollback, scale, rotate certs) with your confirmation on each step. Think of it as a DevOps engineer that lives in your terminal and never needs a coffee break.

Supported Models

Claude 3.5 Sonnet

Anthropic

Recommended — best reasoning

GPT-4o

OpenAI

Strong alternative

Gemini 1.5 Pro

Google

Large context window

Ollama (local)

Self-hosted

No API key needed, runs locally

Configure the AI Panel

$ voxeltron ai configure

? Provider: Anthropic
? API key: (hidden input)
? Model: claude-3-5-sonnet-20241022
? Max tokens per request: 4096

✓ AI configured. Press ctrl+a in the TUI to open the panel.

Or edit the config file directly:

# ~/.config/voxeltron/config.toml
[ai]
provider = "anthropic"
model = "claude-3-5-sonnet-20241022"
api_key_env = "ANTHROPIC_API_KEY"  # reads from env var
max_tokens = 4096
agentic = true

Using the AI Panel

In the TUI, press ctrl+a. The panel opens with context already loaded:

┌─ Voxeltron AI ────────────────────────────────────────────┐
│ Context loaded:                                           │
│   • web-app: last 50 log lines                            │
│   • web-app: config, env vars (values redacted), health   │
│   • Deploy history: last 5 deploys                        │
│                                                           │
│ > _                                                       │
└───────────────────────────────────────────────────────────┘

Example queries

why is web-app using so much memory?
show me errors from the last hour
roll back to yesterday's deploy
add a health check endpoint and redeploy
my postgres is slow — check the logs and suggest indexes
rotate the TLS cert for api.example.com

Agentic actions

When the AI wants to take action, it asks for confirmation:

AI: I can see web-app has been OOMing every 4h.
    The memory limit is 256MB but the Node.js heap
    is growing to 280MB before the OOM killer hits.

    I recommend increasing the memory limit to 512MB
    and adding --max-old-space-size=400 to NODE_OPTIONS.

    Proposed action:
      voxeltron service update web-app --memory 512mb
      voxeltron env set web-app NODE_OPTIONS="--max-old-space-size=400"
      voxeltron redeploy web-app

    Execute these 3 actions? [y/N]: y

✓ Memory limit updated
✓ Environment variable set
✓ Zero-downtime redeploy complete

The AI never takes action without your explicit y confirmation. Each proposed action is shown clearly before execution. You can review and reject individual steps.

BYOK vs Cloud Quota

You can always bring your own API key (BYOK) — the AI panel works entirely on the client side. Pro and Teams plans also include a monthly token quota via Voxeltron Cloud if you don't want to manage keys:

# Use your own key (free, unlimited)
$ ANTHROPIC_API_KEY=sk-... voxeltron

# Use Voxeltron Cloud quota (Pro/Teams)
$ voxeltron ai auth login  # authenticates with Voxeltron Cloud

AI Tool Registry

The AI panel has access to 10 built-in tools that let it inspect and act on your infrastructure. Tools are organized by permission level — the AI can only use tools within the grant configured for your project:

Read

Observe only, no side effects

query_logs query_metrics list_alerts project_status list_deployments

Safe

Recoverable actions

container_action (restart) create_backup

Deploy

Service-affecting changes

scale_service rollback

Full

Security-sensitive operations

rotate_secret

Set the tool permission level per project in your config:

# ~/.config/voxeltron/config.toml
[ai.tools]
grant = "safe"  # one of: read, safe, deploy, full

When the AI invokes a tool, you see exactly what it does:

AI: Checking the current state of web-app.

   tool: query_logs project=web-app lines=50 severity=error

   I found 23 OOM errors in the last hour. Let me check the
   memory metrics.

   tool: query_metrics project=web-app metric=memory_usage range=1h

   Memory usage is averaging 245MB against a 256MB limit.
   The container is being OOM-killed roughly every 15 minutes.

Skills (Multi-Step Workflows)

Skills are multi-step workflows that chain multiple tool calls together to solve common operational tasks. Each skill defines a sequence of steps, handles errors, and streams real-time progress to the TUI.

Voxeltron ships with 4 built-in skills:

health-check

Full-stack health sweep: container status, resource usage, endpoint latency, recent errors

deployment-analysis

Analyze a deploy: diff changes, check rollout status, compare error rates before and after

log-investigation

Structured log triage: pattern detection, error correlation, root cause suggestions

cost-optimizer

Identify over-provisioned services, idle containers, and potential savings

Trigger a skill by asking the AI, or run it directly:

AI: Running skill: health-check on web-app

+ Check container status ........... running
+ Check resource usage ............. memory 94%
+ Check endpoint latency ........... p99 120ms
+ Scan recent errors ............... 23 OOM events

Summary: web-app is functionally healthy but under memory pressure.
  Recommendation: increase memory limit from 256MB to 512MB.

Progress streams in real-time — each step appears as it completes, so you always know where the skill is in its workflow.

Runbooks (Approval-Gated Automation)

Runbooks extend skills with approval gates and automatic rollback. They are designed for operations that make real changes to your infrastructure and need human sign-off before each destructive step.

Voxeltron ships with 3 built-in runbooks:

disk-cleanup

Find and remove dangling images, old build layers, and unused volumes

restart-loop-detection

Detect crash-looping containers, diagnose root cause, apply fixes

certificate-expiry-renewal

Check TLS cert expiry, trigger ACME renewal, verify new cert

Every runbook step that modifies state pauses for your approval. If you reject a step, the runbook rolls back any changes it already made:

Runbook: disk-cleanup
  + Identify old images ............. Found 12 dangling images
  > Delete old images ...

  +--------------------------------------------------+
  | Approval required: Delete old images              |
  | This will free 5GB                                 |
  | [y] Approve  [n] Reject                        |
  +--------------------------------------------------+

Runbook approval gates are mandatory and cannot be bypassed. Every destructive action requires an explicit [y] before the runbook proceeds. Rejecting a step triggers automatic rollback of prior changes in that runbook session.