Runbook

Developer7 min read

Operations Runbook

Deployment

Prerequisites

  • Production server with Docker + Docker Compose
  • All required environment variables in .env.production
  • ENCRYPTION_KEY must be the same across deployments (keys are encrypted at rest)

Production Deploy (Docker Compose)

ssh root@46.225.232.35
cd /opt/nyxcore
git pull
docker compose -f docker-compose.production.yml build --no-cache app
docker compose -f docker-compose.production.yml up -d app

Or use the deploy script:

./scripts/deploy.sh

Post-Deploy

  1. Verify health: curl http://localhost:3000/api/v1/health
  2. Check logs: docker compose -f docker-compose.production.yml logs -f app --tail 50
  3. Verify Redis connection (rate limiting is fail-open, so app works without it)

First-Time Setup

npm run db:push          # Apply full schema
psql $DATABASE_URL -f prisma/rls.sql   # Apply RLS policies
npm run db:seed          # Seed default tenant + built-in personas

Production Schema Changes

Never use db push on production — it drops pgvector embedding columns. Use the safe migration script:

./scripts/db-migrate-safe.sh              # Show SQL diff, prompt before applying
./scripts/db-migrate-safe.sh --dry-run    # Preview only
./scripts/db-migrate-safe.sh --apply      # Apply without confirmation

For direct SQL on production:

docker exec nyxcore-postgres-1 psql -U nyxcore -d nyxcore

Infrastructure

Development

Service Image Port Health Check
Next.js 3000 GET /
PostgreSQL pgvector/pgvector:pg16 5432 pg_isready -U nyxcore
Redis redis:7-alpine 6379 redis-cli ping
npm run docker:up        # Start postgres + redis
npm run docker:down      # Stop (preserves volumes)
docker compose down -v   # Stop + delete data volumes

Production Stack

All services run via docker-compose.production.yml on root@46.225.232.35:

Service Image Purpose
app Custom (Dockerfile) Next.js application
postgres pgvector/pgvector:pg16 Primary database
redis redis:7-alpine Rate limiting
ollama ollama/ollama:latest Local LLM inference
ipcha Custom (ipcha/Dockerfile) IPCHA claim verification
ckb ghcr.io/simplyliz/ckb:latest Code Knowledge Base
traefik traefik:latest Reverse proxy + TLS (Let's Encrypt)
prometheus prom/prometheus:v2.53.0 Metrics (30d retention)
grafana grafana/grafana:11.3.0 Dashboards (grafana.nyxcore.cloud)
loki grafana/loki:3.4.2 Log aggregation (7d retention)
alloy grafana/alloy:v1.7.1 Docker log collection (DSGVO-compliant)
node-exporter prom/node-exporter:v1.8.0 Host metrics
cadvisor gcr.io/cadvisor/cadvisor:v0.49.1 Container metrics
postgres-exporter prometheuscommunity/postgres-exporter:v0.15.0 PostgreSQL metrics
redis-exporter oliver006/redis_exporter:v1.62.0 Redis metrics

Database

  • pgvector: Extension v0.8.1, HNSW index on workflow_insights.embedding (m=16, ef_construction=64, vector_cosine_ops)
  • RLS: Row-Level Security policies in prisma/rls.sql enforce tenant isolation
  • Backups: scripts/db-backup.sh or pg_dump; embedding column requires pgvector extension on restore

Rate Limiting

Endpoint Type Limit Window
General API 100 req 1 min
LLM operations 10 req 1 min

Rate limiting is fail-open — if Redis is unavailable, requests are allowed through. Monitor Redis connectivity if you see unusual traffic patterns.

Monitoring

Dashboards

  • Grafana: https://grafana.nyxcore.cloud (login: admin / GRAFANA_ADMIN_PASSWORD)
  • Prometheus: Internal on nyxcore-net, scraped by Grafana
  • Loki: Log queries via Grafana Explore

Key Metrics to Watch

  • Audit logs: audit_logs table tracks all significant actions (API key changes, PR creation, workflow execution)
  • Workflow step costs: workflow_steps.costEstimate — aggregate to track LLM spend
  • Token usage: workflow_steps.tokenUsage — monitor for cost anomalies
  • Failed workflows: SELECT count(*) FROM workflows WHERE status = 'failed'
  • Container resources: cAdvisor dashboards in Grafana
  • Host metrics: Node Exporter dashboards (CPU, memory, disk)

Prometheus LLM Metrics

Application-level metrics exposed via prom-client at /api/v1/metrics:

Metric Type Labels Description
nyxcore_llm_call_duration_seconds Histogram provider, model, method LLM API call latency
nyxcore_llm_calls_total Counter provider, model, status Total LLM API calls
nyxcore_llm_tokens_total Counter provider, model, type Tokens consumed (prompt/completion)
nyxcore_llm_cost_usd_total Counter provider, model Estimated LLM cost in USD
nyxcore_workflow_duration_seconds Histogram status Workflow execution duration
nyxcore_workflow_steps_total Counter status Workflow steps executed
nyxcore_active_workflows Gauge Currently running workflows
nyxcore_discussion_duration_seconds Histogram mode Discussion processing duration
nyxcore_active_discussions Gauge Currently active discussions
nyxcore_sse_connections_active Gauge endpoint Active SSE connections
nyxcore_rate_limit_hits_total Counter source Rate limit hits
nyxcore_http_request_duration_seconds Histogram method, route, status_code HTTP request latency
nyxcore_http_requests_total Counter method, route, status_code Total HTTP requests

Wired via withMetrics() wrapper in src/server/services/llm/resolve-provider.ts — every resolveProvider() call automatically instruments the returned provider.

Log Collection

  • Alloy collects Docker container logs and forwards to Loki
  • Logs include JSON-parsed fields: level, status, method, path, duration
  • DSGVO-compliant: Client IPs are anonymized (IPv4 last octet → 0, IPv6 truncated)
  • Health-check and metrics paths are auto-dropped to reduce noise
  • Application logs: docker compose -f docker-compose.production.yml logs -f app
  • Audit trail: audit_logs table (queryable via Prisma Studio: npm run db:studio)

Docker Maintenance

# Prune unused images and build cache (frees disk space)
docker system prune -af && docker builder prune -af

Common Issues

"No GitHub token configured"

Cause: Tenant has no GitHub PAT stored in API keys vault. Fix: Go to Admin > API Keys > Add a GitHub personal access token with repo and read:org scopes.

"Workflow stuck in running state"

Cause: SSE connection dropped or server restarted mid-execution. Fix:

UPDATE workflows SET status = 'paused' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';
UPDATE workflow_steps SET status = 'pending' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';

Then resume from the dashboard.

"ENCRYPTION_KEY mismatch"

Cause: ENCRYPTION_KEY changed between deployments. Stored API keys can't be decrypted. Fix: Restore the original key, or have tenants re-enter their API keys in Admin > API Keys.

"prisma db push fails with column conflict"

Cause: Adding a required column to a table with existing rows. Fix: Add @default("") to the column in schema.prisma, or make it optional (?). See CLAUDE.md Prisma Gotchas.

"RLS blocks all queries"

Cause: RLS policies not applied, or tenantId not set in session context. Fix:

psql $DATABASE_URL -f prisma/rls.sql

Verify enforceTenant middleware is in the tRPC chain.

"Docker build fails with ERESOLVE"

Cause: eslint peer dependency conflict between eslint@8 and eslint-config-next@16. Fix: Dockerfile already uses npm ci --legacy-peer-deps. If adding new dependencies that cause similar conflicts, ensure --legacy-peer-deps is on both deps and builder stages.

"Vector search returns no results"

Cause: workflow_insights.embedding column is Unsupported in Prisma — must use raw SQL. Fix: Use Prisma.sql for vector operations, not Prisma client methods. Check that embeddings are generated via text-embedding-3-small (1536 dimensions).

Rollback

Application Rollback

  1. Deploy the previous build artifact / container image
  2. No database rollback needed if no migrations were run

Database Rollback

If a migration was applied:

prisma migrate resolve --rolled-back <migration_name>

Then manually revert the SQL changes. Prisma does not auto-rollback migrations.

For db:push (dev only): There is no built-in rollback. Restore from backup or manually revert schema changes.

Emergency: Disable a Feature

To disable LLM-dependent features without downtime, remove the provider's API key from the tenant's key vault (Admin > API Keys). The BYOK pattern means no key = no LLM calls.

Secrets Management

Secret Storage Rotation
AUTH_SECRET Environment variable Rotate anytime; invalidates existing sessions
ENCRYPTION_KEY Environment variable Cannot rotate without re-encrypting all stored API keys
Tenant API keys api_keys table, AES-256-GCM encrypted Tenants manage via Admin > API Keys
STRIPE_SECRET_KEY Environment variable Rotate via Stripe Dashboard; update .env.production
STRIPE_WEBHOOK_SECRET Environment variable Re-create webhook in Stripe Dashboard if rotated
GitHub PAT api_keys table, encrypted Rotate via GitHub + re-enter in Admin

Letter-to-Blog Pipeline

Session checkpoints in .memory/letter_*.md are auto-converted to blog drafts:

python .github/scripts/blog_gen.py                               # Latest letter
python .github/scripts/blog_gen.py --file letter_20260227_0001.md # Specific file

The GitHub Actions workflow .github/workflows/vibe_publisher.yml auto-creates PRs when letters are pushed to .memory/.