Operations Runbook

Deployment

Prerequisites

Production server with Docker + Docker Compose
All required environment variables in .env.production
ENCRYPTION_KEY must be the same across deployments (keys are encrypted at rest)

Production Deploy (Docker Compose)

ssh root@46.225.232.35
cd /opt/nyxcore
git pull
docker compose -f docker-compose.production.yml build --no-cache app
docker compose -f docker-compose.production.yml up -d app

Or use the deploy script:

./scripts/deploy.sh

Post-Deploy

Verify health: curl http://localhost:3000/api/v1/health
Check logs: docker compose -f docker-compose.production.yml logs -f app --tail 50
Verify Redis connection (rate limiting is fail-open, so app works without it)

First-Time Setup

npm run db:push          # Apply full schema
psql $DATABASE_URL -f prisma/rls.sql   # Apply RLS policies
npm run db:seed          # Seed default tenant + built-in personas

Production Schema Changes

Never use db push on production — it drops pgvector embedding columns. Use the safe migration script:

./scripts/db-migrate-safe.sh              # Show SQL diff, prompt before applying
./scripts/db-migrate-safe.sh --dry-run    # Preview only
./scripts/db-migrate-safe.sh --apply      # Apply without confirmation

For direct SQL on production:

docker exec nyxcore-postgres-1 psql -U nyxcore -d nyxcore

Infrastructure

Development

Service	Image	Port	Health Check
Next.js	—	3000	`GET /`
PostgreSQL	`pgvector/pgvector:pg16`	5432	`pg_isready -U nyxcore`
Redis	`redis:7-alpine`	6379	`redis-cli ping`

npm run docker:up        # Start postgres + redis
npm run docker:down      # Stop (preserves volumes)
docker compose down -v   # Stop + delete data volumes

Production Stack

All services run via docker-compose.production.yml on root@46.225.232.35:

Service	Image	Purpose
`app`	Custom (Dockerfile)	Next.js application
`postgres`	`pgvector/pgvector:pg16`	Primary database
`redis`	`redis:7-alpine`	Rate limiting
`ollama`	`ollama/ollama:latest`	Local LLM inference
`ipcha`	Custom (`ipcha/Dockerfile`)	IPCHA claim verification
`ckb`	`ghcr.io/simplyliz/ckb:latest`	Code Knowledge Base
`traefik`	`traefik:latest`	Reverse proxy + TLS (Let's Encrypt)
`prometheus`	`prom/prometheus:v2.53.0`	Metrics (30d retention)
`grafana`	`grafana/grafana:11.3.0`	Dashboards (`grafana.nyxcore.cloud`)
`loki`	`grafana/loki:3.4.2`	Log aggregation (7d retention)
`alloy`	`grafana/alloy:v1.7.1`	Docker log collection (DSGVO-compliant)
`node-exporter`	`prom/node-exporter:v1.8.0`	Host metrics
`cadvisor`	`gcr.io/cadvisor/cadvisor:v0.49.1`	Container metrics
`postgres-exporter`	`prometheuscommunity/postgres-exporter:v0.15.0`	PostgreSQL metrics
`redis-exporter`	`oliver006/redis_exporter:v1.62.0`	Redis metrics

Database

pgvector: Extension v0.8.1, HNSW index on workflow_insights.embedding (m=16, ef_construction=64, vector_cosine_ops)
RLS: Row-Level Security policies in prisma/rls.sql enforce tenant isolation
Backups: scripts/db-backup.sh or pg_dump; embedding column requires pgvector extension on restore

Rate Limiting

Endpoint Type	Limit	Window
General API	100 req	1 min
LLM operations	10 req	1 min

Rate limiting is fail-open — if Redis is unavailable, requests are allowed through. Monitor Redis connectivity if you see unusual traffic patterns.

Monitoring

Dashboards

Grafana: https://grafana.nyxcore.cloud (login: admin / GRAFANA_ADMIN_PASSWORD)
Prometheus: Internal on nyxcore-net, scraped by Grafana
Loki: Log queries via Grafana Explore

Key Metrics to Watch

Audit logs: audit_logs table tracks all significant actions (API key changes, PR creation, workflow execution)
Workflow step costs: workflow_steps.costEstimate — aggregate to track LLM spend
Token usage: workflow_steps.tokenUsage — monitor for cost anomalies
Failed workflows: SELECT count(*) FROM workflows WHERE status = 'failed'
Container resources: cAdvisor dashboards in Grafana
Host metrics: Node Exporter dashboards (CPU, memory, disk)

Prometheus LLM Metrics

Application-level metrics exposed via prom-client at /api/v1/metrics:

Metric	Type	Labels	Description
`nyxcore_llm_call_duration_seconds`	Histogram	provider, model, method	LLM API call latency
`nyxcore_llm_calls_total`	Counter	provider, model, status	Total LLM API calls
`nyxcore_llm_tokens_total`	Counter	provider, model, type	Tokens consumed (prompt/completion)
`nyxcore_llm_cost_usd_total`	Counter	provider, model	Estimated LLM cost in USD
`nyxcore_workflow_duration_seconds`	Histogram	status	Workflow execution duration
`nyxcore_workflow_steps_total`	Counter	status	Workflow steps executed
`nyxcore_active_workflows`	Gauge	—	Currently running workflows
`nyxcore_discussion_duration_seconds`	Histogram	mode	Discussion processing duration
`nyxcore_active_discussions`	Gauge	—	Currently active discussions
`nyxcore_sse_connections_active`	Gauge	endpoint	Active SSE connections
`nyxcore_rate_limit_hits_total`	Counter	source	Rate limit hits
`nyxcore_http_request_duration_seconds`	Histogram	method, route, status_code	HTTP request latency
`nyxcore_http_requests_total`	Counter	method, route, status_code	Total HTTP requests

Wired via withMetrics() wrapper in src/server/services/llm/resolve-provider.ts — every resolveProvider() call automatically instruments the returned provider.

Log Collection

Alloy collects Docker container logs and forwards to Loki
Logs include JSON-parsed fields: level, status, method, path, duration
DSGVO-compliant: Client IPs are anonymized (IPv4 last octet → 0, IPv6 truncated)
Health-check and metrics paths are auto-dropped to reduce noise
Application logs: docker compose -f docker-compose.production.yml logs -f app
Audit trail: audit_logs table (queryable via Prisma Studio: npm run db:studio)

Docker Maintenance

# Prune unused images and build cache (frees disk space)
docker system prune -af && docker builder prune -af

Common Issues

"No GitHub token configured"

Cause: Tenant has no GitHub PAT stored in API keys vault. Fix: Go to Admin > API Keys > Add a GitHub personal access token with repo and read:org scopes.

"Workflow stuck in running state"

Cause: SSE connection dropped or server restarted mid-execution. Fix:

UPDATE workflows SET status = 'paused' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';
UPDATE workflow_steps SET status = 'pending' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';

Then resume from the dashboard.

"ENCRYPTION_KEY mismatch"

Cause: ENCRYPTION_KEY changed between deployments. Stored API keys can't be decrypted. Fix: Restore the original key, or have tenants re-enter their API keys in Admin > API Keys.

"prisma db push fails with column conflict"

Cause: Adding a required column to a table with existing rows. Fix: Add @default("") to the column in schema.prisma, or make it optional (?). See CLAUDE.md Prisma Gotchas.

"RLS blocks all queries"

Cause: RLS policies not applied, or tenantId not set in session context. Fix:

psql $DATABASE_URL -f prisma/rls.sql

Verify enforceTenant middleware is in the tRPC chain.

"Docker build fails with ERESOLVE"

Cause: eslint peer dependency conflict between eslint@8 and eslint-config-next@16. Fix: Dockerfile already uses npm ci --legacy-peer-deps. If adding new dependencies that cause similar conflicts, ensure --legacy-peer-deps is on both deps and builder stages.

"Vector search returns no results"

Cause: workflow_insights.embedding column is Unsupported in Prisma — must use raw SQL. Fix: Use Prisma.sql for vector operations, not Prisma client methods. Check that embeddings are generated via text-embedding-3-small (1536 dimensions).

Rollback

Application Rollback

Deploy the previous build artifact / container image
No database rollback needed if no migrations were run

Database Rollback

If a migration was applied:

prisma migrate resolve --rolled-back <migration_name>

Then manually revert the SQL changes. Prisma does not auto-rollback migrations.

For db:push (dev only): There is no built-in rollback. Restore from backup or manually revert schema changes.

Emergency: Disable a Feature

To disable LLM-dependent features without downtime, remove the provider's API key from the tenant's key vault (Admin > API Keys). The BYOK pattern means no key = no LLM calls.

Secrets Management

Secret	Storage	Rotation
`AUTH_SECRET`	Environment variable	Rotate anytime; invalidates existing sessions
`ENCRYPTION_KEY`	Environment variable	Cannot rotate without re-encrypting all stored API keys
Tenant API keys	`api_keys` table, AES-256-GCM encrypted	Tenants manage via Admin > API Keys
`STRIPE_SECRET_KEY`	Environment variable	Rotate via Stripe Dashboard; update `.env.production`
`STRIPE_WEBHOOK_SECRET`	Environment variable	Re-create webhook in Stripe Dashboard if rotated
GitHub PAT	`api_keys` table, encrypted	Rotate via GitHub + re-enter in Admin

Letter-to-Blog Pipeline

Session checkpoints in .memory/letter_*.md are auto-converted to blog drafts:

python .github/scripts/blog_gen.py                               # Latest letter
python .github/scripts/blog_gen.py --file letter_20260227_0001.md # Specific file

The GitHub Actions workflow .github/workflows/vibe_publisher.yml auto-creates PRs when letters are pushed to .memory/.