Runbook
Operations Runbook
Deployment
Prerequisites
- Production server with Docker + Docker Compose
- All required environment variables in
.env.production ENCRYPTION_KEYmust be the same across deployments (keys are encrypted at rest)
Production Deploy (Docker Compose)
ssh root@46.225.232.35
cd /opt/nyxcore
git pull
docker compose -f docker-compose.production.yml build --no-cache app
docker compose -f docker-compose.production.yml up -d app
Or use the deploy script:
./scripts/deploy.sh
Post-Deploy
- Verify health:
curl http://localhost:3000/api/v1/health - Check logs:
docker compose -f docker-compose.production.yml logs -f app --tail 50 - Verify Redis connection (rate limiting is fail-open, so app works without it)
First-Time Setup
npm run db:push # Apply full schema
psql $DATABASE_URL -f prisma/rls.sql # Apply RLS policies
npm run db:seed # Seed default tenant + built-in personas
Production Schema Changes
Never use db push on production — it drops pgvector embedding columns. Use the safe migration script:
./scripts/db-migrate-safe.sh # Show SQL diff, prompt before applying
./scripts/db-migrate-safe.sh --dry-run # Preview only
./scripts/db-migrate-safe.sh --apply # Apply without confirmation
For direct SQL on production:
docker exec nyxcore-postgres-1 psql -U nyxcore -d nyxcore
Infrastructure
Development
| Service | Image | Port | Health Check |
|---|---|---|---|
| Next.js | — | 3000 | GET / |
| PostgreSQL | pgvector/pgvector:pg16 |
5432 | pg_isready -U nyxcore |
| Redis | redis:7-alpine |
6379 | redis-cli ping |
npm run docker:up # Start postgres + redis
npm run docker:down # Stop (preserves volumes)
docker compose down -v # Stop + delete data volumes
Production Stack
All services run via docker-compose.production.yml on root@46.225.232.35:
| Service | Image | Purpose |
|---|---|---|
app |
Custom (Dockerfile) | Next.js application |
postgres |
pgvector/pgvector:pg16 |
Primary database |
redis |
redis:7-alpine |
Rate limiting |
ollama |
ollama/ollama:latest |
Local LLM inference |
ipcha |
Custom (ipcha/Dockerfile) |
IPCHA claim verification |
ckb |
ghcr.io/simplyliz/ckb:latest |
Code Knowledge Base |
traefik |
traefik:latest |
Reverse proxy + TLS (Let's Encrypt) |
prometheus |
prom/prometheus:v2.53.0 |
Metrics (30d retention) |
grafana |
grafana/grafana:11.3.0 |
Dashboards (grafana.nyxcore.cloud) |
loki |
grafana/loki:3.4.2 |
Log aggregation (7d retention) |
alloy |
grafana/alloy:v1.7.1 |
Docker log collection (DSGVO-compliant) |
node-exporter |
prom/node-exporter:v1.8.0 |
Host metrics |
cadvisor |
gcr.io/cadvisor/cadvisor:v0.49.1 |
Container metrics |
postgres-exporter |
prometheuscommunity/postgres-exporter:v0.15.0 |
PostgreSQL metrics |
redis-exporter |
oliver006/redis_exporter:v1.62.0 |
Redis metrics |
Database
- pgvector: Extension v0.8.1, HNSW index on
workflow_insights.embedding(m=16, ef_construction=64, vector_cosine_ops) - RLS: Row-Level Security policies in
prisma/rls.sqlenforce tenant isolation - Backups:
scripts/db-backup.shorpg_dump; embedding column requires pgvector extension on restore
Rate Limiting
| Endpoint Type | Limit | Window |
|---|---|---|
| General API | 100 req | 1 min |
| LLM operations | 10 req | 1 min |
Rate limiting is fail-open — if Redis is unavailable, requests are allowed through. Monitor Redis connectivity if you see unusual traffic patterns.
Monitoring
Dashboards
- Grafana:
https://grafana.nyxcore.cloud(login: admin /GRAFANA_ADMIN_PASSWORD) - Prometheus: Internal on
nyxcore-net, scraped by Grafana - Loki: Log queries via Grafana Explore
Key Metrics to Watch
- Audit logs:
audit_logstable tracks all significant actions (API key changes, PR creation, workflow execution) - Workflow step costs:
workflow_steps.costEstimate— aggregate to track LLM spend - Token usage:
workflow_steps.tokenUsage— monitor for cost anomalies - Failed workflows:
SELECT count(*) FROM workflows WHERE status = 'failed' - Container resources: cAdvisor dashboards in Grafana
- Host metrics: Node Exporter dashboards (CPU, memory, disk)
Prometheus LLM Metrics
Application-level metrics exposed via prom-client at /api/v1/metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
nyxcore_llm_call_duration_seconds |
Histogram | provider, model, method | LLM API call latency |
nyxcore_llm_calls_total |
Counter | provider, model, status | Total LLM API calls |
nyxcore_llm_tokens_total |
Counter | provider, model, type | Tokens consumed (prompt/completion) |
nyxcore_llm_cost_usd_total |
Counter | provider, model | Estimated LLM cost in USD |
nyxcore_workflow_duration_seconds |
Histogram | status | Workflow execution duration |
nyxcore_workflow_steps_total |
Counter | status | Workflow steps executed |
nyxcore_active_workflows |
Gauge | — | Currently running workflows |
nyxcore_discussion_duration_seconds |
Histogram | mode | Discussion processing duration |
nyxcore_active_discussions |
Gauge | — | Currently active discussions |
nyxcore_sse_connections_active |
Gauge | endpoint | Active SSE connections |
nyxcore_rate_limit_hits_total |
Counter | source | Rate limit hits |
nyxcore_http_request_duration_seconds |
Histogram | method, route, status_code | HTTP request latency |
nyxcore_http_requests_total |
Counter | method, route, status_code | Total HTTP requests |
Wired via withMetrics() wrapper in src/server/services/llm/resolve-provider.ts — every resolveProvider() call automatically instruments the returned provider.
Log Collection
- Alloy collects Docker container logs and forwards to Loki
- Logs include JSON-parsed fields:
level,status,method,path,duration - DSGVO-compliant: Client IPs are anonymized (IPv4 last octet →
0, IPv6 truncated) - Health-check and metrics paths are auto-dropped to reduce noise
- Application logs:
docker compose -f docker-compose.production.yml logs -f app - Audit trail:
audit_logstable (queryable via Prisma Studio:npm run db:studio)
Docker Maintenance
# Prune unused images and build cache (frees disk space)
docker system prune -af && docker builder prune -af
Common Issues
"No GitHub token configured"
Cause: Tenant has no GitHub PAT stored in API keys vault.
Fix: Go to Admin > API Keys > Add a GitHub personal access token with repo and read:org scopes.
"Workflow stuck in running state"
Cause: SSE connection dropped or server restarted mid-execution. Fix:
UPDATE workflows SET status = 'paused' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';
UPDATE workflow_steps SET status = 'pending' WHERE status = 'running' AND updated_at < NOW() - INTERVAL '30 minutes';
Then resume from the dashboard.
"ENCRYPTION_KEY mismatch"
Cause: ENCRYPTION_KEY changed between deployments. Stored API keys can't be decrypted.
Fix: Restore the original key, or have tenants re-enter their API keys in Admin > API Keys.
"prisma db push fails with column conflict"
Cause: Adding a required column to a table with existing rows.
Fix: Add @default("") to the column in schema.prisma, or make it optional (?). See CLAUDE.md Prisma Gotchas.
"RLS blocks all queries"
Cause: RLS policies not applied, or tenantId not set in session context.
Fix:
psql $DATABASE_URL -f prisma/rls.sql
Verify enforceTenant middleware is in the tRPC chain.
"Docker build fails with ERESOLVE"
Cause: eslint peer dependency conflict between eslint@8 and eslint-config-next@16.
Fix: Dockerfile already uses npm ci --legacy-peer-deps. If adding new dependencies that cause similar conflicts, ensure --legacy-peer-deps is on both deps and builder stages.
"Vector search returns no results"
Cause: workflow_insights.embedding column is Unsupported in Prisma — must use raw SQL.
Fix: Use Prisma.sql for vector operations, not Prisma client methods. Check that embeddings are generated via text-embedding-3-small (1536 dimensions).
Rollback
Application Rollback
- Deploy the previous build artifact / container image
- No database rollback needed if no migrations were run
Database Rollback
If a migration was applied:
prisma migrate resolve --rolled-back <migration_name>
Then manually revert the SQL changes. Prisma does not auto-rollback migrations.
For db:push (dev only): There is no built-in rollback. Restore from backup or manually revert schema changes.
Emergency: Disable a Feature
To disable LLM-dependent features without downtime, remove the provider's API key from the tenant's key vault (Admin > API Keys). The BYOK pattern means no key = no LLM calls.
Secrets Management
| Secret | Storage | Rotation |
|---|---|---|
AUTH_SECRET |
Environment variable | Rotate anytime; invalidates existing sessions |
ENCRYPTION_KEY |
Environment variable | Cannot rotate without re-encrypting all stored API keys |
| Tenant API keys | api_keys table, AES-256-GCM encrypted |
Tenants manage via Admin > API Keys |
STRIPE_SECRET_KEY |
Environment variable | Rotate via Stripe Dashboard; update .env.production |
STRIPE_WEBHOOK_SECRET |
Environment variable | Re-create webhook in Stripe Dashboard if rotated |
| GitHub PAT | api_keys table, encrypted |
Rotate via GitHub + re-enter in Admin |
Letter-to-Blog Pipeline
Session checkpoints in .memory/letter_*.md are auto-converted to blog drafts:
python .github/scripts/blog_gen.py # Latest letter
python .github/scripts/blog_gen.py --file letter_20260227_0001.md # Specific file
The GitHub Actions workflow .github/workflows/vibe_publisher.yml auto-creates PRs when letters are pushed to .memory/.
