Code Analysis
14. Code Analysis Pipeline
The Code Analysis pipeline is a 3-phase system that scans repositories, detects structural and behavioral patterns via LLM, and generates documentation artifacts. It underpins both the Auto-Fix and Refactor pipelines by producing the RepositoryFile index and CodePattern records they depend on.
Pipeline Architecture
Orchestrator
runAnalysis() in src/server/services/code-analysis/analysis-runner.ts is the top-level AsyncGenerator<AnalysisEvent> that coordinates all three phases. Each phase can be independently skipped via config flags:
interface AnalysisRunOptions {
runId: string;
repositoryId: string;
tenantId: string;
userId: string;
skipScan?: boolean; // Skip Phase 1
skipPatterns?: boolean; // Skip Phase 2
skipDocs?: boolean; // Skip Phase 3
provider?: string;
model?: string;
}
Status transitions: pending -> scanning -> analyzing -> generating -> completed
SSE Streaming
The route at src/app/api/v1/events/code-analysis/[id]/route.ts authenticates the request, looks up the CodeAnalysisRun, extracts config, and streams AnalysisEvent objects as SSE. If the generator throws, the run is marked failed (best-effort DB update) and an error event is sent to the client.
Phase 1: Repository Scanner
scanRepository() in src/server/services/code-analysis/scanner.ts:
- Resolves the tenant's GitHub token via
resolveGitHubToken() - Fetches the full file tree with
fetchRepoTree() - Deletes any existing
RepositoryFilerecords (full re-sync) - Processes files in batches of 50, optionally fetching content for source files
Content Budget Constants
| Constant | Value | Description |
|---|---|---|
MAX_CONTENT_FETCH |
200 | Maximum source files to fetch content for |
MAX_FILE_SIZE |
100,000 | Skip files > 100KB for content analysis |
BATCH_SIZE |
50 | Files per DB insert batch |
Source File Detection
Only files with recognized source extensions are eligible for content fetching:
const SOURCE_EXTENSIONS = new Set([
"ts", "tsx", "js", "jsx", "mjs", "cjs",
"py", "rb", "go", "rs", "java", "kt", "swift",
"cs", "cpp", "c", "h", "hpp", "php",
"vue", "svelte", "dart", "ex", "exs",
]);
File Indexer
extractFileMetadata() in src/server/services/code-analysis/file-indexer.ts classifies each file into:
| Category | Detection Method |
|---|---|
source |
Default for recognized language extensions |
config |
Matches patterns like .env*, *config.*, Dockerfile, .github/ |
test |
Matches *.test.*, *.spec.*, __tests__/, *.e2e.* |
docs |
Matches *.md, docs/, README, CHANGELOG, LICENSE |
other |
Unrecognized language |
Language detection maps 50+ file extensions to language names. Special filenames (Dockerfile, Makefile, .gitignore) are handled explicitly.
The CategorizedFiles utility provides aggregate summaries by language and category for downstream consumers.
Phase 2: Pattern Detection
detectPatterns() in src/server/services/code-analysis/pattern-detector.ts:
Eight Pattern Types
type PatternType =
| "architecture" // Module organization, layer separation
| "naming" // File, function, variable naming conventions
| "error-handling" // Try/catch patterns, error types
| "testing" // Test patterns, coverage approaches
| "dependency" // Import patterns, external coupling
| "security" // Input validation, auth, secrets
| "performance" // Performance patterns or anti-patterns
| "style"; // Code style, formatting conventions
Batching Strategy
Source files are fetched from GitHub and split into character-budgeted batches:
| Constant | Value | Description |
|---|---|---|
BATCH_CHAR_BUDGET |
50,000 | ~12.5k tokens per batch |
MAX_FILE_CHARS |
10,000 | Truncate individual files at this limit |
MAX_SOURCE_FILES |
100 | Max files to analyze total |
LLM Prompt Structure
Each batch is sent with a structured prompt requesting JSON output:
{
patterns: [{
type: PatternType,
title: string, // 5-10 words
description: string, // 2-4 sentences
evidence: string[], // 1-3 file paths or code excerpts
confidence: number, // 0.0 to 1.0
frequency: number, // How many files exhibit this
filePaths: string[], // Files where detected
tags: string[], // 2-4 keywords
}]
}
Custom rules can be injected via PatternRule[]:
interface PatternRule {
name: string;
description: string;
condition: string; // Natural-language condition for the LLM
}
Deduplication
deduplicatePatterns() merges patterns across batches by keying on type + normalized title. When duplicates are found:
- Confidence: takes the higher value
- Frequency: sums the counts
- Evidence: merges and caps at 5 entries
- FilePaths and tags: set-union
Results are sorted by confidence descending.
Response Parsing
parsePatternResponse() handles markdown-fenced JSON (```json), validates that each pattern has a valid type, title, and description, and clamps confidence to [0.0, 1.0].
Phase 3: Documentation Generation
generateDocs() in src/server/services/code-analysis/doc-generator.ts:
Five Document Types
type DocType = "readme" | "api" | "architecture" | "onboarding" | "changelog";
Default generation produces readme, architecture, and onboarding. Each type has a tailored prompt that includes the repository file tree, primary languages, and detected patterns.
| DocType | LLM Temperature | Max Tokens | Key Content |
|---|---|---|---|
readme |
0.3 | 8192 | Title, features, tech stack, setup, structure |
api |
0.3 | 8192 | Endpoints, request/response formats, auth |
architecture |
0.3 | 8192 | System overview, components, data flow, decisions |
onboarding |
0.3 | 8192 | Prerequisites, setup, workflow, debugging |
changelog |
0.3 | 8192 | Keep a Changelog format, version history |
Quality Scoring
scoreDocQuality() produces a 0.0-1.0 score based on heuristic checks against a 10-point rubric:
Universal checks (7 points max):
- Length: +1 for >500 chars, +1 for >2000, +1 for >5000
- Headings: +1 for >=2, +1 for >=5
- Code blocks: +1 for >=1
- Lists: +1 for >=3 items
Type-specific checks (3 points max per type):
readme: install/setup, usage/example, contributing/licenseapi: endpoint/route keywords, request/response, auth/tokenarchitecture: component/module/layer, data flow/design, decision/trade-offonboarding: prerequisite/requirement, step/guide, debug/troubleshootchangelog: added/changed/fixed, version numbers, unreleased
Data Model
CodeAnalysisRun
| Column | Type | Description |
|---|---|---|
id |
UUID |
Primary key |
tenantId |
UUID |
Tenant scope |
repositoryId |
UUID |
Target repository |
status |
String |
pending / scanning / analyzing / generating / completed / failed |
config |
Json? |
{ skipScan, skipPatterns, skipDocs, provider, model } |
stats |
Json? |
{ filesScanned, patternsFound, docsGenerated, totalTokens, totalCost } |
CodePattern
| Column | Type | Description |
|---|---|---|
id |
UUID |
Primary key |
repositoryId |
UUID |
Repository |
runId |
UUID |
Analysis run |
type |
String |
One of 8 pattern types |
title |
String |
Pattern name |
description |
Text |
Detailed explanation |
evidence |
String[] |
File paths or code excerpts |
confidence |
Float |
0.0 to 1.0 |
frequency |
Int |
Files exhibiting this pattern |
filePaths |
String[] |
Where the pattern was detected |
tags |
String[] |
Keywords |
GeneratedDoc
| Column | Type | Description |
|---|---|---|
id |
UUID |
Primary key |
tenantId |
UUID |
Tenant scope |
repositoryId |
UUID |
Repository |
runId |
UUID? |
Analysis run (nullable for standalone) |
docType |
String |
One of 5 doc types |
title |
String |
Generated title |
content |
Text |
Full markdown content |
qualityScore |
Float? |
0.0 to 1.0 |
status |
String |
draft / approved / archived |
model |
String? |
LLM model used |
provider |
String? |
LLM provider used |
RepositoryFile
| Column | Type | Description |
|---|---|---|
id |
UUID |
Primary key |
repositoryId |
UUID |
Parent repository |
path |
String |
File path in repo |
language |
String |
Detected language (e.g., "typescript") |
category |
String |
source / config / test / docs / other |
lineCount |
Int |
Line count (0 if content not fetched) |
sizeBytes |
Int |
Byte size (0 if content not fetched) |
isTest |
Boolean |
Matches test patterns |
isConfig |
Boolean |
Matches config patterns |
