14. Code Analysis Pipeline

The Code Analysis pipeline is a 3-phase system that scans repositories, detects structural and behavioral patterns via LLM, and generates documentation artifacts. It underpins both the Auto-Fix and Refactor pipelines by producing the RepositoryFile index and CodePattern records they depend on.

Pipeline Architecture

flowchart TB subgraph Phase1["Phase 1: Scan"] A[fetchRepoTree] --> B[extractFileMetadata per file] B --> C[Batch insert RepositoryFile records] end subgraph Phase2["Phase 2: Pattern Detection"] D[Fetch source file content] --> E[Build character-budgeted batches] E --> F[LLM analysis per batch] F --> G[Parse + deduplicate patterns] G --> H[Save CodePattern records] end subgraph Phase3["Phase 3: Doc Generation"] I[Build repo context from files + patterns] I --> J[LLM generates each doc type] J --> K[Score quality] K --> L[Save GeneratedDoc records] end Phase1 --> Phase2 --> Phase3

Orchestrator

runAnalysis() in src/server/services/code-analysis/analysis-runner.ts is the top-level AsyncGenerator<AnalysisEvent> that coordinates all three phases. Each phase can be independently skipped via config flags:

interface AnalysisRunOptions {
  runId: string;
  repositoryId: string;
  tenantId: string;
  userId: string;
  skipScan?: boolean;      // Skip Phase 1
  skipPatterns?: boolean;   // Skip Phase 2
  skipDocs?: boolean;       // Skip Phase 3
  provider?: string;
  model?: string;
}

Status transitions: pending -> scanning -> analyzing -> generating -> completed

SSE Streaming

The route at src/app/api/v1/events/code-analysis/[id]/route.ts authenticates the request, looks up the CodeAnalysisRun, extracts config, and streams AnalysisEvent objects as SSE. If the generator throws, the run is marked failed (best-effort DB update) and an error event is sent to the client.

Phase 1: Repository Scanner

scanRepository() in src/server/services/code-analysis/scanner.ts:

Resolves the tenant's GitHub token via resolveGitHubToken()
Fetches the full file tree with fetchRepoTree()
Deletes any existing RepositoryFile records (full re-sync)
Processes files in batches of 50, optionally fetching content for source files

Content Budget Constants

Constant	Value	Description
`MAX_CONTENT_FETCH`	200	Maximum source files to fetch content for
`MAX_FILE_SIZE`	100,000	Skip files > 100KB for content analysis
`BATCH_SIZE`	50	Files per DB insert batch

Source File Detection

Only files with recognized source extensions are eligible for content fetching:

const SOURCE_EXTENSIONS = new Set([
  "ts", "tsx", "js", "jsx", "mjs", "cjs",
  "py", "rb", "go", "rs", "java", "kt", "swift",
  "cs", "cpp", "c", "h", "hpp", "php",
  "vue", "svelte", "dart", "ex", "exs",
]);

File Indexer

extractFileMetadata() in src/server/services/code-analysis/file-indexer.ts classifies each file into:

Category	Detection Method
`source`	Default for recognized language extensions
`config`	Matches patterns like `.env`, `config.*`, `Dockerfile`, `.github/`
`test`	Matches `.test.`, `.spec.`, `__tests__/`, `.e2e.`
`docs`	Matches `*.md`, `docs/`, `README`, `CHANGELOG`, `LICENSE`
`other`	Unrecognized language

Language detection maps 50+ file extensions to language names. Special filenames (Dockerfile, Makefile, .gitignore) are handled explicitly.

The CategorizedFiles utility provides aggregate summaries by language and category for downstream consumers.

Phase 2: Pattern Detection

detectPatterns() in src/server/services/code-analysis/pattern-detector.ts:

Eight Pattern Types

type PatternType =
  | "architecture"    // Module organization, layer separation
  | "naming"          // File, function, variable naming conventions
  | "error-handling"  // Try/catch patterns, error types
  | "testing"         // Test patterns, coverage approaches
  | "dependency"      // Import patterns, external coupling
  | "security"        // Input validation, auth, secrets
  | "performance"     // Performance patterns or anti-patterns
  | "style";          // Code style, formatting conventions

Batching Strategy

Source files are fetched from GitHub and split into character-budgeted batches:

Constant	Value	Description
`BATCH_CHAR_BUDGET`	50,000	~12.5k tokens per batch
`MAX_FILE_CHARS`	10,000	Truncate individual files at this limit
`MAX_SOURCE_FILES`	100	Max files to analyze total

LLM Prompt Structure

Each batch is sent with a structured prompt requesting JSON output:

{
  patterns: [{
    type: PatternType,
    title: string,          // 5-10 words
    description: string,    // 2-4 sentences
    evidence: string[],     // 1-3 file paths or code excerpts
    confidence: number,     // 0.0 to 1.0
    frequency: number,      // How many files exhibit this
    filePaths: string[],    // Files where detected
    tags: string[],         // 2-4 keywords
  }]
}

Custom rules can be injected via PatternRule[]:

interface PatternRule {
  name: string;
  description: string;
  condition: string;  // Natural-language condition for the LLM
}

Deduplication

deduplicatePatterns() merges patterns across batches by keying on type + normalized title. When duplicates are found:

Confidence: takes the higher value
Frequency: sums the counts
Evidence: merges and caps at 5 entries
FilePaths and tags: set-union

Results are sorted by confidence descending.

Response Parsing

parsePatternResponse() handles markdown-fenced JSON (```json), validates that each pattern has a valid type, title, and description, and clamps confidence to [0.0, 1.0].

Phase 3: Documentation Generation

generateDocs() in src/server/services/code-analysis/doc-generator.ts:

Five Document Types

type DocType = "readme" | "api" | "architecture" | "onboarding" | "changelog";

Default generation produces readme, architecture, and onboarding. Each type has a tailored prompt that includes the repository file tree, primary languages, and detected patterns.

DocType	LLM Temperature	Max Tokens	Key Content
`readme`	0.3	8192	Title, features, tech stack, setup, structure
`api`	0.3	8192	Endpoints, request/response formats, auth
`architecture`	0.3	8192	System overview, components, data flow, decisions
`onboarding`	0.3	8192	Prerequisites, setup, workflow, debugging
`changelog`	0.3	8192	Keep a Changelog format, version history

Quality Scoring

scoreDocQuality() produces a 0.0-1.0 score based on heuristic checks against a 10-point rubric:

Universal checks (7 points max):

Length: +1 for >500 chars, +1 for >2000, +1 for >5000
Headings: +1 for >=2, +1 for >=5
Code blocks: +1 for >=1
Lists: +1 for >=3 items

Type-specific checks (3 points max per type):

readme: install/setup, usage/example, contributing/license
api: endpoint/route keywords, request/response, auth/token
architecture: component/module/layer, data flow/design, decision/trade-off
onboarding: prerequisite/requirement, step/guide, debug/troubleshoot
changelog: added/changed/fixed, version numbers, unreleased

Data Model

CodeAnalysisRun

Column	Type	Description
`id`	`UUID`	Primary key
`tenantId`	`UUID`	Tenant scope
`repositoryId`	`UUID`	Target repository
`status`	`String`	`pending` / `scanning` / `analyzing` / `generating` / `completed` / `failed`
`config`	`Json?`	`{ skipScan, skipPatterns, skipDocs, provider, model }`
`stats`	`Json?`	`{ filesScanned, patternsFound, docsGenerated, totalTokens, totalCost }`

CodePattern

Column	Type	Description
`id`	`UUID`	Primary key
`repositoryId`	`UUID`	Repository
`runId`	`UUID`	Analysis run
`type`	`String`	One of 8 pattern types
`title`	`String`	Pattern name
`description`	`Text`	Detailed explanation
`evidence`	`String[]`	File paths or code excerpts
`confidence`	`Float`	0.0 to 1.0
`frequency`	`Int`	Files exhibiting this pattern
`filePaths`	`String[]`	Where the pattern was detected
`tags`	`String[]`	Keywords

GeneratedDoc

Column	Type	Description
`id`	`UUID`	Primary key
`tenantId`	`UUID`	Tenant scope
`repositoryId`	`UUID`	Repository
`runId`	`UUID?`	Analysis run (nullable for standalone)
`docType`	`String`	One of 5 doc types
`title`	`String`	Generated title
`content`	`Text`	Full markdown content
`qualityScore`	`Float?`	0.0 to 1.0
`status`	`String`	`draft` / `approved` / `archived`
`model`	`String?`	LLM model used
`provider`	`String?`	LLM provider used

RepositoryFile

Column	Type	Description
`id`	`UUID`	Primary key
`repositoryId`	`UUID`	Parent repository
`path`	`String`	File path in repo
`language`	`String`	Detected language (e.g., `"typescript"`)
`category`	`String`	`source` / `config` / `test` / `docs` / `other`
`lineCount`	`Int`	Line count (0 if content not fetched)
`sizeBytes`	`Int`	Byte size (0 if content not fetched)
`isTest`	`Boolean`	Matches test patterns
`isConfig`	`Boolean`	Matches config patterns