Code Analysis

UserDeveloper7 min read

14. Code Analysis Pipeline

The Code Analysis pipeline is a 3-phase system that scans repositories, detects structural and behavioral patterns via LLM, and generates documentation artifacts. It underpins both the Auto-Fix and Refactor pipelines by producing the RepositoryFile index and CodePattern records they depend on.

Pipeline Architecture

flowchart TB subgraph Phase1["Phase 1: Scan"] A[fetchRepoTree] --> B[extractFileMetadata per file] B --> C[Batch insert RepositoryFile records] end subgraph Phase2["Phase 2: Pattern Detection"] D[Fetch source file content] --> E[Build character-budgeted batches] E --> F[LLM analysis per batch] F --> G[Parse + deduplicate patterns] G --> H[Save CodePattern records] end subgraph Phase3["Phase 3: Doc Generation"] I[Build repo context from files + patterns] I --> J[LLM generates each doc type] J --> K[Score quality] K --> L[Save GeneratedDoc records] end Phase1 --> Phase2 --> Phase3

Orchestrator

runAnalysis() in src/server/services/code-analysis/analysis-runner.ts is the top-level AsyncGenerator<AnalysisEvent> that coordinates all three phases. Each phase can be independently skipped via config flags:

interface AnalysisRunOptions {
  runId: string;
  repositoryId: string;
  tenantId: string;
  userId: string;
  skipScan?: boolean;      // Skip Phase 1
  skipPatterns?: boolean;   // Skip Phase 2
  skipDocs?: boolean;       // Skip Phase 3
  provider?: string;
  model?: string;
}

Status transitions: pending -> scanning -> analyzing -> generating -> completed

SSE Streaming

The route at src/app/api/v1/events/code-analysis/[id]/route.ts authenticates the request, looks up the CodeAnalysisRun, extracts config, and streams AnalysisEvent objects as SSE. If the generator throws, the run is marked failed (best-effort DB update) and an error event is sent to the client.

Phase 1: Repository Scanner

scanRepository() in src/server/services/code-analysis/scanner.ts:

  1. Resolves the tenant's GitHub token via resolveGitHubToken()
  2. Fetches the full file tree with fetchRepoTree()
  3. Deletes any existing RepositoryFile records (full re-sync)
  4. Processes files in batches of 50, optionally fetching content for source files

Content Budget Constants

Constant Value Description
MAX_CONTENT_FETCH 200 Maximum source files to fetch content for
MAX_FILE_SIZE 100,000 Skip files > 100KB for content analysis
BATCH_SIZE 50 Files per DB insert batch

Source File Detection

Only files with recognized source extensions are eligible for content fetching:

const SOURCE_EXTENSIONS = new Set([
  "ts", "tsx", "js", "jsx", "mjs", "cjs",
  "py", "rb", "go", "rs", "java", "kt", "swift",
  "cs", "cpp", "c", "h", "hpp", "php",
  "vue", "svelte", "dart", "ex", "exs",
]);

File Indexer

extractFileMetadata() in src/server/services/code-analysis/file-indexer.ts classifies each file into:

Category Detection Method
source Default for recognized language extensions
config Matches patterns like .env*, *config.*, Dockerfile, .github/
test Matches *.test.*, *.spec.*, __tests__/, *.e2e.*
docs Matches *.md, docs/, README, CHANGELOG, LICENSE
other Unrecognized language

Language detection maps 50+ file extensions to language names. Special filenames (Dockerfile, Makefile, .gitignore) are handled explicitly.

The CategorizedFiles utility provides aggregate summaries by language and category for downstream consumers.

Phase 2: Pattern Detection

detectPatterns() in src/server/services/code-analysis/pattern-detector.ts:

Eight Pattern Types

type PatternType =
  | "architecture"    // Module organization, layer separation
  | "naming"          // File, function, variable naming conventions
  | "error-handling"  // Try/catch patterns, error types
  | "testing"         // Test patterns, coverage approaches
  | "dependency"      // Import patterns, external coupling
  | "security"        // Input validation, auth, secrets
  | "performance"     // Performance patterns or anti-patterns
  | "style";          // Code style, formatting conventions

Batching Strategy

Source files are fetched from GitHub and split into character-budgeted batches:

Constant Value Description
BATCH_CHAR_BUDGET 50,000 ~12.5k tokens per batch
MAX_FILE_CHARS 10,000 Truncate individual files at this limit
MAX_SOURCE_FILES 100 Max files to analyze total

LLM Prompt Structure

Each batch is sent with a structured prompt requesting JSON output:

{
  patterns: [{
    type: PatternType,
    title: string,          // 5-10 words
    description: string,    // 2-4 sentences
    evidence: string[],     // 1-3 file paths or code excerpts
    confidence: number,     // 0.0 to 1.0
    frequency: number,      // How many files exhibit this
    filePaths: string[],    // Files where detected
    tags: string[],         // 2-4 keywords
  }]
}

Custom rules can be injected via PatternRule[]:

interface PatternRule {
  name: string;
  description: string;
  condition: string;  // Natural-language condition for the LLM
}

Deduplication

deduplicatePatterns() merges patterns across batches by keying on type + normalized title. When duplicates are found:

  • Confidence: takes the higher value
  • Frequency: sums the counts
  • Evidence: merges and caps at 5 entries
  • FilePaths and tags: set-union

Results are sorted by confidence descending.

Response Parsing

parsePatternResponse() handles markdown-fenced JSON (```json), validates that each pattern has a valid type, title, and description, and clamps confidence to [0.0, 1.0].

Phase 3: Documentation Generation

generateDocs() in src/server/services/code-analysis/doc-generator.ts:

Five Document Types

type DocType = "readme" | "api" | "architecture" | "onboarding" | "changelog";

Default generation produces readme, architecture, and onboarding. Each type has a tailored prompt that includes the repository file tree, primary languages, and detected patterns.

DocType LLM Temperature Max Tokens Key Content
readme 0.3 8192 Title, features, tech stack, setup, structure
api 0.3 8192 Endpoints, request/response formats, auth
architecture 0.3 8192 System overview, components, data flow, decisions
onboarding 0.3 8192 Prerequisites, setup, workflow, debugging
changelog 0.3 8192 Keep a Changelog format, version history

Quality Scoring

scoreDocQuality() produces a 0.0-1.0 score based on heuristic checks against a 10-point rubric:

Universal checks (7 points max):

  • Length: +1 for >500 chars, +1 for >2000, +1 for >5000
  • Headings: +1 for >=2, +1 for >=5
  • Code blocks: +1 for >=1
  • Lists: +1 for >=3 items

Type-specific checks (3 points max per type):

  • readme: install/setup, usage/example, contributing/license
  • api: endpoint/route keywords, request/response, auth/token
  • architecture: component/module/layer, data flow/design, decision/trade-off
  • onboarding: prerequisite/requirement, step/guide, debug/troubleshoot
  • changelog: added/changed/fixed, version numbers, unreleased

Data Model

CodeAnalysisRun

Column Type Description
id UUID Primary key
tenantId UUID Tenant scope
repositoryId UUID Target repository
status String pending / scanning / analyzing / generating / completed / failed
config Json? { skipScan, skipPatterns, skipDocs, provider, model }
stats Json? { filesScanned, patternsFound, docsGenerated, totalTokens, totalCost }

CodePattern

Column Type Description
id UUID Primary key
repositoryId UUID Repository
runId UUID Analysis run
type String One of 8 pattern types
title String Pattern name
description Text Detailed explanation
evidence String[] File paths or code excerpts
confidence Float 0.0 to 1.0
frequency Int Files exhibiting this pattern
filePaths String[] Where the pattern was detected
tags String[] Keywords

GeneratedDoc

Column Type Description
id UUID Primary key
tenantId UUID Tenant scope
repositoryId UUID Repository
runId UUID? Analysis run (nullable for standalone)
docType String One of 5 doc types
title String Generated title
content Text Full markdown content
qualityScore Float? 0.0 to 1.0
status String draft / approved / archived
model String? LLM model used
provider String? LLM provider used

RepositoryFile

Column Type Description
id UUID Primary key
repositoryId UUID Parent repository
path String File path in repo
language String Detected language (e.g., "typescript")
category String source / config / test / docs / other
lineCount Int Line count (0 if content not fetched)
sizeBytes Int Byte size (0 if content not fetched)
isTest Boolean Matches test patterns
isConfig Boolean Matches config patterns