AI-Powered Repository Documentation Generation • Multi-Language Support • Architecture-Aware Analysis
Generate holistic, structured documentation for large-scale codebases • Cross-module interactions • Visual artifacts and diagrams
Quick Start • CLI Commands • Output Structure • Paper
# Install from source
pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git
# Verify installation
codewiki --versionCodeWiki supports multiple LLM providers: OpenAI-compatible, Anthropic, AWS Bedrock, and Azure OpenAI.
# Anthropic
codewiki config set \
--api-key YOUR_API_KEY \
--base-url https://api.anthropic.com \
--main-model claude-sonnet-4 \
--cluster-model claude-sonnet-4 \
--fallback-model glm-4p5
# Azure OpenAI
codewiki config set \
--provider azure-openai \
--api-key YOUR_AZURE_KEY \
--base-url https://YOUR_RESOURCE.openai.azure.com \
--azure-deployment YOUR_DEPLOYMENT \
--main-model gpt-4o \
--cluster-model gpt-4o
# AWS Bedrock
codewiki config set \
--provider bedrock \
--aws-region us-east-1 \
--main-model anthropic.claude-sonnet-4-v2:0 \
--cluster-model anthropic.claude-sonnet-4-v2:0# Navigate to your project
cd /path/to/your/project
# Generate documentation
codewiki generate
# Generate with HTML viewer for GitHub Pages
codewiki generate --github-pages --create-branchThat's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.
CodeWiki is an open-source framework for automated repository-level documentation across eight programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.
| Innovation | Description | Impact |
|---|---|---|
| Hierarchical Decomposition | Dynamic programming-inspired strategy that preserves architectural context | Handles codebases of arbitrary size (86K-1.4M LOC tested) |
| Recursive Agentic System | Adaptive multi-agent processing with dynamic delegation capabilities | Maintains quality while scaling to repository-level scope |
| Multi-Modal Synthesis | Generates textual documentation, architecture diagrams, data flows, and sequence diagrams | Comprehensive understanding from multiple perspectives |
🐍 Python • ☕ Java • 🟨 JavaScript • 🔷 TypeScript • ⚙️ C • 🔧 C++ • 🪟 C# • 🎯 Kotlin
# Set up your API configuration
codewiki config set \
--api-key <your-api-key> \
--base-url <provider-url> \
--main-model <model-name> \
--cluster-model <model-name> \
--fallback-model <model-name>
# Configure max token settings
codewiki config set --max-tokens 32768 --max-token-per-module 36369 --max-token-per-leaf-module 16000
# Configure max depth for hierarchical decomposition
codewiki config set --max-depth 3
# Show current configuration
codewiki config show
# Validate your configuration
codewiki config validate# Basic generation
codewiki generate
# Custom output directory
codewiki generate --output ./documentation
# Create git branch for documentation
codewiki generate --create-branch
# Generate HTML viewer for GitHub Pages
codewiki generate --github-pages
# Enable verbose logging
codewiki generate --verbose
# Full-featured generation
codewiki generate --create-branch --github-pages --verbose
# Incremental update (only regenerate changed modules since last run)
codewiki generate --updateCodeWiki supports customization for language-specific projects and documentation styles:
# C# project: only analyze .cs files, exclude test directories
codewiki generate --include "*.cs" --exclude "Tests,Specs,*.test.cs"
# Focus on specific modules with architecture-style docs
codewiki generate --focus "src/core,src/api" --doc-type architecture
# Add custom instructions for the AI agent
codewiki generate --instructions "Focus on public APIs and include usage examples"-
--include: When specified, ONLY these patterns are used (replaces defaults completely)- Example:
--include "*.cs"will analyze ONLY.csfiles - If omitted, all supported file types are analyzed
- Supports glob patterns:
*.py,src/**/*.ts,*.{js,jsx}
- Example:
-
--exclude: When specified, patterns are MERGED with default ignore patterns- Example:
--exclude "Tests,Specs"will exclude these directories AND still exclude.git,__pycache__,node_modules, etc. - Default patterns include:
.git,node_modules,__pycache__,*.pyc,bin/,dist/, and many more - Supports multiple formats:
- Exact names:
Tests,.env,config.local - Glob patterns:
*.test.js,*_test.py,*.min.* - Directory patterns:
build/,dist/,coverage/
- Exact names:
- Example:
Save your preferred settings as defaults:
# Set include patterns for C# projects
codewiki config agent --include "*.cs"
# Exclude test projects by default (merged with default excludes)
codewiki config agent --exclude "Tests,Specs,*.test.cs"
# Set focus modules
codewiki config agent --focus "src/core,src/api"
# Set default documentation type
codewiki config agent --doc-type architecture
# View current agent settings
codewiki config agent
# Clear all agent settings
codewiki config agent --clear| Option | Description | Behavior | Example |
|---|---|---|---|
--include |
File patterns to include | Replaces defaults | *.cs, *.py, src/**/*.ts |
--exclude |
Patterns to exclude | Merges with defaults | Tests,Specs, *.test.js, build/ |
--focus |
Modules to document in detail | Standalone option | src/core,src/api |
--doc-type |
Documentation style | Standalone option | api, architecture, user-guide, developer |
--instructions |
Custom agent instructions | Standalone option | Free-form text |
CodeWiki allows you to configure maximum token limits for LLM calls. This is useful for:
- Adapting to different model context windows
- Controlling costs by limiting response sizes
- Optimizing for faster response times
# Set max tokens for LLM responses (default: 32768)
codewiki config set --max-tokens 16384
# Set max tokens for module clustering (default: 36369)
codewiki config set --max-token-per-module 40000
# Set max tokens for leaf modules (default: 16000)
codewiki config set --max-token-per-leaf-module 20000
# Set max depth for hierarchical decomposition (default: 2)
codewiki config set --max-depth 3
# Override at runtime for a single generation
codewiki generate --max-tokens 16384 --max-token-per-module 40000 --max-depth 3| Option | Description | Default |
|---|---|---|
--max-tokens |
Maximum output tokens for LLM response | 32768 |
--max-token-per-module |
Input tokens threshold for module clustering | 36369 |
--max-token-per-leaf-module |
Input tokens threshold for leaf modules | 16000 |
--max-depth |
Maximum depth for hierarchical decomposition | 2 |
- API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service). Falls back to
~/.codewiki/credentials.jsonin headless/container environments. SetCODEWIKI_NO_KEYRING=1to force file-based storage. - Settings & Agent Instructions:
~/.codewiki/config.json
Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.
- Repository overview with architecture guide
- Module-level documentation with API references
- Usage examples and implementation patterns
- Cross-module interaction analysis
- System architecture diagrams (Mermaid)
- Data flow visualizations
- Dependency graphs and module relationships
- Sequence diagrams for complex interactions
./docs/
├── overview.md # Repository overview (start here!)
├── module1.md # Module documentation
├── module2.md # Additional modules...
├── module_tree.json # Hierarchical module structure
├── first_module_tree.json # Initial clustering result
├── metadata.json # Generation metadata
└── index.html # Interactive viewer (with --github-pages)
CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.
| Language Category | CodeWiki (Sonnet-4) | DeepWiki | Improvement |
|---|---|---|---|
| High-Level (Python, JS, TS) | 79.14% | 68.67% | +10.47% |
| Managed (C#, Java) | 68.84% | 64.80% | +4.04% |
| Systems (C, C++) | 53.24% | 56.39% | -3.15% |
| Overall Average | 68.79% | 64.06% | +4.73% |
| Repository | Language | LOC | CodeWiki-Sonnet-4 | DeepWiki | Improvement |
|---|---|---|---|---|---|
| All-Hands-AI--OpenHands | Python | 229K | 82.45% | 73.04% | +9.41% |
| puppeteer--puppeteer | TypeScript | 136K | 83.00% | 64.46% | +18.54% |
| sveltejs--svelte | JavaScript | 125K | 71.96% | 68.51% | +3.45% |
| Unity-Technologies--ml-agents | C# | 86K | 79.78% | 74.80% | +4.98% |
| elastic--logstash | Java | 117K | 57.90% | 54.80% | +3.10% |
View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.
CodeWiki employs a three-stage process for comprehensive documentation generation:
-
Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.
-
Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.
-
Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Codebase │───▶│ Hierarchical │───▶│ Multi-Agent │
│ Analysis │ │ Decomposition │ │ Processing │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Visual │◀───│ Multi-Modal │◀───│ Structured │
│ Artifacts │ │ Synthesis │ │ Content │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- Python 3.12+
- Node.js (for Mermaid diagram validation)
- LLM API access (Anthropic Claude, OpenAI, Azure OpenAI, AWS Bedrock)
- Git (for branch creation features)
- MCP Server - Model Context Protocol server for IDE integrations
- Docker Deployment - Containerized deployment instructions
- Development Guide - Project structure, architecture, and contributing guidelines
- CodeWikiBench - Repository-level documentation benchmark
- Live Demo - Interactive demo and examples
- Paper - Full research paper with detailed methodology and results
- Citation - How to cite CodeWiki in your research
If you use CodeWiki in your research, please cite:
@misc{hoang2025codewikievaluatingaisability,
title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases},
author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui},
year={2025},
eprint={2510.24428},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2510.24428},
}This project is licensed under the MIT License.
This repository is a fork of the outstanding work done by the team at FSoft AI4Code. Huge thanks and full credit go to the original authors:
CodeWiki — Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases by Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi D. Q. Bui.
Their research, framework design, and open-source contributions have made this project possible. If you use this work, please consider citing the original paper and starring the upstream repository.
This fork is maintained by WebCafeTech and will be extended with new designs, features, and integrations over time.
🌐 GitHub Pages: https://webcafetech.github.io/AutoCodeWikiAI/

