Skip to content

WebCafeTech/AutoCodeWikiAI

 
 

Repository files navigation

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

AI-Powered Repository Documentation GenerationMulti-Language SupportArchitecture-Aware Analysis

Generate holistic, structured documentation for large-scale codebases • Cross-module interactions • Visual artifacts and diagrams

Python version License: MIT GitHub stars arXiv

Quick StartCLI CommandsOutput StructurePaper

CodeWiki Framework


Quick Start

1. Install CodeWiki

# Install from source
pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git

# Verify installation
codewiki --version

2. Configure Your Environment

CodeWiki supports multiple LLM providers: OpenAI-compatible, Anthropic, AWS Bedrock, and Azure OpenAI.

# Anthropic
codewiki config set \
  --api-key YOUR_API_KEY \
  --base-url https://api.anthropic.com \
  --main-model claude-sonnet-4 \
  --cluster-model claude-sonnet-4 \
  --fallback-model glm-4p5

# Azure OpenAI
codewiki config set \
  --provider azure-openai \
  --api-key YOUR_AZURE_KEY \
  --base-url https://YOUR_RESOURCE.openai.azure.com \
  --azure-deployment YOUR_DEPLOYMENT \
  --main-model gpt-4o \
  --cluster-model gpt-4o

# AWS Bedrock
codewiki config set \
  --provider bedrock \
  --aws-region us-east-1 \
  --main-model anthropic.claude-sonnet-4-v2:0 \
  --cluster-model anthropic.claude-sonnet-4-v2:0

3. Generate Documentation

# Navigate to your project
cd /path/to/your/project

# Generate documentation
codewiki generate

# Generate with HTML viewer for GitHub Pages
codewiki generate --github-pages --create-branch

That's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.

Usage Example

CLI Usage Example


What is CodeWiki?

CodeWiki is an open-source framework for automated repository-level documentation across eight programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.

Key Innovations

Innovation Description Impact
Hierarchical Decomposition Dynamic programming-inspired strategy that preserves architectural context Handles codebases of arbitrary size (86K-1.4M LOC tested)
Recursive Agentic System Adaptive multi-agent processing with dynamic delegation capabilities Maintains quality while scaling to repository-level scope
Multi-Modal Synthesis Generates textual documentation, architecture diagrams, data flows, and sequence diagrams Comprehensive understanding from multiple perspectives

Supported Languages

🐍 Python☕ Java🟨 JavaScript🔷 TypeScript⚙️ C🔧 C++🪟 C#🎯 Kotlin


CLI Commands

Configuration Management

# Set up your API configuration
codewiki config set \
  --api-key <your-api-key> \
  --base-url <provider-url> \
  --main-model <model-name> \
  --cluster-model <model-name> \
  --fallback-model <model-name>

# Configure max token settings
codewiki config set --max-tokens 32768 --max-token-per-module 36369 --max-token-per-leaf-module 16000

# Configure max depth for hierarchical decomposition
codewiki config set --max-depth 3

# Show current configuration
codewiki config show

# Validate your configuration
codewiki config validate

Documentation Generation

# Basic generation
codewiki generate

# Custom output directory
codewiki generate --output ./documentation

# Create git branch for documentation
codewiki generate --create-branch

# Generate HTML viewer for GitHub Pages
codewiki generate --github-pages

# Enable verbose logging
codewiki generate --verbose

# Full-featured generation
codewiki generate --create-branch --github-pages --verbose

# Incremental update (only regenerate changed modules since last run)
codewiki generate --update

Customization Options

CodeWiki supports customization for language-specific projects and documentation styles:

# C# project: only analyze .cs files, exclude test directories
codewiki generate --include "*.cs" --exclude "Tests,Specs,*.test.cs"

# Focus on specific modules with architecture-style docs
codewiki generate --focus "src/core,src/api" --doc-type architecture

# Add custom instructions for the AI agent
codewiki generate --instructions "Focus on public APIs and include usage examples"

Pattern Behavior (Important!)

  • --include: When specified, ONLY these patterns are used (replaces defaults completely)

    • Example: --include "*.cs" will analyze ONLY .cs files
    • If omitted, all supported file types are analyzed
    • Supports glob patterns: *.py, src/**/*.ts, *.{js,jsx}
  • --exclude: When specified, patterns are MERGED with default ignore patterns

    • Example: --exclude "Tests,Specs" will exclude these directories AND still exclude .git, __pycache__, node_modules, etc.
    • Default patterns include: .git, node_modules, __pycache__, *.pyc, bin/, dist/, and many more
    • Supports multiple formats:
      • Exact names: Tests, .env, config.local
      • Glob patterns: *.test.js, *_test.py, *.min.*
      • Directory patterns: build/, dist/, coverage/

Setting Persistent Defaults

Save your preferred settings as defaults:

# Set include patterns for C# projects
codewiki config agent --include "*.cs"

# Exclude test projects by default (merged with default excludes)
codewiki config agent --exclude "Tests,Specs,*.test.cs"

# Set focus modules
codewiki config agent --focus "src/core,src/api"

# Set default documentation type
codewiki config agent --doc-type architecture

# View current agent settings
codewiki config agent

# Clear all agent settings
codewiki config agent --clear
Option Description Behavior Example
--include File patterns to include Replaces defaults *.cs, *.py, src/**/*.ts
--exclude Patterns to exclude Merges with defaults Tests,Specs, *.test.js, build/
--focus Modules to document in detail Standalone option src/core,src/api
--doc-type Documentation style Standalone option api, architecture, user-guide, developer
--instructions Custom agent instructions Standalone option Free-form text

Token Settings

CodeWiki allows you to configure maximum token limits for LLM calls. This is useful for:

  • Adapting to different model context windows
  • Controlling costs by limiting response sizes
  • Optimizing for faster response times
# Set max tokens for LLM responses (default: 32768)
codewiki config set --max-tokens 16384

# Set max tokens for module clustering (default: 36369)
codewiki config set --max-token-per-module 40000

# Set max tokens for leaf modules (default: 16000)
codewiki config set --max-token-per-leaf-module 20000

# Set max depth for hierarchical decomposition (default: 2)
codewiki config set --max-depth 3

# Override at runtime for a single generation
codewiki generate --max-tokens 16384 --max-token-per-module 40000 --max-depth 3
Option Description Default
--max-tokens Maximum output tokens for LLM response 32768
--max-token-per-module Input tokens threshold for module clustering 36369
--max-token-per-leaf-module Input tokens threshold for leaf modules 16000
--max-depth Maximum depth for hierarchical decomposition 2

Configuration Storage

  • API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service). Falls back to ~/.codewiki/credentials.json in headless/container environments. Set CODEWIKI_NO_KEYRING=1 to force file-based storage.
  • Settings & Agent Instructions: ~/.codewiki/config.json

Documentation Output

Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.

Textual Documentation

  • Repository overview with architecture guide
  • Module-level documentation with API references
  • Usage examples and implementation patterns
  • Cross-module interaction analysis

Visual Artifacts

  • System architecture diagrams (Mermaid)
  • Data flow visualizations
  • Dependency graphs and module relationships
  • Sequence diagrams for complex interactions

Output Structure

./docs/
├── overview.md              # Repository overview (start here!)
├── module1.md               # Module documentation
├── module2.md               # Additional modules...
├── module_tree.json         # Hierarchical module structure
├── first_module_tree.json   # Initial clustering result
├── metadata.json            # Generation metadata
└── index.html               # Interactive viewer (with --github-pages)

Experimental Results

CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.

Performance by Language Category

Language Category CodeWiki (Sonnet-4) DeepWiki Improvement
High-Level (Python, JS, TS) 79.14% 68.67% +10.47%
Managed (C#, Java) 68.84% 64.80% +4.04%
Systems (C, C++) 53.24% 56.39% -3.15%
Overall Average 68.79% 64.06% +4.73%

Results on Representative Repositories

Repository Language LOC CodeWiki-Sonnet-4 DeepWiki Improvement
All-Hands-AI--OpenHands Python 229K 82.45% 73.04% +9.41%
puppeteer--puppeteer TypeScript 136K 83.00% 64.46% +18.54%
sveltejs--svelte JavaScript 125K 71.96% 68.51% +3.45%
Unity-Technologies--ml-agents C# 86K 79.78% 74.80% +4.98%
elastic--logstash Java 117K 57.90% 54.80% +3.10%

View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.


How It Works

Architecture Overview

CodeWiki employs a three-stage process for comprehensive documentation generation:

  1. Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.

  2. Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.

  3. Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.

Data Flow

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Codebase      │───▶│  Hierarchical    │───▶│  Multi-Agent    │
│   Analysis      │    │  Decomposition   │    │  Processing     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Visual        │◀───│  Multi-Modal     │◀───│  Structured     │
│   Artifacts     │    │  Synthesis       │    │  Content        │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Requirements

  • Python 3.12+
  • Node.js (for Mermaid diagram validation)
  • LLM API access (Anthropic Claude, OpenAI, Azure OpenAI, AWS Bedrock)
  • Git (for branch creation features)

Additional Resources

Documentation & Guides

Academic Resources

  • Paper - Full research paper with detailed methodology and results
  • Citation - How to cite CodeWiki in your research

Citation

If you use CodeWiki in your research, please cite:

@misc{hoang2025codewikievaluatingaisability,
      title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases},
      author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui},
      year={2025},
      eprint={2510.24428},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2510.24428},
}

Star History

Star History Chart


License

This project is licensed under the MIT License.


Acknowledgments

This repository is a fork of the outstanding work done by the team at FSoft AI4Code. Huge thanks and full credit go to the original authors:

CodeWikiEvaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases by Anh Nguyen Hoang, Minh Le-Anh, Bach Le, and Nghi D. Q. Bui.

Their research, framework design, and open-source contributions have made this project possible. If you use this work, please consider citing the original paper and starring the upstream repository.

This fork is maintained by WebCafeTech and will be extended with new designs, features, and integrations over time.

🌐 GitHub Pages: https://webcafetech.github.io/AutoCodeWikiAI/

About

[ACL 2026] Open-source framework for holistic, structured repository-level documentation across multilingual codebases

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 92.5%
  • HTML 7.4%
  • Dockerfile 0.1%