Lesson 22 of 46 ~25 min
Course progress
0%

Structuring Documents for Optimal Retrieval

Learn patterns for organizing large documents, codebases, and datasets within the context window for maximum retrieval accuracy.

Loading a million tokens into the context is easy. Getting the model to reliably find and use the right information is the hard part. This lesson teaches you structuring patterns that maximize retrieval accuracy.

The Document Map Pattern

Place a table of contents at the beginning of your context so the model can efficiently locate sections:

def build_codebase_context(project_path: str) -> str:
    """Build a structured context from a codebase with a document map."""
    files = collect_files(project_path)

    # Build document map
    doc_map = "## Document Map\n\n"
    doc_map += "| # | File | Lines | Purpose |\n"
    doc_map += "|---|------|-------|---------|\n"
    for i, f in enumerate(files):
        doc_map += f"| {i+1} | {f.path} | {f.line_count} | {f.purpose} |\n"

    # Build full content with markers
    content = doc_map + "\n\n## Full Source Code\n\n"
    for i, f in enumerate(files):
        content += f"### [{i+1}] {f.path}\n"
        content += f"```{f.language}\n{f.content}\n```\n\n"

    return content

The Sectioned Context Pattern

For non-code content (legal, research, documentation), use clear hierarchical sections:

## SECTION A: Case Background [Priority: HIGH]
[Content here]

## SECTION B: Witness Depositions [Priority: MEDIUM]
### B.1: Deposition of John Smith (2026-01-15)
[Content]
### B.2: Deposition of Jane Doe (2026-01-18)
[Content]

## SECTION C: Expert Reports [Priority: HIGH]
[Content here]

## SECTION D: Exhibits [Priority: LOW]
[Content here]

Key principles:

  • Number sections for easy reference
  • Add priority indicators so the model knows where to focus
  • Use consistent formatting across all sections
  • Keep related content together — do not interleave unrelated documents

The Chunked Codebase Pattern

For codebases that exceed even the 1M context, chunk intelligently by module:

def chunk_codebase(project_path: str, max_tokens: int = 900_000) -> list[str]:
    """Split codebase into optimal chunks based on module boundaries."""
    modules = detect_modules(project_path)  # Group by directory/package

    chunks = []
    current_chunk = []
    current_tokens = 0

    for module in modules:
        module_tokens = count_tokens(module.content)
        if current_tokens + module_tokens > max_tokens and current_chunk:
            chunks.append(assemble_chunk(current_chunk))
            current_chunk = []
            current_tokens = 0
        current_chunk.append(module)
        current_tokens += module_tokens

    if current_chunk:
        chunks.append(assemble_chunk(current_chunk))

    return chunks

What to Include vs. Exclude

Not everything belongs in the context. Be strategic:

Include:

  • Source code files relevant to the task
  • Configuration files that affect behavior
  • Test files (they document expected behavior)
  • README and architectural documentation
  • Database schemas and migration files

Exclude:

  • node_modules/, vendor/, dependency source code
  • Build artifacts (dist/, .next/, __pycache__/)
  • Binary files, images, compiled assets
  • Lock files (package-lock.json, yarn.lock)
  • IDE configuration (.idea/, .vscode/ — unless relevant)
  • Git history
EXCLUDE_PATTERNS = [
    "node_modules/**", "dist/**", "build/**", "__pycache__/**",
    "*.lock", "*.min.js", "*.map", "*.wasm", "*.png", "*.jpg",
    ".git/**", ".idea/**", ".vscode/**", "coverage/**",
]

Token Counting

Always count tokens before sending to avoid hitting limits:

from anthropic import Anthropic

client = Anthropic()

def count_tokens(text: str) -> int:
    """Count tokens using the Anthropic tokenizer."""
    return client.count_tokens(text)

# Check before sending
context = build_codebase_context("./my-project")
token_count = count_tokens(context)
print(f"Context size: {token_count:,} tokens")

if token_count > 900_000:  # Leave 100K headroom for response
    print("⚠️ Context too large — need to chunk or trim")

In the next lesson, you will process an entire codebase for architecture analysis and cross-file refactoring.