Architecture
System Overview
TreeDex has six core modules, each implemented identically in Python and TypeScript:
┌─────────────────────────────────────────────────────┐
│ Document Loaders │
│ PDFLoader · TextLoader · HTMLLoader · DOCXLoader │
└──────────────────────┬──────────────────────────────┘
│
┌────────────▼────────────┐
│ PDF Parser │
│ extract_pages() │
│ extract_toc() │
│ [H1][H2][H3] markers │
│ group_pages() │
└────────┬───────┬────────┘
│ │
┌─────────▼─┐ ┌▼──────────────────────┐
│ ToC found │ │ No ToC → LLM path │
│ Zero LLM │ │ Heading-guided │
│ calls │ │ Capped continuation │
└─────┬──────┘ └─────────┬──────────────┘
│ │
┌────▼────────────────────▼─────┐
│ Tree Builder │
│ toc_to_sections() │
│ repair_orphans() │
│ list_to_tree() │
│ assign_page_ranges() │
│ assign_node_ids() │
│ embed_text_in_tree() │
└──────────────┬────────────────┘
│
┌──────────────▼────────────────┐
│ TreeDex Core │
│ .query() .save() .load() │
└──────────────┬────────────────┘
│
┌──────────────▼────────────────┐
│ QueryResult │
│ .context .node_ids │
│ .page_ranges .reasoning │
│ .answer (agentic mode) │
└───────────────────────────────┘
Module Map
| Module | Python | TypeScript | LOC | Purpose |
|---|---|---|---|---|
| Core | treedex/core.py | src/core.ts | 303 / 398 | Main TreeDex class |
| LLM Backends | treedex/llm_backends.py | src/llm-backends.ts | 700 / 738 | 18+ provider integrations |
| Loaders | treedex/loaders.py | src/loaders.ts | 153 / 190 | Format-specific document loading |
| PDF Parser | treedex/pdf_parser.py | src/pdf-parser.ts | 133 / 154 | PDF extraction, ToC, headings |
| Prompts | treedex/prompts.py | src/prompts.ts | 91 / 105 | LLM prompt templates |
| Tree Builder | treedex/tree_builder.py | src/tree-builder.ts | 120 / 156 | Flat → hierarchical tree |
| Tree Utils | treedex/tree_utils.py | src/tree-utils.ts | 152 / 188 | Traversal, serialization |
| Types | — | src/types.ts | — / 42 | TypeScript type definitions |
Indexing Pipeline
Step 1: Document Loading
pages = auto_loader("doc.pdf", detect_headings=True)
# Returns: [{"page_num": 0, "text": "...", "token_count": 342}, ...]
Each loader returns a flat list of Page objects with pre-computed token counts. Non-PDF formats split text into synthetic pages by character count (default: 3,000 chars/page).
Step 2: Smart Detection (PDF only)
ToC Extraction:
toc = extract_toc("doc.pdf")
# Returns: [{"level": 1, "title": "Intro", "physical_index": 0}, ...] or None
Uses PDF bookmarks/outline metadata. If 3+ entries found, the tree is built directly — no LLM needed.
Font-Size Heading Detection:
- Sample up to 50 pages, collect all font sizes weighted by character count
- Most-used size = body text. Sizes > body + 0.5pt = headings
- Top 3 heading sizes →
[H1],[H2],[H3](largest first) - Inject markers before heading lines in page text
Step 3: Page Grouping
groups = group_pages(pages, max_tokens=20000, overlap=1)
Pages are batched into token-budget groups. Each page is wrapped in XML tags:
<physical_index_0>Page 0 text here...</physical_index_0>
<physical_index_1>Page 1 text here...</physical_index_1>
Adjacent groups overlap by 1 page to catch sections spanning boundaries.
Step 4: Structure Extraction
- Group 1: Full extraction prompt → LLM returns
[{structure, title, physical_index}, ...] - Groups 2+: Continuation prompt with capped context (top-level outline + last 30 sections)
- Result: Flat list of sections with hierarchical numbering (
"1","1.1","1.2.3")
Step 5: Orphan Repair
sections = repair_orphans(sections)
If "2.3.1" exists but "2.3" doesn’t, a synthetic "2.3" node is inserted.
Step 6: Tree Construction
tree = list_to_tree(sections) # Build hierarchy
assign_page_ranges(tree, len(pages)) # Set start/end indices
assign_node_ids(tree) # DFS: "0001", "0002", ...
embed_text_in_tree(tree, pages) # Populate node.text
Query Pipeline
result = index.query("What methods were used?")
- Strip tree — Remove
node.textfrom tree copy (minimize tokens) - LLM selection — Send stripped tree + question → LLM returns
{node_ids, reasoning} - Text lookup — O(1) node map lookup for each selected node
- Optional answer — If
agentic=True, generate answer from retrieved context - Return —
QueryResultwith context, node IDs, page ranges, reasoning
Data Types
Page
interface Page {
page_num: number; // 0-indexed
text: string; // Page content (may include [H1]/[H2]/[H3] markers)
token_count: number; // Pre-computed cl100k_base tokens
images?: PageImage[]; // Optional extracted images
}
TreeNode
interface TreeNode {
structure: string; // "1", "1.1", "1.2.3"
title: string; // Section title
physical_index: number; // Page where section starts
nodes: TreeNode[]; // Children
start_index?: number; // Page range start (assigned after tree build)
end_index?: number; // Page range end
node_id?: string; // "0001", "0002" (4-digit, DFS order)
text?: string; // Concatenated page text (assigned after tree build)
}
IndexData (JSON format)
interface IndexData {
version: "1.0";
framework: "TreeDex";
tree: TreeNode[]; // Without text (stripped for storage)
pages: Page[]; // Without images (descriptions already in text)
}
Complexity
| Operation | Time | Space |
|---|---|---|
extract_pages() | O(pages) | O(pages × text) |
extract_toc() | O(toc entries) | O(entries) |
group_pages() | O(pages) | O(groups × text) |
list_to_tree() | O(n) | O(n) |
repair_orphans() | O(n × depth) | O(inserts) |
assign_page_ranges() | O(n) | O(1) in-place |
embed_text_in_tree() | O(n × pages/node) | O(text) |
query() | O(1 LLM + n) | O(context) |
save() / load() | O(n + pages) | O(JSON) |
Where n = tree nodes (typically 10-500).
Next: API Reference →