Case Studies
1. Hierarchy Fix — Before vs After
Problem: On a 21-page research paper, the old extraction treated all 41 sections as top-level nodes. Subsections like “1.1 Background” were at the same level as “1 Introduction”.
After v0.1.5 (PDF ToC extraction):
| Metric | Before (flat) | After (hierarchy) | Change |
|---|---|---|---|
| Root nodes | 41 | 10 | -75.6% |
| Max depth | 1 | 3 | 3x deeper |
| Child nodes | 0 | 31 | Proper nesting |
| LLM calls | 1+ | 0 | 100% saved |
Output:
[0001] 1: Introduction (pages 1-1)
[0002] 1.1: Background (pages 1-1)
[0003] 1.2: Limitations of vector-based RAG (pages 1-1)
[0004] 1.3: Our contribution (pages 2-2)
[0005] 2: Related Work (pages 2-4)
[0006] 2.1: Retrieval-augmented generation (pages 2-2)
[0007] 2.2: Document chunking strategies (pages 3-3)
[0008] 2.3: Structured and hierarchical retrieval (pages 3-3)
...
[0011] 3: System Architecture (pages 5-8)
[0012] 3.1: Architecture overview (pages 5-5)
[0013] 3.2: Document loading layer (pages 5-5)
[0014] 3.3: Page grouping with token budget (pages 6-6)
[0015] 3.4: LLM-based structure extraction (pages 7-7)
[0016] 3.5: Tree construction (pages 7-7)
[0017] 3.6: Query retrieval (pages 7-7)
[0018] 3.7: LLM backend abstraction (pages 8-8)
10 correctly nested chapters with proper subsections. Zero LLM cost.
2. Heading Detection Impact
What the LLM used to see (plain text):
1 Introduction 1.1 Background Large Language Models (LLMs),
accessible primarily through web APIs, have become foundational
components of modern web information systems...
Everything on one line. No hierarchy signals. The LLM has to guess whether “1.1 Background” is a chapter or subsection.
What the LLM sees now (with heading markers):
[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models (LLMs), accessible primarily through web
APIs, have become foundational components of modern web
information systems...
The [H2] and [H3] markers come from font-size analysis:
- Title (17.2pt) →
[H1] - Chapter headings (12.0pt) →
[H2] - Section headings (11.0pt) →
[H3] - Body text (10.0pt) → no marker
Stats for the research paper:
- 3
[H1]markers (title) - 12
[H2]markers (chapters) - 31
[H3]markers (sections) - Token overhead: only +314 tokens (+2.7%)
The prompt explicitly instructs the LLM:
[H1]= top-level chapters (structure: “1”, “2”)[H2]= sections (structure: “1.1”, “1.2”)[H3]= subsections (structure: “1.1.1”, “1.1.2”)
3. Capped Continuation Context
Scenario: Indexing a 500-page textbook. The LLM processes page groups sequentially. By group 50 of 56, ~900 sections have been extracted.
Old approach:
// Sent to LLM as "previous structure" — ALL 900+ sections
[
{"structure": "1", "title": "Chapter 1", "physical_index": 0},
{"structure": "1.1", "title": "Section 1.1", "physical_index": 2},
{"structure": "1.1.1", "title": "Subsection 1.1.1", "physical_index": 3},
// ... 897 more sections
{"structure": "8.5.3", "title": "Last extracted", "physical_index": 489}
]
// = 317,200 tokens just for context!
This exceeds most model context windows and causes the LLM to truncate or hallucinate.
New approach (capped):
{
"top_level_sections": [
{"structure": "1", "title": "Chapter 1", "physical_index": 0},
{"structure": "2", "title": "Chapter 2", "physical_index": 30},
// ... only 15 top-level entries
],
"recent_sections (last 30)": [
{"structure": "8.4.2", "title": "...", "physical_index": 475},
// ... last 30 sections in detail
],
"total_sections_so_far": 976,
"last_structure_id": "8.5.3"
}
// = 31,200 tokens — fits comfortably
| Document | Old Context | Capped Context | Savings |
|---|---|---|---|
| 100 pages | 9,750 tok | 4,800 tok | 50.8% |
| 300 pages | 117,200 tok | 19,200 tok | 83.6% |
| 500 pages | 317,200 tok | 31,200 tok | 90.2% |
4. Orphan Repair
Scenario: The LLM processes chunk 8 and outputs "2.3.1" — but chunks 1-7 never produced "2" or "2.3". Without repair, "2.3.1" becomes a root node.
Mild case (1 missing parent):
Input: After repair:
1 — Introduction 1 — Introduction
1.1 — Background 1.1 — Background
2.1 — Data (no "2") 2 — Section 2 ← synthetic
2.1 — Data
Severe case (deep orphan chain):
Input: After repair:
1 — Introduction 1 — Introduction
1.1 — Background 1.1 — Background
2.3.1 — Deep orphan 2 — Section 2 ← synthetic
3.1.2 — Another orphan 2.3 — Section 2.3 ← synthetic
4 — Conclusion 2.3.1 — Deep orphan
3 — Section 3 ← synthetic
3.1 — Section 3.1 ← synthetic
3.1.2 — Another orphan
4 — Conclusion
5 input → 9 after repair. 4 synthetic parents inserted. The tree now has correct 3-level hierarchy instead of 3 orphaned root nodes.
5. TreeDex vs Vector DB RAG
Feature Comparison
| Dimension | TreeDex | Vector DB (Chroma/Pinecone) |
|---|---|---|
| Index structure | Hierarchical tree | Flat vector space |
| Storage format | JSON (human-readable) | Vector database (opaque) |
| Retrieval method | LLM navigates tree | Cosine similarity |
| Preserves structure | Chapters → sections → subsections | No hierarchy |
| Source attribution | Exact page ranges | Approximate chunk IDs |
| Infrastructure | None (just JSON files) | Database server |
| Dependencies | 1 LLM API | 1 LLM + 1 embedding + 1 DB |
| Debugging | Read the JSON tree | Query embedding space |
| Cost per index | N LLM calls (or 0 with ToC) | N embedding calls |
| Cost per query | 1 LLM call | 1 embedding + 1 LLM call |
When to Use TreeDex
- Structured documents: papers, textbooks, manuals, reports, legal docs
- Need exact page-level attribution
- Want a human-inspectable index
- Don’t want to run a vector database
- PDFs with bookmarks (zero LLM indexing cost)
When to Use Vector DB
- Unstructured content: chat logs, mixed media, knowledge bases
- Need sub-sentence matching
- Already have embedding infrastructure
- Documents with no inherent hierarchy
Performance Profile
| Metric | TreeDex | Vector DB |
|---|---|---|
| Index build time | Seconds (LLM calls) | Seconds (embedding calls) |
| Index size | 20-100 KB (JSON) | Varies (vectors) |
| Query latency | 1 LLM call (~0.5-3s) | 1 similarity search (~50ms) + 1 LLM call |
| Accuracy on structured docs | High (preserves hierarchy) | Medium (loses structure) |
| Accuracy on unstructured | Medium | High (semantic matching) |
Next: Configuration →