Case Studies

1. Hierarchy Fix — Before vs After

Problem: On a 21-page research paper, the old extraction treated all 41 sections as top-level nodes. Subsections like “1.1 Background” were at the same level as “1 Introduction”.

After v0.1.5 (PDF ToC extraction):

Metric Before (flat) After (hierarchy) Change
Root nodes 41 10 -75.6%
Max depth 1 3 3x deeper
Child nodes 0 31 Proper nesting
LLM calls 1+ 0 100% saved

Output:

[0001] 1: Introduction (pages 1-1)
  [0002] 1.1: Background (pages 1-1)
  [0003] 1.2: Limitations of vector-based RAG (pages 1-1)
  [0004] 1.3: Our contribution (pages 2-2)
[0005] 2: Related Work (pages 2-4)
  [0006] 2.1: Retrieval-augmented generation (pages 2-2)
  [0007] 2.2: Document chunking strategies (pages 3-3)
  [0008] 2.3: Structured and hierarchical retrieval (pages 3-3)
  ...
[0011] 3: System Architecture (pages 5-8)
  [0012] 3.1: Architecture overview (pages 5-5)
  [0013] 3.2: Document loading layer (pages 5-5)
  [0014] 3.3: Page grouping with token budget (pages 6-6)
  [0015] 3.4: LLM-based structure extraction (pages 7-7)
  [0016] 3.5: Tree construction (pages 7-7)
  [0017] 3.6: Query retrieval (pages 7-7)
  [0018] 3.7: LLM backend abstraction (pages 8-8)

10 correctly nested chapters with proper subsections. Zero LLM cost.


2. Heading Detection Impact

What the LLM used to see (plain text):

1 Introduction  1.1 Background  Large Language Models (LLMs),
accessible primarily through web APIs, have become foundational
components of modern web information systems...

Everything on one line. No hierarchy signals. The LLM has to guess whether “1.1 Background” is a chapter or subsection.

What the LLM sees now (with heading markers):

[H2] 1 Introduction
[H3] 1.1 Background
Large Language Models (LLMs), accessible primarily through web
APIs, have become foundational components of modern web
information systems...

The [H2] and [H3] markers come from font-size analysis:

  • Title (17.2pt) → [H1]
  • Chapter headings (12.0pt) → [H2]
  • Section headings (11.0pt) → [H3]
  • Body text (10.0pt) → no marker

Stats for the research paper:

  • 3 [H1] markers (title)
  • 12 [H2] markers (chapters)
  • 31 [H3] markers (sections)
  • Token overhead: only +314 tokens (+2.7%)

The prompt explicitly instructs the LLM:

[H1] = top-level chapters (structure: “1”, “2”) [H2] = sections (structure: “1.1”, “1.2”) [H3] = subsections (structure: “1.1.1”, “1.1.2”)


3. Capped Continuation Context

Scenario: Indexing a 500-page textbook. The LLM processes page groups sequentially. By group 50 of 56, ~900 sections have been extracted.

Old approach:

// Sent to LLM as "previous structure"  ALL 900+ sections
[
  {"structure": "1", "title": "Chapter 1", "physical_index": 0},
  {"structure": "1.1", "title": "Section 1.1", "physical_index": 2},
  {"structure": "1.1.1", "title": "Subsection 1.1.1", "physical_index": 3},
  // ... 897 more sections
  {"structure": "8.5.3", "title": "Last extracted", "physical_index": 489}
]
// = 317,200 tokens just for context!

This exceeds most model context windows and causes the LLM to truncate or hallucinate.

New approach (capped):

{
  "top_level_sections": [
    {"structure": "1", "title": "Chapter 1", "physical_index": 0},
    {"structure": "2", "title": "Chapter 2", "physical_index": 30},
    // ... only 15 top-level entries
  ],
  "recent_sections (last 30)": [
    {"structure": "8.4.2", "title": "...", "physical_index": 475},
    // ... last 30 sections in detail
  ],
  "total_sections_so_far": 976,
  "last_structure_id": "8.5.3"
}
// = 31,200 tokens  fits comfortably
Document Old Context Capped Context Savings
100 pages 9,750 tok 4,800 tok 50.8%
300 pages 117,200 tok 19,200 tok 83.6%
500 pages 317,200 tok 31,200 tok 90.2%

4. Orphan Repair

Scenario: The LLM processes chunk 8 and outputs "2.3.1" — but chunks 1-7 never produced "2" or "2.3". Without repair, "2.3.1" becomes a root node.

Mild case (1 missing parent):

Input:                    After repair:
1    — Introduction       1    — Introduction
1.1  — Background         1.1  — Background
2.1  — Data (no "2")      2    — Section 2      ← synthetic
                          2.1  — Data

Severe case (deep orphan chain):

Input:                      After repair:
1    — Introduction         1    — Introduction
1.1  — Background           1.1  — Background
2.3.1 — Deep orphan         2    — Section 2      ← synthetic
3.1.2 — Another orphan      2.3  — Section 2.3    ← synthetic
4    — Conclusion           2.3.1 — Deep orphan
                            3    — Section 3      ← synthetic
                            3.1  — Section 3.1    ← synthetic
                            3.1.2 — Another orphan
                            4    — Conclusion

5 input → 9 after repair. 4 synthetic parents inserted. The tree now has correct 3-level hierarchy instead of 3 orphaned root nodes.


5. TreeDex vs Vector DB RAG

Feature Comparison

Dimension TreeDex Vector DB (Chroma/Pinecone)
Index structure Hierarchical tree Flat vector space
Storage format JSON (human-readable) Vector database (opaque)
Retrieval method LLM navigates tree Cosine similarity
Preserves structure Chapters → sections → subsections No hierarchy
Source attribution Exact page ranges Approximate chunk IDs
Infrastructure None (just JSON files) Database server
Dependencies 1 LLM API 1 LLM + 1 embedding + 1 DB
Debugging Read the JSON tree Query embedding space
Cost per index N LLM calls (or 0 with ToC) N embedding calls
Cost per query 1 LLM call 1 embedding + 1 LLM call

When to Use TreeDex

  • Structured documents: papers, textbooks, manuals, reports, legal docs
  • Need exact page-level attribution
  • Want a human-inspectable index
  • Don’t want to run a vector database
  • PDFs with bookmarks (zero LLM indexing cost)

When to Use Vector DB

  • Unstructured content: chat logs, mixed media, knowledge bases
  • Need sub-sentence matching
  • Already have embedding infrastructure
  • Documents with no inherent hierarchy

Performance Profile

Metric TreeDex Vector DB
Index build time Seconds (LLM calls) Seconds (embedding calls)
Index size 20-100 KB (JSON) Varies (vectors)
Query latency 1 LLM call (~0.5-3s) 1 similarity search (~50ms) + 1 LLM call
Accuracy on structured docs High (preserves hierarchy) Medium (loses structure)
Accuracy on unstructured Medium High (semantic matching)

Next: Configuration →


Back to top

TreeDex © 2024-2026 Mithun Gowda B. MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.