API Reference

TreeDex

The main class for building and querying document indices.

Constructor

# Python
TreeDex(tree: list[dict], pages: list[dict], llm=None)
// TypeScript
new TreeDex(tree: TreeNode[], pages: Page[], llm: BaseLLM | null = null)

You typically don’t call the constructor directly — use the factory methods below.


Factory Methods

from_file / fromFile

Build an index from a document file. This is the primary entry point.

# Python
index = TreeDex.from_file(
    path: str,
    llm: BaseLLM,
    loader=None,             # Custom loader (auto-detect if None)
    max_tokens: int = 20000, # Token budget per page group
    overlap: int = 1,        # Page overlap between groups
    verbose: bool = True,    # Print progress
    extract_images: bool = False  # Extract images for vision LLM
)
// TypeScript
const index = await TreeDex.fromFile(path, llm, {
  loader?,          // Custom loader
  maxTokens?: 20000,
  overlap?: 1,
  verbose?: true,
  extractImages?: false,
});

Pipeline:

  1. Check for PDF ToC → if found, build tree directly (0 LLM calls)
  2. Load pages with heading detection (PDFs without ToC)
  3. Group pages by token budget
  4. LLM extracts structure per group
  5. Repair orphans → build tree

from_pages / fromPages

Build from pre-extracted pages (skip document loading).

index = TreeDex.from_pages(pages, llm, max_tokens=20000, overlap=1, verbose=True)
const index = await TreeDex.fromPages(pages, llm, { maxTokens, overlap, verbose });

from_tree / fromTree

Create from an existing tree and pages (no processing).

index = TreeDex.from_tree(tree, pages, llm)
const index = TreeDex.fromTree(tree, pages, llm);

load

Load a previously saved index from JSON.

index = TreeDex.load("index.json", llm=llm)
const index = await TreeDex.load("index.json", llm);

Instance Methods

query

Retrieve relevant sections for a question.

result = index.query(
    question: str,
    llm=None,            # Override LLM (uses constructor LLM if None)
    agentic: bool = False # Generate an answer from context
) -> QueryResult
const result = await index.query(question, {
  llm?,       // Override LLM
  agentic?,   // Generate answer
});
// Or shorthand: await index.query(question, llm)

save

Export the index to a JSON file.

path = index.save("index.json")  # Returns the path
const path = await index.save("index.json");

The saved JSON contains the tree structure (without embedded text) and all pages. Text is re-embedded on load().

show_tree / showTree

Pretty-print the tree structure.

index.show_tree()

Output:

[0001] 1: Introduction (pages 1-4)
  [0002] 1.1: Background (pages 1-2)
  [0003] 1.2: Motivation (pages 3-4)
[0004] 2: Methods (pages 5-12)
  ...

stats

Return index statistics.

stats = index.stats()
# {
#   "total_pages": 21,
#   "total_tokens": 11710,
#   "total_nodes": 41,
#   "leaf_nodes": 32,
#   "root_sections": 10
# }

find_large_sections / findLargeSections

Find sections that exceed size thresholds.

large = index.find_large_sections(max_pages=10, max_tokens=20000)
const large = index.findLargeSections({ maxPages: 10, maxTokens: 20000 });

QueryResult

Returned by index.query().

Property Python Node.js Type Description
Context .context .context str Concatenated text from selected nodes
Node IDs .node_ids .nodeIds list[str] IDs of selected tree nodes
Page ranges .page_ranges .pageRanges list[tuple] [(start, end), ...] (0-indexed)
Pages string .pages_str .pagesStr str Human-readable: "pages 5-8, 12-15"
Reasoning .reasoning .reasoning str LLM’s explanation
Answer .answer .answer str Generated answer (agentic mode only)

PDF Parser Functions

extract_toc / extractToc

Extract table of contents from PDF bookmarks.

toc = extract_toc("doc.pdf")
# Returns: [{"level": 1, "title": "Intro", "physical_index": 0}, ...] or None
const toc = await extractToc("doc.pdf");
// Returns: TocEntry[] | null

Returns None/null if the PDF has fewer than 3 ToC entries.

extract_pages / extractPages

Extract text from each page of a PDF.

pages = extract_pages(
    "doc.pdf",
    extract_images=False,    # Extract images as base64
    detect_headings=False    # Inject [H1]/[H2]/[H3] markers
)
const pages = await extractPages("doc.pdf", {
  extractImages: false,
  detectHeadings: false,
});

group_pages / groupPages

Split pages into token-budget groups.

groups = group_pages(pages, max_tokens=20000, overlap=1)
# Returns: list of tagged text strings

count_tokens / countTokens

Count tokens using cl100k_base encoding.

n = count_tokens("Hello world")  # → 2

Tree Builder Functions

toc_to_sections / tocToSections

Convert ToC entries to numbered sections.

sections = toc_to_sections([
    {"level": 1, "title": "Intro", "physical_index": 0},
    {"level": 2, "title": "Background", "physical_index": 2},
])
# → [{"structure": "1", "title": "Intro", ...},
#    {"structure": "1.1", "title": "Background", ...}]

repair_orphans / repairOrphans

Insert synthetic parent nodes for orphaned subsections.

repaired = repair_orphans([
    {"structure": "1", "title": "Intro", "physical_index": 0},
    {"structure": "2.3.1", "title": "Deep", "physical_index": 5},
])
# Inserts "2" and "2.3" as synthetic parents

list_to_tree / listToTree

Convert flat sections to a hierarchical tree.

tree = list_to_tree(sections)

assign_page_ranges / assignPageRanges

Set start_index and end_index on each node.

assign_page_ranges(tree, total_pages=21)

assign_node_ids / assignNodeIds

DFS traversal, assigns sequential IDs ("0001", "0002", …).

assign_node_ids(tree)

embed_text_in_tree / embedTextInTree

Populate each node’s text field by concatenating pages in its range.

embed_text_in_tree(tree, pages)

find_large_nodes / findLargeNodes

Return nodes exceeding page or token thresholds.

large = find_large_nodes(tree, max_pages=10, max_tokens=20000, pages=pages)

Tree Utility Functions

create_node_mapping / createNodeMapping

Build a flat {node_id: node} map for O(1) lookups.

node_map = create_node_mapping(tree)
node = node_map["0005"]

strip_text_from_tree / stripTextFromTree

Deep copy the tree with all text fields removed (for LLM prompts).

stripped = strip_text_from_tree(tree)

collect_node_texts / collectNodeTexts

Concatenate text from specific nodes.

text = collect_node_texts(["0005", "0008"], node_map)

extract_json / extractJson

Robust JSON extraction from LLM output. Handles:

  • Raw JSON
  • Markdown code blocks
  • Trailing commas
  • Text before/after JSON
data = extract_json('Some text ```json\n[{"a": 1}]\n``` more text')
# → [{"a": 1}]

count_nodes / countNodes

n = count_nodes(tree)  # Total nodes including children

get_leaf_nodes / getLeafNodes

leaves = get_leaf_nodes(tree)  # Nodes with no children
print_tree(tree)
# [0001] 1: Introduction (pages 1-4)
#   [0002] 1.1: Background (pages 1-2)

Document Loaders

auto_loader / autoLoader

pages = auto_loader("doc.pdf", extract_images=False, detect_headings=False)
const pages = await autoLoader("doc.pdf", { extractImages, detectHeadings });

Auto-detects format by extension: .pdf, .txt, .md, .html, .htm, .docx.

PDFLoader

loader = PDFLoader(extract_images=False, detect_headings=False)
pages = loader.load("doc.pdf")

TextLoader

loader = TextLoader(chars_per_page=3000)
pages = loader.load("doc.txt")

HTMLLoader

loader = HTMLLoader(chars_per_page=3000)
pages = loader.load("doc.html")

DOCXLoader

loader = DOCXLoader(chars_per_page=3000)
pages = loader.load("doc.docx")

Next: LLM Backends →


Back to top

TreeDex © 2024-2026 Mithun Gowda B. MIT License.

This site uses Just the Docs, a documentation theme for Jekyll.