API Reference

TreeDex

The main class for building and querying document indices.

Constructor

# Python
TreeDex(tree: list[dict], pages: list[dict], llm=None)

// TypeScript
new TreeDex(tree: TreeNode[], pages: Page[], llm: BaseLLM | null = null)

You typically don’t call the constructor directly — use the factory methods below.

Factory Methods

`from_file` / `fromFile`

Build an index from a document file. This is the primary entry point.

# Python
index = TreeDex.from_file(
    path: str,
    llm: BaseLLM,
    loader=None,             # Custom loader (auto-detect if None)
    max_tokens: int = 20000, # Token budget per page group
    overlap: int = 1,        # Page overlap between groups
    verbose: bool = True,    # Print progress
    extract_images: bool = False  # Extract images for vision LLM
)

// TypeScript
const index = await TreeDex.fromFile(path, llm, {
  loader?,          // Custom loader
  maxTokens?: 20000,
  overlap?: 1,
  verbose?: true,
  extractImages?: false,
});

Pipeline:

Check for PDF ToC → if found, build tree directly (0 LLM calls)
Load pages with heading detection (PDFs without ToC)
Group pages by token budget
LLM extracts structure per group
Repair orphans → build tree

`from_pages` / `fromPages`

Build from pre-extracted pages (skip document loading).

index = TreeDex.from_pages(pages, llm, max_tokens=20000, overlap=1, verbose=True)

const index = await TreeDex.fromPages(pages, llm, { maxTokens, overlap, verbose });

`from_tree` / `fromTree`

Create from an existing tree and pages (no processing).

index = TreeDex.from_tree(tree, pages, llm)

const index = TreeDex.fromTree(tree, pages, llm);

`load`

Load a previously saved index from JSON.

index = TreeDex.load("index.json", llm=llm)

const index = await TreeDex.load("index.json", llm);

Instance Methods

`query`

Retrieve relevant sections for a question.

result = index.query(
    question: str,
    llm=None,            # Override LLM (uses constructor LLM if None)
    agentic: bool = False # Generate an answer from context
) -> QueryResult

const result = await index.query(question, {
  llm?,       // Override LLM
  agentic?,   // Generate answer
});
// Or shorthand: await index.query(question, llm)

`save`

Export the index to a JSON file.

path = index.save("index.json")  # Returns the path

const path = await index.save("index.json");

The saved JSON contains the tree structure (without embedded text) and all pages. Text is re-embedded on load().

`show_tree` / `showTree`

Pretty-print the tree structure.

index.show_tree()

Output:

[0001] 1: Introduction (pages 1-4)
  [0002] 1.1: Background (pages 1-2)
  [0003] 1.2: Motivation (pages 3-4)
[0004] 2: Methods (pages 5-12)
  ...

`stats`

Return index statistics.

stats = index.stats()
# {
#   "total_pages": 21,
#   "total_tokens": 11710,
#   "total_nodes": 41,
#   "leaf_nodes": 32,
#   "root_sections": 10
# }

`find_large_sections` / `findLargeSections`

Find sections that exceed size thresholds.

large = index.find_large_sections(max_pages=10, max_tokens=20000)

const large = index.findLargeSections({ maxPages: 10, maxTokens: 20000 });

QueryResult

Returned by index.query().

Property	Python	Node.js	Type	Description
Context	`.context`	`.context`	`str`	Concatenated text from selected nodes
Node IDs	`.node_ids`	`.nodeIds`	`list[str]`	IDs of selected tree nodes
Page ranges	`.page_ranges`	`.pageRanges`	`list[tuple]`	`[(start, end), ...]` (0-indexed)
Pages string	`.pages_str`	`.pagesStr`	`str`	Human-readable: `"pages 5-8, 12-15"`
Reasoning	`.reasoning`	`.reasoning`	`str`	LLM’s explanation
Answer	`.answer`	`.answer`	`str`	Generated answer (agentic mode only)

PDF Parser Functions

`extract_toc` / `extractToc`

Extract table of contents from PDF bookmarks.

toc = extract_toc("doc.pdf")
# Returns: [{"level": 1, "title": "Intro", "physical_index": 0}, ...] or None

const toc = await extractToc("doc.pdf");
// Returns: TocEntry[] | null

Returns None/null if the PDF has fewer than 3 ToC entries.

`extract_pages` / `extractPages`

Extract text from each page of a PDF.

pages = extract_pages(
    "doc.pdf",
    extract_images=False,    # Extract images as base64
    detect_headings=False    # Inject [H1]/[H2]/[H3] markers
)

const pages = await extractPages("doc.pdf", {
  extractImages: false,
  detectHeadings: false,
});

`group_pages` / `groupPages`

Split pages into token-budget groups.

groups = group_pages(pages, max_tokens=20000, overlap=1)
# Returns: list of tagged text strings

`count_tokens` / `countTokens`

Count tokens using cl100k_base encoding.

n = count_tokens("Hello world")  # → 2

Tree Builder Functions

`toc_to_sections` / `tocToSections`

Convert ToC entries to numbered sections.

sections = toc_to_sections([
    {"level": 1, "title": "Intro", "physical_index": 0},
    {"level": 2, "title": "Background", "physical_index": 2},
])
# → [{"structure": "1", "title": "Intro", ...},
#    {"structure": "1.1", "title": "Background", ...}]

`repair_orphans` / `repairOrphans`

Insert synthetic parent nodes for orphaned subsections.

repaired = repair_orphans([
    {"structure": "1", "title": "Intro", "physical_index": 0},
    {"structure": "2.3.1", "title": "Deep", "physical_index": 5},
])
# Inserts "2" and "2.3" as synthetic parents

`list_to_tree` / `listToTree`

Convert flat sections to a hierarchical tree.

tree = list_to_tree(sections)

`assign_page_ranges` / `assignPageRanges`

Set start_index and end_index on each node.

assign_page_ranges(tree, total_pages=21)

`assign_node_ids` / `assignNodeIds`

DFS traversal, assigns sequential IDs ("0001", "0002", …).

assign_node_ids(tree)

`embed_text_in_tree` / `embedTextInTree`

Populate each node’s text field by concatenating pages in its range.

embed_text_in_tree(tree, pages)

`find_large_nodes` / `findLargeNodes`

Return nodes exceeding page or token thresholds.

large = find_large_nodes(tree, max_pages=10, max_tokens=20000, pages=pages)

Tree Utility Functions

`create_node_mapping` / `createNodeMapping`

Build a flat {node_id: node} map for O(1) lookups.

node_map = create_node_mapping(tree)
node = node_map["0005"]

`strip_text_from_tree` / `stripTextFromTree`

Deep copy the tree with all text fields removed (for LLM prompts).

stripped = strip_text_from_tree(tree)

`collect_node_texts` / `collectNodeTexts`

Concatenate text from specific nodes.

text = collect_node_texts(["0005", "0008"], node_map)

`extract_json` / `extractJson`

Robust JSON extraction from LLM output. Handles:

Raw JSON
Markdown code blocks
Trailing commas
Text before/after JSON

data = extract_json('Some text ```json\n[{"a": 1}]\n``` more text')
# → [{"a": 1}]

`count_nodes` / `countNodes`

n = count_nodes(tree)  # Total nodes including children

`get_leaf_nodes` / `getLeafNodes`

leaves = get_leaf_nodes(tree)  # Nodes with no children

`print_tree` / `printTree`

print_tree(tree)
# [0001] 1: Introduction (pages 1-4)
#   [0002] 1.1: Background (pages 1-2)

Document Loaders

auto_loader / autoLoader

pages = auto_loader("doc.pdf", extract_images=False, detect_headings=False)

const pages = await autoLoader("doc.pdf", { extractImages, detectHeadings });

Auto-detects format by extension: .pdf, .txt, .md, .html, .htm, .docx.

PDFLoader

loader = PDFLoader(extract_images=False, detect_headings=False)
pages = loader.load("doc.pdf")

TextLoader

loader = TextLoader(chars_per_page=3000)
pages = loader.load("doc.txt")

HTMLLoader

loader = HTMLLoader(chars_per_page=3000)
pages = loader.load("doc.html")

DOCXLoader

loader = DOCXLoader(chars_per_page=3000)
pages = loader.load("doc.docx")

Next: LLM Backends →

API Reference

TreeDex

Constructor

Factory Methods

from_file / fromFile

from_pages / fromPages

from_tree / fromTree

load

Instance Methods

query

save

show_tree / showTree

stats

find_large_sections / findLargeSections

QueryResult

PDF Parser Functions

extract_toc / extractToc

extract_pages / extractPages

group_pages / groupPages

count_tokens / countTokens