API Reference
TreeDex
The main class for building and querying document indices.
Constructor
# Python
TreeDex(tree: list[dict], pages: list[dict], llm=None)
// TypeScript
new TreeDex(tree: TreeNode[], pages: Page[], llm: BaseLLM | null = null)
You typically don’t call the constructor directly — use the factory methods below.
Factory Methods
from_file / fromFile
Build an index from a document file. This is the primary entry point.
# Python
index = TreeDex.from_file(
path: str,
llm: BaseLLM,
loader=None, # Custom loader (auto-detect if None)
max_tokens: int = 20000, # Token budget per page group
overlap: int = 1, # Page overlap between groups
verbose: bool = True, # Print progress
extract_images: bool = False # Extract images for vision LLM
)
// TypeScript
const index = await TreeDex.fromFile(path, llm, {
loader?, // Custom loader
maxTokens?: 20000,
overlap?: 1,
verbose?: true,
extractImages?: false,
});
Pipeline:
- Check for PDF ToC → if found, build tree directly (0 LLM calls)
- Load pages with heading detection (PDFs without ToC)
- Group pages by token budget
- LLM extracts structure per group
- Repair orphans → build tree
from_pages / fromPages
Build from pre-extracted pages (skip document loading).
index = TreeDex.from_pages(pages, llm, max_tokens=20000, overlap=1, verbose=True)
const index = await TreeDex.fromPages(pages, llm, { maxTokens, overlap, verbose });
from_tree / fromTree
Create from an existing tree and pages (no processing).
index = TreeDex.from_tree(tree, pages, llm)
const index = TreeDex.fromTree(tree, pages, llm);
load
Load a previously saved index from JSON.
index = TreeDex.load("index.json", llm=llm)
const index = await TreeDex.load("index.json", llm);
Instance Methods
query
Retrieve relevant sections for a question.
result = index.query(
question: str,
llm=None, # Override LLM (uses constructor LLM if None)
agentic: bool = False # Generate an answer from context
) -> QueryResult
const result = await index.query(question, {
llm?, // Override LLM
agentic?, // Generate answer
});
// Or shorthand: await index.query(question, llm)
save
Export the index to a JSON file.
path = index.save("index.json") # Returns the path
const path = await index.save("index.json");
The saved JSON contains the tree structure (without embedded text) and all pages. Text is re-embedded on load().
show_tree / showTree
Pretty-print the tree structure.
index.show_tree()
Output:
[0001] 1: Introduction (pages 1-4)
[0002] 1.1: Background (pages 1-2)
[0003] 1.2: Motivation (pages 3-4)
[0004] 2: Methods (pages 5-12)
...
stats
Return index statistics.
stats = index.stats()
# {
# "total_pages": 21,
# "total_tokens": 11710,
# "total_nodes": 41,
# "leaf_nodes": 32,
# "root_sections": 10
# }
find_large_sections / findLargeSections
Find sections that exceed size thresholds.
large = index.find_large_sections(max_pages=10, max_tokens=20000)
const large = index.findLargeSections({ maxPages: 10, maxTokens: 20000 });
QueryResult
Returned by index.query().
| Property | Python | Node.js | Type | Description |
|---|---|---|---|---|
| Context | .context | .context | str | Concatenated text from selected nodes |
| Node IDs | .node_ids | .nodeIds | list[str] | IDs of selected tree nodes |
| Page ranges | .page_ranges | .pageRanges | list[tuple] | [(start, end), ...] (0-indexed) |
| Pages string | .pages_str | .pagesStr | str | Human-readable: "pages 5-8, 12-15" |
| Reasoning | .reasoning | .reasoning | str | LLM’s explanation |
| Answer | .answer | .answer | str | Generated answer (agentic mode only) |
PDF Parser Functions
extract_toc / extractToc
Extract table of contents from PDF bookmarks.
toc = extract_toc("doc.pdf")
# Returns: [{"level": 1, "title": "Intro", "physical_index": 0}, ...] or None
const toc = await extractToc("doc.pdf");
// Returns: TocEntry[] | null
Returns None/null if the PDF has fewer than 3 ToC entries.
extract_pages / extractPages
Extract text from each page of a PDF.
pages = extract_pages(
"doc.pdf",
extract_images=False, # Extract images as base64
detect_headings=False # Inject [H1]/[H2]/[H3] markers
)
const pages = await extractPages("doc.pdf", {
extractImages: false,
detectHeadings: false,
});
group_pages / groupPages
Split pages into token-budget groups.
groups = group_pages(pages, max_tokens=20000, overlap=1)
# Returns: list of tagged text strings
count_tokens / countTokens
Count tokens using cl100k_base encoding.
n = count_tokens("Hello world") # → 2
Tree Builder Functions
toc_to_sections / tocToSections
Convert ToC entries to numbered sections.
sections = toc_to_sections([
{"level": 1, "title": "Intro", "physical_index": 0},
{"level": 2, "title": "Background", "physical_index": 2},
])
# → [{"structure": "1", "title": "Intro", ...},
# {"structure": "1.1", "title": "Background", ...}]
repair_orphans / repairOrphans
Insert synthetic parent nodes for orphaned subsections.
repaired = repair_orphans([
{"structure": "1", "title": "Intro", "physical_index": 0},
{"structure": "2.3.1", "title": "Deep", "physical_index": 5},
])
# Inserts "2" and "2.3" as synthetic parents
list_to_tree / listToTree
Convert flat sections to a hierarchical tree.
tree = list_to_tree(sections)
assign_page_ranges / assignPageRanges
Set start_index and end_index on each node.
assign_page_ranges(tree, total_pages=21)
assign_node_ids / assignNodeIds
DFS traversal, assigns sequential IDs ("0001", "0002", …).
assign_node_ids(tree)
embed_text_in_tree / embedTextInTree
Populate each node’s text field by concatenating pages in its range.
embed_text_in_tree(tree, pages)
find_large_nodes / findLargeNodes
Return nodes exceeding page or token thresholds.
large = find_large_nodes(tree, max_pages=10, max_tokens=20000, pages=pages)
Tree Utility Functions
create_node_mapping / createNodeMapping
Build a flat {node_id: node} map for O(1) lookups.
node_map = create_node_mapping(tree)
node = node_map["0005"]
strip_text_from_tree / stripTextFromTree
Deep copy the tree with all text fields removed (for LLM prompts).
stripped = strip_text_from_tree(tree)
collect_node_texts / collectNodeTexts
Concatenate text from specific nodes.
text = collect_node_texts(["0005", "0008"], node_map)
extract_json / extractJson
Robust JSON extraction from LLM output. Handles:
- Raw JSON
- Markdown code blocks
- Trailing commas
- Text before/after JSON
data = extract_json('Some text ```json\n[{"a": 1}]\n``` more text')
# → [{"a": 1}]
count_nodes / countNodes
n = count_nodes(tree) # Total nodes including children
get_leaf_nodes / getLeafNodes
leaves = get_leaf_nodes(tree) # Nodes with no children
print_tree / printTree
print_tree(tree)
# [0001] 1: Introduction (pages 1-4)
# [0002] 1.1: Background (pages 1-2)
Document Loaders
auto_loader / autoLoader
pages = auto_loader("doc.pdf", extract_images=False, detect_headings=False)
const pages = await autoLoader("doc.pdf", { extractImages, detectHeadings });
Auto-detects format by extension: .pdf, .txt, .md, .html, .htm, .docx.
PDFLoader
loader = PDFLoader(extract_images=False, detect_headings=False)
pages = loader.load("doc.pdf")
TextLoader
loader = TextLoader(chars_per_page=3000)
pages = loader.load("doc.txt")
HTMLLoader
loader = HTMLLoader(chars_per_page=3000)
pages = loader.load("doc.html")
DOCXLoader
loader = DOCXLoader(chars_per_page=3000)
pages = loader.load("doc.docx")
Next: LLM Backends →