Indexing Pipeline Guide

The indexing pipeline processes documents through 7 sequential steps to build a knowledge representation for query operations.

Pipeline Overview

The indexing process follows a sequential architecture where each step builds upon previous outputs:

flowchart TD
    A["📄 Input Documents"]

    subgraph Phase1[" "]
        direction TB
        P1["Phase 1: Text Processing"]
        B["Step 1: Text Units"]
        P1 --> B
    end

    subgraph Phase2[" "]
        direction TB
        P2["Phase 2: Knowledge Extraction"]
        C["Step 2: Knowledge Graph"]
        D["Step 3: Communities"]
        P2 --> C
        C --> D
    end

    subgraph Phase3[" "]
        direction TB
        P3["Phase 3: Artifact Generation"]
        E["Step 4: Reports"]
        F["Step 5: Entities"]
        G["Step 6: Relationships"]
        H["Step 7: Enhanced Text"]
        P3 --> E
        P3 --> F
        P3 --> G
        P3 --> H
    end

    I["🔍 Search Ready System"]

    A --> Phase1
    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> I

    classDef phaseTitle fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#000
    classDef stepBox fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef inputOutput fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#000

    class P1,P2,P3 phaseTitle
    class B,C,D,E,F,G,H stepBox
    class A,I inputOutput

Step 1: Text Unit Extraction

Purpose: Convert documents into manageable, analyzable text units

What Happens

Documents are loaded and their content extracted
Text is split into overlapping chunks (typically 1000-1500 tokens)
Each chunk gets a unique identifier and document reference

Key Configuration

Chunk Size: Configurable (example: 1200 tokens with 100 token overlap)
Splitting Strategy: Depends on chosen TextSplitter implementation
Metadata Preservation: Document source information maintained

Input → Output

Raw Document: "50-page annual report.pdf"
    ↓
Text Units: 30+ chunks, each with:
   - Unique ID (uuid-001, uuid-002, ...)
   - Document ID reference
   - Text content (1-2 pages worth)

Step 2: Graph Generation

Purpose: Extract knowledge from text units and build a unified knowledge graph

This step has three sub-processes:

2a. Entity & Relationship Extraction

Process: LLM analyzes each text unit individually
Extraction: Identifies people, organizations, concepts, and their relationships
Output: Individual NetworkX graph per text unit

2b. Graph Merging

Process: Combines all individual graphs into one master graph
Challenge: Same entities mentioned across multiple text units need consolidation
Solution: Creates lists of descriptions for each entity/relationship

2c. Description Summarization

Process: LLM summarizes multiple descriptions into clean, unified descriptions
Result: Generates a single, clear, and comprehensive description for each entity and relationship

Input → Output

Text Unit: "Ratan Tata served as Chairman..."
    ↓ (Entity Extraction)
Individual Graph: [Ratan Tata] --[served_as_chairman]--> [Tata Group]
    ↓ (Graph Merging + Summarization)
Unified Graph: Clean entities and relationships with summarized descriptions

Step 3: Community Detection

Purpose: Discover thematic clusters of related entities

Algorithm

Method: Hierarchical Leiden community detection
Input: The unified knowledge graph from Step 2
Process: Groups entities that are highly connected to each other
Levels: Creates multi-level community hierarchy (Level 0, 1, 2...)

Algorithm Capabilities

Automatic Topic Discovery: Finds themes without manual categorization
Hierarchical Understanding: Broad themes → Specific sub-topics
Scalability: Works with graphs of any size

Example Output

Community Structure:
   Level 0: "Business Leadership" (Ratan Tata, N.R. Narayana Murthy, ...)
   Level 1: "Technology Leaders" (subset of business leaders)
   Level 2: "Software Industry" (even more specific)

Step 4: Community Reports

Purpose: Generate human-readable summaries for each community

Process

Analysis: LLM examines all entities and relationships within each community
Synthesis: Creates comprehensive summaries explaining what the community represents
Insights: Identifies key patterns, relationships, and notable findings

Report Structure

Each community report contains: - Title: Descriptive name for the community - Summary: Overview of what this community represents - Rating: Impact severity rating (float between 0-10) - Rating Explanation: Single sentence explanation of the impact rating - Findings: List of key insights with summaries and explanations - Content: Full markdown-formatted report text

Report Generation

Thematic Analysis: Understand major themes in your data
Content Summaries: High-level overviews of document collections
Pattern Recognition: Discover trends across your document collection

Step 5: Entity Artifacts

Purpose: Create searchable, structured entity records

Process

Conversion: Transforms graph nodes into structured data records
Enrichment: Adds community membership and importance metrics
Vectorization: Generates embeddings for semantic similarity search
Storage: Stores in vector database for efficient retrieval

Entity Record Structure

Field	Description	Example
`title`	Display name	`"Ratan Tata"`
`id`	Unique identifier	`"uuid-entity-001"`
`type`	Entity category	`"PERSON"`
`description`	Summarized description	`"Former Chairman of Tata Group..."`
`degree`	Connection count	`3` (connected to 3 other entities)
`text_unit_ids`	Source references	`["uuid-text-001", "uuid-text-045", ...]`
`communities`	Community membership	`[1, 2]` (community IDs)
`graph_embedding`	Graph embedding vector	`null` (if not generated)

Step 6: Relationship Artifacts

Purpose: Create searchable, structured relationship records

Process

Conversion: Transforms graph edges into structured data records
Ranking: Calculates relationship importance based on frequency and context
Traceability: Maintains links to source text units
Organization: Structures for efficient relationship queries

Relationship Record Structure

Field	Description	Example
`source`	Starting entity	`"Ratan Tata"`
`target`	Ending entity	`"Tata Group"`
`source_id`	Source entity ID	`"uuid-entity-001"`
`target_id`	Target entity ID	`"uuid-entity-004"`
`id`	Relationship ID	`"uuid-rel-001"`
`description`	Relationship summary	`"Served as Chairman from 1991-2012"`
`rank`	Importance score	`6` (sum of source and target degrees)
`text_unit_ids`	Source references	`["uuid-text-001", "uuid-text-003", ...]`
`source_degree`	Source entity degree	`3`
`target_degree`	Target entity degree	`3`

Step 7: Text Unit Artifacts (Enrichment)

Purpose: Link original text back to extracted knowledge

Process

Enhancement: Takes original text units from Step 1
Entity Linking: Adds references to entities found in each text unit
Relationship Linking: Adds references to relationships found in each text unit
Bidirectional Navigation: Enables text ↔ graph navigation

Enhanced Text Unit Structure

Field	Description	Example
`id`	Original unit ID	`"uuid-text-unit-001"`
`document_id`	Source document	`"uuid-doc-001"`
`text_unit`	Original text content	`"Ratan Tata served as Chairman..."`
`entity_ids`	Referenced entities	`["uuid-entity-001", "uuid-entity-004"]`
`relationship_ids`	Referenced relationships	`["uuid-rel-001"]`

Why This Matters

Source Verification: Trace any insight back to original text
Context Building: Find all text mentioning specific entities
Query Enhancement: Rich context for better answers

Data Flow & Dependencies

flowchart LR
    subgraph Input[" "]
        direction TB
        I1["📄 Input"]
        A["Step 1<br/>Text Units"]
        I1 --> A
    end

    subgraph Processing[" "]
        direction TB
        P1["⚙️ Processing"]
        B["Step 2<br/>Knowledge Graph"]
        C["Step 3<br/>Communities"]
        P1 --> B
        B --> C
    end

    subgraph Output[" "]
        direction TB
        O1["📊 Output"]
        D["Step 4<br/>Reports"]
        E["Step 5<br/>Entities"]
        F["Step 6<br/>Relationships"]
        G["Step 7<br/>Enhanced Text"]
        O1 --> D
        O1 --> E
        O1 --> F
        O1 --> G
    end

    Input --> Processing
    Processing --> Output

    classDef sectionTitle fill:#f9f9f9,stroke:#333,stroke-width:2px,color:#000,font-weight:bold
    classDef component fill:#e3f2fd,stroke:#1976d2,stroke-width:1px,color:#000

    class I1,P1,O1 sectionTitle
    class A,B,C,D,E,F,G component

Simple Dependencies

Steps 4, 5 need Step 3 (Communities) to work
Steps 5, 6, 7 need Step 2 (Knowledge Graph) to work
Step 7 also needs Step 1 (Text Units) to add references

Customization Points

Each step can be customized for specific use cases:

Step	Customization Options	Use Cases
Step 1	Chunk size, overlap, splitting strategy	Domain-specific text formats
Step 2	LLM prompts, entity types, extraction rules	Specialized knowledge domains
Step 3	Community detection algorithm, resolution	Different clustering needs
Step 4	Report templates, analysis depth	Custom reporting requirements
Step 5	Vector embedding model, metadata fields	Specialized search needs
Step 6	Ranking algorithms, relationship types	Domain-specific relationships
Step 7	Enrichment strategies, linking rules	Custom text-graph connections

Query System
Documentation for searching knowledge graphs with Local vs Global search strategies.

Documentation Index
Return to documentation overview

Architecture Overview - System design and concepts
Data Flow Examples - Real data transformations
Advanced Examples - Component-level customization

The indexing pipeline creates the foundation for intelligent querying. Each step contributes to building a comprehensive knowledge representation that enables both precise factual queries and strategic analytical insights.

Indexing Pipeline Guide

Pipeline Overview

Step 1: Text Unit Extraction

What Happens

Key Configuration

Input → Output

Step 2: Graph Generation

2a. Entity & Relationship Extraction

2b. Graph Merging

2c. Description Summarization

Input → Output

Step 3: Community Detection

Algorithm

Algorithm Capabilities

Example Output

Step 4: Community Reports

Process

Report Structure

Report Generation

Step 5: Entity Artifacts

Process

Entity Record Structure

Step 6: Relationship Artifacts

Process

Relationship Record Structure

Step 7: Text Unit Artifacts (Enrichment)

Process

Enhanced Text Unit Structure

Why This Matters

Data Flow & Dependencies

Simple Dependencies

Customization Points

Related Documentation

Related Resources