Semantic Boundary Detection for Improving RAG on Real-time Agents

Learn how semantic boundary detection can enhance RAG systems for real-time AI agents by improving context management and information processing efficiency.

Nadeesha Cabral on 26-12-2024

Real-time AI agents face a critical challenge: they need to process and understand vast amounts of unbounded data while maintaining context and making quick decisions. Traditional Retrieval Augmented Generation (RAG) approaches often struggle with this balance, either missing crucial information or becoming computationally expensive. This post discusses semantic boundary detection as an elegant solution to this problem, offering a way to "intelligently" chunk and process information in real-time without sacrificing understanding.

When building realtime agentic systems at scle, doing RAG is not an option, but a necessity. This is mostly because the data that the agent encounters is not bounded. To put in other words, if it's a conversational experience, we might have an unbounded amount of messages in the agent's context window. Or, if the agent does tool calling, some tools might return unbounded amounts of data.

This is where semantic boundary detection comes in. By detecting the semantic boundaries in the data, we can break down the data into more manageable chunks. This allows us to do RAG in a way that is more likely to capture the key information that we need to make a decision.

The tradeoffs

The tradeoffs in any agentic system comes down to computational effort vs. accuracy. For example, you can do fixed size chunking, which takes less computational effort, but it might not be sophisticated enough to keep semantically coherent chunks. On the other hand, you can do agentic RAG, which offloads the whole process to a sub-agent who tries to form an understanding of the data, but this is more computationally expensive, and doesn't suite a realtime experience.

Core Concept

Semantic boundary detection finds natural breaks in text where topics or meanings shift. Unlike simple sentence splitting, it considers the semantic relationship between sentences to determine where content should be divided. We think this is a good middle ground between the two approaches of syntactic and semantic chunking.

How It Works: Step by Step

1. Sentence Segmentation

First, break the text into individual sentences. This step is crucial because it forms the foundation for all subsequent semantic analysis. We use a sentence tokenizer that can handle various punctuation patterns and edge cases:

This code accomplishes three key things:

Uses a sentence tokenizer to properly split text into individual sentences
Handles various punctuation patterns while preserving sentence integrity
Works across different languages with varying sentence boundary markers

import { SentenceTokenizer } from 'sentence-tokenizer'; // npm install sentence-tokenizer

const tokenizer = new SentenceTokenizer();
tokenizer.setEntry(text);
const sentences = tokenizer.getSentences();

2. Create Sliding Windows

Generate groups of sentences using a sliding window approach. This technique is essential for maintaining context awareness. Each group contains the current sentence (anchor) plus surrounding context, allowing us to understand the semantic flow of ideas:

The sliding window implementation achieves three objectives:

Ensures context preservation between chunks through overlapping windows
Uses windowSize parameter to control the amount of surrounding context
Balances context depth against computational requirements

interface SentenceGroup {
  anchor: string;
  context: string[];
  start: number;
  end: number;
}

function createSentenceGroups(
  sentences: string[], 
  windowSize: number = 3
): SentenceGroup[] {
  const groups: SentenceGroup[] = [];
  
  for (let i = 0; i < sentences.length; i++) {
    const start = Math.max(0, i - Math.floor(windowSize/2));
    const end = Math.min(sentences.length, i + Math.floor(windowSize/2) + 1);
    
    groups.push({
      anchor: sentences[i],
      context: sentences.slice(start, end),
      start,
      end
    });
  }
  
  return groups;
}

3. Generate Embeddings

Create embeddings for each sentence group. These embeddings capture the semantic meaning of the entire group, allowing us to detect topic shifts and meaning changes:

The embedding generation process accomplishes three things:

Converts text groups into dense vector representations for semantic comparison
Captures the complete semantic meaning of each group in its context

async function generateEmbeddings(
  groups: SentenceGroup[], 
  model: EmbeddingModel
): Promise<number[][]> {
  const embeddings: number[][] = [];
  
  for (const group of groups) {
    const text = group.context.join(' ');
    const embedding = await model.embed(text);
    embeddings.push(embedding);
  }
  
  return embeddings;
}

4. Calculate Semantic Distances

Compare adjacent embeddings to find where meaning shifts significantly:

function calculateDistances(
  embeddings: number[][]
): number[] {
  const distances: number[] = [];
  
  for (let i = 1; i < embeddings.length; i++) {
    const distance = cosineSimilarity(
      embeddings[i], 
      embeddings[i-1]
    );
    distances.push(distance);
  }
  
  return distances;
}

5. Detect Boundaries

Find points where semantic distance exceeds a threshold:

function findBoundaries(
  distances: number[],
  threshold: number = 0.85
): number[] {
  const boundaries: number[] = [];
  
  for (let i = 0; i < distances.length; i++) {
    if (distances[i] < threshold) {
      boundaries.push(i);
    }
  }
  
  return boundaries;
}

Why This Matters for Autonomous Agents

Autonomous agents processing large amounts of data need to understand context and meaning shifts in real-time. Traditional RAG systems with fixed chunking can miss important semantic boundaries or create arbitrary breaks in meaning. This approach offers several advantages for autonomous agents:

Dynamic Understanding: Agents can process text more naturally, adapting to content flow
Better Context Preservation: Semantic boundaries ensure context isn't lost between chunks
Improved Retrieval: More meaningful chunks lead to better search and retrieval results
Real-time Processing: Agents can process streaming data while maintaining semantic coherence

Real-time Implementation

For real-time processing, consider these optimizations that are particularly relevant for autonomous agents:

Batch Processing: Process sentences in small batches rather than one at a time.

class SemanticBoundaryDetector {
  private buffer: string[] = [];
  private batchSize: number = 5;
  
  async processBatch(): Promise<number[]> {
    if (this.buffer.length < this.batchSize) {
      return [];
    }
    
    const groups = createSentenceGroups(this.buffer);
    const embeddings = await generateEmbeddings(groups, this.model);
    const distances = calculateDistances(embeddings);
    const boundaries = findBoundaries(distances);
    
    // Clear processed sentences
    this.buffer = this.buffer.slice(this.batchSize);
    
    return boundaries;
  }
}

Sliding Window Optimization: Only compute new embeddings for new sentences:

# Sliding Window Implementation
This optimized detector maintains a rolling window of embeddings to minimize computation:

class OptimizedDetector {
  private embeddings: number[][] = [];
  
  async processNewSentence(sentence: string): Promise<void> {
    const newGroup = this.createGroupWithContext(sentence);
    const newEmbedding = await this.model.embed(newGroup.join(' '));
    
    // Only store last N embeddings
    this.embeddings = [...this.embeddings.slice(-5), newEmbedding];
    
    // Compare with previous embedding
    if (this.embeddings.length > 1) {
      const distance = cosineSimilarity(
        this.embeddings[this.embeddings.length - 1],
        this.embeddings[this.embeddings.length - 2]
      );
      this.checkBoundary(distance);
    }
  }
}

Caching: Cache embeddings for frequently seen sentence patterns:

# Caching Implementation
The cached detector remembers frequently seen patterns to reduce embedding computations:

class CachedDetector {
  private cache = new Map<string, number[]>();
  
  async getEmbedding(text: string): Promise<number[]> {
    const hash = this.hashText(text);
    
    if (this.cache.has(hash)) {
      return this.cache.get(hash)!;
    }
    
    const embedding = await this.model.embed(text);
    this.cache.set(hash, embedding);
    return embedding;
  }
}

System Flow

Below is a visualization of how semantic boundary detection processes text streams in real-time:

flowchart TD
    subgraph Input
        A[Text Stream] --> B[Text Buffer]
    end

    subgraph Preprocessing
        B --> C[Sentence Segmentation]
        C --> D[Create Sliding Windows]
    end

    subgraph Semantic Processing
        D --> E[Generate Embeddings]
        E --> F[Calculate Distances]
        F --> G[Detect Boundaries]
    end

    subgraph Optimization
        H[Embedding Cache] <-.-> E
        I[Batch Processor] <-.-> D
    end

    subgraph Output
        G --> J[Semantic Chunks]
        J --> K[Vector Store]
    end
    
    class A,B,K storage
    class C,D,E,F,G process
    class H,I optimization

Alternative Chunking Strategies

While semantic boundary detection offers powerful capabilities for autonomous agents, there are several other approaches we didn't explore.

Topic Modeling-Based Chunking: Leverages algorithms like LDA or BERTopic to create chunks based on detected topics within the text. This approach excels with documents that have clear thematic structures, though it can be computationally expensive for real-time applications.
Graph-Based Chunking: Represents sentences as nodes in a graph, with edge weights indicating semantic similarity. By applying community detection algorithms, we can identify natural clusters of related content. This method is particularly effective for highly interconnected content.
Hierarchical Chunking: Creates a tree structure of content, maintaining multiple levels of granularity simultaneously. This approach allows for flexible retrieval at different levels of detail, making it particularly useful for documents with clear hierarchical organization.
Attention-Based Chunking: Uses transformer attention patterns to identify natural breaks in content. While computationally intensive, it often produces highly accurate results that align well with human-perceived content boundaries.