Semantic Chunking with LangChain: A Step-by-Step Guide

Tutorial: Semantic Chunking with LangChain Experimental

  • Tutorials
  • AI and Machine Learning
  • Text Processing
  • Software Development Tools

Learn how to split text into semantically similar chunks using the langchain_experimental library. This tutorial will guide you through the installation, setup, and usage of the SemanticChunker, covering different methods for determining breakpoints.

Overview

Introduction

Semantic chunking involves splitting text based on semantic similarity, ensuring that grouped sentences convey coherent meaning. This tutorial leverages the SemanticChunker from the langchain_experimental library to achieve effective text segmentation.

Goals

  • Install necessary dependencies.
  • Load example data.
  • Create and configure a semantic text splitter.
  • Split text into meaningful chunks.
  • Explore different methods for determining chunk breakpoints.

Tutorial Outline

Step 1: Install Dependencies

First, install the required packages. This tutorial uses langchain_experimental and langchain_openai.

!pip install --quiet langchain_experimental langchain_openai

Step 2: Load Example Data

For this tutorial, we’ll use a sample text file. Here’s how to load the data:

with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

Step 3: Create the Text Splitter

Next, we’ll create a SemanticChunker instance with OpenAI embeddings.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

Step 4: Split the Text

Now, use the splitter to create document chunks and print the first chunk to see the result.

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Step 5: Explore Breakpoints

Semantic chunking relies on determining when to split sentences. We will explore three methods for setting breakpoints: Percentile, Standard Deviation, and Interquartile.

Method 1: Percentile

In the percentile method, all sentence differences are calculated, and splits occur at differences greater than a specified percentile.

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

Method 2: Standard Deviation

This method splits text when differences exceed a specified number of standard deviations.

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

Method 3: Interquartile

The interquartile method uses the interquartile range to determine split points.

text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

Advanced Configurations

Custom Breakpoints

For more control, you can define custom logic to determine breakpoints.

def my_custom_function(sentences):
    # Custom logic to determine breakpoints
    return breakpoints

text_splitter = SemanticChunker(OpenAIEmbeddings(), custom_breakpoint_function=my_custom_function)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

Analyzing Results

Finally, you can analyze all the generated chunks to understand the segmentation better.

for doc in docs:
    print(doc.page_content)

Complete Code

Here’s the complete code for reference:

# Install Dependencies
!pip install --quiet langchain_experimental langchain_openai

# Load Example Data
with open("../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()

# Create Text Splitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

# Split Text
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

# Percentile Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

# Standard Deviation Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

# Interquartile Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))

# Custom Breakpoints (example)
def my_custom_function(sentences):
    # Custom logic to determine breakpoints
    return breakpoints

text_splitter = SemanticChunker(OpenAIEmbeddings(), custom_breakpoint_function=my_custom_function)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

# Analyzing Results
for doc in docs:
    print(doc.page_content)

This tutorial provides a structured approach to implementing semantic chunking using the langchain_experimental library. Experiment with different methods and configurations to optimize text segmentation for your needs.

Leave a Comment

Your email address will not be published. Required fields are marked *