Tutorial: Semantic Chunking with LangChain Experimental
- Tutorials
- AI and Machine Learning
- Text Processing
- Software Development Tools
Learn how to split text into semantically similar chunks using the langchain_experimental
library. This tutorial will guide you through the installation, setup, and usage of the SemanticChunker, covering different methods for determining breakpoints.
Overview
Introduction
Semantic chunking involves splitting text based on semantic similarity, ensuring that grouped sentences convey coherent meaning. This tutorial leverages the SemanticChunker from the langchain_experimental
library to achieve effective text segmentation.
Goals
- Install necessary dependencies.
- Load example data.
- Create and configure a semantic text splitter.
- Split text into meaningful chunks.
- Explore different methods for determining chunk breakpoints.
Tutorial Outline
Step 1: Install Dependencies
First, install the required packages. This tutorial uses langchain_experimental
and langchain_openai
.
!pip install --quiet langchain_experimental langchain_openai
Step 2: Load Example Data
For this tutorial, we'll use a sample text file. Here's how to load the data:
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
Step 3: Create the Text Splitter
Next, we'll create a SemanticChunker instance with OpenAI embeddings.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
Step 4: Split the Text
Now, use the splitter to create document chunks and print the first chunk to see the result.
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
Step 5: Explore Breakpoints
Semantic chunking relies on determining when to split sentences. We will explore three methods for setting breakpoints: Percentile, Standard Deviation, and Interquartile.
Method 1: Percentile
In the percentile method, all sentence differences are calculated, and splits occur at differences greater than a specified percentile.
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
Method 2: Standard Deviation
This method splits text when differences exceed a specified number of standard deviations.
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
Method 3: Interquartile
The interquartile method uses the interquartile range to determine split points.
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
Advanced Configurations
Custom Breakpoints
For more control, you can define custom logic to determine breakpoints.
def my_custom_function(sentences):
# Custom logic to determine breakpoints
return breakpoints
text_splitter = SemanticChunker(OpenAIEmbeddings(), custom_breakpoint_function=my_custom_function)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
Analyzing Results
Finally, you can analyze all the generated chunks to understand the segmentation better.
for doc in docs:
print(doc.page_content)
Complete Code
Here's the complete code for reference:
# Install Dependencies
!pip install --quiet langchain_experimental langchain_openai
# Load Example Data
with open("../../state_of_the_union.txt") as f:
state_of_the_union = f.read()
# Create Text Splitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
# Split Text
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Percentile Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="percentile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
# Standard Deviation Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="standard_deviation")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
# Interquartile Method
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type="interquartile")
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
print(len(docs))
# Custom Breakpoints (example)
def my_custom_function(sentences):
# Custom logic to determine breakpoints
return breakpoints
text_splitter = SemanticChunker(OpenAIEmbeddings(), custom_breakpoint_function=my_custom_function)
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
# Analyzing Results
for doc in docs:
print(doc.page_content)
This tutorial provides a structured approach to implementing semantic chunking using the langchain_experimental
library. Experiment with different methods and configurations to optimize text segmentation for your needs.