ETL Pipeline
The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.
The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.
The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.
API Overview
There are three main components of the ETL pipeline,
-
DocumentReader
that implementsSupplier<List<Document>>
-
DocumentTransformer
that implementsFunction<List<Document>, List<Document>>
-
DocumentWriter
that implementsConsumer<List<Document>>
The Document
class contains text and metadata and is created from PDFs, text files and other document types via the DocumentReader
.
To construct a simple ETL pipeline, you can chain together an instance of each type.
Let’s say we have the following instances of those three ETL types
-
PagePdfDocumentReader
an implementation ofDocumentReader
-
TokenTextSplitter
an implementation ofDocumentTransformer
-
VectorStore
an implementation ofDocumentWriter
To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code in Java function style syntax.
vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));
Alternatively, you can use method names that are more naturally expressive for the domain
vectorStore.write(tokenTextSplitter.split(pdfReader.read()));
Getting Started
To begin creating a Spring AI RAG application, follow these steps:
-
Download the latest Spring CLI Release and follow the installation instructions.
-
To create a simple OpenAI-based application, use the command:
spring boot new --from ai-rag --name myrag
-
Consult the generated
README.md
file for guidance on obtaining an OpenAI API Key and running your first AI RAG application.
ETL Interfaces and Implementations
The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.
DocumentReader
Provides a source of documents from diverse origins.
public interface DocumentReader extends Supplier<List<Document>> {
default List<Document> read() {
return get();
}
}
JsonReader
The JsonReader
Parses documents in JSON format.
Example:
@Component
class MyAiAppComponent {
private final Resource resource;
MyAiAppComponent(@Value("classpath:bikes.json") Resource resource) {
this.resource = resource;
}
List<Document> loadJsonAsDocuments() {
JsonReader jsonReader = new JsonReader(resource, "description");
return jsonReader.read();
}
}
TextReader
The TextReader
processes plain text documents.
Example:
@Component
class MyTextReader {
private final Resource resource;
MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
this.resource = resource;
}
List<Document> loadText() {
TextReader textReader = new TextReader(resource);
textReader.getCustomMetadata().put("filename", "text-source.txt");
return textReader.read();
}
}
PagePdfDocumentReader
The PagePdfDocumentReader
uses Apache PdfBox library to parse PDF documents
Example:
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdf() {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.read();
}
}
ParagraphPdfDocumentReader
The ParagraphPdfDocumentReader
uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document
per paragraph.
NOTE: Not all PDF documents contain the PDF catalog.
Example:
@Component
public class MyPagePdfDocumentReader {
List<Document> getDocsFromPdfwithCatalog() {
new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.read();
}
}
TikaDocumentReader
The TikaDocumentReader
uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.
Example:
@Component
class MyTikaDocumentReader {
private final Resource resource;
MyTikaDocumentReader(@Value("classpath:/word-sample.docx")
Resource resource) {
this.resource = resource;
}
List<Document> loadText() {
TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resource);
return tikaDocumentReader.read();
}
}
DocumentTransformer
Transforms a batch of documents as part of the processing workflow.
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
default List<Document> transform(List<Document> transform) {
return apply(transform);
}
}
DocumentWriter
Manages the final stage of the ETL process, preparing documents for storage.
public interface DocumentWriter extends Consumer<List<Document>> {
default void write(List<Document> documents) {
accept(documents);
}
}
VectorStore
Provides integration with various vector stores. See Vector DB Documentation for a full listing.