ETL Pipeline

The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.

The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.

The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.

API Overview

There are three main components of the ETL pipeline,

  • DocumentReader that implements Supplier<List<Document>>

  • DocumentTransformer that implements Function<List<Document>, List<Document>>

  • DocumentWriter that implements Consumer<List<Document>>

The Document class contains text and metadata and is created from PDFs, text files and other document types via the DocumentReader.

To construct a simple ETL pipeline, you can chain together an instance of each type.

Let’s say we have the following instances of those three ETL types

  • PagePdfDocumentReader an implementation of DocumentReader

  • TokenTextSplitter an implementation of DocumentTransformer

  • VectorStore an implementation of DocumentWriter

To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code.

vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));

Getting Started

To begin creating a Spring AI RAG application, follow these steps:

  1. Download the latest Spring CLI Release and follow the installation instructions.

  2. To create a simple OpenAI-based application, use the command:

    spring boot new --from ai-rag --name myrag
  3. Consult the generated README.md file for guidance on obtaining an OpenAI API Key and running your first AI RAG application.

ETL Interfaces and Implementations

DocumentReader

Provides a source of documents from diverse origins.

public interface DocumentReader extends Supplier<List<Document>> {

}

JsonReader

The JsonReader Parses documents in JSON format.

Example:

@Component
public class MyAiApp {

	@Value("classpath:bikes.json") // This is the json document to load
	private Resource resource;

	List<Document> loadJsonAsDocuments() {
		JsonReader jsonReader = new JsonReader(resource, "description");
		return jsonReader.get();
	}
}

TextReader

The TextReader processes plain text documents.

Example:

@Component
public class MyTextReader {

    @Value("classpath:text-source.txt") // This is the text document to load
	private Resource resource;

	List<Document> loadText() {
		TextReader textReader = new TextReader(resource);
		textReader.getCustomMetadata().put("filename", "text-source.txt");

		return textReader.get();
    }
}

PagePdfDocumentReader

The PagePdfDocumentReader uses Apache PdfBox library to parse PDF documents

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdf() {

		PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
				PdfDocumentReaderConfig.builder()
					.withPageTopMargin(0)
					.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
						.withNumberOfTopTextLinesToDelete(0)
						.build())
					.withPagesPerDocument(1)
					.build());

		return pdfReader.get();
    }

}

ParagraphPdfDocumentReader

The ParagraphPdfDocumentReader uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document per paragraph. NOTE: Not all PDF documents contain the PDF catalog.

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdfwithCatalog() {

        new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                    .withPageTopMargin(0)
                    .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build())
                    .withPagesPerDocument(1)
                    .build());

		return pdfReader.get();
    }
}

TikaDocumentReader

The TikaDocumentReader uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.

Example:

@Component
public class MyTikaDocumentReader {

    @Value("classpath:/word-sample.docx") // This is the word document to load
	private Resource resource;

	List<Document> loadText() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resourceUri);
        return tikaDocumentReader.get();
    }
}

DocumentTransformer

Transforms a batch of documents as part of the processing workflow.

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

}

TextSplitter

The TextSplitter an abstract base class that helps divides documents to fit the AI model’s context window.

TokenTextSplitter

Splits documents while preserving token-level integrity.

ContentFormatTransformer

Ensures uniform content formats across all documents.

KeywordMetadataEnricher

Augments documents with essential keyword metadata.

SummaryMetadataEnricher

Enriches documents with summarization metadata for enhanced retrieval.

DocumentWriter

Manages the final stage of the ETL process, preparing documents for storage.

public interface DocumentWriter extends Consumer<List<Document>> {

}

Available Implementations

There is an implementation for each of the Vector Stores that Spring AI supports, e.g. PineconeVectorStore.

See Vector DB Documentation for a full listing.