ETL Pipeline

The Extract, Transform, and Load (ETL) framework serves as the backbone of data processing within the Retrieval Augmented Generation (RAG) use case.

The ETL pipeline orchestrates the flow from raw data sources to a structured vector store, ensuring data is in the optimal format for retrieval by the AI model.

The RAG use case is text to augment the capabilities of generative models by retrieving relevant information from a body of data to enhance the quality and relevance of the generated output.

API Overview

There are three main components of the ETL pipeline,

  • DocumentReader that implements Supplier<List<Document>>

  • DocumentTransformer that implements Function<List<Document>, List<Document>>

  • DocumentWriter that implements Consumer<List<Document>>

The Document class contains text and metadata and is created from PDFs, text files and other document types via the DocumentReader.

To construct a simple ETL pipeline, you can chain together an instance of each type.

etl pipeline

Let’s say we have the following instances of those three ETL types

  • PagePdfDocumentReader an implementation of DocumentReader

  • TokenTextSplitter an implementation of DocumentTransformer

  • VectorStore an implementation of DocumentWriter

To perform the basic loading of data into a Vector Database for use with the Retrieval Augmented Generation pattern, use the following code in Java function style syntax.

vectorStore.accept(tokenTextSplitter.apply(pdfReader.get()));

Alternatively, you can use method names that are more naturally expressive for the domain

vectorStore.write(tokenTextSplitter.split(pdfReader.read()));

Getting Started

To begin creating a Spring AI RAG application, follow these steps:

  1. Download the latest Spring CLI Release and follow the installation instructions.

  2. To create a simple OpenAI-based application, use the command:

    spring boot new --from ai-rag --name myrag
  3. Consult the generated README.md file for guidance on obtaining an OpenAI API Key and running your first AI RAG application.

ETL Interfaces and Implementations

The ETL pipeline is composed of the following interfaces and implementations. Detailed ETL class diagram is shown in the ETL Class Diagram section.

DocumentReader

Provides a source of documents from diverse origins.

public interface DocumentReader extends Supplier<List<Document>> {

    default List<Document> read() {
		return get();
	}
}

JsonReader

The JsonReader Parses documents in JSON format.

Example:

@Component
class MyAiAppComponent {

	private final Resource resource;

    MyAiAppComponent(@Value("classpath:bikes.json") Resource resource) {
        this.resource = resource;
    }

	List<Document> loadJsonAsDocuments() {
		JsonReader jsonReader = new JsonReader(resource, "description");
		return jsonReader.read();
	}
}

TextReader

The TextReader processes plain text documents.

Example:

@Component
class MyTextReader {

    private final Resource resource;

    MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
        this.resource = resource;
    }
	List<Document> loadText() {
		TextReader textReader = new TextReader(resource);
		textReader.getCustomMetadata().put("filename", "text-source.txt");

		return textReader.read();
    }
}

PagePdfDocumentReader

The PagePdfDocumentReader uses Apache PdfBox library to parse PDF documents

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdf() {

		PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf",
				PdfDocumentReaderConfig.builder()
					.withPageTopMargin(0)
					.withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
						.withNumberOfTopTextLinesToDelete(0)
						.build())
					.withPagesPerDocument(1)
					.build());

		return pdfReader.read();
    }

}

ParagraphPdfDocumentReader

The ParagraphPdfDocumentReader uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document per paragraph. NOTE: Not all PDF documents contain the PDF catalog.

Example:

@Component
public class MyPagePdfDocumentReader {

	List<Document> getDocsFromPdfwithCatalog() {

        new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                    .withPageTopMargin(0)
                    .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                        .withNumberOfTopTextLinesToDelete(0)
                        .build())
                    .withPagesPerDocument(1)
                    .build());

		return pdfReader.read();
    }
}

TikaDocumentReader

The TikaDocumentReader uses Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to the Tika documentation.

Example:

@Component
class MyTikaDocumentReader {

    private final Resource resource;

    MyTikaDocumentReader(@Value("classpath:/word-sample.docx")
                            Resource resource) {
        this.resource = resource;
    }

    List<Document> loadText() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(resource);
        return tikaDocumentReader.read();
    }
}

DocumentTransformer

Transforms a batch of documents as part of the processing workflow.

public interface DocumentTransformer extends Function<List<Document>, List<Document>> {

    default List<Document> transform(List<Document> transform) {
		return apply(transform);
	}
}

TextSplitter

The TextSplitter an abstract base class that helps divides documents to fit the AI model’s context window.

TokenTextSplitter

Splits documents while preserving token-level integrity.

ContentFormatTransformer

Ensures uniform content formats across all documents.

KeywordMetadataEnricher

Augments documents with essential keyword metadata.

SummaryMetadataEnricher

Enriches documents with summarization metadata for enhanced retrieval.

DocumentWriter

Manages the final stage of the ETL process, preparing documents for storage.

public interface DocumentWriter extends Consumer<List<Document>> {

    default void write(List<Document> documents) {
		accept(documents);
	}
}

FileDocumentWriter

Persist documents to a file .

VectorStore

Provides integration with various vector stores. See Vector DB Documentation for a full listing.

ETL Class Diagram

The following class diagram illustrates the ETL interfaces and implementations.

etl class diagram