Class ParagraphPdfDocumentReader

java.lang.Object
org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader
All Implemented Interfaces:
Supplier<List<Document>>, DocumentReader

public class ParagraphPdfDocumentReader extends Object implements DocumentReader
Uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs and output a single Document per paragraph. This class provides methods for reading and processing PDF documents. It uses the Apache PDFBox library for parsing PDF content and converting it into text paragraphs. The paragraphs are grouped into Document objects.
Author:
Christian Tzolov
  • Field Details

    • document

      protected final org.apache.pdfbox.pdmodel.PDDocument document
    • resourceFileName

      protected String resourceFileName
  • Constructor Details

    • ParagraphPdfDocumentReader

      public ParagraphPdfDocumentReader(String resourceUrl)
      Constructs a ParagraphPdfDocumentReader using a resource URL.
      Parameters:
      resourceUrl - The URL of the PDF resource.
    • ParagraphPdfDocumentReader

      public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource)
      Constructs a ParagraphPdfDocumentReader using a resource.
      Parameters:
      pdfResource - The PDF resource.
    • ParagraphPdfDocumentReader

      public ParagraphPdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config)
      Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.
      Parameters:
      resourceUrl - The URL of the PDF resource.
      config - The configuration for PDF document processing.
    • ParagraphPdfDocumentReader

      public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config)
      Constructs a ParagraphPdfDocumentReader using a resource and a configuration.
      Parameters:
      pdfResource - The PDF resource.
      config - The configuration for PDF document processing.
  • Method Details