Class ParagraphPdfDocumentReader
java.lang.Object
org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader
- All Implemented Interfaces:
- Supplier<List<Document>>,- DocumentReader
Uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs
 and output a single 
Document per paragraph.
 This class provides methods for reading and processing PDF documents. It uses the
 Apache PDFBox library for parsing PDF content and converting it into text paragraphs.
 The paragraphs are grouped into Document objects.- Author:
- Christian Tzolov, Heonwoo Kim
- 
Field SummaryFieldsModifier and TypeFieldDescriptionprotected final org.apache.pdfbox.pdmodel.PDDocumentprotected String
- 
Constructor SummaryConstructorsConstructorDescriptionParagraphPdfDocumentReader(String resourceUrl) Constructs a ParagraphPdfDocumentReader using a resource URL.ParagraphPdfDocumentReader(String resourceUrl, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration.
- 
Method SummaryModifier and TypeMethodDescriptionprotected voidaddMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) get()Reads and processes the PDF document to extract paragraphs.getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) protected DocumentMethods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.springframework.ai.document.DocumentReaderread
- 
Field Details- 
documentprotected final org.apache.pdfbox.pdmodel.PDDocument document
- 
resourceFileName
 
- 
- 
Constructor Details- 
ParagraphPdfDocumentReaderConstructs a ParagraphPdfDocumentReader using a resource URL.- Parameters:
- resourceUrl- The URL of the PDF resource.
 
- 
ParagraphPdfDocumentReaderpublic ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.- Parameters:
- pdfResource- The PDF resource.
 
- 
ParagraphPdfDocumentReaderConstructs a ParagraphPdfDocumentReader using a resource URL and a configuration.- Parameters:
- resourceUrl- The URL of the PDF resource.
- config- The configuration for PDF document processing.
 
- 
ParagraphPdfDocumentReaderpublic ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration.- Parameters:
- pdfResource- The PDF resource.
- config- The configuration for PDF document processing.
 
 
- 
- 
Method Details- 
getReads and processes the PDF document to extract paragraphs.
- 
toDocument
- 
addMetadataprotected void addMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) 
- 
getTextBetweenParagraphspublic String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) 
 
-