Class ParagraphPdfDocumentReader
java.lang.Object
org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader
- All Implemented Interfaces:
Supplier<List<Document>>
,DocumentReader
Uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs
and output a single
Document
per paragraph.
This class provides methods for reading and processing PDF documents. It uses the
Apache PDFBox library for parsing PDF content and converting it into text paragraphs.
The paragraphs are grouped into Document
objects.- Author:
- Christian Tzolov
-
Field Summary
Modifier and TypeFieldDescriptionprotected final org.apache.pdfbox.pdmodel.PDDocument
protected String
-
Constructor Summary
ConstructorDescriptionParagraphPdfDocumentReader
(String resourceUrl) Constructs a ParagraphPdfDocumentReader using a resource URL.ParagraphPdfDocumentReader
(String resourceUrl, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.ParagraphPdfDocumentReader
(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.ParagraphPdfDocumentReader
(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration. -
Method Summary
Modifier and TypeMethodDescriptionprotected void
addMetadata
(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) get()
Reads and processes the PDF document to extract paragraphs.getTextBetweenParagraphs
(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph) protected Document
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.springframework.ai.document.DocumentReader
read
-
Field Details
-
document
protected final org.apache.pdfbox.pdmodel.PDDocument document -
resourceFileName
-
-
Constructor Details
-
ParagraphPdfDocumentReader
Constructs a ParagraphPdfDocumentReader using a resource URL.- Parameters:
resourceUrl
- The URL of the PDF resource.
-
ParagraphPdfDocumentReader
public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.- Parameters:
pdfResource
- The PDF resource.
-
ParagraphPdfDocumentReader
Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.- Parameters:
resourceUrl
- The URL of the PDF resource.config
- The configuration for PDF document processing.
-
ParagraphPdfDocumentReader
public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration.- Parameters:
pdfResource
- The PDF resource.config
- The configuration for PDF document processing.
-
-
Method Details
-
get
Reads and processes the PDF document to extract paragraphs. -
toDocument
-
addMetadata
protected void addMetadata(ParagraphManager.Paragraph from, ParagraphManager.Paragraph to, Document document) -
getTextBetweenParagraphs
public String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph)
-