Class ParagraphPdfDocumentReader
java.lang.Object
org.springframework.ai.reader.pdf.ParagraphPdfDocumentReader
- All Implemented Interfaces:
Supplier<List<Document>>
,DocumentReader
Uses the PDF catalog (e.g. TOC) information to split the input PDF into text paragraphs
and output a single
Document
per paragraph.
This class provides methods for reading and processing PDF documents. It uses the
Apache PDFBox library for parsing PDF content and converting it into text paragraphs.
The paragraphs are grouped into Document
objects.- Author:
- Christian Tzolov
-
Constructor Summary
ConstructorDescriptionParagraphPdfDocumentReader
(String resourceUrl) Constructs a ParagraphPdfDocumentReader using a resource URL.ParagraphPdfDocumentReader
(String resourceUrl, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.ParagraphPdfDocumentReader
(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.ParagraphPdfDocumentReader
(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration. -
Method Summary
Modifier and TypeMethodDescriptionget()
Reads and processes the PDF document to extract paragraphs.getTextBetweenParagraphs
(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph)
-
Constructor Details
-
ParagraphPdfDocumentReader
Constructs a ParagraphPdfDocumentReader using a resource URL.- Parameters:
resourceUrl
- The URL of the PDF resource.
-
ParagraphPdfDocumentReader
public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource) Constructs a ParagraphPdfDocumentReader using a resource.- Parameters:
pdfResource
- The PDF resource.
-
ParagraphPdfDocumentReader
Constructs a ParagraphPdfDocumentReader using a resource URL and a configuration.- Parameters:
resourceUrl
- The URL of the PDF resource.config
- The configuration for PDF document processing.
-
ParagraphPdfDocumentReader
public ParagraphPdfDocumentReader(org.springframework.core.io.Resource pdfResource, PdfDocumentReaderConfig config) Constructs a ParagraphPdfDocumentReader using a resource and a configuration.- Parameters:
pdfResource
- The PDF resource.config
- The configuration for PDF document processing.
-
-
Method Details
-
get
Reads and processes the PDF document to extract paragraphs. -
getTextBetweenParagraphs
public String getTextBetweenParagraphs(ParagraphManager.Paragraph fromParagraph, ParagraphManager.Paragraph toParagraph)
-