Class TikaDocumentReader
java.lang.Object
org.springframework.ai.reader.tika.TikaDocumentReader
- All Implemented Interfaces:
Supplier<List<Document>>
,DocumentReader
A document reader that leverages Apache Tika to extract text from a variety of document
formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of
supported formats, refer to: https://tika.apache.org/2.9.0/formats.html.
This reader directly provides the extracted text without any additional formatting. All
extracted texts are encapsulated within a
Document
instance.
If you require more specialized handling for PDFs, consider using the
PagePdfDocumentReader or ParagraphPdfDocumentReader.- Author:
- Christian Tzolov
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Metadata key representing the source of the document. -
Constructor Summary
ConstructorDescriptionTikaDocumentReader
(String resourceUrl) Constructor initializing the reader with a given resource URL.TikaDocumentReader
(String resourceUrl, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a given resource URL and a text formatter.TikaDocumentReader
(org.springframework.core.io.Resource resource) Constructor initializing the reader with a resource.TikaDocumentReader
(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource and a text formatter.TikaDocumentReader
(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource, content handler, and a text formatter. -
Method Summary
-
Field Details
-
METADATA_SOURCE
Metadata key representing the source of the document.- See Also:
-
-
Constructor Details
-
TikaDocumentReader
Constructor initializing the reader with a given resource URL.- Parameters:
resourceUrl
- URL to the resource
-
TikaDocumentReader
Constructor initializing the reader with a given resource URL and a text formatter.- Parameters:
resourceUrl
- URL to the resourcetextFormatter
- Formatter for the extracted text
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource) Constructor initializing the reader with a resource.- Parameters:
resource
- Resource pointing to the document
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource and a text formatter.- Parameters:
resource
- Resource pointing to the documenttextFormatter
- Formatter for the extracted text
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource, content handler, and a text formatter.- Parameters:
resource
- Resource pointing to the documentcontentHandler
- Handler to manage content extractiontextFormatter
- Formatter for the extracted text
-
-
Method Details