Class TikaDocumentReader
java.lang.Object
org.springframework.ai.reader.tika.TikaDocumentReader
- All Implemented Interfaces:
Supplier<List<Document>>
,DocumentReader
A document reader that leverages Apache Tika to extract text from a variety of document
formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of
supported formats, refer to: https://tika.apache.org/3.0.0/formats.html.
This reader directly provides the extracted text without any additional formatting. All
extracted texts are encapsulated within a
Document
instance.
If you require more specialized handling for PDFs, consider using the
PagePdfDocumentReader or ParagraphPdfDocumentReader.- Author:
- Christian Tzolov
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Metadata key representing the source of the document. -
Constructor Summary
ConstructorDescriptionTikaDocumentReader
(String resourceUrl) Constructor initializing the reader with a given resource URL.TikaDocumentReader
(String resourceUrl, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a given resource URL and a text formatter.TikaDocumentReader
(org.springframework.core.io.Resource resource) Constructor initializing the reader with a resource.TikaDocumentReader
(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource and a text formatter.TikaDocumentReader
(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource, content handler, and a text formatter. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.springframework.ai.document.DocumentReader
read
-
Field Details
-
METADATA_SOURCE
Metadata key representing the source of the document.- See Also:
-
-
Constructor Details
-
TikaDocumentReader
Constructor initializing the reader with a given resource URL.- Parameters:
resourceUrl
- URL to the resource
-
TikaDocumentReader
Constructor initializing the reader with a given resource URL and a text formatter.- Parameters:
resourceUrl
- URL to the resourcetextFormatter
- Formatter for the extracted text
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource) Constructor initializing the reader with a resource.- Parameters:
resource
- Resource pointing to the document
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource and a text formatter. This constructor will create a BodyContentHandler that allows for reading large PDFs (constrained only by memory)- Parameters:
resource
- Resource pointing to the documenttextFormatter
- Formatter for the extracted text
-
TikaDocumentReader
public TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter) Constructor initializing the reader with a resource, content handler, and a text formatter.- Parameters:
resource
- Resource pointing to the documentcontentHandler
- Handler to manage content extractiontextFormatter
- Formatter for the extracted text
-
-
Method Details