Class TikaDocumentReader

java.lang.Object
org.springframework.ai.reader.tika.TikaDocumentReader
All Implemented Interfaces:
Supplier<List<Document>>, DocumentReader

public class TikaDocumentReader extends Object implements DocumentReader
A document reader that leverages Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to: https://tika.apache.org/2.9.0/formats.html. This reader directly provides the extracted text without any additional formatting. All extracted texts are encapsulated within a Document instance. If you require more specialized handling for PDFs, consider using the PagePdfDocumentReader or ParagraphPdfDocumentReader.
Author:
Christian Tzolov
  • Field Details

    • METADATA_SOURCE

      public static final String METADATA_SOURCE
      Metadata key representing the source of the document.
      See Also:
  • Constructor Details

    • TikaDocumentReader

      public TikaDocumentReader(String resourceUrl)
      Constructor initializing the reader with a given resource URL.
      Parameters:
      resourceUrl - URL to the resource
    • TikaDocumentReader

      public TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a given resource URL and a text formatter.
      Parameters:
      resourceUrl - URL to the resource
      textFormatter - Formatter for the extracted text
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource)
      Constructor initializing the reader with a resource.
      Parameters:
      resource - Resource pointing to the document
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a resource and a text formatter. This constructor will create a BodyContentHandler that allows for reading large PDFs (constrained only by memory)
      Parameters:
      resource - Resource pointing to the document
      textFormatter - Formatter for the extracted text
    • TikaDocumentReader

      public TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter)
      Constructor initializing the reader with a resource, content handler, and a text formatter.
      Parameters:
      resource - Resource pointing to the document
      contentHandler - Handler to manage content extraction
      textFormatter - Formatter for the extracted text
  • Method Details