org.springframework.ai.reader.tika.TikaDocumentReader

All Implemented Interfaces:: Supplier<List<Document>>, DocumentReader

public class TikaDocumentReader extends Object implements DocumentReader

A document reader that leverages Apache Tika to extract text from a variety of document formats, such as PDF, DOC/DOCX, PPT/PPTX, and HTML. For a comprehensive list of supported formats, refer to: https://tika.apache.org/3.0.0-BETA2/formats.html. This reader directly provides the extracted text without any additional formatting. All extracted texts are encapsulated within a Document instance. If you require more specialized handling for PDFs, consider using the PagePdfDocumentReader or ParagraphPdfDocumentReader.

Author:: Christian Tzolov

Field Summary

Fields

Modifier and Type

Field

Description

static final String

METADATA_SOURCE

Metadata key representing the source of the document.
Constructor Summary

Constructors

Constructor

Description

TikaDocumentReader(String resourceUrl)

Constructor initializing the reader with a given resource URL.

TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter)

Constructor initializing the reader with a given resource URL and a text formatter.

TikaDocumentReader(org.springframework.core.io.Resource resource)

Constructor initializing the reader with a resource.

TikaDocumentReader(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter)

Constructor initializing the reader with a resource and a text formatter.

TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter)

Constructor initializing the reader with a resource, content handler, and a text formatter.
Method Summary

Modifier and Type

Method

Description

List<Document>

get()

Extracts and returns the list of documents from the resource.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.springframework.ai.document.DocumentReader
read

Field Details
- METADATA_SOURCE
  
  public static final String METADATA_SOURCE
  
  Metadata key representing the source of the document.
  See Also:
  
  Constant Field Values
Constructor Details
- TikaDocumentReader
  
  public TikaDocumentReader(String resourceUrl)
  
  Constructor initializing the reader with a given resource URL.
  
  Parameters:
  
  resourceUrl - URL to the resource
- TikaDocumentReader
  
  public TikaDocumentReader(String resourceUrl, ExtractedTextFormatter textFormatter)
  
  Constructor initializing the reader with a given resource URL and a text formatter.
  
  Parameters:
  
  resourceUrl - URL to the resource
  
  textFormatter - Formatter for the extracted text
- TikaDocumentReader
  
  public TikaDocumentReader(org.springframework.core.io.Resource resource)
  
  Constructor initializing the reader with a resource.
  
  Parameters:
  
  resource - Resource pointing to the document
- TikaDocumentReader
  
  public TikaDocumentReader(org.springframework.core.io.Resource resource, ExtractedTextFormatter textFormatter)
  
  Constructor initializing the reader with a resource and a text formatter. This constructor will create a BodyContentHandler that allows for reading large PDFs (constrained only by memory)
  
  Parameters:
  
  resource - Resource pointing to the document
  
  textFormatter - Formatter for the extracted text
- TikaDocumentReader
  
  public TikaDocumentReader(org.springframework.core.io.Resource resource, ContentHandler contentHandler, ExtractedTextFormatter textFormatter)
  
  Constructor initializing the reader with a resource, content handler, and a text formatter.
  
  Parameters:
  
  resource - Resource pointing to the document
  
  contentHandler - Handler to manage content extraction
  
  textFormatter - Formatter for the extracted text
Method Details
- get
  
  public List<Document> get()
  
  Extracts and returns the list of documents from the resource.
  
  Specified by:
  
  get in interface Supplier<List<Document>>
  
  Returns:
  
  List of extracted Document

Class TikaDocumentReader

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.springframework.ai.document.DocumentReader

Field Details

METADATA_SOURCE

Constructor Details

TikaDocumentReader

TikaDocumentReader

TikaDocumentReader

TikaDocumentReader

TikaDocumentReader

Method Details

get