Class ExtractedTextFormatter

java.lang.Object
org.springframework.ai.reader.ExtractedTextFormatter

public final class ExtractedTextFormatter extends Object
A utility to reformat extracted text content before encapsulating it in a Document. This formatter provides the following functionalities:
  • Left alignment of text
  • Removal of specified lines from the beginning and end of content
  • Consolidation of consecutive blank lines
An instance of this formatter can be customized using the ExtractedTextFormatter.Builder nested class.
Author:
Christian Tzolov
  • Method Details

    • builder

      public static ExtractedTextFormatter.Builder builder()
      Provides an instance of the builder for this formatter.
      Returns:
      an instance of the builder.
    • defaults

      public static ExtractedTextFormatter defaults()
      Provides a default instance of the formatter.
      Returns:
      default instance of the formatter.
    • trimAdjacentBlankLines

      public static String trimAdjacentBlankLines(String pageText)
      Replaces multiple, adjacent blank lines into a single blank line.
      Parameters:
      pageText - text to adjust the blank lines for.
      Returns:
      Returns the same text but with blank lines trimmed.
    • alignToLeft

      public static String alignToLeft(String pageText)
      Parameters:
      pageText - text to align.
      Returns:
      Returns the same text but aligned to the left side.
    • deleteBottomTextLines

      @Deprecated(forRemoval=true, since="1.0.0-M5") public static String deleteBottomTextLines(String pageText, int numberOfLines)
      Deprecated, for removal: This API element is subject to removal in a future version.
      Removes the specified number of lines from the bottom part of the text.
      Parameters:
      pageText - Text to remove lines from.
      numberOfLines - Number of lines to remove.
      Returns:
      Returns the text striped from last lines.
    • deleteBottomTextLines

      public static String deleteBottomTextLines(String pageText, int numberOfLines, String lineSeparator)
      Removes the specified number of lines from the bottom part of the text.
      Parameters:
      pageText - Text to remove lines from.
      numberOfLines - Number of lines to remove.
      lineSeparator - The line separator to use when identifying lines in the text.
      Returns:
      Returns the text striped from last lines.
    • deleteTopTextLines

      @Deprecated(forRemoval=true, since="1.0.0-M5") public static String deleteTopTextLines(String pageText, int numberOfLines)
      Deprecated, for removal: This API element is subject to removal in a future version.
      Removes a specified number of lines from the top part of the given text.

      This method takes a text and trims it by removing a certain number of lines from the top. If the provided text is null or contains only whitespace, it will be returned as is. If the number of lines to remove exceeds the actual number of lines in the text, the result will be an empty string.

      The method identifies lines based on the system's line separator, making it compatible with different platforms.

      Parameters:
      pageText - The text from which the top lines need to be removed. If this is null, empty, or consists only of whitespace, it will be returned unchanged.
      numberOfLines - The number of lines to remove from the top of the text. If this exceeds the actual number of lines in the text, an empty string will be returned.
      Returns:
      The text with the specified number of lines removed from the top.
    • deleteTopTextLines

      public static String deleteTopTextLines(String pageText, int numberOfLines, String lineSeparator)
      Removes a specified number of lines from the top part of the given text.

      This method takes a text and trims it by removing a certain number of lines from the top. If the provided text is null or contains only whitespace, it will be returned as is. If the number of lines to remove exceeds the actual number of lines in the text, the result will be an empty string.

      The method identifies lines based on the system's line separator, making it compatible with different platforms.

      Parameters:
      pageText - The text from which the top lines need to be removed. If this is null, empty, or consists only of whitespace, it will be returned unchanged.
      numberOfLines - The number of lines to remove from the top of the text. If this exceeds the actual number of lines in the text, an empty string will be returned.
      lineSeparator - The line separator to use when identifying lines in the text.
      Returns:
      The text with the specified number of lines removed from the top.
    • format

      public String format(String pageText)
      Formats the provided text according to the formatter's configuration.
      Parameters:
      pageText - Text to be formatted.
      Returns:
      Formatted text.
    • format

      public String format(String pageText, int pageNumber)
      Formats the provided text based on the formatter's configuration, considering the page number.
      Parameters:
      pageText - Text to be formatted.
      pageNumber - Page number of the provided text.
      Returns:
      Formatted text.