Class PDFLayoutTextStripperByArea

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea

public class PDFLayoutTextStripperByArea extends ForkPDFLayoutTextStripper
Re-implement the PDFLayoutTextStripperByArea on top of the PDFLayoutTextStripper instead the original PDFTextStripper. This class allows cropping pages (e.g., removing headers, footers, and between-page empty spaces) while extracting layout text, preserving the PDF's internal text formatting.
Author:
Christian Tzolov
  • Field Summary

    Fields inherited from class org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper

    DEBUG, OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    addRegion(String regionName, Rectangle2D rect)
    Add a new region to group text by.
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    void
    extractRegions(org.apache.pdfbox.pdmodel.PDPage page)
    Process the page to extract the region text.
    Get the list of regions that have been setup.
    Get the text for the region, this should be called after extractRegions().
    protected void
    processTextPosition(org.apache.pdfbox.text.TextPosition text)
    void
    removeRegion(String regionName)
    Delete a region to group text by.
    final void
    setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
    This method does nothing in this derived class, because beads and regions are incompatible.
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3)
     
    protected void
    This will print the processed page text to the output stream.

    Methods inherited from class org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper

    processPage

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • PDFLayoutTextStripperByArea

      public PDFLayoutTextStripperByArea() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error loading properties.
  • Method Details

    • setShouldSeparateByBeads

      public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
      This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.
      Overrides:
      setShouldSeparateByBeads in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      aShouldSeparateByBeads - The new grouping of beads.
    • addRegion

      public void addRegion(String regionName, Rectangle2D rect)
      Add a new region to group text by.
      Parameters:
      regionName - The name of the region.
      rect - The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
    • removeRegion

      public void removeRegion(String regionName)
      Delete a region to group text by. If the region does not exist, this method does nothing.
      Parameters:
      regionName - The name of the region to delete.
    • getRegions

      public List<String> getRegions()
      Get the list of regions that have been setup.
      Returns:
      A list of java.lang.String objects to identify the region names.
    • getTextForRegion

      public String getTextForRegion(String regionName)
      Get the text for the region, this should be called after extractRegions().
      Parameters:
      regionName - The name of the region to get the text from.
      Returns:
      The text that was identified in that region.
    • extractRegions

      public void extractRegions(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Process the page to extract the region text.
      Parameters:
      page - The page to extract the regions from.
      Throws:
      IOException - If there is an error while extracting text.
    • processTextPosition

      protected void processTextPosition(org.apache.pdfbox.text.TextPosition text)
      Overrides:
      processTextPosition in class org.apache.pdfbox.text.PDFTextStripper
    • writePage

      protected void writePage() throws IOException
      This will print the processed page text to the output stream.
      Overrides:
      writePage in class ForkPDFLayoutTextStripper
      Throws:
      IOException - If there is an error writing the text.
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException