Class ForkPDFLayoutTextStripper

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
Direct Known Subclasses:
PDFLayoutTextStripperByArea

public class ForkPDFLayoutTextStripper extends org.apache.pdfbox.text.PDFTextStripper
This class extends PDFTextStripper to provide custom text extraction and formatting capabilities for PDF pages. It includes features like processing text lines, sorting text positions, and managing line breaks.
Author:
Jonathan Link
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final boolean
     
    static final int
     

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    void
    processPage(org.apache.pdfbox.pdmodel.PDPage page)
     
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3)
     
    protected void
     

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

    • ForkPDFLayoutTextStripper

      public ForkPDFLayoutTextStripper() throws IOException
      Constructor
      Throws:
      IOException
  • Method Details

    • processPage

      public void processPage(org.apache.pdfbox.pdmodel.PDPage page) throws IOException
      Overrides:
      processPage in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      page - page to parse
      Throws:
      IOException
    • writePage

      protected void writePage() throws IOException
      Overrides:
      writePage in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException