Class ForkPDFLayoutTextStripper
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
- Direct Known Subclasses:
PDFLayoutTextStripperByArea
public class ForkPDFLayoutTextStripper
extends org.apache.pdfbox.text.PDFTextStripper
This class extends PDFTextStripper to provide custom text extraction and formatting
capabilities for PDF pages. It includes features like processing text lines, sorting
text positions, and managing line breaks.
- Author:
- Jonathan Link
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final booleanstatic final intFields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected floatcomputeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page) protected voidshowGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) protected voidMethods inherited from class org.apache.pdfbox.text.PDFTextStripper
beginMarkedContentSequence, endArticle, endDocument, endMarkedContentSequence, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIgnoreContentStreamSpaceGlyphs, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIgnoreContentStreamSpaceGlyphs, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparatorMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, isShouldProcessColorOperators, markedContentPoint, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
DEBUG
public static final boolean DEBUG- See Also:
-
OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT
public static final int OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT- See Also:
-
-
Constructor Details
-
ForkPDFLayoutTextStripper
Constructor- Throws:
IOException
-
-
Method Details
-
processPage
- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Parameters:
page- page to parse- Throws:
IOException
-
writePage
- Overrides:
writePagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) throws IOException - Overrides:
showGlyphin classorg.apache.pdfbox.contentstream.PDFStreamEngine- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-