Class ForkPDFLayoutTextStripper
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
- Direct Known Subclasses:
PDFLayoutTextStripperByArea
public class ForkPDFLayoutTextStripper
extends org.apache.pdfbox.text.PDFTextStripper
This class extends PDFTextStripper to provide custom text extraction and formatting
capabilities for PDF pages. It includes features like processing text lines, sorting
text positions, and managing line breaks.
- Author:
- Jonathan Link
-
Field Summary
Modifier and TypeFieldDescriptionstatic final boolean
static final int
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionprotected float
computeFontHeight
(org.apache.pdfbox.pdmodel.font.PDFont arg0) void
processPage
(org.apache.pdfbox.pdmodel.PDPage page) protected void
showGlyph
(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) protected void
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Field Details
-
DEBUG
public static final boolean DEBUG- See Also:
-
OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT
public static final int OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT- See Also:
-
-
Constructor Details
-
ForkPDFLayoutTextStripper
Constructor- Throws:
IOException
-
-
Method Details
-
processPage
- Overrides:
processPage
in classorg.apache.pdfbox.text.PDFTextStripper
- Parameters:
page
- page to parse- Throws:
IOException
-
writePage
- Overrides:
writePage
in classorg.apache.pdfbox.text.PDFTextStripper
- Throws:
IOException
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) throws IOException - Overrides:
showGlyph
in classorg.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-