Class PDFLayoutTextStripperByArea
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
org.springframework.ai.reader.pdf.layout.PDFLayoutTextStripperByArea
Re-implement the PDFLayoutTextStripperByArea on top of the PDFLayoutTextStripper
instead the original PDFTextStripper.
This class allows cropping pages (e.g., removing headers, footers, and between-page
empty spaces) while extracting layout text, preserving the PDF's internal text
formatting.
- Author:
- Christian Tzolov
-
Field Summary
Fields inherited from class org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
DEBUG, OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT
Fields inherited from class org.apache.pdfbox.text.PDFTextStripper
charactersByArticle, document, LINE_SEPARATOR, output
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addRegion
(String regionName, Rectangle2D rect) Add a new region to group text by.protected float
computeFontHeight
(org.apache.pdfbox.pdmodel.font.PDFont arg0) void
extractRegions
(org.apache.pdfbox.pdmodel.PDPage page) Process the page to extract the region text.Get the list of regions that have been setup.getTextForRegion
(String regionName) Get the text for the region, this should be called after extractRegions().protected void
processTextPosition
(org.apache.pdfbox.text.TextPosition text) void
removeRegion
(String regionName) Delete a region to group text by.final void
setShouldSeparateByBeads
(boolean aShouldSeparateByBeads) This method does nothing in this derived class, because beads and regions are incompatible.protected void
showGlyph
(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) protected void
This will print the processed page text to the output stream.Methods inherited from class org.springframework.ai.reader.pdf.layout.ForkPDFLayoutTextStripper
processPage
Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPages, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeWordSeparator
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
Constructor Details
-
PDFLayoutTextStripperByArea
Constructor.- Throws:
IOException
- If there is an error loading properties.
-
-
Method Details
-
setShouldSeparateByBeads
public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads) This method does nothing in this derived class, because beads and regions are incompatible. Beads are ignored when stripping by area.- Overrides:
setShouldSeparateByBeads
in classorg.apache.pdfbox.text.PDFTextStripper
- Parameters:
aShouldSeparateByBeads
- The new grouping of beads.
-
addRegion
Add a new region to group text by.- Parameters:
regionName
- The name of the region.rect
- The rectangle area to retrieve the text from. The y-coordinates are java coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
-
removeRegion
Delete a region to group text by. If the region does not exist, this method does nothing.- Parameters:
regionName
- The name of the region to delete.
-
getRegions
Get the list of regions that have been setup.- Returns:
- A list of java.lang.String objects to identify the region names.
-
getTextForRegion
Get the text for the region, this should be called after extractRegions().- Parameters:
regionName
- The name of the region to get the text from.- Returns:
- The text that was identified in that region.
-
extractRegions
Process the page to extract the region text.- Parameters:
page
- The page to extract the regions from.- Throws:
IOException
- If there is an error while extracting text.
-
processTextPosition
protected void processTextPosition(org.apache.pdfbox.text.TextPosition text) - Overrides:
processTextPosition
in classorg.apache.pdfbox.text.PDFTextStripper
-
writePage
This will print the processed page text to the output stream.- Overrides:
writePage
in classForkPDFLayoutTextStripper
- Throws:
IOException
- If there is an error writing the text.
-
showGlyph
protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, org.apache.pdfbox.util.Vector arg3) throws IOException - Overrides:
showGlyph
in classorg.apache.pdfbox.contentstream.PDFStreamEngine
- Throws:
IOException
-
computeFontHeight
- Throws:
IOException
-