I think I did something similar in 2018 that you might use, see the
FilteredTextStripper class in ExtractText.java . That one only extracts
text with angle 0.
/**
* TextStripper that only processes glyphs that have angle 0.
*/
class FilteredTextStripper extends PDFTextStripper
{
FilteredTextStripper() throws IOException
{
}
@Override
protected void processTextPosition(TextPosition text)
{
int angle = ExtractText.getAngle(text);
if (angle == 0)
{
super.processTextPosition(text);
}
}
}
static int getAngle(TextPosition text)
{
// should this become a part of TextPosition?
Matrix m = text.getTextMatrix().clone();
m.concatenate(text.getFont().getFontMatrix());
return (int)
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
}
Tilman
On 05.03.2024 11:52, Hengyu Weng wrote:
Sometimes the watermark will overlap with normal text which we want to
extract, so it would be great if it is possible to insert a filter and skip
some useless TextPositons (e.g. the text of the watermark may have a
rotation). I think the 'writePage' method in 'PDFTextStripper' is an
appropriate place to add this filter, but I found it is difficult to
override this method as it refers to a lot of private members, and
PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
which makes me unable to copy and modify it.
Currently I'm embedding the source code of pdfbox to allow me to modify the
above classes, I believe it would be definitely better if you can
officially add an insert point or some hooks to them.
Thank you.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org