I think I did something similar in 2018 that you might use, see the FilteredTextStripper class in ExtractText.java . That one only extracts text with angle 0.

/**
 * TextStripper that only processes glyphs that have angle 0.
 */
class FilteredTextStripper extends PDFTextStripper
{
    FilteredTextStripper() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        int angle = ExtractText.getAngle(text);
        if (angle == 0)
        {
            super.processTextPosition(text);
        }
    }
}



    static int getAngle(TextPosition text)
    {
        // should this become a part of TextPosition?
        Matrix m = text.getTextMatrix().clone();
        m.concatenate(text.getFont().getFontMatrix());
        return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
    }


Tilman


On 05.03.2024 11:52, Hengyu Weng wrote:
Sometimes the watermark will overlap with normal text which we want to
extract, so it would be great if it is possible to insert a filter and skip
some useless TextPositons (e.g. the text of the watermark may have a
rotation). I think the 'writePage' method in 'PDFTextStripper' is an
appropriate place to add this filter, but I found it is difficult to
override this method as it refers to a lot of private members, and
PDFTextStripper extends LegacyPDFStreamEngine, which is a non-public class,
which makes me unable to copy and modify it.

Currently I'm embedding the source code of pdfbox to allow me to modify the
above classes, I believe it would be definitely better if you can
officially add an insert point or some hooks to them.

Thank you.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to