Hi all,

I'm trying to scrape highlighted text from a large number of pdfs, using
PDFBox (and jython). I've come close, but have hit a bit of a wall.

I can access the annotations (PDAnnotationTextMarkup) and get their bounding
box in user space coordinates through a getRectangle() method. I then use
PDFTextStripperByArea to mark the region and fetch text from it. However,
the text I get is from a different portion of the page. This suggests to me
that the coordinate systems might be different.

If I use the PrintTextLocations example in org.pdfbox.examples.util, it
gives every piece of text with its location. The locations given here agree
with those I get from scraping text with PDFTextStripperByArea, i.e. they
use the same coordinate system.

Can anyone suggest the appropriate transform between these systems?

Thanks,
Lars

Reply via email to