Scraping pdf annotations

Lars Yencken Thu, 30 Apr 2009 22:05:36 -0700

Hi all,

I'm trying to scrape highlighted text from a large number of pdfs, using
PDFBox (and jython). I've come close, but have hit a bit of a wall.


I can access the annotations (PDAnnotationTextMarkup) and get their bounding
box in user space coordinates through a getRectangle() method. I then use
PDFTextStripperByArea to mark the region and fetch text from it. However,
the text I get is from a different portion of the page. This suggests to me
that the coordinate systems might be different.

If I use the PrintTextLocations example in org.pdfbox.examples.util, it
gives every piece of text with its location. The locations given here agree
with those I get from scraping text with PDFTextStripperByArea, i.e. they
use the same coordinate system.

Can anyone suggest the appropriate transform between these systems?

Thanks,
Lars

Scraping pdf annotations

Reply via email to