Hi,

Am 04.01.2012 12:53, schrieb Ilija Pavlic:
I am having issues with coordinates. The PDFTextStripperByArea region
seems to be pushed too high.

Consider the following example snippet:

...
     PDPage page = (PDPage) allPages.get(0);
     PDFTextStripperByArea stripper = new PDFTextStripperByArea();

     // define region for extraction -- the coordinates and dimensions
are x, y, width, height
     Rectangle region = new Rectangle((int) x, (int)y, (int)width, (int)height);
     stripper.addRegion("test region", region);

     // overlay the region with a cyan rectangle to check if I got the
coordinates and dimensions right
     PDPageContentStream contentStream = new
PDPageContentStream(document, page, true, true);
     contentStream.setNonStrokingColor( Color.CYAN );
     contentStream.fillRect( (int)x, (int)y, (int)width, (int)height );
     contentStream.close();

     // extract the text from the defined region
     stripper.extractRegions(page);
     String content = stripper.getTextForRegion("test region");
...
     document.save(...);
...

The cyan rectangle overlays the desired region nicely. On the other
hand, stripper misses a couple of lines at the bottom of the rectangle
and includes couple of lines above the rectangle. What is going on?
Maybe an issue with the current transformation matrix? You probably should use
the PDPageContentStream contructor containing 5 parameters, setting the last
one to "true". See [1] for further information.

Thank you,
Ilija.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-854

Reply via email to