Hi, Am 04.01.2012 12:53, schrieb Ilija Pavlic:
I am having issues with coordinates. The PDFTextStripperByArea region seems to be pushed too high.Consider the following example snippet: ... PDPage page = (PDPage) allPages.get(0); PDFTextStripperByArea stripper = new PDFTextStripperByArea(); // define region for extraction -- the coordinates and dimensions are x, y, width, height Rectangle region = new Rectangle((int) x, (int)y, (int)width, (int)height); stripper.addRegion("test region", region); // overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true); contentStream.setNonStrokingColor( Color.CYAN ); contentStream.fillRect( (int)x, (int)y, (int)width, (int)height ); contentStream.close(); // extract the text from the defined region stripper.extractRegions(page); String content = stripper.getTextForRegion("test region"); ... document.save(...); ... The cyan rectangle overlays the desired region nicely. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle. What is going on?
Maybe an issue with the current transformation matrix? You probably should use the PDPageContentStream contructor containing 5 parameters, setting the last one to "true". See [1] for further information.
Thank you, Ilija.
BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-854

