I have opened a new issue: https://issues.apache.org/jira/browse/PDFBOX-2775
This will be tricky... there are almost never problems with text extractions (except fonts).
Tilman Am 25.04.2015 um 07:24 schrieb Andrew Munn:
Procssing this doc: http://www.oracle.com/technetwork/java/jaf-1-150219.pdf I am getting this: x=33 y=159 w=216 h=43 page=1 getting text from page #1 of 21 in doc. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 3 at java.util.Vector.get(Vector.java:748) at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:903) at org.apache.pdfbox.text.PDFTextStripperByArea.processTextPosition(PDFTextStripperByArea.java:132) at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229) at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:690) at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:600) at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347) at org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:113) Code is: String textFromBox(PDDocument doc, int x, int y, int w, int h, int page) throws IOException { System.out.println("x="+x); System.out.println("y="+y); System.out.println("w="+w); System.out.println("h="+h); System.out.println("page="+page); PDFTextStripperByArea stripper = new PDFTextStripperByArea(); Rectangle rect = new Rectangle(x, y - h, w, h); stripper.addRegion("region", rect); int pageCount = doc.getDocumentCatalog().getPages().getCount(); System.out.println("getting text from page #" + page + " of " + pageCount + " in doc."); if (page <= pageCount) { PDPage pp = (PDPage) doc.getDocumentCatalog().getPages().get(page - 1); stripper.extractRegions(pp); String text = stripper.getTextForRegion("region"); System.out.println("text=" + text); return text; } else { return "No page #" + page; } } Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

