> Hmm, because it's you, I'll try it myself :-) Thank you, Tilman!
> You can't really know for sure with the classic text extraction, but you > could use the extractTextByArea example with the rect coordinates. Based on your example, though, I think this should work. If I cache the rectangle coordinates (Rectangle2D) before processing the page and then test for whether the rectangle contains each TextPosition in writeString(String text, List<TextPosition> textPositions), this might work?...have to implement to test this idea... The key from your example is to subtract the rectangle's y values from the height of the page. PDAnnotationLink's rectangle is: (lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38 TEXT: This is a hyperlink x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52 .... So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis, which now fits within the rectangle's y range of 415.38 to 440.83. Does this seem right, or did I happen to get this right on this one doc? -----Original Message----- From: Tilman Hausherr [mailto:[email protected]] Sent: Thursday, July 7, 2016 12:55 PM To: [email protected] Subject: Re: associating text with a PDActionURI? here's code that works - for some reason, I can't take the rectangle as it is, I have to flip the coordinates. I wonder if this is documented. The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the coordinates I had to use are top is y = 0) Tilman package org.apache.pdfbox.examples.util; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.text.PDFTextStripperByArea; import java.awt.geom.Rectangle2D; import java.io.File; import java.io.IOException; /** * This is an example on how to extract text from a specific area on the PDF document. * * @author Ben Litchfield */ public final class ExtractTextByArea { private ExtractTextByArea() { //utility class and should not be constructed. } /** * This will print the documents text in a certain area. * * @param args The command line arguments. * * @throws IOException If there is an error parsing the document. */ public static void main( String[] args ) throws IOException { if( args.length != 1 ) { usage(); } else { PDDocument document = null; try { document = PDDocument.load( new File(args[0]) ); PDFTextStripperByArea stripper = new PDFTextStripperByArea(); stripper.setSortByPosition( true ); float pageHeight = document.getPage(0).getCropBox().getHeight(); Rectangle2D rect = new Rectangle2D.Float( 69.75f, pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f); ///////////////////////////////////////////////// stripper.addRegion( "class1", rect ); PDPage firstPage = document.getPage(0); stripper.extractRegions( firstPage ); System.out.println( "Text in the area:" + rect ); System.out.println( stripper.getTextForRegion( "class1" ) ); } finally { if( document != null ) { document.close(); } } } } /** * This will print the usage for this document. */ private static void usage() { System.err.println( "Usage: java " + ExtractTextByArea.class.getName() + " <input-pdf>" ); } } --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

