RE: associating text with a PDActionURI?

Allison, Timothy B. Thu, 07 Jul 2016 12:49:51 -0700

> Hmm, because it's you, I'll try it myself :-)

Thank you, Tilman!


> You can't really know for sure with the classic text extraction, but you 
> could use the extractTextByArea example with the rect coordinates.

Based on your example, though, I think this should work. If I cache the 
rectangle coordinates (Rectangle2D) before processing the page and then test 
for whether the rectangle contains each TextPosition in writeString(String 
text, List<TextPosition> textPositions), this might work?...have to implement 
to test this idea...

The key from your example is to subtract the rectangle's y values from the 
height of the page.

PDAnnotationLink's rectangle is:
 (lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38

TEXT: This is a hyperlink 
x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 
5.52
....
So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis, 
which now fits within the rectangle's y range of 415.38 to 440.83.

Does this seem right, or did I happen to get this right on this one doc?


-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]] 
Sent: Thursday, July 7, 2016 12:55 PM
To: [email protected]
Subject: Re: associating text with a PDActionURI?

here's code that works - for some reason, I can't take the rectangle as it is, 
I have to flip the coordinates. I wonder if this is documented. 
The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the 
coordinates I had to use are top is y = 0) Tilman

package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage; import 
org.apache.pdfbox.text.PDFTextStripperByArea;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

/**
  * This is an example on how to extract text from a specific area on the PDF 
document.
  *
  * @author Ben Litchfield
  */
public final class ExtractTextByArea
{
     private ExtractTextByArea()
     {
         //utility class and should not be constructed.
     }


     /**
      * This will print the documents text in a certain area.
      *
      * @param args The command line arguments.
      *
      * @throws IOException If there is an error parsing the document.
      */
     public static void main( String[] args ) throws IOException
     {
         if( args.length != 1 )
         {
             usage();
         }
         else
         {
             PDDocument document = null;
             try
             {
                 document = PDDocument.load( new File(args[0]) );
                 PDFTextStripperByArea stripper = new 
PDFTextStripperByArea();
                 stripper.setSortByPosition( true );
                 float pageHeight = 
document.getPage(0).getCropBox().getHeight();
                 Rectangle2D rect = new Rectangle2D.Float( 69.75f, 
pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f); 
/////////////////////////////////////////////////
                 stripper.addRegion( "class1", rect );
                 PDPage firstPage = document.getPage(0);
                 stripper.extractRegions( firstPage );
                 System.out.println( "Text in the area:" + rect );
                 System.out.println( stripper.getTextForRegion( "class1" 
) );
             }
             finally
             {
                 if( document != null )
                 {
                     document.close();
                 }
             }
         }
     }

     /**
      * This will print the usage for this document.
      */
     private static void usage()
     {
         System.err.println( "Usage: java " + 
ExtractTextByArea.class.getName() + " <input-pdf>" );
     }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: associating text with a PDActionURI?

Reply via email to