Re: associating text with a PDActionURI?

Tilman Hausherr Sat, 09 Jul 2016 06:05:15 -0700

Am 07.07.2016 um 21:49 schrieb Allison, Timothy B.:

Hmm, because it's you, I'll try it myself :-)

Thank you, Tilman!

You can't really know for sure with the classic text extraction, but you could 
use the extractTextByArea example with the rect coordinates.

Based on your example, though, I think this should work. If I cache the rectangle 
coordinates (Rectangle2D) before processing the page and then test for whether the 
rectangle contains each TextPosition in writeString(String text, 
List<TextPosition> textPositions), this might work?...have to implement to test 
this idea...

I don't know if the list contains the spaces at that time... Sometimesspaces are there as a space glyph (like in your text), sometimes thespace is created in text extraction with heuristics.

The key from your example is to subtract the rectangle's y values from the 
height of the page.

PDAnnotationLink's rectangle is:
  (lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38


Yes, i.e. after the flip.


TEXT: This is a hyperlink
x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 
5.52
....
So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis, 
which now fits within the rectangle's y range of 415.38 to 440.83.


yDirAdj has the flip, while text.getTextMatrix() doesn't.

The height you are using is the one that appears in red. It is notreally the height of the glyphs, it is a heuristic value that is derivedfrom several other values, and is used to decide whether other glyphsare on the same line or not. The real height is the one in cyan. Butthis height is usally smaller than the largest height (it is usually theheight of small glyphs like a, e, u) it should work, so it is a yes.(Uhm, you might want to check whether it works for rotated pages and forpages with a cropbox).

I suggest you try a few files from the digitalcorpora site... it is easyto find such files, you just get the annotations of a page, and thenlook for their class.

Looking at the code in DrawPrintTextLocations, I believe that the flipis done in writeString() is not needed, i.e. I'm first creating aflipped rectangle, and then flip it again about 30 lines later. I'lllook at that in a quiet moment.


Tilman


Does this seem right, or did I happen to get this right on this one doc?


-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, July 7, 2016 12:55 PM
To: users@pdfbox.apache.org
Subject: Re: associating text with a PDActionURI?

here's code that works - for some reason, I can't take the rectangle as it is, 
I have to flip the coordinates. I wonder if this is documented.
The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the 
coordinates I had to use are top is y = 0) Tilman

package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage; import 
org.apache.pdfbox.text.PDFTextStripperByArea;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

/**
   * This is an example on how to extract text from a specific area on the PDF 
document.
   *
   * @author Ben Litchfield
   */
public final class ExtractTextByArea
{
      private ExtractTextByArea()
      {
          //utility class and should not be constructed.
      }


      /**
       * This will print the documents text in a certain area.
       *
       * @param args The command line arguments.
       *
       * @throws IOException If there is an error parsing the document.
       */
      public static void main( String[] args ) throws IOException
      {
          if( args.length != 1 )
          {
              usage();
          }
          else
          {
              PDDocument document = null;
              try
              {
                  document = PDDocument.load( new File(args[0]) );
                  PDFTextStripperByArea stripper = new
PDFTextStripperByArea();
                  stripper.setSortByPosition( true );
                  float pageHeight =
document.getPage(0).getCropBox().getHeight();
                  Rectangle2D rect = new Rectangle2D.Float( 69.75f,
pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f);
/////////////////////////////////////////////////
                  stripper.addRegion( "class1", rect );
                  PDPage firstPage = document.getPage(0);
                  stripper.extractRegions( firstPage );
                  System.out.println( "Text in the area:" + rect );
                  System.out.println( stripper.getTextForRegion( "class1"
) );
              }
              finally
              {
                  if( document != null )
                  {
                      document.close();
                  }
              }
          }
      }

      /**
       * This will print the usage for this document.
       */
      private static void usage()
      {
          System.err.println( "Usage: java " +
ExtractTextByArea.class.getName() + " <input-pdf>" );
      }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: associating text with a PDActionURI?

Reply via email to