Am 07.07.2016 um 21:49 schrieb Allison, Timothy B.:
Hmm, because it's you, I'll try it myself :-)
Thank you, Tilman!

You can't really know for sure with the classic text extraction, but you could 
use the extractTextByArea example with the rect coordinates.
Based on your example, though, I think this should work. If I cache the rectangle 
coordinates (Rectangle2D) before processing the page and then test for whether the 
rectangle contains each TextPosition in writeString(String text, 
List<TextPosition> textPositions), this might work?...have to implement to test 
this idea...

I don't know if the list contains the spaces at that time... Sometimes spaces are there as a space glyph (like in your text), sometimes the space is created in text extraction with heuristics.


The key from your example is to subtract the rectangle's y values from the 
height of the page.

PDAnnotationLink's rectangle is:
  (lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38

Yes, i.e. after the flip.


TEXT: This is a hyperlink
x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 
5.52
....
So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis, 
which now fits within the rectangle's y range of 415.38 to 440.83.

yDirAdj has the flip, while text.getTextMatrix() doesn't.

The height you are using is the one that appears in red. It is not really the height of the glyphs, it is a heuristic value that is derived from several other values, and is used to decide whether other glyphs are on the same line or not. The real height is the one in cyan. But this height is usally smaller than the largest height (it is usually the height of small glyphs like a, e, u) it should work, so it is a yes. (Uhm, you might want to check whether it works for rotated pages and for pages with a cropbox).

I suggest you try a few files from the digitalcorpora site... it is easy to find such files, you just get the annotations of a page, and then look for their class.

Looking at the code in DrawPrintTextLocations, I believe that the flip is done in writeString() is not needed, i.e. I'm first creating a flipped rectangle, and then flip it again about 30 lines later. I'll look at that in a quiet moment.

Tilman


Does this seem right, or did I happen to get this right on this one doc?


-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, July 7, 2016 12:55 PM
To: users@pdfbox.apache.org
Subject: Re: associating text with a PDActionURI?

here's code that works - for some reason, I can't take the rectangle as it is, 
I have to flip the coordinates. I wonder if this is documented.
The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the 
coordinates I had to use are top is y = 0) Tilman

package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage; import 
org.apache.pdfbox.text.PDFTextStripperByArea;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

/**
   * This is an example on how to extract text from a specific area on the PDF 
document.
   *
   * @author Ben Litchfield
   */
public final class ExtractTextByArea
{
      private ExtractTextByArea()
      {
          //utility class and should not be constructed.
      }


      /**
       * This will print the documents text in a certain area.
       *
       * @param args The command line arguments.
       *
       * @throws IOException If there is an error parsing the document.
       */
      public static void main( String[] args ) throws IOException
      {
          if( args.length != 1 )
          {
              usage();
          }
          else
          {
              PDDocument document = null;
              try
              {
                  document = PDDocument.load( new File(args[0]) );
                  PDFTextStripperByArea stripper = new
PDFTextStripperByArea();
                  stripper.setSortByPosition( true );
                  float pageHeight =
document.getPage(0).getCropBox().getHeight();
                  Rectangle2D rect = new Rectangle2D.Float( 69.75f,
pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f);
/////////////////////////////////////////////////
                  stripper.addRegion( "class1", rect );
                  PDPage firstPage = document.getPage(0);
                  stripper.extractRegions( firstPage );
                  System.out.println( "Text in the area:" + rect );
                  System.out.println( stripper.getTextForRegion( "class1"
) );
              }
              finally
              {
                  if( document != null )
                  {
                      document.close();
                  }
              }
          }
      }

      /**
       * This will print the usage for this document.
       */
      private static void usage()
      {
          System.err.println( "Usage: java " +
ExtractTextByArea.class.getName() + " <input-pdf>" );
      }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to