Am 07.07.2016 um 21:49 schrieb Allison, Timothy B.:
Hmm, because it's you, I'll try it myself :-)
Thank you, Tilman!
You can't really know for sure with the classic text extraction, but you could
use the extractTextByArea example with the rect coordinates.
Based on your example, though, I think this should work. If I cache the rectangle
coordinates (Rectangle2D) before processing the page and then test for whether the
rectangle contains each TextPosition in writeString(String text,
List<TextPosition> textPositions), this might work?...have to implement to test
this idea...
I don't know if the list contains the spaces at that time... Sometimes
spaces are there as a space glyph (like in your text), sometimes the
space is created in text extraction with heuristics.
The key from your example is to subtract the rectangle's y values from the
height of the page.
PDAnnotationLink's rectangle is:
(lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38
Yes, i.e. after the flip.
TEXT: This is a hyperlink
x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir:
5.52
....
So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis,
which now fits within the rectangle's y range of 415.38 to 440.83.
yDirAdj has the flip, while text.getTextMatrix() doesn't.
The height you are using is the one that appears in red. It is not
really the height of the glyphs, it is a heuristic value that is derived
from several other values, and is used to decide whether other glyphs
are on the same line or not. The real height is the one in cyan. But
this height is usally smaller than the largest height (it is usually the
height of small glyphs like a, e, u) it should work, so it is a yes.
(Uhm, you might want to check whether it works for rotated pages and for
pages with a cropbox).
I suggest you try a few files from the digitalcorpora site... it is easy
to find such files, you just get the annotations of a page, and then
look for their class.
Looking at the code in DrawPrintTextLocations, I believe that the flip
is done in writeString() is not needed, i.e. I'm first creating a
flipped rectangle, and then flip it again about 30 lines later. I'll
look at that in a quiet moment.
Tilman
Does this seem right, or did I happen to get this right on this one doc?
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Thursday, July 7, 2016 12:55 PM
To: users@pdfbox.apache.org
Subject: Re: associating text with a PDActionURI?
here's code that works - for some reason, I can't take the rectangle as it is,
I have to flip the coordinates. I wonder if this is documented.
The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the
coordinates I had to use are top is y = 0) Tilman
package org.apache.pdfbox.examples.util;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage; import
org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;
/**
* This is an example on how to extract text from a specific area on the PDF
document.
*
* @author Ben Litchfield
*/
public final class ExtractTextByArea
{
private ExtractTextByArea()
{
//utility class and should not be constructed.
}
/**
* This will print the documents text in a certain area.
*
* @param args The command line arguments.
*
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException
{
if( args.length != 1 )
{
usage();
}
else
{
PDDocument document = null;
try
{
document = PDDocument.load( new File(args[0]) );
PDFTextStripperByArea stripper = new
PDFTextStripperByArea();
stripper.setSortByPosition( true );
float pageHeight =
document.getPage(0).getCropBox().getHeight();
Rectangle2D rect = new Rectangle2D.Float( 69.75f,
pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f);
/////////////////////////////////////////////////
stripper.addRegion( "class1", rect );
PDPage firstPage = document.getPage(0);
stripper.extractRegions( firstPage );
System.out.println( "Text in the area:" + rect );
System.out.println( stripper.getTextForRegion( "class1"
) );
}
finally
{
if( document != null )
{
document.close();
}
}
}
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println( "Usage: java " +
ExtractTextByArea.class.getName() + " <input-pdf>" );
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org