Am 06.11.2018 um 22:30 schrieb jorgeeflorez:
Thanks a lot Tilman for your help.
What it seems to me is that, regarding text extraction from a page,
some improvements can be made (I used PDFBox 2.0.11). The idea, I
think, is that one could just invoke a method and get the text of the
page, just as you would get it if you select the text from the page
using Adobe Reader.
Looking at the code of LegacyPDFStreamEngine, ancestor of
PDFTextStripper, I found in several ocassions the expresion "THIS CODE
IS DELIBERATELY INCORRECT" (I don't know if this affects what I am
trying to do). Anyway, I made a subclass of PDFStreamEngine and tried
to get the text of the page (I am not familiar with the pdf
specification, operators, fonts and all that stuff). I just took some
code from the examples, that I think I understood, and added a couple
lines.
I could extract the text of the file I used to test, regardless the
page rotation. I also used the pdf file from PDFBOX-4368 and it seems
it got the text correctly. In a third file I used, it took the text,
but no spaces between words (I guess spaces were not stored in the pdf).
I attached the test files and the class I created, I know it doesn't
cover all the cases, but maybe it can be helpful.
By the way, text extraction was a part of a bigger problem. I needed
the text of the page and also group text in words and store the
coordinates (x, y, width height) of each word. The grouping part I
could do it (more or less) but the first part was giving me trouble :)
I don't know what is incorrect there except the height. I think John
wanted to do something about it but it didn't happen.
Your attachment didn't get through. Please upload it somewhere.
Yes most PDFs don't have spaces. The PDFTextStripper class uses
heuristics to make them up. If you are working on an own algorithm then
use a test like TestTextStripper.java and maybe some or all the files
that are part of the test. You can then compare your extraction with the
current code, or just keep it to retest your own code as your algorithm
evolves.
Btw here's some updated code. The last code had several bugs, it didn't
work on multiple pages and didn't work on pages with a /Rotate entry.
Tilman
public class ExtractAngledText
{
/**
* This will print the documents data.
*
* @param args The command line arguments.
*
* @throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException
{
if (args.length != 1)
{
usage();
}
else
{
try (PDDocument doc = PDDocument.load(new File(args[0])))
{
for (int p = 1; p <= doc.getNumberOfPages(); ++p)
{
System.out.printf("Page: %3d\n", p);
System.out.println("----------");
AngleCollector angleCollector = new
AngleCollector(); // alternatively, reset angles
angleCollector.setStartPage(p);
angleCollector.setEndPage(p);
angleCollector.getText(doc);
System.out.println("Collected angles: " +
angleCollector.getAngles());
System.out.println();
PDPage page = doc.getPage(p - 1);
int rotation = page.getRotation();
page.setRotation(0);
PDFTextStripper filteredTextStripper = new
FilteredTextStripper();
for (int angle : angleCollector.getAngles())
{
filteredTextStripper.setStartPage(p);
filteredTextStripper.setEndPage(p);
System.out.printf("Angle: %3d\n", angle);
System.out.println("----------");
String text;
if (angle == 0)
{
text = filteredTextStripper.getText(doc);
}
else
{
// prepend a transformation
try (PDPageContentStream cs = new
PDPageContentStream(doc, page, AppendMode.PREPEND, false))
{
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
}
text = filteredTextStripper.getText(doc);
// remove transformation
COSArray contents = (COSArray)
page.getCOSObject().getItem(COSName.CONTENTS);
contents.remove(0);
}
System.out.println(text);
}
page.setRotation(rotation);
}
}
}
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println("Usage: java " +
AngleCollector.class.getName() + " <input-pdf>");
}
}
class AngleCollector extends PDFTextStripper
{
Set<Integer> angles = new HashSet<>();
public Set<Integer> getAngles()
{
return angles;
}
/**
* Instantiate a new PDFTextStripper object.
*
* @throws IOException If there is an error loading the properties.
*/
AngleCollector() throws IOException
{
}
@Override
protected void processTextPosition(TextPosition text)
{
Matrix m = text.getTextMatrix();
int angle = (int)
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
angle = (angle + 360) % 360;
angles.add(angle);
}
}
class FilteredTextStripper extends PDFTextStripper
{
FilteredTextStripper() throws IOException
{
}
@Override
protected void processTextPosition(TextPosition text)
{
Matrix m = text.getTextMatrix();
int angle = (int)
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
if (angle == 0)
{
super.processTextPosition(text);
}
}
}
Thanks.
Best Regards.
Jorge Eduardo Flórez
I've been thinking about similar strategies for the same problem for
some time but never worked on it.
So yes, we could try all 4 rotations and then see what extract makes
more sense.
Another idea that I just came up with: take the
DrawPrintTextLocations.java example from the source code download,
then
find this line
AffineTransform at = text.getTextMatrix().createAffineTransform();
below that, add this line:
System.out.println("Angle: " +
Math.toDegrees(Math.atan2(at.getShearY(),
at.getScaleY())));
Then look at the output....
This gets the rotation angle, which will hopefully be one of 0,
90, 180,
270.
Now run text extraction by preparing each page with
page.setRotation(page.getRotation()-angle);
However this won't work with fine rotations, e.g. the file from
PDFBOX-4368.
That would need something different, e.g. collecting all
rotations, and
then somehow run a filtered extract for each one.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]