Am 06.11.2018 um 22:30 schrieb jorgeeflorez:
Thanks a lot Tilman for your help.

What it seems to me is that, regarding text extraction from a page, some improvements can be made (I used PDFBox 2.0.11). The idea, I think, is that one could just invoke a method and get the text of the page, just as you would get it if you select the text from the page using Adobe Reader.

Looking at the code of LegacyPDFStreamEngine, ancestor of PDFTextStripper, I found in several ocassions the expresion "THIS CODE IS DELIBERATELY INCORRECT" (I don't know if this affects what I am trying to do). Anyway, I made a subclass of PDFStreamEngine and tried to get the text of the page (I am not familiar with the pdf specification, operators, fonts and all that stuff). I just took some code from the examples, that I think I understood, and added a couple lines.

I could extract the text of the file I used to test, regardless the page rotation. I also used the pdf file from PDFBOX-4368 and it seems it got the text correctly. In a third file I used, it took the text, but no spaces between words (I guess spaces were not stored in the pdf).

I attached the test files and the class I created, I know it doesn't cover all the cases, but maybe it can be helpful.

By the way, text extraction was a part of a bigger problem. I needed the text of the page and also group text in words and store the coordinates (x, y, width height) of each word. The grouping part I could do it (more or less) but the first part was giving me trouble :)


I don't know what is incorrect there except the height. I think John wanted to do something about it but it didn't happen.

Your attachment didn't get through. Please upload it somewhere.

Yes most PDFs don't have spaces. The PDFTextStripper class uses heuristics to make them up. If you are working on an own algorithm then use a test like TestTextStripper.java and maybe some or all the files that are part of the test. You can then compare your extraction with the current code, or just keep it to retest your own code as your algorithm evolves.

Btw here's some updated code. The last code had several bugs, it didn't work on multiple pages and didn't work on pages with a /Rotate entry.

Tilman


public class ExtractAngledText
{
    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }
        else
        {
            try (PDDocument doc = PDDocument.load(new File(args[0])))
            {
                for (int p = 1; p <= doc.getNumberOfPages(); ++p)
                {
                    System.out.printf("Page: %3d\n", p);
                    System.out.println("----------");

                    AngleCollector angleCollector = new AngleCollector(); // alternatively, reset angles
                    angleCollector.setStartPage(p);
                    angleCollector.setEndPage(p);
                    angleCollector.getText(doc);
                    System.out.println("Collected angles: " + angleCollector.getAngles());
                    System.out.println();

                    PDPage page = doc.getPage(p - 1);
                    int rotation = page.getRotation();
                    page.setRotation(0);
                    PDFTextStripper filteredTextStripper = new FilteredTextStripper();
                    for (int angle : angleCollector.getAngles())
                    {
                        filteredTextStripper.setStartPage(p);
                        filteredTextStripper.setEndPage(p);

                        System.out.printf("Angle: %3d\n", angle);
                        System.out.println("----------");
                        String text;
                        if (angle == 0)
                        {
                            text = filteredTextStripper.getText(doc);
                        }
                        else
                        {
                            // prepend a transformation
                            try (PDPageContentStream cs = new PDPageContentStream(doc, page, AppendMode.PREPEND, false))
                            {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
                            }

                            text = filteredTextStripper.getText(doc);

                            // remove transformation
                            COSArray contents = (COSArray) page.getCOSObject().getItem(COSName.CONTENTS);
                            contents.remove(0);
                        }
                        System.out.println(text);
                    }
                    page.setRotation(rotation);
                }
            }
        }
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println("Usage: java " + AngleCollector.class.getName() + " <input-pdf>");
    }
}

class AngleCollector extends PDFTextStripper
{
    Set<Integer> angles = new HashSet<>();

    public Set<Integer> getAngles()
    {
        return angles;
    }

    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @throws IOException If there is an error loading the properties.
     */
    AngleCollector() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        Matrix m = text.getTextMatrix();
        int angle = (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
        angle = (angle + 360) % 360;
        angles.add(angle);
    }
}

class FilteredTextStripper extends PDFTextStripper
{
    FilteredTextStripper() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        Matrix m = text.getTextMatrix();
        int angle = (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
        if (angle == 0)
        {
            super.processTextPosition(text);
        }
    }
}







Thanks.
Best Regards.
Jorge Eduardo Flórez



    I've been thinking about similar strategies for the same problem for
    some time but never worked on it.
    So yes, we could try all 4 rotations and then see what extract makes
    more sense.
    Another idea that I just came up with: take the
    DrawPrintTextLocations.java example from the source code download,
    then
    find this line
    AffineTransform at = text.getTextMatrix().createAffineTransform();
    below that, add this line:
    System.out.println("Angle: " +
    Math.toDegrees(Math.atan2(at.getShearY(),
    at.getScaleY())));
    Then look at the output....
    This gets the rotation angle, which will hopefully be one of 0,
    90, 180,
    270.
    Now run text extraction by preparing each page with
    page.setRotation(page.getRotation()-angle);
    However this won't work with fine rotations, e.g. the file from
    PDFBOX-4368.
    That would need something different, e.g. collecting all
    rotations, and
    then somehow run a filtered extract for each one.
    Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to