Re: Extracting page "correctly"

Tilman Hausherr Wed, 07 Nov 2018 10:34:25 -0800

Am 06.11.2018 um 22:30 schrieb jorgeeflorez:

Thanks a lot Tilman for your help.
What it seems to me is that, regarding text extraction from a page,some improvements can be made (I used PDFBox 2.0.11). The idea, Ithink, is that one could just invoke a method and get the text of thepage, just as you would get it if you select the text from the pageusing Adobe Reader.
Looking at the code of LegacyPDFStreamEngine, ancestor ofPDFTextStripper, I found in several ocassions the expresion "THIS CODEIS DELIBERATELY INCORRECT" (I don't know if this affects what I amtrying to do). Anyway, I made a subclass of PDFStreamEngine and triedto get the text of the page (I am not familiar with the pdfspecification, operators, fonts and all that stuff). I just took somecode from the examples, that I think I understood, and added a couplelines.
I could extract the text of the file I used to test, regardless thepage rotation. I also used the pdf file from PDFBOX-4368 and it seemsit got the text correctly. In a third file I used, it took the text,but no spaces between words (I guess spaces were not stored in the pdf).
I attached the test files and the class I created, I know it doesn'tcover all the cases, but maybe it can be helpful.
By the way, text extraction was a part of a bigger problem. I neededthe text of the page and also group text in words and store thecoordinates (x, y, width height) of each word. The grouping part Icould do it (more or less) but the first part was giving me trouble :)

I don't know what is incorrect there except the height. I think Johnwanted to do something about it but it didn't happen.


Your attachment didn't get through. Please upload it somewhere.

Yes most PDFs don't have spaces. The PDFTextStripper class usesheuristics to make them up. If you are working on an own algorithm thenuse a test like TestTextStripper.java and maybe some or all the filesthat are part of the test. You can then compare your extraction with thecurrent code, or just keep it to retest your own code as your algorithmevolves.

Btw here's some updated code. The last code had several bugs, it didn'twork on multiple pages and didn't work on pages with a /Rotate entry.


Tilman


public class ExtractAngledText
{
    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }
        else
        {
            try (PDDocument doc = PDDocument.load(new File(args[0])))
            {
                for (int p = 1; p <= doc.getNumberOfPages(); ++p)
                {
                    System.out.printf("Page: %3d\n", p);
                    System.out.println("----------");

AngleCollector angleCollector = newAngleCollector(); // alternatively, reset angles

                    angleCollector.setStartPage(p);
                    angleCollector.setEndPage(p);
                    angleCollector.getText(doc);

System.out.println("Collected angles: " +angleCollector.getAngles());

                    System.out.println();

                    PDPage page = doc.getPage(p - 1);
                    int rotation = page.getRotation();
                    page.setRotation(0);

PDFTextStripper filteredTextStripper = newFilteredTextStripper();

                    for (int angle : angleCollector.getAngles())
                    {
                        filteredTextStripper.setStartPage(p);
                        filteredTextStripper.setEndPage(p);

                        System.out.printf("Angle: %3d\n", angle);
                        System.out.println("----------");
                        String text;
                        if (angle == 0)
                        {
                            text = filteredTextStripper.getText(doc);
                        }
                        else
                        {
                            // prepend a transformation

try (PDPageContentStream cs = newPDPageContentStream(doc, page, AppendMode.PREPEND, false))

                            {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
                            }

                            text = filteredTextStripper.getText(doc);

                            // remove transformation

COSArray contents = (COSArray)page.getCOSObject().getItem(COSName.CONTENTS);

                            contents.remove(0);
                        }
                        System.out.println(text);
                    }
                    page.setRotation(rotation);
                }
            }
        }
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {

System.err.println("Usage: java " +AngleCollector.class.getName() + " <input-pdf>");

    }
}

class AngleCollector extends PDFTextStripper
{
    Set<Integer> angles = new HashSet<>();

    public Set<Integer> getAngles()
    {
        return angles;
    }

    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @throws IOException If there is an error loading the properties.
     */
    AngleCollector() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        Matrix m = text.getTextMatrix();

int angle = (int)Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));

        angle = (angle + 360) % 360;
        angles.add(angle);
    }
}

class FilteredTextStripper extends PDFTextStripper
{
    FilteredTextStripper() throws IOException
    {
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        Matrix m = text.getTextMatrix();

int angle = (int)Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));

        if (angle == 0)
        {
            super.processTextPosition(text);
        }
    }
}


Thanks.
Best Regards.
Jorge Eduardo Flórez



    I've been thinking about similar strategies for the same problem for
    some time but never worked on it.
    So yes, we could try all 4 rotations and then see what extract makes
    more sense.
    Another idea that I just came up with: take the
    DrawPrintTextLocations.java example from the source code download,
    then
    find this line
    AffineTransform at = text.getTextMatrix().createAffineTransform();
    below that, add this line:
    System.out.println("Angle: " +
    Math.toDegrees(Math.atan2(at.getShearY(),
    at.getScaleY())));
    Then look at the output....
    This gets the rotation angle, which will hopefully be one of 0,
    90, 180,
    270.
    Now run text extraction by preparing each page with
    page.setRotation(page.getRotation()-angle);
    However this won't work with fine rotations, e.g. the file from
    PDFBOX-4368.
    That would need something different, e.g. collecting all
    rotations, and
    then somehow run a filtered extract for each one.
    Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Extracting page "correctly"

Reply via email to