Re: Garbage in PDF files

Tilman Hausherr Tue, 21 Dec 2021 07:06:32 -0800

Am 21.12.2021 um 15:51 schrieb Peter Kronenberg:

Running the attached PDF file through TIKA, I get a lot of garbage inthe output (see txt file). Far more than can be explained by theunmapped characters. Where is this coming from?

After the character is found to be unmapped, PDFBox tries a backupstrategy, which is obviously not successful here.

// when there is no Unicode mapping available, Acrobat simplycoerces the character code // into Unicode, so we do the same. Subclasses ofPDFStreamEngine don't necessarily want // this, which is why we leave it until this point inPDFTextStreamEngine.

        if (unicode == null)
        {
            if (font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {

// Acrobat doesn't seem to coerce composite font'scharacter codes, instead it

                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }

Adobe Reader has also trash.

I can't comment whether it is "far more than expected", this wouldrequire to count and make comparisons.

If I take the PDF and flatten it by ‘printing’ to a PDF file, thegarbage goes away

Printing is probably converting it all to raster graphics or vectorgraphics.


Tilman

Re: Garbage in PDF files

Reply via email to