Am 21.12.2021 um 17:16 schrieb Peter Kronenberg:
So are all the ‘garbage’ characters I’m seeing simply due to unmapped
characters?
Yes
Is there any solution to make it look prettier?
No
Or is ‘flattening’ (if that’s the correct word), the best solution?
Yes or OCR or using a dictionary to disregard trash. IIRC Tika has an
option to use OCR and compare whether it is better.
Tilman
*Peter Kronenberg****| **Senior AI Analytic ENGINEER *
*C: 703.887.5623 *
Torch AI <http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI <http://www.torch.ai/>
*From:* Tilman Hausherr <[email protected]>
*Sent:* Tuesday, December 21, 2021 10:06 AM
*To:* [email protected]
*Subject:* Re: Garbage in PDF files
Am 21.12.2021 um 15:51 schrieb Peter Kronenberg:
Running the attached PDF file through TIKA, I get a lot of garbage
in the output (see txt file). Far more than can be explained by
the unmapped characters. Where is this coming from?
After the character is found to be unmapped, PDFBox tries a backup
strategy, which is obviously not successful here.
// when there is no Unicode mapping available, Acrobat simply
coerces the character code
// into Unicode, so we do the same. Subclasses of
PDFStreamEngine don't necessarily want
// this, which is why we leave it until this point in
PDFTextStreamEngine.
if (unicode == null)
{
if (font instanceof PDSimpleFont)
{
char c = (char) code;
unicode = new String(new char[] { c });
}
else
{
// Acrobat doesn't seem to coerce composite font's
character codes, instead it
// skips them. See the "allah2.pdf" TestTextStripper file.
return;
}
}
Adobe Reader has also trash.
I can't comment whether it is "far more than expected", this would
require to count and make comparisons.
If I take the PDF and flatten it by ‘printing’ to a PDF file, the
garbage goes away
Printing is probably converting it all to raster graphics or vector
graphics.
Tilman