And with TIKA-2846 (thanks to Tilman), you will now be able to see how
many unmapped chars there were per page.  If there's more than one
page, you'll get a parallel array of ints.  These were the results on
your doc:

0: pdf:unmappedUnicodeCharsPerPage : 3242
0: pdf:charsPerPage : 3242

Note, you'll either have to retrieve the Tika Metadata object after
the parse or use the RecursiveParserWrapper (-j /rmeta).  These stats
won't show up in the xhtml because they are calculated after the first
bit of content has been written.

On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
<[email protected]> wrote:
>
> Hello Tim, Peter,
>
> Thank you for your replies.
>
> It seems indeed that the only solution is to include Tesseract in my 
> processing pipeline.
>
> I don’t know if it might be useful to future readers, but I noticed that 
> *all* pdf created with PDF24 are subject to this behavior.
>
> I guess this might fall into the “obfuscation” approach some software adopt 
> :-(
>
> Cheers,
>
> Giovanni
> On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <[email protected]>, wrote:
>
> I agree with Tim's analysis.
>
> Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> are not mapped onto Unicode. There are two indications (codepoints and
> names which can often be used to create a partial mapping. I spent a *lot*
> of time doing this manually. For example
>
>
> WARN No Unicode mapping for .notdef (89) in font null
>
> WARN No Unicode mapping for 90 (90) in font null
> <<<
> The first field is the name , the second the codepoint. In your example the
> font (probably) uses codepoints consistently within that particular font,
> e.g. 89 is consistently the same character and different from 90. The names
> *may* differentiate characters. Here is my (handedited) entry for CMSY
> (used by LaTeX for symbols):
>
> <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
>
> But this will only work for this particularly font.
>
> If you are only dealing with anglophone alphanumeric from a single
> source/font you can probably work out a table. You are welcome to use mine
> (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> For example distinguishing between the many types of dash/minus/underline
> depend on having a system trained on these. Relative heights and size are a
> major problem
>
> In general, typesetters and their software are only concerned with the
> visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> for "not-equals". Anyone having work typeset in PDF should insist that a
> Unicode font is used. Better still avoid PDF.
>
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Reply via email to