a parallel array -> parallel arrays -j -> -J (tika-app commandline options)
On Thu, Apr 4, 2019 at 7:06 AM Tim Allison <[email protected]> wrote: > > And with TIKA-2846 (thanks to Tilman), you will now be able to see how > many unmapped chars there were per page. If there's more than one > page, you'll get a parallel array of ints. These were the results on > your doc: > > 0: pdf:unmappedUnicodeCharsPerPage : 3242 > 0: pdf:charsPerPage : 3242 > > Note, you'll either have to retrieve the Tika Metadata object after > the parse or use the RecursiveParserWrapper (-j /rmeta). These stats > won't show up in the xhtml because they are calculated after the first > bit of content has been written. > > On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz) > <[email protected]> wrote: > > > > Hello Tim, Peter, > > > > Thank you for your replies. > > > > It seems indeed that the only solution is to include Tesseract in my > > processing pipeline. > > > > I don’t know if it might be useful to future readers, but I noticed that > > *all* pdf created with PDF24 are subject to this behavior. > > > > I guess this might fall into the “obfuscation” approach some software adopt > > :-( > > > > Cheers, > > > > Giovanni > > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <[email protected]>, wrote: > > > > I agree with Tim's analysis. > > > > Many "legacy" fonts (including unfortunately some of those used by LaTeX) > > are not mapped onto Unicode. There are two indications (codepoints and > > names which can often be used to create a partial mapping. I spent a *lot* > > of time doing this manually. For example > > > > > > WARN No Unicode mapping for .notdef (89) in font null > > > > WARN No Unicode mapping for 90 (90) in font null > > <<< > > The first field is the name , the second the codepoint. In your example the > > font (probably) uses codepoints consistently within that particular font, > > e.g. 89 is consistently the same character and different from 90. The names > > *may* differentiate characters. Here is my (handedited) entry for CMSY > > (used by LaTeX for symbols): > > > > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/> > > > > But this will only work for this particularly font. > > > > If you are only dealing with anglophone alphanumeric from a single > > source/font you can probably work out a table. You are welcome to use mine > > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract > > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic. > > For example distinguishing between the many types of dash/minus/underline > > depend on having a system trained on these. Relative heights and size are a > > major problem > > > > In general, typesetters and their software are only concerned with the > > visual display and frequently use illiteracies (e.g. "=" + backspace + "/" > > for "not-equals". Anyone having work typeset in PDF should insist that a > > Unicode font is used. Better still avoid PDF. > > > > > > > > -- > > Peter Murray-Rust > > Reader Emeritus in Molecular Informatics > > Unilever Centre, Dept. Of Chemistry > > University of Cambridge > > CB2 1EW, UK > > +44-1223-763069
