Thanks a lot!

I am now watching https://issues.apache.org/jira/browse/TIKA-2749.

From the test I made performance dropped way too much when I impost to always 
OCR pdf as image instead of just extracting the text :-(

A hybrid solution would indeed be a much better approach in my case.


Giovanni
On 5 Apr 2019, 20:12 +0200, Tim Allison <[email protected]>, wrote:
> > Also, does anybody know when 1.21 is due? :-)
>
> Both POI and PDFBox are about to make releases. I'd be willing to run
> a release of Tika once those are out (two or so weeks)...
>
> Fellow devs, What do you think of 1.21 shortly after POI and PDFBox
> are released?
>
> > Do you think that would be a decent strategy?
> Yep, exactly. I _may_ have time to implement a "first steps" of
> https://issues.apache.org/jira/browse/TIKA-2749 before the
> release...so maybe you won't have to make changes on your side.
>
> On Thu, Apr 4, 2019 at 5:06 PM Giovanni De Stefano
> <[email protected]> wrote:
> >
> > I could use the number of unmapped unicode chars are in a page to decide 
> > whether a PDF should be parsed “normally” or OCR.
> >
> > Do you think that would be a decent strategy?
> >
> > Also, does anybody know when 1.21 is due? :-)
> >
> >
> > Giovanni
> > On 4 Apr 2019, 13:06 +0200, Tim Allison <[email protected]>, wrote:
> >
> > And with TIKA-2846 (thanks to Tilman), you will now be able to see how
> > many unmapped chars there were per page. If there's more than one
> > page, you'll get a parallel array of ints. These were the results on
> > your doc:
> >
> > 0: pdf:unmappedUnicodeCharsPerPage : 3242
> > 0: pdf:charsPerPage : 3242
> >
> > Note, you'll either have to retrieve the Tika Metadata object after
> > the parse or use the RecursiveParserWrapper (-j /rmeta). These stats
> > won't show up in the xhtml because they are calculated after the first
> > bit of content has been written.
> >
> > On Tue, Apr 2, 2019 at 4:52 AM Giovanni De Stefano (zxxz)
> > <[email protected]> wrote:
> >
> >
> > Hello Tim, Peter,
> >
> > Thank you for your replies.
> >
> > It seems indeed that the only solution is to include Tesseract in my 
> > processing pipeline.
> >
> > I don’t know if it might be useful to future readers, but I noticed that 
> > *all* pdf created with PDF24 are subject to this behavior.
> >
> > I guess this might fall into the “obfuscation” approach some software adopt 
> > :-(
> >
> > Cheers,
> >
> > Giovanni
> > On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <[email protected]>, wrote:
> >
> > I agree with Tim's analysis.
> >
> > Many "legacy" fonts (including unfortunately some of those used by LaTeX)
> > are not mapped onto Unicode. There are two indications (codepoints and
> > names which can often be used to create a partial mapping. I spent a *lot*
> > of time doing this manually. For example
> >
> >
> > WARN No Unicode mapping for .notdef (89) in font null
> >
> > WARN No Unicode mapping for 90 (90) in font null
> > <<<
> > The first field is the name , the second the codepoint. In your example the
> > font (probably) uses codepoints consistently within that particular font,
> > e.g. 89 is consistently the same character and different from 90. The names
> > *may* differentiate characters. Here is my (handedited) entry for CMSY
> > (used by LaTeX for symbols):
> >
> > <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>
> >
> > But this will only work for this particularly font.
> >
> > If you are only dealing with anglophone alphanumeric from a single
> > source/font you can probably work out a table. You are welcome to use mine
> > (mainly from scientific / technical publishing) Beyond that OCR/Tesseract
> > may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
> > For example distinguishing between the many types of dash/minus/underline
> > depend on having a system trained on these. Relative heights and size are a
> > major problem
> >
> > In general, typesetters and their software are only concerned with the
> > visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
> > for "not-equals". Anyone having work typeset in PDF should insist that a
> > Unicode font is used. Better still avoid PDF.
> >
> >
> >
> > --
> > Peter Murray-Rust
> > Reader Emeritus in Molecular Informatics
> > Unilever Centre, Dept. Of Chemistry
> > University of Cambridge
> > CB2 1EW, UK
> > +44-1223-763069

Reply via email to