Re: Corrupted Arabic text in a PDF

שי ברק Thu, 26 Jan 2023 13:03:18 -0800

So I guess another issue that we can’t run OCR on document that contains
multiple languages…


On Thu, 26 Jan 2023 at 22:58 Tim Allison <talli...@apache.org> wrote:

> I sent offline the text extracted by tesseract when told the language
> is "ara".  The English is completely garbled.  I can't evaluate the
> quality of the Arabic.
>
> On Thu, Jan 26, 2023 at 3:53 PM Tim Allison <talli...@apache.org> wrote:
> >
> > Ha. Cool.  I was going to recommend that.
> >
> > This file does trigger OCR on my local dev environment.  If you use
> > the /rmeta endpoint on tika-server, you'll see something like:
> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.DefaultParser
> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.pdf.PDFParser
> > X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.ocr.TesseractOCRParser
> >
> > There are two areas of bad news: 1) Arabic is not loaded by default in
> > the tika-full docker container. 2) We don't have a good way of doing
> > language detection to tell tesseract which language to apply by
> > default..
> >
> > ‪On Thu, Jan 26, 2023 at 3:28 PM ‫שי ברק‬‎ <shai...@gmail.com> wrote:‬
> > >
> > > I use the full docker image of Tika 2.6,
> > > How can I check if I have it or not and where am i supposed to see the
> outcome of the OCR?
> > >
> > > On Thu, 26 Jan 2023 at 22:25 Tim Allison <talli...@apache.org> wrote:
> > >>
> > >> If tesseract is installed on your system and callable as 'tesseract'
> > >> and if you don't make any modifications via tika-config.xml, tesseract
> > >> will be applied to images automatically and to pages of PDFs that have
> > >> a) only a few characters (<10?) or b) have more than a handful of
> > >> unmapped unicode characters.
> > >>
> > >> ‪On Thu, Jan 26, 2023 at 3:17 PM ‫שי ברק‬‎ <shai...@gmail.com>
> wrote:‬
> > >> >
> > >> > Does Tika support OCR on pdf, is there an endpoint or header for
> this?
> > >> >
> > >> > On Thu, 26 Jan 2023 at 21:54 Tim Allison <talli...@apache.org>
> wrote:
> > >> >>
> > >> >> Sorry, one more thing.
> > >> >>
> > >> >> If you use tika-eval's metadata filter, that will tell you that the
> > >> >> out of vocabulary statistic (an indicator of "garbage") would
> likely
> > >> >> be quite high for this file.
> > >> >>
> > >> >> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <talli...@apache.org>
> wrote:
> > >> >> >
> > >> >> > A user dm'd me with an example file that contained English and
> Arabic.
> > >> >> > The Arabic that was extracted was gibberish/mojibake.  I wanted
> to
> > >> >> > archive my response on our user list.
> > >> >> >
> > >> >> > * Extracting text from PDFs is a challenge.
> > >> >> > * For troubleshooting, see:
> > >> >> >
> https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
> > >> >> > * Text extracted by other tools is also gibberish: Foxit,
> pdftotext
> > >> >> > and Mac's Preview
> > >> >> > * PDFBox logs warnings about missing unicode mappings
> > >> >> > * Tika reports that there are a bunch of unicode mappings
> missing per
> > >> >> > page.  The point of this is that integrators might choose to run
> OCR
> > >> >> > on pages with high counts of missing unicode mappings. From the
> > >> >> > metadata: "pdf:charsPerPage":["1224","662"]
> > >> >> > "pdf:unmappedUnicodeCharsPerPage":["620","249"]
> > >> >> >
> > >> >> > Finally, if you want a medium dive on some of the things that
> can go
> > >> >> > wrong with text extraction in PDFs:
> > >> >> >
> https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
>

Re: Corrupted Arabic text in a PDF

Reply via email to