Re: Corrupted Arabic text in a PDF

Tim Allison Thu, 26 Jan 2023 12:53:45 -0800

Ha. Cool.  I was going to recommend that.

This file does trigger OCR on my local dev environment.  If you use
the /rmeta endpoint on tika-server, you'll see something like:
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.pdf.PDFParser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.ocr.TesseractOCRParser


There are two areas of bad news: 1) Arabic is not loaded by default in
the tika-full docker container. 2) We don't have a good way of doing
language detection to tell tesseract which language to apply by
default..

‪On Thu, Jan 26, 2023 at 3:28 PM ‫שי ברק‬‎ <[email protected]> wrote:‬
>
> I use the full docker image of Tika 2.6,
> How can I check if I have it or not and where am i supposed to see the 
> outcome of the OCR?
>
> On Thu, 26 Jan 2023 at 22:25 Tim Allison <[email protected]> wrote:
>>
>> If tesseract is installed on your system and callable as 'tesseract'
>> and if you don't make any modifications via tika-config.xml, tesseract
>> will be applied to images automatically and to pages of PDFs that have
>> a) only a few characters (<10?) or b) have more than a handful of
>> unmapped unicode characters.
>>
>> ‪On Thu, Jan 26, 2023 at 3:17 PM ‫שי ברק‬‎ <[email protected]> wrote:‬
>> >
>> > Does Tika support OCR on pdf, is there an endpoint or header for this?
>> >
>> > On Thu, 26 Jan 2023 at 21:54 Tim Allison <[email protected]> wrote:
>> >>
>> >> Sorry, one more thing.
>> >>
>> >> If you use tika-eval's metadata filter, that will tell you that the
>> >> out of vocabulary statistic (an indicator of "garbage") would likely
>> >> be quite high for this file.
>> >>
>> >> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> wrote:
>> >> >
>> >> > A user dm'd me with an example file that contained English and Arabic.
>> >> > The Arabic that was extracted was gibberish/mojibake.  I wanted to
>> >> > archive my response on our user list.
>> >> >
>> >> > * Extracting text from PDFs is a challenge.
>> >> > * For troubleshooting, see:
>> >> > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
>> >> > * Text extracted by other tools is also gibberish: Foxit, pdftotext
>> >> > and Mac's Preview
>> >> > * PDFBox logs warnings about missing unicode mappings
>> >> > * Tika reports that there are a bunch of unicode mappings missing per
>> >> > page.  The point of this is that integrators might choose to run OCR
>> >> > on pages with high counts of missing unicode mappings. From the
>> >> > metadata: "pdf:charsPerPage":["1224","662"]
>> >> > "pdf:unmappedUnicodeCharsPerPage":["620","249"]
>> >> >
>> >> > Finally, if you want a medium dive on some of the things that can go
>> >> > wrong with text extraction in PDFs:
>> >> > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

Re: Corrupted Arabic text in a PDF

Reply via email to