Re: Corrupted Arabic text in a PDF

שי ברק Thu, 26 Jan 2023 12:28:30 -0800

I use the full docker image of Tika 2.6,
How can I check if I have it or not and where am i supposed to see the
outcome of the OCR?


On Thu, 26 Jan 2023 at 22:25 Tim Allison <talli...@apache.org> wrote:

> If tesseract is installed on your system and callable as 'tesseract'
> and if you don't make any modifications via tika-config.xml, tesseract
> will be applied to images automatically and to pages of PDFs that have
> a) only a few characters (<10?) or b) have more than a handful of
> unmapped unicode characters.
>
> ‪On Thu, Jan 26, 2023 at 3:17 PM ‫שי ברק‬‎ <shai...@gmail.com> wrote:‬
> >
> > Does Tika support OCR on pdf, is there an endpoint or header for this?
> >
> > On Thu, 26 Jan 2023 at 21:54 Tim Allison <talli...@apache.org> wrote:
> >>
> >> Sorry, one more thing.
> >>
> >> If you use tika-eval's metadata filter, that will tell you that the
> >> out of vocabulary statistic (an indicator of "garbage") would likely
> >> be quite high for this file.
> >>
> >> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <talli...@apache.org>
> wrote:
> >> >
> >> > A user dm'd me with an example file that contained English and Arabic.
> >> > The Arabic that was extracted was gibberish/mojibake.  I wanted to
> >> > archive my response on our user list.
> >> >
> >> > * Extracting text from PDFs is a challenge.
> >> > * For troubleshooting, see:
> >> >
> https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
> >> > * Text extracted by other tools is also gibberish: Foxit, pdftotext
> >> > and Mac's Preview
> >> > * PDFBox logs warnings about missing unicode mappings
> >> > * Tika reports that there are a bunch of unicode mappings missing per
> >> > page.  The point of this is that integrators might choose to run OCR
> >> > on pages with high counts of missing unicode mappings. From the
> >> > metadata: "pdf:charsPerPage":["1224","662"]
> >> > "pdf:unmappedUnicodeCharsPerPage":["620","249"]
> >> >
> >> > Finally, if you want a medium dive on some of the things that can go
> >> > wrong with text extraction in PDFs:
> >> >
> https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
>

Re: Corrupted Arabic text in a PDF

Reply via email to