Re: Corrupted Arabic text in a PDF

שי ברק Thu, 26 Jan 2023 12:17:10 -0800

Does Tika support OCR on pdf, is there an endpoint or header for this?

On Thu, 26 Jan 2023 at 21:54 Tim Allison <[email protected]> wrote:


> Sorry, one more thing.
>
> If you use tika-eval's metadata filter, that will tell you that the
> out of vocabulary statistic (an indicator of "garbage") would likely
> be quite high for this file.
>
> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> wrote:
> >
> > A user dm'd me with an example file that contained English and Arabic.
> > The Arabic that was extracted was gibberish/mojibake.  I wanted to
> > archive my response on our user list.
> >
> > * Extracting text from PDFs is a challenge.
> > * For troubleshooting, see:
> >
> https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
> > * Text extracted by other tools is also gibberish: Foxit, pdftotext
> > and Mac's Preview
> > * PDFBox logs warnings about missing unicode mappings
> > * Tika reports that there are a bunch of unicode mappings missing per
> > page.  The point of this is that integrators might choose to run OCR
> > on pages with high counts of missing unicode mappings. From the
> > metadata: "pdf:charsPerPage":["1224","662"]
> > "pdf:unmappedUnicodeCharsPerPage":["620","249"]
> >
> > Finally, if you want a medium dive on some of the things that can go
> > wrong with text extraction in PDFs:
> >
> https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf
>

Re: Corrupted Arabic text in a PDF

Reply via email to