Re: Corrupted Arabic text in a PDF

Tim Allison Thu, 26 Jan 2023 12:25:13 -0800

If tesseract is installed on your system and callable as 'tesseract'
and if you don't make any modifications via tika-config.xml, tesseract
will be applied to images automatically and to pages of PDFs that have
a) only a few characters (<10?) or b) have more than a handful of
unmapped unicode characters.


‪On Thu, Jan 26, 2023 at 3:17 PM ‫שי ברק‬‎ <shai...@gmail.com> wrote:‬
>
> Does Tika support OCR on pdf, is there an endpoint or header for this?
>
> On Thu, 26 Jan 2023 at 21:54 Tim Allison <talli...@apache.org> wrote:
>>
>> Sorry, one more thing.
>>
>> If you use tika-eval's metadata filter, that will tell you that the
>> out of vocabulary statistic (an indicator of "garbage") would likely
>> be quite high for this file.
>>
>> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <talli...@apache.org> wrote:
>> >
>> > A user dm'd me with an example file that contained English and Arabic.
>> > The Arabic that was extracted was gibberish/mojibake.  I wanted to
>> > archive my response on our user list.
>> >
>> > * Extracting text from PDFs is a challenge.
>> > * For troubleshooting, see:
>> > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
>> > * Text extracted by other tools is also gibberish: Foxit, pdftotext
>> > and Mac's Preview
>> > * PDFBox logs warnings about missing unicode mappings
>> > * Tika reports that there are a bunch of unicode mappings missing per
>> > page.  The point of this is that integrators might choose to run OCR
>> > on pages with high counts of missing unicode mappings. From the
>> > metadata: "pdf:charsPerPage":["1224","662"]
>> > "pdf:unmappedUnicodeCharsPerPage":["620","249"]
>> >
>> > Finally, if you want a medium dive on some of the things that can go
>> > wrong with text extraction in PDFs:
>> > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

Re: Corrupted Arabic text in a PDF

Reply via email to