If tesseract is installed on your system and callable as 'tesseract' and if you don't make any modifications via tika-config.xml, tesseract will be applied to images automatically and to pages of PDFs that have a) only a few characters (<10?) or b) have more than a handful of unmapped unicode characters.
On Thu, Jan 26, 2023 at 3:17 PM שי ברק <shai...@gmail.com> wrote: > > Does Tika support OCR on pdf, is there an endpoint or header for this? > > On Thu, 26 Jan 2023 at 21:54 Tim Allison <talli...@apache.org> wrote: >> >> Sorry, one more thing. >> >> If you use tika-eval's metadata filter, that will tell you that the >> out of vocabulary statistic (an indicator of "garbage") would likely >> be quite high for this file. >> >> On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <talli...@apache.org> wrote: >> > >> > A user dm'd me with an example file that contained English and Arabic. >> > The Arabic that was extracted was gibberish/mojibake. I wanted to >> > archive my response on our user list. >> > >> > * Extracting text from PDFs is a challenge. >> > * For troubleshooting, see: >> > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems >> > * Text extracted by other tools is also gibberish: Foxit, pdftotext >> > and Mac's Preview >> > * PDFBox logs warnings about missing unicode mappings >> > * Tika reports that there are a bunch of unicode mappings missing per >> > page. The point of this is that integrators might choose to run OCR >> > on pages with high counts of missing unicode mappings. From the >> > metadata: "pdf:charsPerPage":["1224","662"] >> > "pdf:unmappedUnicodeCharsPerPage":["620","249"] >> > >> > Finally, if you want a medium dive on some of the things that can go >> > wrong with text extraction in PDFs: >> > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf