Does Tika support OCR on pdf, is there an endpoint or header for this? On Thu, 26 Jan 2023 at 21:54 Tim Allison <[email protected]> wrote:
> Sorry, one more thing. > > If you use tika-eval's metadata filter, that will tell you that the > out of vocabulary statistic (an indicator of "garbage") would likely > be quite high for this file. > > On Thu, Jan 26, 2023 at 2:51 PM Tim Allison <[email protected]> wrote: > > > > A user dm'd me with an example file that contained English and Arabic. > > The Arabic that was extracted was gibberish/mojibake. I wanted to > > archive my response on our user list. > > > > * Extracting text from PDFs is a challenge. > > * For troubleshooting, see: > > > https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems > > * Text extracted by other tools is also gibberish: Foxit, pdftotext > > and Mac's Preview > > * PDFBox logs warnings about missing unicode mappings > > * Tika reports that there are a bunch of unicode mappings missing per > > page. The point of this is that integrators might choose to run OCR > > on pages with high counts of missing unicode mappings. From the > > metadata: "pdf:charsPerPage":["1224","662"] > > "pdf:unmappedUnicodeCharsPerPage":["620","249"] > > > > Finally, if you want a medium dive on some of the things that can go > > wrong with text extraction in PDFs: > > > https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf >
