A user dm'd me with an example file that contained English and Arabic. The Arabic that was extracted was gibberish/mojibake. I wanted to archive my response on our user list.
* Extracting text from PDFs is a challenge. * For troubleshooting, see: https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems * Text extracted by other tools is also gibberish: Foxit, pdftotext and Mac's Preview * PDFBox logs warnings about missing unicode mappings * Tika reports that there are a bunch of unicode mappings missing per page. The point of this is that integrators might choose to run OCR on pages with high counts of missing unicode mappings. From the metadata: "pdf:charsPerPage":["1224","662"] "pdf:unmappedUnicodeCharsPerPage":["620","249"] Finally, if you want a medium dive on some of the things that can go wrong with text extraction in PDFs: https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf