Corrupted Arabic text in a PDF

Tim Allison Thu, 26 Jan 2023 11:51:54 -0800

A user dm'd me with an example file that contained English and Arabic.
The Arabic that was extracted was gibberish/mojibake.  I wanted to
archive my response on our user list.


* Extracting text from PDFs is a challenge.
* For troubleshooting, see:
https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-PDFTextProblems
* Text extracted by other tools is also gibberish: Foxit, pdftotext
and Mac's Preview
* PDFBox logs warnings about missing unicode mappings
* Tika reports that there are a bunch of unicode mappings missing per
page.  The point of this is that integrators might choose to run OCR
on pages with high counts of missing unicode mappings. From the
metadata: "pdf:charsPerPage":["1224","662"]
"pdf:unmappedUnicodeCharsPerPage":["620","249"]

Finally, if you want a medium dive on some of the things that can go
wrong with text extraction in PDFs:
https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf

Corrupted Arabic text in a PDF

Reply via email to