On Tue, 26 Jul 2016, Oliver Steinau wrote:
I'm having problems extracting text from a small (43 KB) PDF file using
tika-1.13 -- I get a bunch of warnings like
WARN No Unicode mapping for C0104 (38) in font FDLICI+PSOwstswiss
WARN No Unicode mapping for C0097 (31) in font FDLICI+PSOwstswiss
Can you try with the ExtractText tool from Apache PDFBox?
http://pdfbox.apache.org/2.0/commandline.html#extracttext
If that works fine, then it's a Tika bug and we'll need to look into it.
If that fails with the same problem, then you'd need to report a bug to
PDFBox and attach a problematic pdf file to the jira. (Tika would then get
the fix on the next release)
Nick