PDFBox gives (kind of) the same warnings, and also returns garbage
(albeit different).
I'll try and report a bug.
Oliver
On 26.07.2016 17:02, Nick Burch wrote:
On Tue, 26 Jul 2016, Oliver Steinau wrote:
I'm having problems extracting text from a small (43 KB) PDF file
using tika-1.13 -- I get a bunch of warnings like
WARN No Unicode mapping for C0104 (38) in font FDLICI+PSOwstswiss
WARN No Unicode mapping for C0097 (31) in font FDLICI+PSOwstswiss
Can you try with the ExtractText tool from Apache PDFBox?
http://pdfbox.apache.org/2.0/commandline.html#extracttext
If that works fine, then it's a Tika bug and we'll need to look into
it. If that fails with the same problem, then you'd need to report a
bug to PDFBox and attach a problematic pdf file to the jira. (Tika
would then get the fix on the next release)
Nick