PDFBox gives (kind of) the same warnings, and also returns garbage (albeit different).

I'll try and report a bug.

Oliver


On 26.07.2016 17:02, Nick Burch wrote:
On Tue, 26 Jul 2016, Oliver Steinau wrote:
I'm having problems extracting text from a small (43 KB) PDF file using tika-1.13 -- I get a bunch of warnings like

WARN  No Unicode mapping for C0104 (38) in font FDLICI+PSOwstswiss
WARN  No Unicode mapping for C0097 (31) in font FDLICI+PSOwstswiss

Can you try with the ExtractText tool from Apache PDFBox? http://pdfbox.apache.org/2.0/commandline.html#extracttext

If that works fine, then it's a Tika bug and we'll need to look into it. If that fails with the same problem, then you'd need to report a bug to PDFBox and attach a problematic pdf file to the jira. (Tika would then get the fix on the next release)

Nick


Reply via email to