Problem with text extraction from existing PDF

Bob Swanson Wed, 13 Jun 2018 14:59:04 -0700

Hello All,

I am having a problem with text extraction
that may have been reported before. One of
the people that I support created
a PDF file from an LibreOffice document,
and then misplaced the original document. I
believed that I could use PDFBox to extract the
text from the PDF, and at least provide
that information to the user.


When I ran the text extractor from
the "app" jar, on their PDF file I got the
following types of messages (many):

...

Jun 13, 2018 5:38:43 PM org.apache.pdfbox.pdmodel.font.PDSimpleFonttoUnicode

WARNING: No Unicode mapping for 7 (7) in font EXIRGE+Ubuntu

Jun 13, 2018 5:38:43 PM org.apache.pdfbox.pdmodel.font.PDSimpleFonttoUnicode

WARNING: No Unicode mapping for 8 (8) in font EXIRGE+Ubuntu

Jun 13, 2018 5:38:43 PM org.apache.pdfbox.pdmodel.font.PDSimpleFonttoUnicode

WARNING: No Unicode mapping for 1 (1) in font JTPICY+AndaleMono

Jun 13, 2018 5:38:43 PM org.apache.pdfbox.pdmodel.font.PDSimpleFonttoUnicode

...

The resulting "txt" file is just binary numbers, unless
the font is one of the "standard". I ran
the debugger on the PDF file and saw that several fonts were
embedded, and thus used low numbers for encoding (1,2,3, etc).

When viewed, the PDF file looks good, but nothing can
be copied or pasted from the display (again, standard font
seems OK).

The original file was of a sensitive nature, so I was able
to re-create the problem with a simpler file.

Running on Ubuntu 16.04
LibreOffice was used to "print" on
the cups-pdf "printer" (which seems to be
part of the problem).

Text extract was attempted with
pdfbox-app-2.0.9.jar (and older)

PDF file is at:

http://swansongrp.com/misc/mytest3.pdf

Thanks as always for your help.

Bob Swanson

[email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Problem with text extraction from existing PDF

Reply via email to