Am 16.05.2016 um 14:26 schrieb Jason Lewis:
Hi,
I'm having a problem using PDFBox to extract text from PDFs.
I have an application that prints to a PDF printer device in Windows.
The PDF printer device is actually cups-pdf on a linux server.
Under Windows 7 I had the same problem extracting text from PDFs that
were generated in this way, they seemed unreadable by PDFBox. Eventually
I solved this by turning off the "Enable advanced printing features" in
the Windows printer driver settings. After that PDFBox was able to
extract the text perfectly.
In windows 10 however you can't turn this option off. From what I gather
Windows 10 uses "type 4" printer drivers and the option "enable advanced
printing features" is ticked but greyed out so you can't un-tick it.
I have a test PDF that PDFBox can read fine, but if I print that PDF in
windows to the CUPS PDF printer device, the resulting PDF is mangled in
some way that prevents PDFBox from parsing it.
Why would you do that? You already have a PDF. Or was it just to
explain, i.e. you're really printing from that application of yours,
with the same problem, but you don't want to show that output because it
is confidential?
Is there something I can do to make PDFBox be able to understand the
mangled PDF?
No....
I've also noticed that I can't select text in the broken pdf. Maybe this
windows driver somehow outlines all the text so its no longer text but
vectors?
I had a look at the "printed" PDF with PDFDebugger. It has the text as a
huge image, not as a text.
try this:
http://techspeeder.com/2014/03/06/how-to-fix-printer-properties-that-are-grayed-out/
http://www.networksteve.com/forum/topic.php/Administrator_cannot_change_printer_properties_on_%22Advanced%22_tab/?TopicId=57069&Posts=4
Tilman
I'm using PDFBox like this:
java -jar pdfbox-app-2.0.1.jar ExtractText -encoding UTF-8 -console
-startPage 1 -endPage 1 test-pdf-broken.pdf
Link to working PDF:
https://www.dropbox.com/s/glcmhl7nkg8w45f/test-pdf-works.pdf?dl=0
link to broken PDF:
https://www.dropbox.com/s/uriq36brougr4z1/test-pdf-broken.pdf?dl=0
Any suggestions on how I might fix this?
Thanks,
Jason
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]