RE: PDFPaser generates gibberish

Allison, Timothy B. Thu, 30 Jun 2016 05:16:57 -0700

If you run PDFBox app’s ExtractText on the files, are you getting the same 
output?  If so, might make sense to ask for help from the PDFBox project.

e.g. : http://apache.cs.utah.edu/pdfbox/2.0.2/pdfbox-app-2.0.2.jar

java -jar pdfbox-app-2.0.2.jar ExtractText thispdf.pdf

From: Allison A. [mailto:[email protected]]
Sent: Thursday, June 30, 2016 12:37 AM
To: [email protected]
Subject: Re: PDFPaser generates gibberish

I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I am 
getting gibberish characters between words, it seems they are added to spacing 
between words or at the end of the file.

For two column pdf files, this is quite serious, adding too much gibberish.

How can I get rid of this? Any suggestions are welcome.

Allison

RE: PDFPaser generates gibberish

Reply via email to