If you run PDFBox app’s ExtractText on the files, are you getting the same output? If so, might make sense to ask for help from the PDFBox project.
e.g. : http://apache.cs.utah.edu/pdfbox/2.0.2/pdfbox-app-2.0.2.jar java -jar pdfbox-app-2.0.2.jar ExtractText thispdf.pdf From: Allison A. [mailto:[email protected]] Sent: Thursday, June 30, 2016 12:37 AM To: [email protected] Subject: Re: PDFPaser generates gibberish I am running Tika-server-1.13 to extract text from a pdf file. Sometimes I am getting gibberish characters between words, it seems they are added to spacing between words or at the end of the file. For two column pdf files, this is quite serious, adding too much gibberish. How can I get rid of this? Any suggestions are welcome. Allison
