PDFs don't necessarily include spaces.  In some (many?) cases, code has to do 
the calculation of character widths and locations on the page to determine 
whether or not to insert spaces.  If something goes wrong with the coordinate 
calculations, you can get extra or missing spaces.

You could experiment with changing enableAutoSpace to false via the 
PDFParserConfig, but I doubt that would fix the problem.

If you run straight PDFBox's app [1]

java -jar pdfbox-app...jar ExtractText file.pdf

Do you get the same spacing?  If so, please open an issue on PDFBox's issue 
tracker.


[1] http://mirror.reverse.net/pub/apache/pdfbox/2.0.1/pdfbox-app-2.0.1.jar

-----Original Message-----
From: Augusto Ribeiro Silva [mailto:[email protected]] 
Sent: Tuesday, May 31, 2016 7:36 AM
To: [email protected]
Subject: Weird spacing in words 

Hi all,

I am using TIKA java library to read the content of some PDFs and it seems like 
it inserts some weird (hyphen-like) spacing. For example:
The es tab lish ment of an in te grated Part ner Re la tion ship Man age ment 
(PRM) sys tem can po ten tially ad dress sev eral as pets

I tried to extract text from the same PDF using the pdftotext command line 
utility it extracts the text correctly:
The establishment of an integrated Partner Relationship Management (PRM) system 
can potentially address several aspects 

Does somebody have any idea why TIKA behaves in this way and any tips to fixing 
it?

Best regards, 
Augusto

Reply via email to