Softhyphens / white space

Dirk Högemann Fri, 10 Feb 2012 02:37:19 -0800

Hello,

I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.


Unfortunately it seems to insert a space character, when there are
soft-hyphens in the content of the PDF.
Thus the extracted text is sometimes very fragmented. For example the word
Medizin is extracted as Me di zin.
I also tried to set the new option "parser.setEnableAutoSpace(false);".
But this had no effect on the output.

Has anyone a suggestion how to extract the content of PDF containing
sof-hyphens without fragmenting it?

As I use the output of pdfbox for searching with Apache Solr my search
results are getting sometimes very strange...

Best regards
Dirk

Softhyphens / white space

Reply via email to