Hello, I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.
Unfortunately it seems to insert a space character, when there are soft-hyphens in the content of the PDF. Thus the extracted text is sometimes very fragmented. For example the word Medizin is extracted as Me di zin. I also tried to set the new option "parser.setEnableAutoSpace(false);". But this had no effect on the output. Has anyone a suggestion how to extract the content of PDF containing sof-hyphens without fragmenting it? As I use the output of pdfbox for searching with Apache Solr my search results are getting sometimes very strange... Best regards Dirk

