Re: Softhyphens / white space

Hesham G. Fri, 10 Feb 2012 06:47:35 -0800

Dirk ,

Did you try to use PDFTextStripper.setAverageCharTolerance( float ) ?




Best regards ,
Hesham


---------------------------------------------
Included message :

Hello,

I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.

Unfortunately it seems to insert a space character, when there are
soft-hyphens in the content of the PDF.
Thus the extracted text is sometimes very fragmented. For example the word
Medizin is extracted as Me di zin.
I also tried to set the new option "parser.setEnableAutoSpace(false);".
But this had no effect on the output.

Has anyone a suggestion how to extract the content of PDF containing
sof-hyphens without fragmenting it?

As I use the output of pdfbox for searching with Apache Solr my search
results are getting sometimes very strange...

Best regards
Dirk

Re: Softhyphens / white space

Reply via email to