Hello there,

>
> There is another notice ... A phrase "A Worldly" in the same line in the PDF 
> was extracted also as "AWorldly" without space !!
> You can check it in this file :
> http://www.4shared.com/file/186430363/628fea7f/Enter-sample2.html
>

The phrase "A Worldly" occurs in the title section of the article and
is painted using a boldface font.

To my knowledge, PDFBox is not very sophisticated and uses the same
word separation detection algorithm with all normal|italic|boldface
fonts. However, as this issue demonstrates, it might be justified to
tweak some threshold values etc. in a font dependent manner.

In the mean time, to overcome this particular problem, you might
simply insert a space wherever you find two consecutive upper case
letters which are painted in boldface font.


VR

Reply via email to