Hi All,

 I tried extracting text from a sample pdf using the class 
org.apache.pdfbox.ExtractText from command line using pdfbox 1.4.0.
The text extracted shows some concatenated words such as "ofgovernance", 
"ProgressiveAlliance" which are not present in the actual pdf.
It seems that the pdfbox is concatenating words at the end of line and the 
start of next line for few cases.

Please find the sample pdf attached with this mail .

Could someone please let me know if this is a known bug and how to solve it.

<snip>
[root@vm-ps3152 lib]# java -cp 
.:pdfbox-1.4.0.jar:commons-logging-1.0.4.jar:fontbox-1.4.0.jar 
org.apache.pdfbox.ExtractText -console /tmp/chapter-04.pdf | more
Feb 23, 2011 3:55:04 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: EI
173
Women, Children and Development
4.1 One of the six basic principles ofgovernance laid down in the United 
ProgressiveAlliance governmentÂs National CommonMinimum Programme (NCMP) is Âto 
empowe
r
women politically, educationally, economically
and legally. In the light of this, it is necessary
to assess how women and children actuallyfared in the process of development 
during theTenth Plan and what correctives need to be
applied.
</snip>


Thanks in advance
-Pravin

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to