Daniel,

On this issue I share the concerns of Andrew and Rekha, and despite
having followed the discussion in this thread I do not understand the
reason for the behaviour exhibited by version 1.0.

I have been using 'PDFTextStripper.processTextPosition(TextPosition
text)' to determine the positions of keywords in documents, and version
1.0 fails to find some keywords for which version 0.8 succeeded
(although 0.8 had its own particular quirks and was not infallible in
this respect). Very often the reason that version 1.0 fails is that it
sometimes splits off the final character of a word and processes it as a
separate 'TextPosition' object. For example, I have a document that
contains the phrase "Sum Assured" which in 1.0 is processed as two
separate 'TextPosition' objects, the first containing the string "Sum
Assure" and the second containing the string "d"!

I find this behaviour very odd since if I examine the document using
'org.apache.pdfbox.PDFDebugger' then in the content stream this phrase
appears as a single simple text object, thus:

BT
/F2 8 Tf
1 0 0 1 260 617 Tm
(Sum Assured)Tj
ET

and if I display the contents of the document as text using
'System.out.println(PDFTextStripper.getText(document))' then "Assured"
is correctly printed as one distinct word with no space placed between
"Assure" and "d".

So why does 'PDFTextStripper.processTextPosition(TextPosition text)'
seem to arbitrarily split "Assured" into two parts? Why does it not
process "Sum Assured" as a single 'TextPosition' object, just as it
appears in the content stream?

Thanks for your help.

Terry.


-----Original Message-----
From: Daniel Wilson [mailto:[email protected]] 
Sent: 01 March 2010 21:28
To: [email protected]; [email protected]
Subject: Re: Fwd: PDFTextStripper.processTextPosition

Andrew, if you & Rekha have similar problems perhaps public discussion
here would result in a good solution.  Villu is following this
discussion closely and did some of the related coding, I believe.

Daniel

Reply via email to