Daniel, On this issue I share the concerns of Andrew and Rekha, and despite having followed the discussion in this thread I do not understand the reason for the behaviour exhibited by version 1.0.
I have been using 'PDFTextStripper.processTextPosition(TextPosition text)' to determine the positions of keywords in documents, and version 1.0 fails to find some keywords for which version 0.8 succeeded (although 0.8 had its own particular quirks and was not infallible in this respect). Very often the reason that version 1.0 fails is that it sometimes splits off the final character of a word and processes it as a separate 'TextPosition' object. For example, I have a document that contains the phrase "Sum Assured" which in 1.0 is processed as two separate 'TextPosition' objects, the first containing the string "Sum Assure" and the second containing the string "d"! I find this behaviour very odd since if I examine the document using 'org.apache.pdfbox.PDFDebugger' then in the content stream this phrase appears as a single simple text object, thus: BT /F2 8 Tf 1 0 0 1 260 617 Tm (Sum Assured)Tj ET and if I display the contents of the document as text using 'System.out.println(PDFTextStripper.getText(document))' then "Assured" is correctly printed as one distinct word with no space placed between "Assure" and "d". So why does 'PDFTextStripper.processTextPosition(TextPosition text)' seem to arbitrarily split "Assured" into two parts? Why does it not process "Sum Assured" as a single 'TextPosition' object, just as it appears in the content stream? Thanks for your help. Terry. -----Original Message----- From: Daniel Wilson [mailto:[email protected]] Sent: 01 March 2010 21:28 To: [email protected]; [email protected] Subject: Re: Fwd: PDFTextStripper.processTextPosition Andrew, if you & Rekha have similar problems perhaps public discussion here would result in a good solution. Villu is following this discussion closely and did some of the related coding, I believe. Daniel

