Thanks Micha for the explanation. I will try looking for the words preceding the text I want to extract. I appreciate your assistance and will let you know if I am successful.
On Tue, Feb 4, 2014 at 8:20 AM, Michael Kuß <[email protected]>wrote: > Hi Karen, > > first the PDF format is not designed to get text back. It is not an > editable format like text or word but more focused on displaying the > content. > Text in a PDF file is like a cloud of points cluttered over a white space. > You have to put the characters (if available) in the correct order and > insert spaces if needed. This pdfbox is doing to some extent. > But if you see Text e.g. in Acrobat Reader it is not necessary "text" but > it can also be a graphic. > > So, to your problem. Different PDF converter do handle the positioning of > text during a PDF conversion in different manners. > Some will produce just a graphic, that represents the printed result of > e.g. a word document as a PDF file. > Some will produce a PDF with text included. This text may be with spaces > or without and the text may be correctly positioned or not. > The converters mostly try to make an accurate representation in a layout > point of view. The focus is not to get content back from the PDF file. PDF > is not designed to do this. > If you have two different PDF converters the text extracted with pdfbox > may differ. > Thus if you must extract text from a PDF file with specific positioning > you have to do more intelligent steps. > Parse for known words or extend the framework to parse just a specific > position. > > To get a clue how the PDF format was created have a look here: > http://en.wikipedia.org/wiki/Pdf > > Hope this helps somehow. > > Kind regards, > Micha >

