Micha, Thank you, thank you, thank you! It is finally working. I am very grateful.
On Tue, Feb 4, 2014 at 8:27 AM, Karen Lindholm <[email protected]>wrote: > Thanks Micha for the explanation. I will try looking for the words > preceding the text I want to extract. I appreciate your assistance and will > let you know if I am successful. > > > On Tue, Feb 4, 2014 at 8:20 AM, Michael Kuß > <[email protected]>wrote: > >> Hi Karen, >> >> first the PDF format is not designed to get text back. It is not an >> editable format like text or word but more focused on displaying the >> content. >> Text in a PDF file is like a cloud of points cluttered over a white >> space. You have to put the characters (if available) in the correct order >> and insert spaces if needed. This pdfbox is doing to some extent. >> But if you see Text e.g. in Acrobat Reader it is not necessary "text" but >> it can also be a graphic. >> >> So, to your problem. Different PDF converter do handle the positioning of >> text during a PDF conversion in different manners. >> Some will produce just a graphic, that represents the printed result of >> e.g. a word document as a PDF file. >> Some will produce a PDF with text included. This text may be with spaces >> or without and the text may be correctly positioned or not. >> The converters mostly try to make an accurate representation in a layout >> point of view. The focus is not to get content back from the PDF file. PDF >> is not designed to do this. >> If you have two different PDF converters the text extracted with pdfbox >> may differ. >> Thus if you must extract text from a PDF file with specific positioning >> you have to do more intelligent steps. >> Parse for known words or extend the framework to parse just a specific >> position. >> >> To get a clue how the PDF format was created have a look here: >> http://en.wikipedia.org/wiki/Pdf >> >> Hope this helps somehow. >> >> Kind regards, >> Micha >> > >

