I have tried that and agree it gives pretty good results. With some empirical rules I should be able to go quite a long way. Thanks for your help. — Sent from Mailbox for iPhone
On Wed, Jun 12, 2013 at 9:23 PM, Maruan Sahyoun <[email protected]> wrote: > Hi Stuart, > give ExtractText a try using > ExtractText -nonSeq -html > and inspect the result. It does a fairly good job for the sample PDF. The > reason why I'm suggesting the -html option is that Paragraphs of text are > written out within a <p> tag. You can build on > org.apache.pdfbox.util.PDFText2HTML and use that as a starter for > enhancements if needed. > As there is no structure information within the PDF these can not be taken to > help enhancing the text extraction. The fact that you see boxes around text > are graphics but not represented e.g. as articles. Of course you could try to > use drawing commands as a hint but that's a lot of effort. Maybe the > functionality available is already sufficient for you. > BR > Maruan Sahyoun > Am 12.06.2013 um 22:04 schrieb Stuart Coleman <[email protected]>: >> Hi, >> >> Thanks for the quick response. I have uploaded one of the pages at >> >> https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf >> >> Any pointers how I could extend things would be great. >> >> Thanks, >> Stuart >> >> On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote: >> >>> Hi Stuart, >>> >>> from the screenshot it's not clear how the PDF is layer out. In general >>> there are some structures like article threads which PDFBox supports for >>> text extraction. Also PDFBox is able to handle bookmarks, annotations …. >>> although some of these informations are not taken into account when using >>> the standard ExtractText functionality. But it's possible to extend >>> existing functions. With the PDF as a sample it would be easier to >>> understand which PDF features is used for the box and give you some >>> additional hints. As the mailing list doesn't allow for PDF attachments >>> please upload a sample at a public location if possible. >>> >>> BR >>> Maruan Sahyoun >>> >>> Am 12.06.2013 um 21:35 schrieb Stuart Coleman <[email protected]>: >>> >>>> Hi, >>>> >>>> I have a PDF file which I am trying to extract text from. Unfortunately >>>> the document is non sequential and has various boxes with supplementary >>>> content. When I open the file in Acrobat Reader, Reader seems to be able >>>> to distinguish these features and can surround them with a blue bounding >>>> box. I would like to be able to extract text by area from within these >>>> bounding boxes? Is PDFBox capable of detecting these features also? >>>> >>>> I have attached a screenshot showing the style of box I am referring to >>>> (top right hand corner) >>>> >>>> Thanks >>>> Stuart >>>> >>>> <Screen Shot 2013-06-12 at 20.17.31.png> >>> >>

