I have tried that and agree it gives pretty good results. With some empirical 
rules I should be able to go quite a long way. Thanks for your help.
—
Sent from Mailbox for iPhone

On Wed, Jun 12, 2013 at 9:23 PM, Maruan Sahyoun <[email protected]>
wrote:

> Hi Stuart,
> give ExtractText a try using
> ExtractText -nonSeq -html
> and inspect the result. It does a fairly good job for the sample PDF. The 
> reason why I'm suggesting the -html option is that Paragraphs of text are 
> written out within a <p> tag. You can build on 
> org.apache.pdfbox.util.PDFText2HTML and use that as a starter for 
> enhancements if needed. 
> As there is no structure information within the PDF these can not be taken to 
> help enhancing the text extraction. The fact that you see boxes around text 
> are graphics but not represented e.g. as articles. Of course you could try to 
> use drawing commands as a hint but that's a lot of effort. Maybe the 
> functionality available is already sufficient for you.
> BR
> Maruan Sahyoun
> Am 12.06.2013 um 22:04 schrieb Stuart Coleman <[email protected]>:
>> Hi,
>> 
>> Thanks for the quick response. I have uploaded one of the pages at 
>> 
>> https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf
>> 
>> Any pointers how I could extend things would be great.
>> 
>> Thanks,
>> Stuart
>> 
>> On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote:
>> 
>>> Hi Stuart,
>>> 
>>> from the screenshot it's not clear how the PDF is layer out. In general 
>>> there are some structures like article threads which PDFBox supports for 
>>> text extraction. Also PDFBox is able to handle bookmarks, annotations …. 
>>> although some of these informations are not taken into account when using 
>>> the standard ExtractText functionality. But it's possible to extend 
>>> existing functions. With the PDF as a sample it would be easier to 
>>> understand which PDF features is used for the box and give you some 
>>> additional hints. As the mailing list doesn't allow for PDF attachments 
>>> please upload a sample at a public location if possible.
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 12.06.2013 um 21:35 schrieb Stuart Coleman <[email protected]>:
>>> 
>>>> Hi,
>>>> 
>>>> I have a PDF file which I am trying to extract text from. Unfortunately 
>>>> the document is non sequential and has various boxes with supplementary 
>>>> content. When I open the file in Acrobat Reader, Reader seems to be able 
>>>> to distinguish these features and can surround them with a blue bounding 
>>>> box. I would like to be able to extract text by area from within these 
>>>> bounding boxes? Is PDFBox capable of detecting these features also?
>>>> 
>>>> I have attached a screenshot showing the style of box I am referring to 
>>>> (top right hand corner)
>>>> 
>>>> Thanks
>>>> Stuart
>>>> 
>>>> <Screen Shot 2013-06-12 at 20.17.31.png>
>>> 
>> 

Reply via email to