Hi,

Thanks for the quick response. I have uploaded one of the pages at 

https://www.dropbox.com/s/7cqlul61pk53gd1/testpage.pdf

Any pointers how I could extend things would be great.

Thanks,
Stuart

On 12 Jun 2013, at 20:52, Maruan Sahyoun wrote:

> Hi Stuart,
> 
> from the screenshot it's not clear how the PDF is layer out. In general there 
> are some structures like article threads which PDFBox supports for text 
> extraction. Also PDFBox is able to handle bookmarks, annotations …. although 
> some of these informations are not taken into account when using the standard 
> ExtractText functionality. But it's possible to extend existing functions. 
> With the PDF as a sample it would be easier to understand which PDF features 
> is used for the box and give you some additional hints. As the mailing list 
> doesn't allow for PDF attachments please upload a sample at a public location 
> if possible.
> 
> BR
> Maruan Sahyoun
> 
> Am 12.06.2013 um 21:35 schrieb Stuart Coleman <[email protected]>:
> 
>> Hi,
>> 
>> I have a PDF file which I am trying to extract text from. Unfortunately the 
>> document is non sequential and has various boxes with supplementary content. 
>> When I open the file in Acrobat Reader, Reader seems to be able to 
>> distinguish these features and can surround them with a blue bounding box. I 
>> would like to be able to extract text by area from within these bounding 
>> boxes? Is PDFBox capable of detecting these features also?
>> 
>> I have attached a screenshot showing the style of box I am referring to (top 
>> right hand corner)
>> 
>> Thanks
>> Stuart
>> 
>> <Screen Shot 2013-06-12 at 20.17.31.png>
> 

Reply via email to