You may want to get in contact with Peter 
Murray-Rust(http://www.ch.cam.ac.uk/person/pm286) at the University of 
Cambridge.  He seems to have been working on molecular informatics involving 
extraction of information from PDFs, and probably has faced many of your issues.
—Ken Bowen

On Oct 29, 2014, at 10:13 AM, João Cardoso 
<[email protected]> wrote:

> Hi,
> 
> I'm a researcher at INESC-ID and I'm currently working on an application
> that intends to parse ISO standards (stored in PDF files) and store their
> text into a database. This implies building some sort of tree with all the
> sections and subsections and so on...
> 
> Well I'm aware that PDF files don't reflect text structure so I was aiming
> for a different approach. Just being able to have the text split into
> paragraphs would aready be a massive help. An amazing help would be to have
> a way to differ between text styles so as to sort normal text from headings
> and all that.
> 
> Well I've managed to extract plain text with your API. And with a lot of
> effot it would be possible to organize that plain text and provide it with
> some structure.
> 
> However, I was wondering if your API does not provide an easier way to do
> this. Maybe using some sort of object iteration within a page?
> 
> Thanks for the help.
> 
> Best regards,
> 
>  *João M. F. Cardoso*
> MSc in Telecommunications and Informatics Engineering, INESC-ID
> m:(+351) 916190940 | e:[email protected] | a: Skype:
> joao.m.f.cardoso
>   Get a signature like this:
> <http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>
> Click
> here!
> <http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9>

Reply via email to