Hi João, I'm happy to share source code for some work I've done on extracting tables from PDF documents. That may be a starting point for you in that it looks for graphic boxes drawn around text to identify table headings.
Frank On Thu, Oct 30, 2014 at 6:27 AM, Ken Bowen <[email protected]> wrote: > You may want to get in contact with Peter Murray-Rust( > http://www.ch.cam.ac.uk/person/pm286) at the University of Cambridge. He > seems to have been working on molecular informatics involving extraction of > information from PDFs, and probably has faced many of your issues. > —Ken Bowen > > On Oct 29, 2014, at 10:13 AM, João Cardoso < > [email protected]> wrote: > > > Hi, > > > > I'm a researcher at INESC-ID and I'm currently working on an application > > that intends to parse ISO standards (stored in PDF files) and store their > > text into a database. This implies building some sort of tree with all > the > > sections and subsections and so on... > > > > Well I'm aware that PDF files don't reflect text structure so I was > aiming > > for a different approach. Just being able to have the text split into > > paragraphs would aready be a massive help. An amazing help would be to > have > > a way to differ between text styles so as to sort normal text from > headings > > and all that. > > > > Well I've managed to extract plain text with your API. And with a lot of > > effot it would be possible to organize that plain text and provide it > with > > some structure. > > > > However, I was wondering if your API does not provide an easier way to do > > this. Maybe using some sort of object iteration within a page? > > > > Thanks for the help. > > > > Best regards, > > > > *João M. F. Cardoso* > > MSc in Telecommunications and Informatics Engineering, INESC-ID > > m:(+351) 916190940 | e:[email protected] | a: Skype: > > joao.m.f.cardoso > > Get a signature like this: > > < > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > > Click > > here! > > < > http://ws-stats.appspot.com/r?rdata=eyJydXJsIjogImh0dHA6Ly93d3cud2lzZXN0YW1wLmNvbS8/dXRtX3NvdXJjZT1leHRlbnNpb24mdXRtX21lZGl1bT1lbWFpbCZ1dG1fY2FtcGFpZ249cHJvbW9fNDUiLCAiZSI6ICJwcm9tb180NV9jbGljayJ9 > > > >

