On Wed, Oct 29, 2014 at 4:13 PM, João Cardoso < [email protected]> wrote:
> Hi, > > I'm a researcher at INESC-ID and I'm currently working on an application > that intends to parse ISO standards (stored in PDF files) and store their > text into a database. This implies building some sort of tree with all the > sections and subsections and so on... > > Well I'm aware that PDF files don't reflect text structure so I was aiming > for a different approach. Just being able to have the text split into > paragraphs would aready be a massive help. An amazing help would be to have > a way to differ between text styles so as to sort normal text from headings > and all that. > > We do a lot of this paragrpah-splitting with scientific technical documents and I expect that the ISO standards are similar. Unfortunately you have to use heuristics and these can be: * whitespace between paragraphs. Some documents have "double spacing" but others don't, unfortunately. * font size and font weight/blackness changes. Often a paragraph has a section heading which is usually (but not always) a single "line". * ragged ends of lines. If the text is right-justified then a ragged end often signals end-of-para (but not the reverse). * numbered sections. If you are really lucky the sections will be 1, 1.1, 1.2, 1.2.1, 2, etc. * underlines and other path-based graphics. Horizontal lines (paths, not characters) will sometime be used. If you are going to do a lot of this, with a single source of documents it may be worth investing in creating some of these heuristics. But it will still be work, unfortunately. We are gradually building up this sort of approach in http://bitbucket.org/petermr PDF2SVG and SVG2XML, based on PDFBOX but it's alpha at best.... P. -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

