On Wed, Oct 29, 2014 at 4:13 PM, João Cardoso <
[email protected]> wrote:

> Hi,
>
> I'm a researcher at INESC-ID and I'm currently working on an application
> that intends to parse ISO standards (stored in PDF files) and store their
> text into a database. This implies building some sort of tree with all the
> sections and subsections and so on...
>
> Well I'm aware that PDF files don't reflect text structure so I was aiming
> for a different approach. Just being able to have the text split into
> paragraphs would aready be a massive help. An amazing help would be to have
> a way to differ between text styles so as to sort normal text from headings
> and all that.
>
>
We do a lot of this paragrpah-splitting with scientific technical documents
and I expect that the ISO standards are similar. Unfortunately you have to
use heuristics and these can be:

* whitespace between paragraphs. Some documents have "double spacing" but
others don't, unfortunately.
* font size and font weight/blackness changes. Often a paragraph has a
section heading which is usually (but not always) a single "line".
* ragged ends of lines. If the text is right-justified then a ragged end
often signals end-of-para (but not the reverse).
* numbered sections. If you are really lucky the sections will be 1, 1.1,
1.2, 1.2.1, 2, etc.
* underlines and other path-based graphics. Horizontal lines (paths, not
characters) will sometime be used.

If you are going to do a lot of this, with a single source of documents it
may be worth investing in creating some of these heuristics. But it will
still be work, unfortunately.

We are gradually building up this sort of approach in
http://bitbucket.org/petermr PDF2SVG and SVG2XML, based on PDFBOX but it's
alpha at best....

P.




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply via email to