FWIW, the Linux tool pdftotext does a very good job of translating the spacing at least.. it's one of the best things I've found for this, if all you need is correctly spaced text.
But I have not tried Tabula, thanks for mentioning that! On Tue, Feb 4, 2014 at 4:29 AM, Peter Murray-Rust <[email protected]> wrote: > On Tue, Feb 4, 2014 at 9:03 AM, Johnny Bekkestad < > [email protected]> wrote: > > > Hi, I have a big problem trying to read a "table" within a pdf. > > > > There is a problem when the so content of a cell wraps over multiple > rows, > > > > I am not able to associate the correct text with the correct value. > > > > This becomes extra hard when there is also a page break. > > > > Here is an example > > > > > > > > ID > > > > Title > > > > Name > > > > 1 > > > > Text 1 > > > > Name 1 > > > > 2 > > > > A very very long text 2 > > > > Name 2 > > > > 3 > > > > A very very very long text 3 > > > > This is also a very long name > > > > 4 > > > > Short text 4 > > > > Another very long name > > > > > > > > I am trying to get these as a text and it quite hard to associate the > > correct values with the columns > > > > > > > > Anyone had this problem too? > > > > Yes - everyone. > > The problem is that PDF has no concept of "table". We have to guess it's a > table because it has some "lines" and aligned text. (The lines are probably > "paths" - a more primitive approach). The characters may be in any order. > We have to deduce that your cell content consists of single sentences and > not two independent items (e.g. by the lack of full stops, the lowercase > second line and (in desperate cases) that an NLP parser can make sense of > it. > > There is no standard way of doing this. TabulaPDF (which uses PDFBox) - > http://tabula.nerdpower.org/ - is among the most advanced open source > projects. I do some of this myself in https://bitbucket.org/petermr/ami2. > > We hope to pool our software and experiences so we don't all have to > reinvent algorithms and heuristics. > > It's mindbogglingly tedious to do this. > > > > > > /Johnny > > > > > > > > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069 >

