Re: PDF extraction

Frank van der Hulst Mon, 02 Feb 2015 14:01:22 -0800

I agree with everything Peter has said.

My 'solutions' work for the tables I wanted to extract, but won't work for
others.


I think its common for a table to have at least one column which is present
in every row. I use that to break the table up into rows... when the X
position goes back to the left of this column it is (probably) a new row.

I have also used the header row of the table to identify the column limits.

Incidentally, it's also a good idea to strip out the page headers & footers
before trying to parse a table.

I'm happy for my work to be included in PDFBox as a starting point for
other people trying to extract tables.

Frank

On Tue, Feb 3, 2015 at 8:48 AM, Peter Murray-Rust <[email protected]> wrote:

> I agree with all those who emphasis that there is no deterministic
> algorithm. I also agree that Tabula is likely to be the best place to start
> and am working with them.
>
> The first question is:
>
> "How do you know where the tables are?"
>
> In some cases you can look for the Anglophone word "Table", and a regex of
> something like:
> - "Tab(le)?\s*((\d+)|(IVXL)+) "
> or you can look for
>  - grid lines
> or you can look for whitespace patterns:
>
> Is    this
> a     table
>
> or just fortuitous.
>
> and some tables use zebra stripes.
>
> I suspect at least 100 person years (and probably much more) have been
> spent on trying to extract tables. If we take the heuristic approach then
> it's work pooling our efforts and trying to share code. I'm sharing mine
> on:
> https://bitbucket.org/petermr/svg2xml/wiki/Home (which is built on PDFBox
> and https://bitbucket.org/petermr/pdf2svg/wiki/Home).
>
> Other people have built systems that use adaptive methods to decide where
> the whitespace is.
>
> I'd recommend splitting the PDF2Character part (I use SVG for the modelling
> syntax) and characters2tables as it means we can use more character
> extractors and combine them with the table synthesizers.
>
> P.
>
>
>
>
>
> On Mon, Feb 2, 2015 at 6:56 PM, Frank van der Hulst <
> [email protected]
> > wrote:
>
> > I have written a couple of Java classes that extract tabular data to
> arrays
> > of Strings.
> >
> > One works where the location of each column is fixed. The other figures
> out
> > the locations of columns from the table headers and outline drawing.
> >
> > The usual story applies... hardly any documentation, and they only work
> for
> > limited cases. I've sent the code to Lorena... I'd be grateful if you
> could
> > improve the documentation.
> >
> > NB: I'll be out of reach of my computer (and therefore my source code)
> for
> > the next few days, but will probably still be able to answer emails.
> >
> > Frank
> >
> >
> > On Tue, Feb 3, 2015 at 7:07 AM, Tilman Hausherr <[email protected]>
> > wrote:
> >
> > > Hi Lorena,
> > >
> > > There is no concept of table in a PDF, except in a tagged PDF.
> > >
> > > A table is just lines and text. In no specific order. It could also be
> an
> > > image of a table.
> > >
> > > You can succeed in this only if you know the structure of the PDF in
> > > advance, e.g. when it all comes from the same client.
> > >
> > >
> https://stackoverflow.com/questions/23495372/extract-table-data-from-pdf
> > > https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf
> > > https://stackoverflow.com/questions/17217194/extracting-
> > > table-contents-from-a-collection-of-pdf-files
> > >
> >
> https://stackoverflow.com/questions/3424588/programmatically-extract-pdf-
> > > tables
> > >
> > > Tilman
> > >
> > >
> > > Am 02.02.2015 um 16:29 schrieb Lorena Leishman:
> > >
> > >  Hi,
> > >> I have a PDF that has information displayed on tables. Example:
> > >> Company Name:   Barnes & Noble   Bank Of America  Macy'sAccount #:
> > >>      123xxxxx              345xxxx               679xxxxStatus:
> > >>        Open                    Closed                 OpenBalance:
> > >>       $23.                      $0.00                    $100
> > >> Is there a way with PDFbox to extract a specific value(s) from the
> > table?
> > >> Example: Bank Of America  and $0.00
> > >> And also is there a way to cut the whole table and paste it into a
> > >> different PDF?
> > >> Please let me know, Thanks!
> > >> Lorena
> > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>
>
>
> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

Re: PDF extraction

Reply via email to