I agree with everything Peter has said. My 'solutions' work for the tables I wanted to extract, but won't work for others.
I think its common for a table to have at least one column which is present in every row. I use that to break the table up into rows... when the X position goes back to the left of this column it is (probably) a new row. I have also used the header row of the table to identify the column limits. Incidentally, it's also a good idea to strip out the page headers & footers before trying to parse a table. I'm happy for my work to be included in PDFBox as a starting point for other people trying to extract tables. Frank On Tue, Feb 3, 2015 at 8:48 AM, Peter Murray-Rust <[email protected]> wrote: > I agree with all those who emphasis that there is no deterministic > algorithm. I also agree that Tabula is likely to be the best place to start > and am working with them. > > The first question is: > > "How do you know where the tables are?" > > In some cases you can look for the Anglophone word "Table", and a regex of > something like: > - "Tab(le)?\s*((\d+)|(IVXL)+) " > or you can look for > - grid lines > or you can look for whitespace patterns: > > Is this > a table > > or just fortuitous. > > and some tables use zebra stripes. > > I suspect at least 100 person years (and probably much more) have been > spent on trying to extract tables. If we take the heuristic approach then > it's work pooling our efforts and trying to share code. I'm sharing mine > on: > https://bitbucket.org/petermr/svg2xml/wiki/Home (which is built on PDFBox > and https://bitbucket.org/petermr/pdf2svg/wiki/Home). > > Other people have built systems that use adaptive methods to decide where > the whitespace is. > > I'd recommend splitting the PDF2Character part (I use SVG for the modelling > syntax) and characters2tables as it means we can use more character > extractors and combine them with the table synthesizers. > > P. > > > > > > On Mon, Feb 2, 2015 at 6:56 PM, Frank van der Hulst < > [email protected] > > wrote: > > > I have written a couple of Java classes that extract tabular data to > arrays > > of Strings. > > > > One works where the location of each column is fixed. The other figures > out > > the locations of columns from the table headers and outline drawing. > > > > The usual story applies... hardly any documentation, and they only work > for > > limited cases. I've sent the code to Lorena... I'd be grateful if you > could > > improve the documentation. > > > > NB: I'll be out of reach of my computer (and therefore my source code) > for > > the next few days, but will probably still be able to answer emails. > > > > Frank > > > > > > On Tue, Feb 3, 2015 at 7:07 AM, Tilman Hausherr <[email protected]> > > wrote: > > > > > Hi Lorena, > > > > > > There is no concept of table in a PDF, except in a tagged PDF. > > > > > > A table is just lines and text. In no specific order. It could also be > an > > > image of a table. > > > > > > You can succeed in this only if you know the structure of the PDF in > > > advance, e.g. when it all comes from the same client. > > > > > > > https://stackoverflow.com/questions/23495372/extract-table-data-from-pdf > > > https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf > > > https://stackoverflow.com/questions/17217194/extracting- > > > table-contents-from-a-collection-of-pdf-files > > > > > > https://stackoverflow.com/questions/3424588/programmatically-extract-pdf- > > > tables > > > > > > Tilman > > > > > > > > > Am 02.02.2015 um 16:29 schrieb Lorena Leishman: > > > > > > Hi, > > >> I have a PDF that has information displayed on tables. Example: > > >> Company Name: Barnes & Noble Bank Of America Macy'sAccount #: > > >> 123xxxxx 345xxxx 679xxxxStatus: > > >> Open Closed OpenBalance: > > >> $23. $0.00 $100 > > >> Is there a way with PDFbox to extract a specific value(s) from the > > table? > > >> Example: Bank Of America and $0.00 > > >> And also is there a way to cut the whole table and paste it into a > > >> different PDF? > > >> Please let me know, Thanks! > > >> Lorena > > >> > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069 >

