I agree with all those who emphasis that there is no deterministic algorithm. I also agree that Tabula is likely to be the best place to start and am working with them.
The first question is: "How do you know where the tables are?" In some cases you can look for the Anglophone word "Table", and a regex of something like: - "Tab(le)?\s*((\d+)|(IVXL)+) " or you can look for - grid lines or you can look for whitespace patterns: Is this a table or just fortuitous. and some tables use zebra stripes. I suspect at least 100 person years (and probably much more) have been spent on trying to extract tables. If we take the heuristic approach then it's work pooling our efforts and trying to share code. I'm sharing mine on: https://bitbucket.org/petermr/svg2xml/wiki/Home (which is built on PDFBox and https://bitbucket.org/petermr/pdf2svg/wiki/Home). Other people have built systems that use adaptive methods to decide where the whitespace is. I'd recommend splitting the PDF2Character part (I use SVG for the modelling syntax) and characters2tables as it means we can use more character extractors and combine them with the table synthesizers. P. On Mon, Feb 2, 2015 at 6:56 PM, Frank van der Hulst <[email protected] > wrote: > I have written a couple of Java classes that extract tabular data to arrays > of Strings. > > One works where the location of each column is fixed. The other figures out > the locations of columns from the table headers and outline drawing. > > The usual story applies... hardly any documentation, and they only work for > limited cases. I've sent the code to Lorena... I'd be grateful if you could > improve the documentation. > > NB: I'll be out of reach of my computer (and therefore my source code) for > the next few days, but will probably still be able to answer emails. > > Frank > > > On Tue, Feb 3, 2015 at 7:07 AM, Tilman Hausherr <[email protected]> > wrote: > > > Hi Lorena, > > > > There is no concept of table in a PDF, except in a tagged PDF. > > > > A table is just lines and text. In no specific order. It could also be an > > image of a table. > > > > You can succeed in this only if you know the structure of the PDF in > > advance, e.g. when it all comes from the same client. > > > > https://stackoverflow.com/questions/23495372/extract-table-data-from-pdf > > https://stackoverflow.com/questions/17591426/extract-table-from-a-pdf > > https://stackoverflow.com/questions/17217194/extracting- > > table-contents-from-a-collection-of-pdf-files > > > https://stackoverflow.com/questions/3424588/programmatically-extract-pdf- > > tables > > > > Tilman > > > > > > Am 02.02.2015 um 16:29 schrieb Lorena Leishman: > > > > Hi, > >> I have a PDF that has information displayed on tables. Example: > >> Company Name: Barnes & Noble Bank Of America Macy'sAccount #: > >> 123xxxxx 345xxxx 679xxxxStatus: > >> Open Closed OpenBalance: > >> $23. $0.00 $100 > >> Is there a way with PDFbox to extract a specific value(s) from the > table? > >> Example: Bank Of America and $0.00 > >> And also is there a way to cut the whole table and paste it into a > >> different PDF? > >> Please let me know, Thanks! > >> Lorena > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

