Yeah I'm having some luck...its not elegant but it's working. What I'm doing is looking for the table header text and finding it's starting and ending X pos, then because I know about how wide my table is I extract all the subsequent rows that are withing this X range.
It's got lots of issues that are not ideal. - Sometimes the table header (something that makes a unique string to look for) is two or three rows...I can't handle this. Btw, I use regex for the header text because you can't be certain of how many spaces will be in the string. - It would be nice if it could figure out how wide the table is...it has the boundary/rectangle info...but I don't know how to get this info so I am telling it how wide the table is. - I have to tell it what the max number of table rows is...because again I don't know how to get the boundary/rectangle info which knows where the table ends. Other than this...it's working. -Dave P.S. The newer iText has a context aware parsing strategy...but it costs thousands of $...too rich for me. On Thu, Dec 30, 2010 at 8:37 AM, Kevin Brown <[email protected]> wrote: > Any luck with this? I couldn't figure a way to do this with PDFBox, or > anything else. > > The best tool I've ever seen is something called BCL Jade which allows you > to extract zones by selecting them. It's non programmable and not supported > or sold any more! > > On Tue, Dec 28, 2010 at 11:26 PM, David Hoffer <[email protected]> wrote: > > > Hi I'm new to PDFBox and need to do PDF text extraction but the standard > > PDFTextStripper behavior isn't what I need. The problem with > > PDFTextStripper is that it left aligns all the output so you have no way > of > > knowing where in the horizontial position the text came from. > > > > I have to extract text from (small) tables within the document and I need > > to > > know which table the data came from. A simple example might be: > > > > Table 1 Table2 > > 1 2 3 4 1 2 3 4 > > 5 6 7 8 > > > > PDFTextStripper can output all 3 rows of this document and rows 1 & 2 are > > fine but it will left align row 3 so there is no way of knowing that it > was > > part of Table 2 and not Table 1. > > > > What I can't show is there there is table formatting (rectangles) around > > all > > the tables. > > > > How can I use PDFBox to extract the data keeping it context aware? > Ideally > > getting each table (I know the text in the top) and then extracting text > > like PDFTextStripper does would be great. > > > > What's the best way to do this? > > > > -Dave > > >

