There is no built in functionality to retrieve tabular data with pdfbox because there is (usually) no table mark-up in pdf documents. Instead, tables are usually represented as absolutely positioned text and lines around that text forming the borders of the table.
It is possible to find all lines forming a table. Exactly how that might work depends heavily on the document in question. For instance, some documents use three overlapping lines instead of a thick line. See the answer to my recent question about finding lines in a document on how to use pdf operators to find lines in a document. While it is certainly possible with pdfbox, I haven't been able to do it yet. Therefore I cannot give more detailed information. Another (a bit complex) option is: 1. Remove all text on a page. 2. Render the page to a png. 3. Find horizontal and vertical lines in the graphic using a line detection algorithm like Hough transform. 4. Find intersections of detected lines -- they will form a grid from which you can read with PDFTextStripperByArea BR, Ilija. On Wed, Jan 11, 2012 at 3:42 PM, Kevin Brown <[email protected]> wrote: > I have not been able to do this. I am not sure it is possible with pdfbox. > Have you had any luck? If you have, please post? > > Kevin > > 2012/1/10 金永梁 <[email protected]> > >> Hi,all >> >> I have a requirement to extract table datas from pdf files, I need the >> datas remain the structure, such as store the data in xml format. >> >> How I fullfill this ? >> >> The main difficult for me Is that I don’t know where is a table begin and >> end, how can I jude it? Acoording to lines?

