Re: How to use pdfbox extract table datas in pdf ?

Ilija Pavlic Wed, 11 Jan 2012 06:52:27 -0800

There is no built in functionality to retrieve tabular data with
pdfbox because there is (usually) no table mark-up in pdf documents.
Instead, tables are usually represented as absolutely positioned text
and lines around that text forming the borders of the table.

It is possible to find all lines forming a table. Exactly how that
might work depends heavily on the document in question. For instance,
some documents use three overlapping lines instead of a thick line.
See the answer to my recent question about finding lines in a document
on how to use pdf operators to find lines in a document. While it is
certainly possible with pdfbox, I haven't been able to do it yet.
Therefore I cannot give more detailed information.

Another (a bit complex) option is:
1. Remove all text on a page.
2. Render the page to a png.
3. Find horizontal and vertical lines in the graphic using a line
detection algorithm like Hough transform.
4. Find intersections of detected lines -- they will form a grid from
which you can read with PDFTextStripperByArea

BR,
Ilija.

On Wed, Jan 11, 2012 at 3:42 PM, Kevin Brown <[email protected]> wrote:
> I have not been able to do this. I am not sure it is possible with pdfbox.
> Have you had any luck? If you have, please post?
>
> Kevin
>
> 2012/1/10 金永梁 <[email protected]>
>
>>  Hi,all
>>
>> I have a requirement to extract table datas from pdf files, I need the
>> datas remain the structure, such as store the data in xml format.
>>
>> How I fullfill this ?
>>
>> The main difficult for me Is that I don’t know where is a table begin and
>> end, how can I jude it? Acoording to lines?

Re: How to use pdfbox extract table datas in pdf ?

Reply via email to