Re: How to logically read text from a PDF table?

Manuel Aristarán Tue, 18 Jul 2017 08:34:55 -0700

Hi Dane,

As you might know, there's no thing such as tables in PDF files. The only
way to extract them is to try to reconstruct the tabular arrangement from
the characters' positions, ruling lines, and so on. I'm one of the
maintainers of Tabula [1], which is a tool based on PDFBox that implements
a number of algorithms to attempt that. We have a GUI tool [2], and a Java
library [3]. Both are open source (MIT license)


Best,

[1] http://tabula.technology
[2] https://github.com/tabulapdf/tabula
[3] https://github.com/tabulapdf/tabula-java

--
Manuel Aristarán
jazzido.com



On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
[email protected]> wrote:

> The examples available are clear on constructing a table, but there is
> little info on reading a table. I've investigated a few solution to this,
> but feel that they are "hacky" in that they rely on establishing column and
> row regions to read text from.
>
> Surely there is a canonical way to traverse the PDDocument table elements
> and access table cells with reference to row and columns?
>
> Any advice would be appreciated.
>
>
> Dane Bezuidenhout
> SprintHive <https://sprinthive.com/>
>
> M: +27 82 562 7850
>
>
> vCard <http://www.sprinthive.com/files/dane.vcf>
>

Re: How to logically read text from a PDF table?

Reply via email to