Also take a look at Scrapy and the work that Hyperion
Grey is doing with Splash and Avatar/HH.

Cheers,
Chris

—
Chris Mattmann
[email protected]






-----Original Message-----
From: Ken Krugler <[email protected]>
Reply-To: <[email protected]>
Date: Thursday, November 12, 2015 at 10:58 AM
To: <[email protected]>
Subject: RE: Extraction table from HTML document in Tika

>There's no (semi)automated method.
>For simple tables you could create a custom ContentHandler that triggers
>of appropriate HTML tags.
>
>But a general purpose extractor is a serious technical challenge.
>
>Companies like Factual have invested heavily in being able to find &
>extract this type of structured content from web pages.
>
>There are some open source projects out there which could help, I just
>haven't looked recently.
>
>http://blog.import.io/post/get-data-from-html-tables-automatically is an
>example of a commercial solution.
>
>-- Ken
>
>
>________________________________________
>From: Sznajder ForMailingList
> Sent: November 12, 2015 6:49:23am PST
> To: [email protected]
> Subject: Extraction table from HTML document in Tika
> 
>
>Hi
>
>
>Is there a way to extract tables from a HTML document using Tika?
>
>thanks!
>
>
>Benjamin
>
>
>
>
>
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Reply via email to