Also take a look at Scrapy and the work that Hyperion Grey is doing with Splash and Avatar/HH.
Cheers, Chris — Chris Mattmann [email protected] -----Original Message----- From: Ken Krugler <[email protected]> Reply-To: <[email protected]> Date: Thursday, November 12, 2015 at 10:58 AM To: <[email protected]> Subject: RE: Extraction table from HTML document in Tika >There's no (semi)automated method. >For simple tables you could create a custom ContentHandler that triggers >of appropriate HTML tags. > >But a general purpose extractor is a serious technical challenge. > >Companies like Factual have invested heavily in being able to find & >extract this type of structured content from web pages. > >There are some open source projects out there which could help, I just >haven't looked recently. > >http://blog.import.io/post/get-data-from-html-tables-automatically is an >example of a commercial solution. > >-- Ken > > >________________________________________ >From: Sznajder ForMailingList > Sent: November 12, 2015 6:49:23am PST > To: [email protected] > Subject: Extraction table from HTML document in Tika > > >Hi > > >Is there a way to extract tables from a HTML document using Tika? > >thanks! > > >Benjamin > > > > > > > >-------------------------- >Ken Krugler >+1 530-210-6378 >http://www.scaleunlimited.com >custom big data solutions & training >Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > > > > >
