There's no (semi)automated method.

For simple tables you could create a custom ContentHandler that triggers of 
appropriate HTML tags.

But a general purpose extractor is a serious technical challenge.

Companies like Factual have invested heavily in being able to find & extract 
this type of structured content from web pages.

There are some open source projects out there which could help, I just haven't 
looked recently.

http://blog.import.io/post/get-data-from-html-tables-automatically is an 
example of a commercial solution.

-- Ken

> From: Sznajder ForMailingList
> Sent: November 12, 2015 6:49:23am PST
> To: [email protected]
> Subject: Extraction table from HTML document in Tika
> 
> Hi
> 
> Is there a way to extract tables from a HTML document using Tika?
> thanks!
> 
> Benjamin



--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to