There's no (semi)automated method. For simple tables you could create a custom ContentHandler that triggers of appropriate HTML tags.
But a general purpose extractor is a serious technical challenge. Companies like Factual have invested heavily in being able to find & extract this type of structured content from web pages. There are some open source projects out there which could help, I just haven't looked recently. http://blog.import.io/post/get-data-from-html-tables-automatically is an example of a commercial solution. -- Ken > From: Sznajder ForMailingList > Sent: November 12, 2015 6:49:23am PST > To: [email protected] > Subject: Extraction table from HTML document in Tika > > Hi > > Is there a way to extract tables from a HTML document using Tika? > thanks! > > Benjamin -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
