Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But there 
is a patch for Apache Tika that comes as a content handler, TIKA-980. You can 
embed it into another content handler or use Tika's TeeContentHandler in Nutch' 
parse-tika plugin. Downside is that you have to transform the output data 
structure to a Writable in the plugin, otherwise you cannot store it as 
metadata and run on Hadoop.

https://issues.apache.org/jira/browse/TIKA-980

Markus

 
 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Thursday 17th March 2016 19:18
> To: [email protected]
> Subject: Extract Microdata
> 
> Hi,
> 
> I need to crawl on Urls and extract micro data and save to solr. Does Nutch 
> support extraction of schema org micro data.
> 
> Thanks
> 
> 
> 

Reply via email to