Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But there is a patch for Apache Tika that comes as a content handler, TIKA-980. You can embed it into another content handler or use Tika's TeeContentHandler in Nutch' parse-tika plugin. Downside is that you have to transform the output data structure to a Writable in the plugin, otherwise you cannot store it as metadata and run on Hadoop.
https://issues.apache.org/jira/browse/TIKA-980 Markus -----Original message----- > From:Manish Verma <[email protected]> > Sent: Thursday 17th March 2016 19:18 > To: [email protected] > Subject: Extract Microdata > > Hi, > > I need to crawl on Urls and extract micro data and save to solr. Does Nutch > support extraction of schema org micro data. > > Thanks > > >

