seems it does not support HTML5 tags,in given patch  the assert statements are 
failing because of that.
Thanks

> On Mar 18, 2016, at 3:16 AM, Markus Jelsma <[email protected]> wrote:
> 
> Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But 
> there is a patch for Apache Tika that comes as a content handler, TIKA-980. 
> You can embed it into another content handler or use Tika's TeeContentHandler 
> in Nutch' parse-tika plugin. Downside is that you have to transform the 
> output data structure to a Writable in the plugin, otherwise you cannot store 
> it as metadata and run on Hadoop.
> 
> https://issues.apache.org/jira/browse/TIKA-980
> 
> Markus
> 
> 
> 
> -----Original message-----
>> From:Manish Verma <[email protected]>
>> Sent: Thursday 17th March 2016 19:18
>> To: [email protected]
>> Subject: Extract Microdata
>> 
>> Hi,
>> 
>> I need to crawl on Urls and extract micro data and save to solr. Does Nutch 
>> support extraction of schema org micro data.
>> 
>> Thanks
>> 
>> 
>> 

Reply via email to