Hello - with TIKA-1193 committed, you can tell TagSoup to allow HTML5 tags. 
This works in Nutch 1.11 as well, we are using it in conjunction with our own 
parser plugin and works very well. TIKA-980 also depends on this.  It is a bit 
of a work-around concerning TagSoup but it fits. Our parser can, because of 
TIKA-1193 process HTML5 in its ContentHandler.

Try it out  :)
Markus

 
 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Wednesday 30th March 2016 18:57
> To: [email protected]
> Subject: Re: Extract Microdata
> 
> 
> seems it does not support HTML5 tags,in given patch  the assert statements 
> are failing because of that.
> Thanks
> 
> > On Mar 18, 2016, at 3:16 AM, Markus Jelsma <[email protected]> 
> > wrote:
> > 
> > Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But 
> > there is a patch for Apache Tika that comes as a content handler, TIKA-980. 
> > You can embed it into another content handler or use Tika's 
> > TeeContentHandler in Nutch' parse-tika plugin. Downside is that you have to 
> > transform the output data structure to a Writable in the plugin, otherwise 
> > you cannot store it as metadata and run on Hadoop.
> > 
> > https://issues.apache.org/jira/browse/TIKA-980
> > 
> > Markus
> > 
> > 
> > 
> > -----Original message-----
> >> From:Manish Verma <[email protected]>
> >> Sent: Thursday 17th March 2016 19:18
> >> To: [email protected]
> >> Subject: Extract Microdata
> >> 
> >> Hi,
> >> 
> >> I need to crawl on Urls and extract micro data and save to solr. Does 
> >> Nutch support extraction of schema org micro data.
> >> 
> >> Thanks
> >> 
> >> 
> >> 
> 
> 

Reply via email to