So Nutch 1.11 have 1193 patch? What exactly I have to do to crawl micro data?
> On Mar 30, 2016, at 1:33 PM, Markus Jelsma <[email protected]> wrote: > > Hello - with TIKA-1193 committed, you can tell TagSoup to allow HTML5 tags. > This works in Nutch 1.11 as well, we are using it in conjunction with our own > parser plugin and works very well. TIKA-980 also depends on this. It is a > bit of a work-around concerning TagSoup but it fits. Our parser can, because > of TIKA-1193 process HTML5 in its ContentHandler. > > Try it out :) > Markus > > > > -----Original message----- >> From:Manish Verma <[email protected]> >> Sent: Wednesday 30th March 2016 18:57 >> To: [email protected] >> Subject: Re: Extract Microdata >> >> >> seems it does not support HTML5 tags,in given patch the assert statements >> are failing because of that. >> Thanks >> >>> On Mar 18, 2016, at 3:16 AM, Markus Jelsma <[email protected]> >>> wrote: >>> >>> Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But >>> there is a patch for Apache Tika that comes as a content handler, TIKA-980. >>> You can embed it into another content handler or use Tika's >>> TeeContentHandler in Nutch' parse-tika plugin. Downside is that you have to >>> transform the output data structure to a Writable in the plugin, otherwise >>> you cannot store it as metadata and run on Hadoop. >>> >>> https://issues.apache.org/jira/browse/TIKA-980 >>> >>> Markus >>> >>> >>> >>> -----Original message----- >>>> From:Manish Verma <[email protected]> >>>> Sent: Thursday 17th March 2016 19:18 >>>> To: [email protected] >>>> Subject: Extract Microdata >>>> >>>> Hi, >>>> >>>> I need to crawl on Urls and extract micro data and save to solr. Does >>>> Nutch support extraction of schema org micro data. >>>> >>>> Thanks >>>> >>>> >>>> >> >>

