Re: Extract Microdata

Manish Verma Wed, 30 Mar 2016 23:18:08 -0700

So Nutch 1.11 have 1193 patch?
What exactly I have to do to crawl micro data?


> On Mar 30, 2016, at 1:33 PM, Markus Jelsma <[email protected]> wrote:
> 
> Hello - with TIKA-1193 committed, you can tell TagSoup to allow HTML5 tags. 
> This works in Nutch 1.11 as well, we are using it in conjunction with our own 
> parser plugin and works very well. TIKA-980 also depends on this.  It is a 
> bit of a work-around concerning TagSoup but it fits. Our parser can, because 
> of TIKA-1193 process HTML5 in its ContentHandler.
> 
> Try it out  :)
> Markus
> 
> 
> 
> -----Original message-----
>> From:Manish Verma <[email protected]>
>> Sent: Wednesday 30th March 2016 18:57
>> To: [email protected]
>> Subject: Re: Extract Microdata
>> 
>> 
>> seems it does not support HTML5 tags,in given patch  the assert statements 
>> are failing because of that.
>> Thanks
>> 
>>> On Mar 18, 2016, at 3:16 AM, Markus Jelsma <[email protected]> 
>>> wrote:
>>> 
>>> Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But 
>>> there is a patch for Apache Tika that comes as a content handler, TIKA-980. 
>>> You can embed it into another content handler or use Tika's 
>>> TeeContentHandler in Nutch' parse-tika plugin. Downside is that you have to 
>>> transform the output data structure to a Writable in the plugin, otherwise 
>>> you cannot store it as metadata and run on Hadoop.
>>> 
>>> https://issues.apache.org/jira/browse/TIKA-980
>>> 
>>> Markus
>>> 
>>> 
>>> 
>>> -----Original message-----
>>>> From:Manish Verma <[email protected]>
>>>> Sent: Thursday 17th March 2016 19:18
>>>> To: [email protected]
>>>> Subject: Extract Microdata
>>>> 
>>>> Hi,
>>>> 
>>>> I need to crawl on Urls and extract micro data and save to solr. Does 
>>>> Nutch support extraction of schema org micro data.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>> 
>> 
>>

Re: Extract Microdata

Reply via email to