RE: Excluding Div After Link Discovery From Content

Markus Jelsma Tue, 15 Dec 2015 11:45:11 -0800

Hello - you could hack into htmlparser to extract content for text, and not 
extract content and then find links. It is probably easier to use the 
boilerpipe patch which already has this ugly hack in it. More elegant would be 
to use the TeeContentHandler and give it Tika's BoilerpipeContentHandler and 
Tika's LinkContentHandler. It is much more efficient.


One of my colleague recently uploaded a patch for trunk, but i am unsure if it 
uses LinkContentHandler. 

Markus
 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Friday 11th December 2015 20:59
> To: [email protected]
> Subject: Excluding Div After Link Discovery From Content
> 
> Hi,
> 
> I am using notch 1.10, and our requirement is to not index footer div in 
> content. I applied solution provided in below link , it worked and it removes 
> footer div from content before parsing.
> But we also want to discover links present in footer div , so basically we 
> don’t want to index footer in content but want to crawl links present in 
> footer section.
> 
> https://issues.apache.org/jira/secure/attachment/12467198/nutch-585-jostens-excludeDIVs.patch
>  
> <https://issues.apache.org/jira/secure/attachment/12467198/nutch-585-jostens-excludeDIVs.patch>
> 
> Please suggest
> 
> Thanks

RE: Excluding Div After Link Discovery From Content

Reply via email to