Hello - you could hack into htmlparser to extract content for text, and not extract content and then find links. It is probably easier to use the boilerpipe patch which already has this ugly hack in it. More elegant would be to use the TeeContentHandler and give it Tika's BoilerpipeContentHandler and Tika's LinkContentHandler. It is much more efficient.
One of my colleague recently uploaded a patch for trunk, but i am unsure if it uses LinkContentHandler. Markus -----Original message----- > From:Manish Verma <[email protected]> > Sent: Friday 11th December 2015 20:59 > To: [email protected] > Subject: Excluding Div After Link Discovery From Content > > Hi, > > I am using notch 1.10, and our requirement is to not index footer div in > content. I applied solution provided in below link , it worked and it removes > footer div from content before parsing. > But we also want to discover links present in footer div , so basically we > don’t want to index footer in content but want to crawl links present in > footer section. > > https://issues.apache.org/jira/secure/attachment/12467198/nutch-585-jostens-excludeDIVs.patch > > <https://issues.apache.org/jira/secure/attachment/12467198/nutch-585-jostens-excludeDIVs.patch> > > Please suggest > > Thanks

