Hi - i remember something weird with headings in anchors, and it is not valid 
if i remember correctly. Anyway, if you are using parse-html, try parse-tika. 
If that doesn't work, you may want to hack the parse-tika plugin to use the 
LinkContentHandler that Tika offers in conjunction with TeeContentHandler. If 
that doesn't work, you can work around the problem but that involves some crazy 
stuff i don't have laying around right now.

M.

 
 
-----Original message-----
> From:bbarani <[email protected]>
> Sent: Tuesday 17th November 2015 19:51
> To: [email protected]
> Subject: Re: Nutch doesnt crawl relative links that doesn't start with 
> leading /
> 
> The issue seems to be with H2 tag inside anchor tag.
> 
> Once I remove the H2 tag, nutch crawls that URL without any issues. Any idea
> how to fix this issue? 
> 
> Note: I don't have rights to remove H2 tags in all the pages that nutch is
> crawling. 
>     
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-doesnt-crawl-relative-links-that-doesn-t-start-with-leading-tp4239303p4240650.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 

Reply via email to