See 
https://issues.apache.org/jira/browse/NUTCH-961
https://issues.apache.org/jira/browse/NUTCH-1233


 
 
-----Original message-----
> From:Markus Jelsma <[email protected]>
> Sent: Tuesday 17th November 2015 20:37
> To: [email protected]
> Subject: RE: Nutch doesnt crawl relative links that doesn't start with 
> leading /
> 
> Hi - i remember something weird with headings in anchors, and it is not valid 
> if i remember correctly. Anyway, if you are using parse-html, try parse-tika. 
> If that doesn't work, you may want to hack the parse-tika plugin to use the 
> LinkContentHandler that Tika offers in conjunction with TeeContentHandler. If 
> that doesn't work, you can work around the problem but that involves some 
> crazy stuff i don't have laying around right now.
> 
> M.
> 
>  
>  
> -----Original message-----
> > From:bbarani <[email protected]>
> > Sent: Tuesday 17th November 2015 19:51
> > To: [email protected]
> > Subject: Re: Nutch doesnt crawl relative links that doesn't start with 
> > leading /
> > 
> > The issue seems to be with H2 tag inside anchor tag.
> > 
> > Once I remove the H2 tag, nutch crawls that URL without any issues. Any idea
> > how to fix this issue? 
> > 
> > Note: I don't have rights to remove H2 tags in all the pages that nutch is
> > crawling. 
> >     
> > Thanks
> > 
> > 
> > 
> > --
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Nutch-doesnt-crawl-relative-links-that-doesn-t-start-with-leading-tp4239303p4240650.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> > 
> 

Reply via email to