See https://issues.apache.org/jira/browse/NUTCH-961 https://issues.apache.org/jira/browse/NUTCH-1233
-----Original message----- > From:Markus Jelsma <[email protected]> > Sent: Tuesday 17th November 2015 20:37 > To: [email protected] > Subject: RE: Nutch doesnt crawl relative links that doesn't start with > leading / > > Hi - i remember something weird with headings in anchors, and it is not valid > if i remember correctly. Anyway, if you are using parse-html, try parse-tika. > If that doesn't work, you may want to hack the parse-tika plugin to use the > LinkContentHandler that Tika offers in conjunction with TeeContentHandler. If > that doesn't work, you can work around the problem but that involves some > crazy stuff i don't have laying around right now. > > M. > > > > -----Original message----- > > From:bbarani <[email protected]> > > Sent: Tuesday 17th November 2015 19:51 > > To: [email protected] > > Subject: Re: Nutch doesnt crawl relative links that doesn't start with > > leading / > > > > The issue seems to be with H2 tag inside anchor tag. > > > > Once I remove the H2 tag, nutch crawls that URL without any issues. Any idea > > how to fix this issue? > > > > Note: I don't have rights to remove H2 tags in all the pages that nutch is > > crawling. > > > > Thanks > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Nutch-doesnt-crawl-relative-links-that-doesn-t-start-with-leading-tp4239303p4240650.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > >

