Hi, Nutch will probably follow the link and fetch test.html prefixed by the base URL.
The default is to ignore the '#' and everything after: it's normally a page anchor which must be removed to avoid duplicate content. That's the default. Have a look at https://wiki.apache.org/nutch/AdvancedAjaxInteraction and https://issues.apache.org/jira/browse/NUTCH-1323 (urlnormalizer-ajax) which may solve your problem. Cheers, Sebastian On 11/10/2015 02:52 AM, bbarani wrote: > Hi, > > We have relative URL's similar to the one below in our HTML page but nutch > is not crawling these URL's. Any idea why nutch doesnt crawl these type of > relative url's? > > href="test.html#!/int/id/8999898?make=Apple&model=Apple6s" > > Do I need to make any changes in the nutch-conf or regex files to crawl > these urls? > > Thanks, > Barani > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-doesnt-crawl-relative-links-that-doesn-t-start-with-leading-tp4239303.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

