Re: Nutch doesnt crawl relative links that doesn't start with leading /

Sebastian Nagel Tue, 10 Nov 2015 07:31:47 -0800

Hi,

Nutch will probably follow the link and fetch test.html
prefixed by the base URL.

The default is to ignore the '#' and everything after:
it's normally a page anchor which must be removed
to avoid duplicate content.

That's the default. Have a look at
  https://wiki.apache.org/nutch/AdvancedAjaxInteraction
and
  https://issues.apache.org/jira/browse/NUTCH-1323
  (urlnormalizer-ajax)
which may solve your problem.

Cheers,
Sebastian

On 11/10/2015 02:52 AM, bbarani wrote:
> Hi,
> 
> We have relative URL's similar to the one below in our HTML page but nutch
> is not crawling these URL's. Any idea why nutch doesnt crawl these type of
> relative url's?
> 
> href="test.html#!/int/id/8999898?make=Apple&model=Apple6s"
> 
> Do I need to make any changes in the nutch-conf or regex files to crawl
> these urls?
> 
> Thanks,
> Barani
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-doesnt-crawl-relative-links-that-doesn-t-start-with-leading-tp4239303.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Nutch doesnt crawl relative links that doesn't start with leading /

Reply via email to