Re: How To Stop Crawling Pges With "Page Redirect Loop"

Sebastian Nagel Wed, 16 Dec 2015 06:33:02 -0800

Hi,

there is no need for Nutch to detect redirect loops:


(A) per default (with http.redirect.max == 0) Nutch just records the
redirect targets
and fetches them in the next round. The redirect backwards which is found in
the next round is not fetched again because it has already been fetched.

(B) with http.redirect.max > 0 redirects are followed immediately without
any
checks for loops or duplicates (many pages redirecting to the same target).
This option requires careful usage anyway and you hardly would set this
property to a high number - a value of 3 redirects should be sufficient
in most cases.

Btw. the urlfilter-regex rule
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
is used to break loops of recursively nested directories.

Best,
Sebastian

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>


2015-12-16 3:26 GMT+01:00 Manish Verma <[email protected]>:

> Hi ,
>
> I am using notch 1:10 and while crawling it discovers few url which are
> redirection loop, when I open these url browser says its redirect loop.
> Now Nutch is not identifying this loop ?
>
> e.g redirect loop page “https://support.apple.com/zh-mo/HT6323";
>
> I have not changed reggae
>  # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> Thanks
>
>
>

Re: How To Stop Crawling Pges With "Page Redirect Loop"

Reply via email to