Re: nutch redirection issue

Sebastian Nagel Thu, 11 Jul 2013 13:14:36 -0700

> If I remember correctly, there used to be a setting that would have Nutch
> follow the redirect instead of storing it as a new url, but I can't seem to
> find it at the moment.

The property is:

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

> Have you done another crawl?  By default, Nutch puts the redirect into the
> database as a new url to be crawled.  So you will find the content under
> the location of the redirect.

Sometimes you'll find the content of the redirect target indexed under the
source URL. In general, if the source is clearly simpler, e.g. (www.asdf.net)
as the target (www.asdf.net/page/index.asp?page=main) the source is given
precende. For details, see URLUtil.chooseRepr().

On 07/11/2013 01:21 PM, Bai Shen wrote:
> Have you done another crawl?  By default, Nutch puts the redirect into the
> database as a new url to be crawled.  So you will find the content under
> the location of the redirect.
> 
> If I remember correctly, there used to be a setting that would have Nutch
> follow the redirect instead of storing it as a new url, but I can't seem to
> find it at the moment.
> 
> 
> On Thu, Jul 11, 2013 at 5:48 AM, devang pandey 
> <[email protected]>wrote:
> 
>> Hello,
>>
>> I am bit new to nutch . Thing is I am crawling a url which redirects to
>> another url .Now when analysing my crawl results I get content of first url
>> along with status code : temp redirected to (second url name) . Now my
>> question is that why I am not getting content and details of that second
>> url . Please help
>>
>

Re: nutch redirection issue

Reply via email to