Re: Bug: redirected URLs lost on indexing stage?

Sebastian Nagel Fri, 06 Nov 2015 04:27:07 -0800

Hi,

> I meant #1 and used if http.redirect.max == 3.


In this case you definitely have to apply the fix for
NUTCH-2124 / NUTCH-1939 and rebuild your 1.9 package.
Or use 1.10 where NUTCH-1939 is fixed and did not yet
appeared again as NUTCH-2124 :)

Alternatively, use http.redirect.max == 0 and crawl
a sufficient number of rounds.

Cheers,
Sebastian


On 11/06/2015 05:09 AM, [email protected] wrote:
> Hi Sebastian,
> 
> I meant #1 and used if http.redirect.max == 3.
> 
> Thanks,
> Arkadi
> 
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:[email protected]]
>> Sent: Tuesday, 3 November 2015 6:13 PM
>> To: [email protected]
>> Subject: Re: Bug: redirected URLs lost on indexing stage?
>>
>> Hi Arkadi,
>>
>>> Example: use http://www.atnf.csiro.au/observers/  as seed and set
>>> depth to 1. It will be redirected to
>>> http://www.atnf.csiro.au/observers/index.html, fetched and parsed
>> successfully and then lost. If you set depth to 2, it will get indexed.
>>
>> Just to be sure we use the same terminology: What does "depth" mean?
>> 1 number of rounds: number of generate-fetch-update cycles when running
>> nutch,
>>   see command-line help of bin/crawl
>> 2 value of property http.redirect.max
>> 3 value of property scoring.depth.max (used by plugin scoring-depth)
>>
>> If it's about #1 and if http.redirect.max == 0 (the default):
>> you need at least two rounds to index a redirected page.
>> During the first round the redirect is fetched and the redirect target is
>> recorded. The second round will fetch, parse and index the redirect target.
>>
>> If http.redirect.max is set to a value > 0, the fetcher will follow redirects
>> immediately in the current round. But there are some drawbacks, and that's
>> why this isn't the default:
>> - no deduplication if multiple pages are redirected
>>   to the same target, e.g., an error page.
>>   This means you'll spend extra network bandwidth
>>   to fetch the same content multiple times.
>>   Nutch will keep only one instance of the page anyway.
>> - by setting http.redirect.max to a high value you
>>   may get lost in round-trip redirects
>> - if http.redirect.max is too low longer redirect
>>   chains are cut-off. Nutch will not follow these
>>   redirects.
>>
>> Cheers,
>> Sebastian
>>
>>
>> On 11/03/2015 01:21 AM, [email protected] wrote:
>>> Hi Sebastian,
>>>
>>> Thank you for very quick and detailed response. I've checked again and
>> found that redirected URLs get lost if they had been injected in the last
>> iteration.
>>>
>>> Example: use http://www.atnf.csiro.au/observers/  as seed and set depth
>> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html,
>> fetched and parsed successfully and then lost. If you set depth to 2, it 
>> will get
>> indexed.
>>>
>>> If you use http://www.atnf.csiro.au/observers/index.html as seed, it will
>> be fetched, parsed and indexed successfully even if you set depth to 1.
>>>
>>>  Regards,
>>> Arkadi
>>>
>>>> -----Original Message-----
>>>> From: Sebastian Nagel [mailto:[email protected]]
>>>> Sent: Thursday, 29 October 2015 7:23 AM
>>>> To: [email protected]
>>>> Subject: Re: Bug: redirected URLs lost on indexing stage?
>>>>
>>>> Hi Arkadi,
>>>>
>>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>>>> applied),
>>>>
>>>> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max
>>>>> 0
>>>>
>>>>
>>>>> fetches target content, parses and saves it, but loses on the indexing
>> stage.
>>>>
>>>> Can you give a concrete example?
>>>>
>>>> While testing NUTCH-2124, I've verified that redirect targets get indexed.
>>>>
>>>>
>>>>> Therefore, when this condition is checked
>>>>>
>>>>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
>>>>> parseData
>>>> == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>>
>>>>> both sets get ignored because each one is incomplete.
>>>>
>>>> This code snippet is correct, a redirect is pretty much the same as a
>>>> link: the crawler follows it. Ok, there are many differences, but the
>>>> central point: a link does not get indexed, but only the link target.
>>>> And that's the same for redirects. There are always at least 2 URLs:
>>>> - the source or redirect
>>>> - and the target of the redirection
>>>> Only the latter gets indexed after it has been fetched and it is not
>>>> a redirect itself.
>>>>
>>>> The source has no parseText and parseData, and that's why cannot be
>>>> indexed.
>>>>
>>>> If the target does not make it into the index:
>>>> - first, check whether it passes URL filters and is not changed by
>>>> normalizers
>>>> - was it successfully fetched and parsed?
>>>> - not excluded by robots=noindex?
>>>>
>>>> You should check the CrawlDb and the segments for this URL.
>>>>
>>>> If you could provide a concrete example, I'm happy to have a detailed
>>>> look on it.
>>>>
>>>> Cheers,
>>>> Sebastian
>>>>
>>>>
>>>> On 10/28/2015 08:57 AM, [email protected] wrote:
>>>>> Hi,
>>>>>
>>>>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a
>>>>> question
>>>> mark in the subject because I work with Nutch modification called
>>>> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is
>>>> why I am only 99% sure that the same bug would occur in the original
>> Nutch 1.9.
>>>>>
>>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124
>>>>> applied), fetches target content, parses and saves it, but loses on
>>>>> the indexing stage. This happens because the db datum is being
>>>>> mapped with the original URL as the key, but the fetch and parse
>>>>> data and parse text are being mapped with the final URL in
>> IndexerMapReduce.
>>>>> Therefore, when this condition is checked
>>>>>
>>>>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
>>>>> parseData
>>>> == null) {
>>>>>       return;                                     // only have inlinks
>>>>>     }
>>>>>
>>>>> both sets get ignored because each one is incomplete.
>>>>>
>>>>> I am going to fix this for Arch, but can't offer a patch for Nutch,
>>>>> sorry. This is
>>>> because I am not completely sure that this is a bug in Nutch (see
>>>> above) and also because what will work for Arch may not work for
>>>> Nutch. They are different in the use of crawl db.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Arkadi
>>>>>
>>>>>
>>>>>
>>>
>

Re: Bug: redirected URLs lost on indexing stage?

Reply via email to