Re: Bug: redirected URLs lost on indexing stage?

Sebastian Nagel Wed, 28 Oct 2015 13:23:56 -0700

Hi Arkadi,

> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),


Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0


> fetches target content, parses and saves it, but loses on the indexing stage.

Can you give a concrete example?

While testing NUTCH-2124, I've verified that redirect targets get indexed.


> Therefore, when this condition is checked
>
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == 
> null) {
>       return;                                     // only have inlinks
>     }
>
> both sets get ignored because each one is incomplete.

This code snippet is correct, a redirect is pretty much the
same as a link: the crawler follows it. Ok, there are many
differences, but the central point: a link does not get
indexed, but only the link target. And that's the same
for redirects. There are always at least 2 URLs:
- the source or redirect
- and the target of the redirection
Only the latter gets indexed after it has been fetched
and it is not a redirect itself.

The source has no parseText and parseData, and that's
why cannot be indexed.

If the target does not make it into the index:
- first, check whether it passes URL filters and is not changed by normalizers
- was it successfully fetched and parsed?
- not excluded by robots=noindex?

You should check the CrawlDb and the segments for this URL.

If you could provide a concrete example, I'm happy to have
a detailed look on it.

Cheers,
Sebastian


On 10/28/2015 08:57 AM, [email protected] wrote:
> Hi,
> 
> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark 
> in the subject because I work with Nutch modification called Arch (see 
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99% 
> sure that the same bug would occur in the original Nutch 1.9.
> 
> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), 
> fetches target content, parses and saves it, but loses on the indexing stage. 
> This happens because the db datum is being mapped with the original URL as 
> the key, but the fetch and parse data and parse text are being mapped with 
> the final URL in IndexerMapReduce. Therefore, when this condition is checked
> 
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == 
> null) {
>       return;                                     // only have inlinks
>     }
> 
> both sets get ignored because each one is incomplete.
> 
> I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. 
> This is because I am not completely sure that this is a bug in Nutch (see 
> above) and also because what will work for Arch may not work for Nutch. They 
> are different in the use of crawl db.
> 
> Regards,
> 
> Arkadi
> 
> 
>

Re: Bug: redirected URLs lost on indexing stage?

Reply via email to