RE: Bug: redirected URLs lost on indexing stage?

Arkadi.Kosmynin Thu, 05 Nov 2015 20:10:02 -0800

Hi Sebastian,

I meant #1 and used if http.redirect.max == 3.


Thanks,
Arkadi

> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Tuesday, 3 November 2015 6:13 PM
> To: [email protected]
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set
> > depth to 1. It will be redirected to
> > http://www.atnf.csiro.au/observers/index.html, fetched and parsed
> successfully and then lost. If you set depth to 2, it will get indexed.
> 
> Just to be sure we use the same terminology: What does "depth" mean?
> 1 number of rounds: number of generate-fetch-update cycles when running
> nutch,
>   see command-line help of bin/crawl
> 2 value of property http.redirect.max
> 3 value of property scoring.depth.max (used by plugin scoring-depth)
> 
> If it's about #1 and if http.redirect.max == 0 (the default):
> you need at least two rounds to index a redirected page.
> During the first round the redirect is fetched and the redirect target is
> recorded. The second round will fetch, parse and index the redirect target.
> 
> If http.redirect.max is set to a value > 0, the fetcher will follow redirects
> immediately in the current round. But there are some drawbacks, and that's
> why this isn't the default:
> - no deduplication if multiple pages are redirected
>   to the same target, e.g., an error page.
>   This means you'll spend extra network bandwidth
>   to fetch the same content multiple times.
>   Nutch will keep only one instance of the page anyway.
> - by setting http.redirect.max to a high value you
>   may get lost in round-trip redirects
> - if http.redirect.max is too low longer redirect
>   chains are cut-off. Nutch will not follow these
>   redirects.
> 
> Cheers,
> Sebastian
> 
> 
> On 11/03/2015 01:21 AM, [email protected] wrote:
> > Hi Sebastian,
> >
> > Thank you for very quick and detailed response. I've checked again and
> found that redirected URLs get lost if they had been injected in the last
> iteration.
> >
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set depth
> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html,
> fetched and parsed successfully and then lost. If you set depth to 2, it will 
> get
> indexed.
> >
> > If you use http://www.atnf.csiro.au/observers/index.html as seed, it will
> be fetched, parsed and indexed successfully even if you set depth to 1.
> >
> >  Regards,
> > Arkadi
> >
> >> -----Original Message-----
> >> From: Sebastian Nagel [mailto:[email protected]]
> >> Sent: Thursday, 29 October 2015 7:23 AM
> >> To: [email protected]
> >> Subject: Re: Bug: redirected URLs lost on indexing stage?
> >>
> >> Hi Arkadi,
> >>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> applied),
> >>
> >> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max
> >> > 0
> >>
> >>
> >>> fetches target content, parses and saves it, but loses on the indexing
> stage.
> >>
> >> Can you give a concrete example?
> >>
> >> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> >>
> >>
> >>> Therefore, when this condition is checked
> >>>
> >>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
> >>> parseData
> >> == null) {
> >>>       return;                                     // only have inlinks
> >>>     }
> >>>
> >>> both sets get ignored because each one is incomplete.
> >>
> >> This code snippet is correct, a redirect is pretty much the same as a
> >> link: the crawler follows it. Ok, there are many differences, but the
> >> central point: a link does not get indexed, but only the link target.
> >> And that's the same for redirects. There are always at least 2 URLs:
> >> - the source or redirect
> >> - and the target of the redirection
> >> Only the latter gets indexed after it has been fetched and it is not
> >> a redirect itself.
> >>
> >> The source has no parseText and parseData, and that's why cannot be
> >> indexed.
> >>
> >> If the target does not make it into the index:
> >> - first, check whether it passes URL filters and is not changed by
> >> normalizers
> >> - was it successfully fetched and parsed?
> >> - not excluded by robots=noindex?
> >>
> >> You should check the CrawlDb and the segments for this URL.
> >>
> >> If you could provide a concrete example, I'm happy to have a detailed
> >> look on it.
> >>
> >> Cheers,
> >> Sebastian
> >>
> >>
> >> On 10/28/2015 08:57 AM, [email protected] wrote:
> >>> Hi,
> >>>
> >>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a
> >>> question
> >> mark in the subject because I work with Nutch modification called
> >> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is
> >> why I am only 99% sure that the same bug would occur in the original
> Nutch 1.9.
> >>>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> applied), fetches target content, parses and saves it, but loses on
> >>> the indexing stage. This happens because the db datum is being
> >>> mapped with the original URL as the key, but the fetch and parse
> >>> data and parse text are being mapped with the final URL in
> IndexerMapReduce.
> >>> Therefore, when this condition is checked
> >>>
> >>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
> >>> parseData
> >> == null) {
> >>>       return;                                     // only have inlinks
> >>>     }
> >>>
> >>> both sets get ignored because each one is incomplete.
> >>>
> >>> I am going to fix this for Arch, but can't offer a patch for Nutch,
> >>> sorry. This is
> >> because I am not completely sure that this is a bug in Nutch (see
> >> above) and also because what will work for Arch may not work for
> >> Nutch. They are different in the use of crawl db.
> >>>
> >>> Regards,
> >>>
> >>> Arkadi
> >>>
> >>>
> >>>
> >

RE: Bug: redirected URLs lost on indexing stage?

Reply via email to