Hi Sebastian, I meant #1 and used if http.redirect.max == 3.
Thanks, Arkadi > -----Original Message----- > From: Sebastian Nagel [mailto:[email protected]] > Sent: Tuesday, 3 November 2015 6:13 PM > To: [email protected] > Subject: Re: Bug: redirected URLs lost on indexing stage? > > Hi Arkadi, > > > Example: use http://www.atnf.csiro.au/observers/ as seed and set > > depth to 1. It will be redirected to > > http://www.atnf.csiro.au/observers/index.html, fetched and parsed > successfully and then lost. If you set depth to 2, it will get indexed. > > Just to be sure we use the same terminology: What does "depth" mean? > 1 number of rounds: number of generate-fetch-update cycles when running > nutch, > see command-line help of bin/crawl > 2 value of property http.redirect.max > 3 value of property scoring.depth.max (used by plugin scoring-depth) > > If it's about #1 and if http.redirect.max == 0 (the default): > you need at least two rounds to index a redirected page. > During the first round the redirect is fetched and the redirect target is > recorded. The second round will fetch, parse and index the redirect target. > > If http.redirect.max is set to a value > 0, the fetcher will follow redirects > immediately in the current round. But there are some drawbacks, and that's > why this isn't the default: > - no deduplication if multiple pages are redirected > to the same target, e.g., an error page. > This means you'll spend extra network bandwidth > to fetch the same content multiple times. > Nutch will keep only one instance of the page anyway. > - by setting http.redirect.max to a high value you > may get lost in round-trip redirects > - if http.redirect.max is too low longer redirect > chains are cut-off. Nutch will not follow these > redirects. > > Cheers, > Sebastian > > > On 11/03/2015 01:21 AM, [email protected] wrote: > > Hi Sebastian, > > > > Thank you for very quick and detailed response. I've checked again and > found that redirected URLs get lost if they had been injected in the last > iteration. > > > > Example: use http://www.atnf.csiro.au/observers/ as seed and set depth > to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html, > fetched and parsed successfully and then lost. If you set depth to 2, it will > get > indexed. > > > > If you use http://www.atnf.csiro.au/observers/index.html as seed, it will > be fetched, parsed and indexed successfully even if you set depth to 1. > > > > Regards, > > Arkadi > > > >> -----Original Message----- > >> From: Sebastian Nagel [mailto:[email protected]] > >> Sent: Thursday, 29 October 2015 7:23 AM > >> To: [email protected] > >> Subject: Re: Bug: redirected URLs lost on indexing stage? > >> > >> Hi Arkadi, > >> > >>> In my experience, Nutch follows redirects OK (after NUTCH-2124 > >>> applied), > >> > >> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > >> > 0 > >> > >> > >>> fetches target content, parses and saves it, but loses on the indexing > stage. > >> > >> Can you give a concrete example? > >> > >> While testing NUTCH-2124, I've verified that redirect targets get indexed. > >> > >> > >>> Therefore, when this condition is checked > >>> > >>> if (fetchDatum == null || dbDatum == null|| parseText == null || > >>> parseData > >> == null) { > >>> return; // only have inlinks > >>> } > >>> > >>> both sets get ignored because each one is incomplete. > >> > >> This code snippet is correct, a redirect is pretty much the same as a > >> link: the crawler follows it. Ok, there are many differences, but the > >> central point: a link does not get indexed, but only the link target. > >> And that's the same for redirects. There are always at least 2 URLs: > >> - the source or redirect > >> - and the target of the redirection > >> Only the latter gets indexed after it has been fetched and it is not > >> a redirect itself. > >> > >> The source has no parseText and parseData, and that's why cannot be > >> indexed. > >> > >> If the target does not make it into the index: > >> - first, check whether it passes URL filters and is not changed by > >> normalizers > >> - was it successfully fetched and parsed? > >> - not excluded by robots=noindex? > >> > >> You should check the CrawlDb and the segments for this URL. > >> > >> If you could provide a concrete example, I'm happy to have a detailed > >> look on it. > >> > >> Cheers, > >> Sebastian > >> > >> > >> On 10/28/2015 08:57 AM, [email protected] wrote: > >>> Hi, > >>> > >>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a > >>> question > >> mark in the subject because I work with Nutch modification called > >> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is > >> why I am only 99% sure that the same bug would occur in the original > Nutch 1.9. > >>> > >>> In my experience, Nutch follows redirects OK (after NUTCH-2124 > >>> applied), fetches target content, parses and saves it, but loses on > >>> the indexing stage. This happens because the db datum is being > >>> mapped with the original URL as the key, but the fetch and parse > >>> data and parse text are being mapped with the final URL in > IndexerMapReduce. > >>> Therefore, when this condition is checked > >>> > >>> if (fetchDatum == null || dbDatum == null|| parseText == null || > >>> parseData > >> == null) { > >>> return; // only have inlinks > >>> } > >>> > >>> both sets get ignored because each one is incomplete. > >>> > >>> I am going to fix this for Arch, but can't offer a patch for Nutch, > >>> sorry. This is > >> because I am not completely sure that this is a bug in Nutch (see > >> above) and also because what will work for Arch may not work for > >> Nutch. They are different in the use of crawl db. > >>> > >>> Regards, > >>> > >>> Arkadi > >>> > >>> > >>> > >

