Hi, > I meant #1 and used if http.redirect.max == 3.
In this case you definitely have to apply the fix for NUTCH-2124 / NUTCH-1939 and rebuild your 1.9 package. Or use 1.10 where NUTCH-1939 is fixed and did not yet appeared again as NUTCH-2124 :) Alternatively, use http.redirect.max == 0 and crawl a sufficient number of rounds. Cheers, Sebastian On 11/06/2015 05:09 AM, [email protected] wrote: > Hi Sebastian, > > I meant #1 and used if http.redirect.max == 3. > > Thanks, > Arkadi > >> -----Original Message----- >> From: Sebastian Nagel [mailto:[email protected]] >> Sent: Tuesday, 3 November 2015 6:13 PM >> To: [email protected] >> Subject: Re: Bug: redirected URLs lost on indexing stage? >> >> Hi Arkadi, >> >>> Example: use http://www.atnf.csiro.au/observers/ as seed and set >>> depth to 1. It will be redirected to >>> http://www.atnf.csiro.au/observers/index.html, fetched and parsed >> successfully and then lost. If you set depth to 2, it will get indexed. >> >> Just to be sure we use the same terminology: What does "depth" mean? >> 1 number of rounds: number of generate-fetch-update cycles when running >> nutch, >> see command-line help of bin/crawl >> 2 value of property http.redirect.max >> 3 value of property scoring.depth.max (used by plugin scoring-depth) >> >> If it's about #1 and if http.redirect.max == 0 (the default): >> you need at least two rounds to index a redirected page. >> During the first round the redirect is fetched and the redirect target is >> recorded. The second round will fetch, parse and index the redirect target. >> >> If http.redirect.max is set to a value > 0, the fetcher will follow redirects >> immediately in the current round. But there are some drawbacks, and that's >> why this isn't the default: >> - no deduplication if multiple pages are redirected >> to the same target, e.g., an error page. >> This means you'll spend extra network bandwidth >> to fetch the same content multiple times. >> Nutch will keep only one instance of the page anyway. >> - by setting http.redirect.max to a high value you >> may get lost in round-trip redirects >> - if http.redirect.max is too low longer redirect >> chains are cut-off. Nutch will not follow these >> redirects. >> >> Cheers, >> Sebastian >> >> >> On 11/03/2015 01:21 AM, [email protected] wrote: >>> Hi Sebastian, >>> >>> Thank you for very quick and detailed response. I've checked again and >> found that redirected URLs get lost if they had been injected in the last >> iteration. >>> >>> Example: use http://www.atnf.csiro.au/observers/ as seed and set depth >> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html, >> fetched and parsed successfully and then lost. If you set depth to 2, it >> will get >> indexed. >>> >>> If you use http://www.atnf.csiro.au/observers/index.html as seed, it will >> be fetched, parsed and indexed successfully even if you set depth to 1. >>> >>> Regards, >>> Arkadi >>> >>>> -----Original Message----- >>>> From: Sebastian Nagel [mailto:[email protected]] >>>> Sent: Thursday, 29 October 2015 7:23 AM >>>> To: [email protected] >>>> Subject: Re: Bug: redirected URLs lost on indexing stage? >>>> >>>> Hi Arkadi, >>>> >>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124 >>>>> applied), >>>> >>>> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max >>>>> 0 >>>> >>>> >>>>> fetches target content, parses and saves it, but loses on the indexing >> stage. >>>> >>>> Can you give a concrete example? >>>> >>>> While testing NUTCH-2124, I've verified that redirect targets get indexed. >>>> >>>> >>>>> Therefore, when this condition is checked >>>>> >>>>> if (fetchDatum == null || dbDatum == null|| parseText == null || >>>>> parseData >>>> == null) { >>>>> return; // only have inlinks >>>>> } >>>>> >>>>> both sets get ignored because each one is incomplete. >>>> >>>> This code snippet is correct, a redirect is pretty much the same as a >>>> link: the crawler follows it. Ok, there are many differences, but the >>>> central point: a link does not get indexed, but only the link target. >>>> And that's the same for redirects. There are always at least 2 URLs: >>>> - the source or redirect >>>> - and the target of the redirection >>>> Only the latter gets indexed after it has been fetched and it is not >>>> a redirect itself. >>>> >>>> The source has no parseText and parseData, and that's why cannot be >>>> indexed. >>>> >>>> If the target does not make it into the index: >>>> - first, check whether it passes URL filters and is not changed by >>>> normalizers >>>> - was it successfully fetched and parsed? >>>> - not excluded by robots=noindex? >>>> >>>> You should check the CrawlDb and the segments for this URL. >>>> >>>> If you could provide a concrete example, I'm happy to have a detailed >>>> look on it. >>>> >>>> Cheers, >>>> Sebastian >>>> >>>> >>>> On 10/28/2015 08:57 AM, [email protected] wrote: >>>>> Hi, >>>>> >>>>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a >>>>> question >>>> mark in the subject because I work with Nutch modification called >>>> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is >>>> why I am only 99% sure that the same bug would occur in the original >> Nutch 1.9. >>>>> >>>>> In my experience, Nutch follows redirects OK (after NUTCH-2124 >>>>> applied), fetches target content, parses and saves it, but loses on >>>>> the indexing stage. This happens because the db datum is being >>>>> mapped with the original URL as the key, but the fetch and parse >>>>> data and parse text are being mapped with the final URL in >> IndexerMapReduce. >>>>> Therefore, when this condition is checked >>>>> >>>>> if (fetchDatum == null || dbDatum == null|| parseText == null || >>>>> parseData >>>> == null) { >>>>> return; // only have inlinks >>>>> } >>>>> >>>>> both sets get ignored because each one is incomplete. >>>>> >>>>> I am going to fix this for Arch, but can't offer a patch for Nutch, >>>>> sorry. This is >>>> because I am not completely sure that this is a bug in Nutch (see >>>> above) and also because what will work for Arch may not work for >>>> Nutch. They are different in the use of crawl db. >>>>> >>>>> Regards, >>>>> >>>>> Arkadi >>>>> >>>>> >>>>> >>> >

