Hi Arkadi,
> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),
Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> fetches target content, parses and saves it, but loses on the indexing stage.
Can you give a concrete example?
While testing NUTCH-2124, I've verified that redirect targets get indexed.
> Therefore, when this condition is checked
>
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData ==
> null) {
> return; // only have inlinks
> }
>
> both sets get ignored because each one is incomplete.
This code snippet is correct, a redirect is pretty much the
same as a link: the crawler follows it. Ok, there are many
differences, but the central point: a link does not get
indexed, but only the link target. And that's the same
for redirects. There are always at least 2 URLs:
- the source or redirect
- and the target of the redirection
Only the latter gets indexed after it has been fetched
and it is not a redirect itself.
The source has no parseText and parseData, and that's
why cannot be indexed.
If the target does not make it into the index:
- first, check whether it passes URL filters and is not changed by normalizers
- was it successfully fetched and parsed?
- not excluded by robots=noindex?
You should check the CrawlDb and the segments for this URL.
If you could provide a concrete example, I'm happy to have
a detailed look on it.
Cheers,
Sebastian
On 10/28/2015 08:57 AM, [email protected] wrote:
> Hi,
>
> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark
> in the subject because I work with Nutch modification called Arch (see
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99%
> sure that the same bug would occur in the original Nutch 1.9.
>
> In my experience, Nutch follows redirects OK (after NUTCH-2124 applied),
> fetches target content, parses and saves it, but loses on the indexing stage.
> This happens because the db datum is being mapped with the original URL as
> the key, but the fetch and parse data and parse text are being mapped with
> the final URL in IndexerMapReduce. Therefore, when this condition is checked
>
> if (fetchDatum == null || dbDatum == null|| parseText == null || parseData ==
> null) {
> return; // only have inlinks
> }
>
> both sets get ignored because each one is incomplete.
>
> I am going to fix this for Arch, but can't offer a patch for Nutch, sorry.
> This is because I am not completely sure that this is a bug in Nutch (see
> above) and also because what will work for Arch may not work for Nutch. They
> are different in the use of crawl db.
>
> Regards,
>
> Arkadi
>
>
>