Re: same page fetched severals times in one crawl

Tejas Patil Sat, 13 Oct 2012 11:51:22 -0700

Hi Pierre,

Can you supply some additional information:

1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all subsequent rounds.

2. I guess you have not modified the fetch interval for urls. Typically its
set to 30 days but if changed to say 4 secs by user then it will cause that
url to be eligible to be fetched in the next round itself.

3. Did you observe any exceptions in any logs ? please share those.

Thanks,
Tejas

On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:

>
> Hello,
>
> I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
> seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
> URL during the crawl, why ?
>
> I just configured nutch to local crawl a website (restriction in
> regex-urlfilter), everything else looks ok on mysql.
>
> nuch-site.xml : http://pastebin.com/Mx9s5Kfz
>
>
>

Re: same page fetched severals times in one crawl

Reply via email to