Re: same page fetched severals times in one crawl

Pierre Mon, 15 Oct 2012 05:28:20 -0700

Hi Tejas,

So all urls are concerned by the problem, they are all fetched 3 or 4 times during the crawl, I didnot edit any fetch interval and I didn't see exception.


I did another test, before the test I deleted all the records from webpage 
table.

I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url 
http://serphacker.com/crawltest/

The apache logs of the remote server : http://pastebin.com/tkMPmpuK
The hadoop.log : http://pastebin.com/xRCuKQ5g
The id,status of the webpage table at the end of the crawl : 
http://pastebin.com/ZVUC5As5
The nutch-site.xml : http://pastebin.com/WD5Cyyin
The regex url filter : +https?://.*serphacker\.com/crawltest/
nutch-default.xml not edited



On 13/10/2012 20:50, Tejas Patil wrote:

Hi Pierre,

Can you supply some additional information:

1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all subsequent rounds.

2. I guess you have not modified the fetch interval for urls. Typically its
set to 30 days but if changed to say 4 secs by user then it will cause that
url to be eligible to be fetched in the next round itself.

3. Did you observe any exceptions in any logs ? please share those.

Thanks,
Tejas

On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:


Hello,

I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
URL during the crawl, why ?

I just configured nutch to local crawl a website (restriction in
regex-urlfilter), everything else looks ok on mysql.

nuch-site.xml : http://pastebin.com/Mx9s5Kfz

Re: same page fetched severals times in one crawl

Reply via email to