Hi Tejas,
So all urls are concerned by the problem, they are all fetched 3 or 4 times during the crawl, I did
not edit any fetch interval and I didn't see exception.
I did another test, before the test I deleted all the records from webpage
table.
I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url
http://serphacker.com/crawltest/
The apache logs of the remote server : http://pastebin.com/tkMPmpuK
The hadoop.log : http://pastebin.com/xRCuKQ5g
The id,status of the webpage table at the end of the crawl :
http://pastebin.com/ZVUC5As5
The nutch-site.xml : http://pastebin.com/WD5Cyyin
The regex url filter : +https?://.*serphacker\.com/crawltest/
nutch-default.xml not edited
On 13/10/2012 20:50, Tejas Patil wrote:
Hi Pierre,
Can you supply some additional information:
1. What is the status of that url now ? if say it is un-fetched in first
round, then it will considered again in 2nd round and so on. Maybe there
might be something with that url which causes some exception and thus
re-tried by nutch in all subsequent rounds.
2. I guess you have not modified the fetch interval for urls. Typically its
set to 30 days but if changed to say 4 secs by user then it will cause that
url to be eligible to be fetched in the next round itself.
3. Did you observe any exceptions in any logs ? please share those.
Thanks,
Tejas
On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> wrote:
Hello,
I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl
seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the same
URL during the crawl, why ?
I just configured nutch to local crawl a website (restriction in
regex-urlfilter), everything else looks ok on mysql.
nuch-site.xml : http://pastebin.com/Mx9s5Kfz