Re: same page fetched severals times in one crawl

2012-10-16 Thread Sebastian Nagel
> Ok, I did the step manually and it worked. So the prblem did come from the > crawl command. It's not the crawl command alone. It worked for me. Can you try with a minimal nutch-site.xml? > Is it planned to have a script who already handle this > generate-fetch-parse-updatedb loop with some tweak

Re: same page fetched severals times in one crawl

2012-10-15 Thread Pierre
Ok, I did the step manually and it worked. So the prblem did come from the crawl command. I did set fetch.store.content = false because I'm only intersted in backlink crawling. So you are telling me that there is no way to run nutch in an automatic way ? If I want to do a crawl of a small par

Re: same page fetched severals times in one crawl

2012-10-15 Thread Sebastian Nagel
Hi Pierre, I tried almost the same with just the default settings (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O). All went ok, no documents were crawled twice. I don't know what exactly went wrong and didn't find a definitive hint in your logs. Some suggestions: - the craw

Re: same page fetched severals times in one crawl

2012-10-15 Thread Pierre
Hi Tejas, So all urls are concerned by the problem, they are all fetched 3 or 4 times during the crawl, I did not edit any fetch interval and I didn't see exception. I did another test, before the test I deleted all the records from webpage table. I ran : "bin/nutch crawl seed/ -depth 5 -top

Re: same page fetched severals times in one crawl

2012-10-13 Thread Tejas Patil
Hi Pierre, Can you supply some additional information: 1. What is the status of that url now ? if say it is un-fetched in first round, then it will considered again in 2nd round and so on. Maybe there might be something with that url which causes some exception and thus re-tried by nutch in all s

same page fetched severals times in one crawl

2012-10-13 Thread Pierre Nogues
Hello, I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl seed/ -depth 5 -topN 1", I noticed nutch fetch 3 or 4 times the same URL during the crawl, why ? I just configured nutch to local crawl a website (restriction in regex-urlfilter), everything else looks ok on mys