> Ok, I did the step manually and it worked. So the prblem did come from the > crawl command. It's not the crawl command alone. It worked for me. Can you try with a minimal nutch-site.xml?
> Is it planned to have a script who already handle this > generate-fetch-parse-updatedb loop with some tweak like maximum depth of the > crawl, maximum time of the crawl ? Have a look at the patches of NUTCH-1087 there is also a patch for 2.x (but see Julien's comment: "needs testing"). If you could test it and share your experience, it would help us much. Of course, the script has an argument to limit the crawl cycles (equiv. to -depth). For maximum time, see the property fetcher.timelimit.mins (as a rough equivalent). 2012/10/16 Pierre <[email protected]>: > Ok, I did the step manually and it worked. So the prblem did come from the > crawl command. > > I did set fetch.store.content = false because I'm only intersted in backlink > crawling. > > So you are telling me that there is no way to run nutch in an automatic way > ? If I want to do a crawl of a small part of the web, I am supposed to > repeat the step manually or make a script who will loop between > generate/fetch/parse/updatedb ? It doesn't sound good... > > Is it planned to have a script who already handle this > generate-fetch-parse-updatedb loop with some tweak like maximum depth of the > crawl, maximum time of the crawl ? > > > > On 15/10/2012 22:11, Sebastian Nagel wrote: >> >> Hi Pierre, >> >> I tried almost the same with just the default settings >> (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O). >> All went ok, no documents were crawled twice. >> I don't know what exactly went wrong >> and didn't find a definitive hint in your logs. Some suggestions: >> >> - the crawl command is deprecated, see >> https://issues.apache.org/jira/browse/NUTCH-1087 >> >> - you should try to perform the steps >> inject >> generate >> fetch >> parse >> updatedb >> "by hand". This gives you more insights what is going on. >> Repeat the steps generate, fetch, parse, updatedb as many times as >> needed. >> There are many tutorials out there how to crawl step-by-step, eg. >> >> http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html >> Finally, of course, but (sorry) it's rather short: >> http://wiki.apache.org/nutch/Nutch2Tutorial >> >> - set fetcher.parse = false and fetcher.store.content = true >> >> Good luck, >> >> Sebastian >> >> >> On 10/15/2012 02:27 PM, Pierre wrote: >>> >>> Hi Tejas, >>> >>> So all urls are concerned by the problem, they are all fetched 3 or 4 >>> times during the crawl, I did >>> not edit any fetch interval and I didn't see exception. >>> >>> I did another test, before the test I deleted all the records from >>> webpage table. >>> >>> I ran : "bin/nutch crawl seed/ -depth 5 -topN 10000" with seed url >>> http://serphacker.com/crawltest/ >>> >>> The apache logs of the remote server : http://pastebin.com/tkMPmpuK >>> The hadoop.log : http://pastebin.com/xRCuKQ5g >>> The id,status of the webpage table at the end of the crawl : >>> http://pastebin.com/ZVUC5As5 >>> The nutch-site.xml : http://pastebin.com/WD5Cyyin >>> The regex url filter : +https?://.*serphacker\.com/crawltest/ >>> nutch-default.xml not edited >>> >>> >>> >>> On 13/10/2012 20:50, Tejas Patil wrote: >>>> >>>> Hi Pierre, >>>> >>>> Can you supply some additional information: >>>> >>>> 1. What is the status of that url now ? if say it is un-fetched in first >>>> round, then it will considered again in 2nd round and so on. Maybe there >>>> might be something with that url which causes some exception and thus >>>> re-tried by nutch in all subsequent rounds. >>>> >>>> 2. I guess you have not modified the fetch interval for urls. Typically >>>> its >>>> set to 30 days but if changed to say 4 secs by user then it will cause >>>> that >>>> url to be eligible to be fetched in the next round itself. >>>> >>>> 3. Did you observe any exceptions in any logs ? please share those. >>>> >>>> Thanks, >>>> Tejas >>>> >>>> On Sat, Oct 13, 2012 at 10:07 AM, Pierre Nogues <[email protected]> >>>> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> I'm using nutch 2.1 with mysql and when I do a simple "bin/nutch crawl >>>>> seed/ -depth 5 -topN 10000", I noticed nutch fetch 3 or 4 times the >>>>> same >>>>> URL during the crawl, why ? >>>>> >>>>> I just configured nutch to local crawl a website (restriction in >>>>> regex-urlfilter), everything else looks ok on mysql. >>>>> >>>>> nuch-site.xml : http://pastebin.com/Mx9s5Kfz >>>>> >>>>> >>>>> >>>> >> >> >> >

