Hi Luca, > I'm using Nutch 2.1 on Linux and I'm having similar problem of > http://goo.gl/nrDLV, my Nutch is > fetching same pages at each round. Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql)
> Then I ran "bin/nutch crawl urls -threads 1" > > first.htm was fetched 5 times > second.htm was fetched 4 times > third.htm was fetched 3 times But after the 5th cycle the crawler stopped? > I tried doing each step separately (inject, generate, ...) with the same > results. For Pierre this has worked... Any suggestions? > Also the whole process take about 2 minutes, am I missing something about > some delay config or is > this normal? Well, Nutch (resp. Hadoop) are designed to process much data. Job management has some overhead (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 20 jobs. 6s per job seems roughly ok, though it could be slightly faster. Sebastian On 10/18/2012 05:55 PM, Luca Vasarelli wrote: > Hello, > > I'm using Nutch 2.1 on Linux and I'm having similar problem of > http://goo.gl/nrDLV, my Nutch is > fetching same pages at each round. > > I've built a simple localhost site, with 3 pages linked each other: > first.htm -> second.htm -> third.htm > > I did these steps: > > - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH} > - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend > (thanks to [1]) > - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql > configuration and adding mysql > properties (thanks to [1]) > - ran "ant runtime" from ${TEMP_NUTCH} > - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME} > - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, > http.robots.agents and changing > db.ignore.external.links to true and fetcher.server.delay to 0.0 > - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.htm" > inside this file > - created db table as [1] > > Then I ran "bin/nutch crawl urls -threads 1" > > first.htm was fetched 5 times > second.htm was fetched 4 times > third.htm was fetched 3 times > > I tried doing each step separately (inject, generate, ...) with the same > results. > > Also the whole process take about 2 minutes, am I missing something about > some delay config or is > this normal? > > Some extra info: > > - HTML of the pages: http://pastebin.com/dyDPJeZs > - Hadoop log: http://pastebin.com/rwQQPnkE > - nutch-site.xml: http://pastebin.com/0WArkvh5 > - Wireshark log: http://pastebin.com/g4Bg17Ls > - MySQL table: http://pastebin.com/gD2SvGsy > > [1] http://nlp.solutions.asia/?p=180

