off topic. we are talking about an issue with the SQL backend in GORA, not the performance of Nutch.
Julien On 18 October 2012 20:28, Stefan Scheffler <[email protected]>wrote: > Hi, > The problem why nutch is so slow is, that all of the steps uses hadoop > jobs which takes a long time to start. As well there is somewhere a > hardcoded 3 second delay in the hadoop core which makes sense in > distributed systems. But not on single machines. > > Regards > stefan > > Am 18.10.2012 17:55, schrieb Luca Vasarelli: > > Hello, >> >> I'm using Nutch 2.1 on Linux and I'm having similar problem of >> http://goo.gl/nrDLV, my Nutch is fetching same pages at each round. >> >> I've built a simple localhost site, with 3 pages linked each other: >> first.htm -> second.htm -> third.htm >> >> I did these steps: >> >> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH} >> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql >> backend (thanks to [1]) >> - edited ${TEMP_NUTCH}/conf/gora.**properties removing default sql >> configuration and adding mysql properties (thanks to [1]) >> - ran "ant runtime" from ${TEMP_NUTCH} >> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME} >> - edited ${NUTCH_HOME}/conf/nutch-site.**xml adding http.agent.name, >> http.robots.agents and changing db.ignore.external.links to true and >> fetcher.server.delay to 0.0 >> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.* >> *htm <http://localhost/test/first.htm>" inside this file >> - created db table as [1] >> >> Then I ran "bin/nutch crawl urls -threads 1" >> >> first.htm was fetched 5 times >> second.htm was fetched 4 times >> third.htm was fetched 3 times >> >> I tried doing each step separately (inject, generate, ...) with the same >> results. >> >> Also the whole process take about 2 minutes, am I missing something about >> some delay config or is this normal? >> >> Some extra info: >> >> - HTML of the pages: http://pastebin.com/dyDPJeZs >> - Hadoop log: http://pastebin.com/rwQQPnkE >> - nutch-site.xml: http://pastebin.com/0WArkvh5 >> - Wireshark log: http://pastebin.com/g4Bg17Ls >> - MySQL table: http://pastebin.com/gD2SvGsy >> >> [1] http://nlp.solutions.asia/?p=**180 <http://nlp.solutions.asia/?p=180> >> > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

