Hello, After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec counting those 2 slaves) and along with map-reduce overhead the whole solution hasn't satisfied me at all. I moved Nutch crawl databases and segments to single EC2 instance and it works pretty fast now reaching 35 fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to work with Hadoop environment and regret it didn't work in my case.
Anyway I would like to know if I'm alone with the approach and everybody set up Nutch with Hadoop. If no and some of you runs Nutch in a single instance maybe you can share with some best practices e.g. do you use crawl script or generate/fetch/update continuously perhaps using some cron jobs? Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats - what exactly does it mean? Regards, Tomasz Here are my current crawldb stats: TOTAL urls: 16347942 retry 0: 16012503 retry 1: 134346 retry 2: 106037 retry 3: 95056 min score: 0.0 avg score: 0.04090025 max score: 331.052 status 1 (db_unfetched): 14045806 status 2 (db_fetched): 1769382 status 3 (db_gone): 160768 status 4 (db_redir_temp): 68104 status 5 (db_redir_perm): 151944 status 6 (db_notmodified): 151938

