Hi - see inline. Markus -----Original message----- > From:Tomasz <[email protected]> > Sent: Wednesday 24th February 2016 11:54 > To: [email protected] > Subject: Nutch single instance > > Hello, > > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec > counting those 2 slaves) and along with map-reduce overhead the whole > solution hasn't satisfied me at all. I moved Nutch crawl databases and > segments to single EC2 instance and it works pretty fast now reaching 35 > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to > work with Hadoop environment and regret it didn't work in my case.
Setting up Nutch the correct way is a delicate matter and quite some trial and error. But in general, more machines are faster. But in some cases, one fast beast can easily outperform a few less powerful machines. > > Anyway I would like to know if I'm alone with the approach and everybody > set up Nutch with Hadoop. If no and some of you runs Nutch in a single > instance maybe you can share with some best practices e.g. do you use crawl > script or generate/fetch/update continuously perhaps using some cron jobs? Well, in both cases you need some script(s) to run the jobs. We have a lot of complicated scripts that get stuff from everywhere. We have integrated Nutch in our Sitesearch platform so it has to be coupled to a lot of different systems. We still rely on bash scripts but probably Python is easier if scripts are complicated. Ideally, in a distributed environment, you use Apache Oozie to run the crawls. > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats - what > exactly does it mean? These are transient errors, e.g. connection time outs, connection resets but also 5xx errors that are usually transient. They are eligble for recrawl 24 hours later. By default, after retry 3, the records goes from db_unfetched to db_gone. > > Regards, > Tomasz > > Here are my current crawldb stats: > TOTAL urls: 16347942 > retry 0: 16012503 > retry 1: 134346 > retry 2: 106037 > retry 3: 95056 > min score: 0.0 > avg score: 0.04090025 > max score: 331.052 > status 1 (db_unfetched): 14045806 > status 2 (db_fetched): 1769382 > status 3 (db_gone): 160768 > status 4 (db_redir_temp): 68104 > status 5 (db_redir_perm): 151944 > status 6 (db_notmodified): 151938 >

