Oh, i forgot the following; enable Hadoop's snappy compression on in- and output files. It reduced our storage requirements to 10% of the original file size. Apparently Nutch' data structures are easily compressed. It also greatly reduces I/O, thus speeding up all load times. CPU usage is negligible compared to I/O wait.
Markus -----Original message----- > From:Tomasz <[email protected]> > Sent: Wednesday 24th February 2016 15:46 > To: [email protected] > Subject: Re: Nutch single instance > > Markus, thanks for sharing. Changing a bit the topic. A few messages > earlier I asked about storing only links between pages without a content. > With your great help I run Nutch with fetcher.store.content = false and > fetcher.parse = true and omit a parse step in generate/fetch/update cycle. > What more I remove parse_text from segments directory after each cycle to > save space, but space used by segments is growing rapidly and I wonder if I > really need all the data. Let me summarise my case - I crawl only to get > connections between pages (inverted links with anchors) and I don't need > the content. I run generate/fetch/update cycle continuously (I've set up > time limit for fetcher to run max 90 min). Is there a way I can save more > storage space? Thanks. > > Tomasz > > 2016-02-24 12:09 GMT+01:00 Markus Jelsma <[email protected]>: > > > Hi - see inline. > > Markus > > > > -----Original message----- > > > From:Tomasz <[email protected]> > > > Sent: Wednesday 24th February 2016 11:54 > > > To: [email protected] > > > Subject: Nutch single instance > > > > > > Hello, > > > > > > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I > > > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec > > > counting those 2 slaves) and along with map-reduce overhead the whole > > > solution hasn't satisfied me at all. I moved Nutch crawl databases and > > > segments to single EC2 instance and it works pretty fast now reaching 35 > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to > > > work with Hadoop environment and regret it didn't work in my case. > > > > Setting up Nutch the correct way is a delicate matter and quite some trial > > and error. But in general, more machines are faster. But in some cases, one > > fast beast can easily outperform a few less powerful machines. > > > > > > > > Anyway I would like to know if I'm alone with the approach and everybody > > > set up Nutch with Hadoop. If no and some of you runs Nutch in a single > > > instance maybe you can share with some best practices e.g. do you use > > crawl > > > script or generate/fetch/update continuously perhaps using some cron > > jobs? > > > > Well, in both cases you need some script(s) to run the jobs. We have a lot > > of complicated scripts that get stuff from everywhere. We have integrated > > Nutch in our Sitesearch platform so it has to be coupled to a lot of > > different systems. We still rely on bash scripts but probably Python is > > easier if scripts are complicated. Ideally, in a distributed environment, > > you use Apache Oozie to run the crawls. > > > > > > > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats - > > what > > > exactly does it mean? > > > > These are transient errors, e.g. connection time outs, connection resets > > but also 5xx errors that are usually transient. They are eligble for > > recrawl 24 hours later. By default, after retry 3, the records goes from > > db_unfetched to db_gone. > > > > > > > > Regards, > > > Tomasz > > > > > > Here are my current crawldb stats: > > > TOTAL urls: 16347942 > > > retry 0: 16012503 > > > retry 1: 134346 > > > retry 2: 106037 > > > retry 3: 95056 > > > min score: 0.0 > > > avg score: 0.04090025 > > > max score: 331.052 > > > status 1 (db_unfetched): 14045806 > > > status 2 (db_fetched): 1769382 > > > status 3 (db_gone): 160768 > > > status 4 (db_redir_temp): 68104 > > > status 5 (db_redir_perm): 151944 > > > status 6 (db_notmodified): 151938 > > > > > >

