Hi Julien, I've been winding down my Nutch server's this last few weeks in prep for moving away. I would however be very interested in stepping up to also provide some stats come the new year. I don't know the duration of time you guys think this should be carried out over however I am very keen to participate when I can come early Janurary.
Best Lewis On Mon, Dec 17, 2012 at 9:59 PM, Markus Jelsma <[email protected]>wrote: > Hi, > > Interesting indeed. Apart from our customers we operate a cluster of a few > high octane machines for research purposes that crawls the entire internet > as much as it physically can. We run a modified Nutch 1.x and some custom > jobs that analyze the crawled data and allow us to crawl the internet more > efficiently. The cluster is far too small to quickly read through all the > data. We only have 80GB of RAM and 80 CPU cores so it takes a while to read > the ~760GB crawldb containing about 5.7 billion records, it takes about 40 > minutes. Compiling a webgraph and calculating the page rank takes about 32 > hours. Fetching and parsing is less intensive, at peak efficiency we can > process over 700 pages per second including the reduce phase time and job > set up and clean up. > > We can provide all the information you would like to have. > > Cheers. > > -----Original message----- > > From:Julien Nioche <[email protected]> > > Sent: Mon 17-Dec-2012 22:00 > > To: [email protected]; [email protected] > > Cc: Lisa Green <[email protected]> > > Subject: Comparing Nutch and Common Crawl > > > > Hi, > > > > I was chatting with the people from the Common Crawl project ( > www.commoncrawl.org <http://www.commoncrawl.org> ) this afternoon and we > thought it would be interesting to have some sort of comparison between the > space / memory / CPU requirements of their crawler and what it would take > to process a similar amount with Nutch 1.x and 2.x. The aim is not so much > to prove that one system is superior to the other (they both have their > pluses and minuses) but to get a better picture of the situation. > > > > One way to do this would be to gather stats from Nutch users operating > large crawls. Alternatively one could push the content of the CC dataset > into e.g. Nutch 2 on Hbase to see how much space it would take and how the > crawl would fare on that. I am pretty sure that would reveal all sorts of > interesting issues and would be a good thing to do to test the Nutch + Gora > stack. > > > > Would anyone be interested in sharing their stats? Anyone with spare > time and machine to populate a crawldb with the CC dataset and get some > stats? > > > > Thanks > > > > Julien > > > > -- > > <http://digitalpebble.com/img/logo.gif> > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> > > http://www.digitalpebble.com <http://www.digitalpebble.com> > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > > > > -- *Lewis*

