RE: Comparing Nutch and Common Crawl

Markus Jelsma Mon, 17 Dec 2012 13:53:47 -0800

Hi,

Interesting indeed. Apart from our customers we operate a cluster of a few high 
octane machines for research purposes that crawls the entire internet as much 
as it physically can. We run a modified Nutch 1.x and some custom jobs that 
analyze the crawled data and allow us to crawl the internet more efficiently. 
The cluster is far too small to quickly read through all the data. We only have 
80GB of RAM and 80 CPU cores so it takes a while to read the ~760GB crawldb 
containing about 5.7 billion records, it takes about 40 minutes. Compiling a 
webgraph and calculating the page rank takes about 32 hours. Fetching and 
parsing is less intensive, at peak efficiency we can process over 700 pages per 
second including the reduce phase time and job set up and clean up.


We can provide all the information you would like to have.

Cheers. 
 
-----Original message-----
> From:Julien Nioche <[email protected]>
> Sent: Mon 17-Dec-2012 22:00
> To: [email protected]; [email protected]
> Cc: Lisa Green <[email protected]>
> Subject: Comparing Nutch and Common Crawl
> 
> Hi,
> 
> I was chatting with the people from the Common Crawl project 
> (www.commoncrawl.org <http://www.commoncrawl.org> ) this afternoon and we 
> thought it would be interesting to have some sort of comparison between the 
> space / memory / CPU requirements of their crawler and what it would take to 
> process a similar amount with Nutch 1.x and 2.x. The aim is not so much to 
> prove that one system is superior to the other (they both have their pluses 
> and minuses) but to get a better picture of the situation.
> 
> One way to do this would be to gather stats from Nutch users operating large 
> crawls. Alternatively one could push the content of the CC dataset into e.g. 
> Nutch 2 on Hbase to see how much space it would take and how the crawl would 
> fare on that. I am pretty sure that would reveal all sorts of interesting 
> issues and would be a good thing to do to test the Nutch + Gora stack.
> 
> Would anyone be interested in sharing their stats? Anyone with spare time and 
> machine to populate a crawldb with the CC dataset and get some stats?
> 
> Thanks
> 
> Julien
>  
> -- 
>  <http://digitalpebble.com/img/logo.gif> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> 
> http://www.digitalpebble.com <http://www.digitalpebble.com> 
> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> 
> 
>

RE: Comparing Nutch and Common Crawl

Reply via email to