I'm reading the paper now, thanks. It states 100-node clusters were used. Is this typical in the field to have 100 node clusters for the 1TB scale? We were expecting to be using ~10 nodes.
I'm still pretty new to cluster computing, so just not sure how people have set these up. -Matt Cheah From: Matei Zaharia <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, December 4, 2013 10:53 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Cc: Mingyu Kim <[email protected]<mailto:[email protected]>> Subject: Re: Benchmark numbers for terabytes of data Yes, check out the Shark paper for example: https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/ The numbers on that benchmark are for Shark. Matei On Dec 3, 2013, at 3:50 PM, Matt Cheah <[email protected]<mailto:[email protected]>> wrote: Hi everyone, I notice the benchmark page for AMPLab provides some numbers on Gbs of data: https://amplab.cs.berkeley.edu/benchmark/ I was wondering if similar benchmark numbers existed for even larger data sets, in the terabytes if possible. Also, are there any for just raw spark, i.e. No shark? Thanks, -Matt Chetah
