These were EC2 clusters, so the machines were smaller than modern machines. You can definitely have 1 TB datasets on 10 nodes too. Actually if you’re curious about hardware configuration, take a look at http://spark.incubator.apache.org/docs/latest/hardware-provisioning.html.
Also, regarding Spark vs Shark — raw Spark code is usually faster than Shark, but we don’t have as many recent benchmarks on large datasets. Some of the code running in the Shark paper is Spark-based though (specifically the machine learning algorithms). Matei On Dec 4, 2013, at 11:06 AM, Matt Cheah <[email protected]> wrote: > I'm reading the paper now, thanks. It states 100-node clusters were used. Is > this typical in the field to have 100 node clusters for the 1TB scale? We > were expecting to be using ~10 nodes. > > I'm still pretty new to cluster computing, so just not sure how people have > set these up. > > -Matt Cheah > > From: Matei Zaharia <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, December 4, 2013 10:53 AM > To: "[email protected]" <[email protected]> > Cc: Mingyu Kim <[email protected]> > Subject: Re: Benchmark numbers for terabytes of data > > Yes, check out the Shark paper for example: > https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/ > > The numbers on that benchmark are for Shark. > > Matei > > On Dec 3, 2013, at 3:50 PM, Matt Cheah <[email protected]> wrote: > >> Hi everyone, >> >> I notice the benchmark page for AMPLab provides some numbers on Gbs of data: >> https://amplab.cs.berkeley.edu/benchmark/ I was wondering if similar >> benchmark numbers existed for even larger data sets, in the terabytes if >> possible. >> >> Also, are there any for just raw spark, i.e. No shark? >> >> Thanks, >> >> -Matt Chetah >
