Matt, we've done 1TB linear models in 2-3 minutes on 40 node clusters
(30GB/node, just enough to hold all partitions simultaneously in memory).
You can do with fewer nodes if you're willing to slow things down.

Some of our TB benchmark numbers are available in my Spark Summit slides.
Sorry I'm on a plane now but you should be able to find the slides fairly
easily.

Re your other comment: monolithic 100-node analytic clusters are not
unusual, but not yet common outside of large companies. I'd eduguesstimate
it to be at the top 5%ile among companies with less than $500MM revenues,
with selection bias among Silicon Valley companies.

Sent while mobile. Pls excuse typos etc.
On Dec 4, 2013 11:06 AM, "Matt Cheah" <[email protected]> wrote:

>  I'm reading the paper now, thanks. It states 100-node clusters were
> used. Is this typical in the field to have 100 node clusters for the 1TB
> scale? We were expecting to be using ~10 nodes.
>
>  I'm still pretty new to cluster computing, so just not sure how people
> have set these up.
>
>  -Matt Cheah
>
>   From: Matei Zaharia <[email protected]>
> Reply-To: "[email protected]" <
> [email protected]>
> Date: Wednesday, December 4, 2013 10:53 AM
> To: "[email protected]" <[email protected]>
> Cc: Mingyu Kim <[email protected]>
> Subject: Re: Benchmark numbers for terabytes of data
>
>   Yes, check out the Shark paper for example:
> https://amplab.cs.berkeley.edu/publication/shark-sql-and-rich-analytics-at-scale/
>
>  The numbers on that benchmark are for Shark.
>
>  Matei
>
>  On Dec 3, 2013, at 3:50 PM, Matt Cheah <[email protected]> wrote:
>
>  Hi everyone,
>
>  I notice the benchmark page for AMPLab provides some numbers on Gbs of
> data: https://amplab.cs.berkeley.edu/benchmark/ I was wondering if
> similar benchmark numbers existed for even larger data sets, in the
> terabytes if possible.
>
>  Also, are there any for just raw spark, i.e. No shark?
>
>  Thanks,
>
>  -Matt Chetah
>
>
>

Reply via email to