Hi Chen, it's certainly correct to say it is hard to make an apple-to-apple comparison in terms of being able to assume that there is an implementation-equivalent for any given Shark query, in "Spark only".
That said, I think the results of your comparisons could still be a valuable reference. There are scenarios where perhaps someone wants to consider the trade-offs between implementing some ETL operation with Shark or with only Spark. Some sense of performance/cost difference would be helpful in making that decision. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Wed, Jan 29, 2014 at 11:10 PM, Chen Jin <[email protected]> wrote: > Hi Christopher, > > Thanks a lot for taking time to explain some details under Shark's > hood. It is probably very hard to make an apple-to-apple comparison > between Shark and Spark since they might be suitable for different > types of tasks. From what you have explained, is it OK to think Shark > is better off for SQL-like tasks, while Spark is more for iterative > machine learning algorithms? > > Cheers, > > -chen > > On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen <[email protected]> > wrote: > > Chen, interesting comparisons you're trying to make. It would be great to > > share this somewhere when you're done. > > > > Some suggestions of non-obvious things to consider: > > > > In general there are any number of differences between Shark and some > > "equivalent" Spark implementation of the same query. > > > > Shark isn't necessarily what we may think of as "let's see which lines of > > code accomplish the same thing in Spark". Its current implementation is > > based on Hive which has its own query planning, optimization, and > execution. > > Shark's code has some of its own tricks. You can use "EXPLAIN" to see > > Shark's execution plan, and compare to your Spark approach. > > > > Further Shark has its own memory storage format, e.g., > typed-column-oriented > > RDD[TablePartition], that can make it more memory-efficient, and help > > execute many column aggregation queries a lot faster than the > row-oriented > > RDD[Array[String]] you may be using. > > > > In short, Shark does a number of things that are smarter and more > optimized > > for SQL queries than a straightforward Spark RDD implementation of the > same. > > -- > > Christopher T. Nguyen > > Co-founder & CEO, Adatao > > linkedin.com/in/ctnguyen > > > > > > > > On Wed, Jan 29, 2014 at 8:10 PM, Chen Jin <[email protected]> wrote: > >> > >> Hi All, > >> > >> https://amplab.cs.berkeley.edu/benchmark/ has given a nice benchmark > >> report. I am trying to reproduce the same set of queries in the > >> spark-shell so that we can understand more about shark and spark and > >> their performance on EC2. > >> > >> As for the Aggregation Query when X=8, Shark-disk takes 210 seconds > >> and Shark-mem takes 111 seconds. However, when I materialize the > >> results to the disk, spark-shell takes more than 5 minutes > >> (reduceByKey is used in the shell for aggregation) . Further, if I > >> cache uservisits RDD, since the dataset is way too big, the > >> performance deteriorates quite a lot. > >> > >> Can anybody shed some light on why there is a more than 2x difference > >> between shark-disk and spark-shell-disk and how to cache data in spark > >> correctly such that we can achieve comparable performance as > >> shark-mem? > >> > >> Thank you very much, > >> > >> -chen > > > > >
