Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson wrote: > Note that regarding a "long load time", data format means a whole lot in > terms of query performance. If you load all your data into compressed, > columnar Parquet files on local hardware, Spark SQL would also perform far, > far better tha

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das wrote: > 600s for Spark vs 5s for Redshift...The numbers look much different from > the amplab benchmark... > > https://amplab.cs.berkeley.edu/benchmark/ > > Is it like SSDs or something that's helping redshift or the whole data is > in memory when yo

Re: Shark vs Impala

2014-06-23 Thread Aaron Davidson
Note that regarding a "long load time", data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far, far better than it would reading from gzipped S3 files. You must also be carefu

Re: Shark vs Impala

2014-06-22 Thread Matei Zaharia
In this benchmark, the problem wasn’t that Shark could not run without enough memory; Shark spills some of the data to disk and can run just fine. The issue was that the in-memory form of the RDDs was larger than the cluster’s memory, although the raw Parquet / ORC files did fit in memory, so Cl

Re: Shark vs Impala

2014-06-22 Thread Debasish Das
600s for Spark vs 5s for Redshift...The numbers look much different from the amplab benchmark... https://amplab.cs.berkeley.edu/benchmark/ Is it like SSDs or something that's helping redshift or the whole data is in memory when you run the query ? Could you publish the query ? Also after spark-s

Re: Shark vs Impala

2014-06-22 Thread Toby Douglass
I've just benchmarked Spark and Impala. Same data (in s3), same query, same cluster. Impala has a long load time, since it cannot load directly from s3. I have to create a Hive table on s3, then insert from that to an Impala table. This takes a long time; Spark took about 600s for the query, Imp

Re: Shark vs Impala

2014-06-22 Thread Bertrand Dechoux
For the second question, I would say it is mainly because the projects have not the same aim. Impala does have a "cost-based optimizer and predicate propagation capability" which is natural because it is interpreting pseudo-SQL query. In the realm of relational database, it is often not a good idea

Shark vs Impala

2014-06-22 Thread Flavio Pompermaier
Hi folks, I was looking at the benchmark provided by Cloudera at http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/ . Is it real that Shark cannot execute some query if you don't have enough memory? And is it true/reliable that Impala