Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Nirav Patel
Awesome! I just read design docs. That is EXACTLY what I was talking about! Looking forward to it! Thanks On Wed, Feb 3, 2016 at 7:40 AM, Koert Kuipers wrote: > yeah there was some discussion about adding them to RDD, but it would > break a lot. so Dataset was born. > > yes

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Michael Armbrust
On Wed, Feb 3, 2016 at 1:42 PM, Nirav Patel wrote: > Awesome! I just read design docs. That is EXACTLY what I was talking > about! Looking forward to it! > Great :) Most of the API is there in 1.6. For the next release I would like to unify DataFrame <-> Dataset and do

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Michael Armbrust
> > A principal difference between RDDs and DataFrames/Datasets is that the > latter have a schema associated to them. This means that they support only > certain types (primitives, case classes and more) and that they are > uniform, whereas RDDs can contain any serializable object and must not >

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
Sure, having a common distributed query and compute engine for all kind of data source is alluring concept to market and advertise and to attract potential customers (non engineers, analyst, data scientist). But it's nothing new!..but darn old school. it's taking bits and pieces from existing sql

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky
To address one specific question: > Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD has types too! A principal difference between RDDs and

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Michael, Is there a section in the spark documentation demonstrate how to serialize arbitrary objects in Dataframe? The last time I did was using some User Defined Type (copy from VectorUDT). Best Regards, Jerry On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
Hi Jerry, Yes I read that benchmark. And doesn't help in most cases. I'll give you example of one of our application. It's a memory hogger by nature since it works on groupByKey and performs combinatorics on Iterator. So it maintain few structures inside task. It works on mapreduce with half the

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
with respect to joins, unfortunately not all implementations are available. for example i would like to use joins where one side is streaming (and the other cached). this seems to be available for DataFrame but not for RDD. On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
so latest optimizations done on spark 1.4 and 1.5 releases are mostly from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
Dataset will have access to some of the catalyst/tungsten optimizations while also giving you scala and types. However that is currently experimental and not yet as efficient as it could be. On Feb 2, 2016 7:50 PM, "Nirav Patel" wrote: > Sure, having a common distributed

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
I dont understand why one thinks RDD of case object doesn't have types(schema) ? If spark can convert RDD to DataFrame which means it understood the schema. SO then from that point why one has to use SQL features to do further processing? If all spark need for optimizations is schema then what

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam
Hi Nirav, I'm sure you read this? https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html There is a benchmark in the article to show that dataframe "can" outperform RDD implementation by 2 times. Of course, benchmarks can be "made". But from the

Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
Hi, Perhaps I should write a blog about this that why spark is focusing more on writing easier spark jobs and hiding underlaying performance optimization details from a seasoned spark users. It's one thing to provide such abstract framework that does optimization for you so you don't have to

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Mark Hamstra
What do you think is preventing you from optimizing your own RDD-level transformations and actions? AFAIK, nothing that has been added in Catalyst precludes you from doing that. The fact of the matter is, though, that there is less type and semantic information available to Spark from the raw

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
I haven't gone through much details of spark catalyst optimizer and tungston project but we have been advised by databricks support to use DataFrame to resolve issues with OOM error that we are getting during Join and GroupBy operations. We use spark 1.3.1 and looks like it can not perform