RE: Shuffle intermidiate results not being cached

2016-12-27 Thread assaf.mendelson
I understand the actual dataframe is different, but the underlying partitions are not (hence the importance of mark's response). The code you suggested would not work as allDF and x would have different schema's (x is the original and allDF becomes the grouped). I can do something like this:

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread dragonly
Thanks for your reply! Here's my *understanding*: basic types that ScalaReflection understands are encoded into tungsten binary format, while UDTs are encoded into GenericInternalRow, which stores the JVM objects in an Array[Any] under the hood, and thus lose those memory footprint efficiency and

Re: unsubscribe

2016-12-27 Thread Minikek
Once you are in, there is no way out… :-) > On Dec 27, 2016, at 7:37 PM, Kyle Kelley wrote: > > You are now in position 238 for unsubscription. If you wish for your > subscription to occur immediately, please email > dev-unsubscr...@spark.apache.org > > Best wishes. > >

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust
An encoder uses reflection to generate expressions that can extract data out of an object (by calling methods on the object) and encode its contents directly into the

RE: Shuffle intermidiate results not being cached

2016-12-27 Thread Liang-Chi Hsieh
Hi, Every iteration the data you run aggregation on it is different. As I showed in previous reply: 1st iteration: aggregation(x1 union x2) 2nd iteration: aggregation(x3 union (x1 union x2)) 3rd iteration: aggregation(x4 union(x3 union (x1 union x2))) In 1st you run aggregation on the data of

What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread dragonly
I'm recently reading the source code of the SparkSQL project, and found some interesting databricks blogs about the tungsten project. I've roughly read through the encoder and unsafe representation part of the tungsten project(haven't read the algorithm part such as cache friendly hashmap