My bad, thanks.
On Fri, Jan 22, 2016 at 4:34 PM, Reynold Xin wrote:
> The original email was asking about data partitioning (Hive style) for
> files, not in memory caching.
>
>
> On Thursday, January 21, 2016, Takeshi Yamamuro
> wrote:
>
>> You mean RDD#partitions are possibly split into multip
The original email was asking about data partitioning (Hive style) for
files, not in memory caching.
On Thursday, January 21, 2016, Takeshi Yamamuro
wrote:
> You mean RDD#partitions are possibly split into multiple Spark task
> partitions?
> If so, the optimization below is wrong?
>
> Without op
You mean RDD#partitions are possibly split into multiple Spark task
partitions?
If so, the optimization below is wrong?
Without opt.:
== Physical Plan ==
TungstenAggregate(key=[col0#159],
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
outp
It is not necessary if you are using bucketing available in Spark 2.0. For
partitioning, it is still necessary because we do not assume each partition
is small, and as a result there is no guarantee all the records for a
partition end up in a single Spark task partition.
On Thu, Jan 21, 2016 at 3
Hi,
If I had a df and I wrote it out via partitionBy("id"), presumably, when I
load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
right? Effectively, we can load in the dataframe with a hash partitioner
already set, since each task can simply read all the folders where
id= whe