You can just repartition by the id, if the final objective is to have all
data for the same key in the same partition.
Regards
Sab
On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue wrote:
> I found that in 1.6 dataframe could do repartition.
>
> Should I still need to do orderby first or I just have t
I found that in 1.6 dataframe could do repartition.
Should I still need to do orderby first or I just have to repartition?
On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue wrote:
> I tried the Ted's solution and it works. But I keep hitting the JVM out
> of memory problem.
> And grouping the key
I tried the Ted's solution and it works. But I keep hitting the JVM out
of memory problem.
And grouping the key causes a lot of data shuffling.
So I am trying to order the data based on ID first and save as Parquet. Is
there way to make sure that the data is partitioned that each ID's data is
This would also be possible with an Aggregator in Spark 1.6:
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu wrote:
> Something like the following:
>
> val zeroValue = collection.mutable.Set[String]()
>
> val a
Something like the following:
val zeroValue = collection.mutable.Set[String]()
val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
(setOne, setTwo) => setOne ++= setTwo)
On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue wrote:
> Hey,
>
> For example, a table df with two columns
> id
Hey,
For example, a table df with two columns
id name
1 abc
1 bdf
2 ab
2 cd
I want to group by the id and concat the string into array of string. like
this
id
1 [abc,bdf]
2 [ab, cd]
How could I achieve this in dataframe? I stuck on df.groupBy("id"). ???
Thanks