I found that in 1.6 dataframe could do repartition. Should I still need to do orderby first or I just have to repartition?
On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > I tried the Ted's solution and it works. But I keep hitting the JVM out > of memory problem. > And grouping the key causes a lot of data shuffling. > > So I am trying to order the data based on ID first and save as Parquet. > Is there way to make sure that the data is partitioned that each ID's data > is in one partition, so there would be no shuffling in the future? > > Thanks. > > > On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> This would also be possible with an Aggregator in Spark 1.6: >> >> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html >> >> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> Something like the following: >>> >>> val zeroValue = collection.mutable.Set[String]() >>> >>> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, >>> (setOne, setTwo) => setOne ++= setTwo) >>> >>> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue <yue.yuany...@gmail.com> >>> wrote: >>> >>>> Hey, >>>> >>>> For example, a table df with two columns >>>> id name >>>> 1 abc >>>> 1 bdf >>>> 2 ab >>>> 2 cd >>>> >>>> I want to group by the id and concat the string into array of string. >>>> like this >>>> >>>> id >>>> 1 [abc,bdf] >>>> 2 [ab, cd] >>>> >>>> How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? >>>> >>>> Thanks >>>> >>>> >>> >> >