I tried the Ted's solution and it works. But I keep hitting the JVM out of memory problem. And grouping the key causes a lot of data shuffling.
So I am trying to order the data based on ID first and save as Parquet. Is there way to make sure that the data is partitioned that each ID's data is in one partition, so there would be no shuffling in the future? Thanks. On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust <[email protected]> wrote: > This would also be possible with an Aggregator in Spark 1.6: > > https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html > > On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu <[email protected]> wrote: > >> Something like the following: >> >> val zeroValue = collection.mutable.Set[String]() >> >> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v, >> (setOne, setTwo) => setOne ++= setTwo) >> >> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue <[email protected]> wrote: >> >>> Hey, >>> >>> For example, a table df with two columns >>> id name >>> 1 abc >>> 1 bdf >>> 2 ab >>> 2 cd >>> >>> I want to group by the id and concat the string into array of string. >>> like this >>> >>> id >>> 1 [abc,bdf] >>> 2 [ab, cd] >>> >>> How could I achieve this in dataframe? I stuck on df.groupBy("id"). ??? >>> >>> Thanks >>> >>> >> >
