Re: How to concat few rows into a new column in dataframe

Gavin Yue Tue, 05 Jan 2016 21:26:08 -0800

I tried the Ted's solution and it works.   But I keep hitting the JVM out
of memory problem.
And grouping the key causes a lot of  data shuffling.


So I am trying to order the data based on ID first and save as Parquet.  Is
there way to make sure that the data is partitioned that each ID's data is
in one partition, so there would be no shuffling in the future?

Thanks.


On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust <[email protected]>
wrote:

> This would also be possible with an Aggregator in Spark 1.6:
>
> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>
> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu <[email protected]> wrote:
>
>> Something like the following:
>>
>> val zeroValue = collection.mutable.Set[String]()
>>
>> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
>> (setOne, setTwo) => setOne ++= setTwo)
>>
>> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> For example, a table df with two columns
>>> id  name
>>> 1   abc
>>> 1   bdf
>>> 2   ab
>>> 2   cd
>>>
>>> I want to group by the id and concat the string into array of string.
>>> like this
>>>
>>> id
>>> 1 [abc,bdf]
>>> 2 [ab, cd]
>>>
>>> How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???
>>>
>>> Thanks
>>>
>>>
>>
>

Re: How to concat few rows into a new column in dataframe

Reply via email to