I found that in 1.6 dataframe could do repartition.

Should I still need to do orderby first or I just have to repartition?




On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> I tried the Ted's solution and it works.   But I keep hitting the JVM out
> of memory problem.
> And grouping the key causes a lot of  data shuffling.
>
> So I am trying to order the data based on ID first and save as Parquet.
> Is there way to make sure that the data is partitioned that each ID's data
> is in one partition, so there would be no shuffling in the future?
>
> Thanks.
>
>
> On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> This would also be possible with an Aggregator in Spark 1.6:
>>
>> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>>
>> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Something like the following:
>>>
>>> val zeroValue = collection.mutable.Set[String]()
>>>
>>> val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
>>> (setOne, setTwo) => setOne ++= setTwo)
>>>
>>> On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue <yue.yuany...@gmail.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> For example, a table df with two columns
>>>> id  name
>>>> 1   abc
>>>> 1   bdf
>>>> 2   ab
>>>> 2   cd
>>>>
>>>> I want to group by the id and concat the string into array of string.
>>>> like this
>>>>
>>>> id
>>>> 1 [abc,bdf]
>>>> 2 [ab, cd]
>>>>
>>>> How could I achieve this in dataframe?  I stuck on df.groupBy("id"). ???
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>

Reply via email to