thanks i will look into Aggregator as well

On Sun, Feb 14, 2016 at 12:31 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Instead of grouping with a lambda function, you can do it with a column
> expression to avoid materializing an unnecessary tuple:
>
> df.groupBy($"_1")
>
> Regarding the mapValues, you can do something similar using an Aggregator
> <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html>,
> but I agree that we should consider something lighter weight like the
> mapValues you propose.
>
> On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i have a Dataset[(K, V)]
>> i would like to group by k and then reduce V using a function (V, V) => V
>> how do i do this?
>>
>> i would expect something like:
>> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f)
>> or better:
>> ds.grouped.reduce(f)  # grouped only works on Dataset[(_, _)] and i dont
>> care about java api
>>
>> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a
>> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key
>> through the reduce operation but that seems ugly and inefficient.
>>
>> any thoughts?
>>
>>
>>
>

Reply via email to