thanks i will look into Aggregator as well On Sun, Feb 14, 2016 at 12:31 AM, Michael Armbrust <mich...@databricks.com> wrote:
> Instead of grouping with a lambda function, you can do it with a column > expression to avoid materializing an unnecessary tuple: > > df.groupBy($"_1") > > Regarding the mapValues, you can do something similar using an Aggregator > <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html>, > but I agree that we should consider something lighter weight like the > mapValues you propose. > > On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i have a Dataset[(K, V)] >> i would like to group by k and then reduce V using a function (V, V) => V >> how do i do this? >> >> i would expect something like: >> val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f) >> or better: >> ds.grouped.reduce(f) # grouped only works on Dataset[(_, _)] and i dont >> care about java api >> >> but there is no mapValues or grouped. ds.groupBy(_._1) gives me a >> GroupedDataset[(K, (K, V))] which is inconvenient. i could carry the key >> through the reduce operation but that seems ugly and inefficient. >> >> any thoughts? >> >> >> >