Two more ways:
*Using the Typed Dataset API with Rows*
Caveat: The docs about flatMapGroups do warn "This function does not
support partial aggregation, and as a result requires shuffling all the
data in the Dataset. If an application intends to perform an aggregation
over each key, it is best to
For the curious, I played around with a UDAF for this (shown below). On the
downside, it assembles a Map of all possible values of the column that'll
need to be stored in memory somewhere.
I suspect some kind of sorted groupByKey + cogroup could stream values
through, though might not support part