subject:"Re\: Calculate mode separately for multiple columns in row"

Re: Calculate mode separately for multiple columns in row

2017-05-01 Thread Everett Anderson

Two more ways: *Using the Typed Dataset API with Rows* Caveat: The docs about flatMapGroups do warn "This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to

Re: Calculate mode separately for multiple columns in row

2017-04-27 Thread Everett Anderson

For the curious, I played around with a UDAF for this (shown below). On the downside, it assembles a Map of all possible values of the column that'll need to be stored in memory somewhere. I suspect some kind of sorted groupByKey + cogroup could stream values through, though might not support part