Hi, I've used functions.monotonically_increasing_id() for assigning a unique ID to all rows, but I'd like to assign a unique ID to each group of rows with the same key.
The two ways I can think of to do this are Option 1: Create separate group ID table and join back - Create a new data frame with the distinct values of the keys. - Add an ID column to it via monotonically_increasing_id. - Join this table back with the original to add the group ID. In this best case, this will be small enough to be a broadcast join. Option 2: Add ID column / groupByKey / flatMapGroups - Add an ID column with monotonically_increasing_id - groupByKey - flatMapGroups and apply the first seen ID from the iterator to the other rows Option 2 is a little annoying if you're dealing with Dataset[Row], as you have to do a lot of work to get the fields out of the old Row objects and create new ones. Is there a better way? Also, generally, while assigning a unique ID to all rows seems like a commonly needed operation, there are comments in RDD.zipWithUniqueId as well as monotonically_increasing_id that suggest these may not be especially reliable in various cases. Do people hit those much? Thanks! - Everett