Hello! I'm working on implementation of Reservoir Sampling aggregation function. In the project I'm working on I need to do grouping by one column and collect a list of at most X values of another column. If there are more than X values of another column for the group, sampling should be done uniformly.
I implemented a version that works on my tests: https://gist.github.com/SemyonSinchenko/baeb19e7a74f90271dbb58bded5ad0c8 But I'm not sure, is it safe to use ThreadLocalRandom inside reduce and merge. Like what happens, if Spark needs to re-compute failed tasks? Is it mandatory for UDAFs to be deterministic? Should I provide any additional flag, like isDeterministic, etc.? And what should be a general approach to creating a random-based user defined aggregation functions? Thanks for any comment / advice! --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
