Hello!

I'm working on implementation of Reservoir Sampling aggregation
function. In the project I'm working on I need to do grouping by one
column and collect a list of at most X values of another column. If
there are more than X values of another column for the group, sampling
should be done uniformly.

I implemented a version that works on my tests:
https://gist.github.com/SemyonSinchenko/baeb19e7a74f90271dbb58bded5ad0c8

But I'm not sure, is it safe to use ThreadLocalRandom inside reduce and
merge. Like what happens, if Spark needs to re-compute failed tasks? Is
it mandatory for UDAFs to be deterministic? Should I provide any
additional flag, like isDeterministic, etc.?

And what should be a general approach to creating a random-based user
defined aggregation functions?

Thanks for any comment / advice!

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Reply via email to