I have a large dataset composed of scores for several thousand segments,
and the timestamps at which time those scores occurred. I'd like to apply
some techniques like reservoir sampling[1], where for every segment I
process records in order of their timestamps, generate a sample, and then
at intervals compute the quantiles in the sample. Ideally I'd like to write
a pyspark udf to do the sampling/quantizing procedure.

It seems like something I should be doing via rdd.map, but it's not really
clear how I can enforce a function to process records in order within a
partition. Any pointers?


[1] https://en.wikipedia.org/wiki/Reservoir_sampling

Reply via email to