I have a large dataset composed of scores for several thousand segments, and the timestamps at which time those scores occurred. I'd like to apply some techniques like reservoir sampling, where for every segment I process records in order of their timestamps, generate a sample, and then at intervals compute the quantiles in the sample. Ideally I'd like to write a pyspark udf to do the sampling/quantizing procedure.
It seems like something I should be doing via rdd.map, but it's not really clear how I can enforce a function to process records in order within a partition. Any pointers? Thanks, Patrick  https://en.wikipedia.org/wiki/Reservoir_sampling