Hi, I have billions, potentially dozens of billions of observations. Each observation is a decimal number. I need to calculate percentiles 1, 25, 50, 75, 95 for these observations using Scala Spark. I can use both RDD and Dataset API. Whatever would work better.
What I can do in terms of perf optimisation: - I can round decimal observations to long - I can even round each observation to nearest 5, for example: 2.6 can be rounded to 5 or 11.3123123 can be rounded to 10 to reduce amount of unique values of observations (if it helps on Math side) - I’m fine with some approximation approach, loose some precision (how to measure an error BTW? ) but get percentile results faster. What can I try? Thanks!