Hi, I have billions, potentially dozens of billions of observations. Each
observation is a decimal number.
I need to calculate percentiles 1, 25, 50, 75, 95 for these observations
using Scala Spark. I can use both RDD and Dataset API. Whatever would work
better.

What I can do in terms of perf optimisation:
- I can round decimal observations to long
- I can even round each observation to nearest 5, for example: 2.6 can be
rounded to 5 or 11.3123123 can be rounded to 10 to reduce amount of unique
values of observations (if it helps on Math side)
- I’m fine with some approximation approach, loose some precision (how to
measure an error BTW? ) but get percentile results faster.


What can I try?
Thanks!

Reply via email to