If you would require higher precision, you may have to write a custom udaf. In my case, I ended up storing the data as a key-value ordered list of histograms.
Thanks Muthu On Mon, Nov 11, 2019, 20:46 Patrick McCarthy <pmccar...@dstillery.com.invalid> wrote: > Depending on your tolerance for error you could also use > percentile_approx(). > > On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <grapesmo...@gmail.com> > wrote: > >> Do you mean that you are trying to compute the percent rank of some data? >> You can use the SparkSQL percent_rank function for that, but I don't think >> that's going to give you any improvement over calling the percentRank >> function on the data frame. Are you currently using a user-defined function >> for this task? Because I bet that's what's slowing you down. >> >> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tzahi.f...@ironsrc.com> >> wrote: >> >>> Hi, >>> >>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >>> percentile function. I'm trying to improve this job by moving it to run >>> with spark SQL. >>> >>> Any suggestions on how to use a percentile function in Spark? >>> >>> >>> Thanks, >>> -- >>> Tzahi File >>> Data Engineer >>> [image: ironSource] <http://www.ironsrc.com/> >>> >>> email tzahi.f...@ironsrc.com >>> mobile +972-546864835 >>> fax +972-77-5448273 >>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv >>> ironsrc.com <http://www.ironsrc.com/> >>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image: >>> twitter] <https://twitter.com/ironsource>[image: facebook] >>> <https://www.facebook.com/ironSource>[image: googleplus] >>> <https://plus.google.com/+ironsrc> >>> This email (including any attachments) is for the sole use of the >>> intended recipient and may contain confidential information which may be >>> protected by legal privilege. If you are not the intended recipient, or the >>> employee or agent responsible for delivering it to the intended recipient, >>> you are hereby notified that any use, dissemination, distribution or >>> copying of this communication and/or its content is strictly prohibited. If >>> you are not the intended recipient, please immediately notify us by reply >>> email or by telephone, delete this email and destroy any copies. Thank you. >>> >> >> >> -- >> http://www.google.com/profiles/grapesmoker >> > > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 >