Currently, I'm using the percentile approx function with Hive. I'm looking for a better way to run this function or another way to get the same result with spark, but faster and not using gigantic instances..
I'm trying to optimize this job by changing the Spark configuration. If you have any ideas how to approach this, it would be great (like instance type, number of instances, number of executers etc.) On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy <pmccar...@dstillery.com> wrote: > Depending on your tolerance for error you could also use > percentile_approx(). > > On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov <grapesmo...@gmail.com> > wrote: > >> Do you mean that you are trying to compute the percent rank of some data? >> You can use the SparkSQL percent_rank function for that, but I don't think >> that's going to give you any improvement over calling the percentRank >> function on the data frame. Are you currently using a user-defined function >> for this task? Because I bet that's what's slowing you down. >> >> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File <tzahi.f...@ironsrc.com> >> wrote: >> >>> Hi, >>> >>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >>> percentile function. I'm trying to improve this job by moving it to run >>> with spark SQL. >>> >>> Any suggestions on how to use a percentile function in Spark? >>> >>> >>> Thanks, >>> -- >>> Tzahi File >>> Data Engineer >>> [image: ironSource] <http://www.ironsrc.com/> >>> >>> email tzahi.f...@ironsrc.com >>> mobile +972-546864835 >>> fax +972-77-5448273 >>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv >>> ironsrc.com <http://www.ironsrc.com/> >>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image: >>> twitter] <https://twitter.com/ironsource>[image: facebook] >>> <https://www.facebook.com/ironSource>[image: googleplus] >>> <https://plus.google.com/+ironsrc> >>> This email (including any attachments) is for the sole use of the >>> intended recipient and may contain confidential information which may be >>> protected by legal privilege. If you are not the intended recipient, or the >>> employee or agent responsible for delivering it to the intended recipient, >>> you are hereby notified that any use, dissemination, distribution or >>> copying of this communication and/or its content is strictly prohibited. If >>> you are not the intended recipient, please immediately notify us by reply >>> email or by telephone, delete this email and destroy any copies. Thank you. >>> >> >> >> -- >> http://www.google.com/profiles/grapesmoker >> > > > -- > > > *Patrick McCarthy * > > Senior Data Scientist, Machine Learning Engineering > > Dstillery > > 470 Park Ave South, 17th Floor, NYC 10016 > -- Tzahi File Data Engineer [image: ironSource] <http://www.ironsrc.com/> email tzahi.f...@ironsrc.com mobile +972-546864835 fax +972-77-5448273 ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv ironsrc.com <http://www.ironsrc.com/> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image: twitter] <https://twitter.com/ironsource>[image: facebook] <https://www.facebook.com/ironSource>[image: googleplus] <https://plus.google.com/+ironsrc> This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.