Re: Question about spark.sql min_by

Sean Owen Mon, 21 Feb 2022 05:42:09 -0800

>From the source code, looks like this function was added to pyspark in
Spark 3.3, up for release soon. It exists in SQL. You can still use it in
SQL with `spark.sql(...)` in Python though, not hard.


On Mon, Feb 21, 2022 at 4:01 AM David Diebold <davidjdieb...@gmail.com>
wrote:

> Hello all,
>
> I'm trying to use the spark.sql min_by aggregation function with pyspark.
> I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2
>
> I have a dataframe made of these columns:
> - productId : int
> - sellerId : int
> - price : double
>
> For each product, I want to get the seller who sells the product for the
> cheapest price.
>
> Naive approach would be to do this, but I would expect two shuffles:
>
> import spark.sql.functions as F
> cheapest_prices_df  =
> df.groupby('productId').agg(F.min('price').alias('price'))
> cheapest_sellers_df = df.join(cheapest_prices_df, on=['productId',
> 'price'])
>
> I would had loved to do this instead :
>
> import spark.sql.functions as F
> cheapest_sellers_df  = df.groupby('productId').agg(F.min('price'),
> F.min_by('sellerId', 'price'))
>
> Unfortunately min_by does not seem available in pyspark sql functions,
> whereas I can see it in the doc :
> https://spark.apache.org/docs/latest/api/sql/index.html
>
> I have managed to use min_by with this approach but it looks slow (maybe
> because of temp table creation ?):
>
> df.createOrReplaceTempView("table")
> cheapest_sellers_df = spark.sql("select min_by(sellerId, price) sellerId,
> min(price) from table group by productId")
>
> Is there a way I can rely on min_by directly in groupby ?
> Is there some code missing in pyspark wrapper to make min_by visible
> somehow ?
>
> Thank you in advance for your help.
>
> Cheers
> David
>

Re: Question about spark.sql min_by

Reply via email to