Re: [PySpark] Getting the best row from each group

Oliver Ruebenacker Mon, 19 Dec 2022 11:36:27 -0800

Thank you, that is an interesting idea. Instead of finding the maximum
population, we are finding the maximum (population, city name) tuple.


On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <[email protected]>
wrote:

> We have pandas API on spark
> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html>
> which is very good.
>
> from pyspark import pandas as ps
>
> You can use pdf = df.pandas_api()
> Where df is your pyspark dataframe.
>
>
> [image: image.png]
>
> Does this help you?
>
> df.groupby(['Country'])[['Population', 'City']].max()
>
> man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <[email protected]
> >:
>
>> Window functions don't work like traditional GROUP BYs. They allow you to
>> partition data and pull any relevant column, whether it's used in the
>> partition or not.
>>
>> I'm not sure what the syntax is for PySpark, but the standard SQL would
>> be something like this:
>>
>> WITH InputData AS
>> (
>>   SELECT 'USA' Country, 'New York' City, 9000000 Population
>>   UNION
>>   SELECT 'USA' Country, 'Miami', 6200000 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population
>>   UNION
>>   SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population
>> )
>>
>>  SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population
>> DESC) PopulationRank
>>  FROM InputData;
>>
>> Results would be something like this:
>>
>> Country    City       Population     PopulationRank
>> Ukraine    Kyiv       3000000        1
>> Ukraine    Kharkiv    1400000        2
>> USA        New York   9000000        1
>> USA        Miami      6200000        2
>>
>> Which you could further filter in another CTE or subquery where
>> PopulationRank = 1.
>>
>> As I mentioned, I'm not sure how this translates into PySpark, but that's
>> the general concept in SQL.
>>
>> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker <
>> [email protected]> wrote:
>>
>>> If we only wanted to know the biggest population, max function would
>>> suffice. The problem is I also want the name of the city with the biggest
>>> population.
>>>
>>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <[email protected]> wrote:
>>>
>>>> As Mich says, isn't this just max by population partitioned by country
>>>> in a window function?
>>>>
>>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   Thank you for the response!
>>>>>
>>>>>   I can think of two ways to get the largest city by country, but both
>>>>> seem to be inefficient:
>>>>>
>>>>>   (1) I could group by country, sort each group by population, add the
>>>>> row number within each group, and then retain only cities with a row 
>>>>> number
>>>>> equal to 1. But it seems wasteful to sort everything when I only want the
>>>>> largest of each country
>>>>>
>>>>>   (2) I could group by country, get the maximum city population for
>>>>> each country, join that with the original data frame, and then retain only
>>>>> cities with population equal to the maximum population in the country. But
>>>>> that seems also expensive because I need to join.
>>>>>
>>>>>   Am I missing something?
>>>>>
>>>>>   Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> In spark you can use windowing function
>>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>>>>>> achieve this
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>>      Hello,
>>>>>>>
>>>>>>>   How can I retain from each group only the row for which one value
>>>>>>> is the maximum of the group? For example, imagine a DataFrame containing
>>>>>>> all major cities in the world, with three columns: (1) City name (2)
>>>>>>> Country (3) population. How would I get a DataFrame that only contains 
>>>>>>> the
>>>>>>> largest city in each country? Thanks!
>>>>>>>
>>>>>>>      Best, Oliver
>>>>>>>
>>>>>>> --
>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Getting the best row from each group

Reply via email to