Thank you, that is an interesting idea. Instead of finding the maximum population, we are finding the maximum (population, city name) tuple.
On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > We have pandas API on spark > <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html> > which is very good. > > from pyspark import pandas as ps > > You can use pdf = df.pandas_api() > Where df is your pyspark dataframe. > > > [image: image.png] > > Does this help you? > > df.groupby(['Country'])[['Population', 'City']].max() > > man. 19. des. 2022 kl. 18:22 skrev Patrick Tucci <patrick.tu...@gmail.com > >: > >> Window functions don't work like traditional GROUP BYs. They allow you to >> partition data and pull any relevant column, whether it's used in the >> partition or not. >> >> I'm not sure what the syntax is for PySpark, but the standard SQL would >> be something like this: >> >> WITH InputData AS >> ( >> SELECT 'USA' Country, 'New York' City, 9000000 Population >> UNION >> SELECT 'USA' Country, 'Miami', 6200000 Population >> UNION >> SELECT 'Ukraine' Country, 'Kyiv', 3000000 Population >> UNION >> SELECT 'Ukraine' Country, 'Kharkiv', 1400000 Population >> ) >> >> SELECT *, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Population >> DESC) PopulationRank >> FROM InputData; >> >> Results would be something like this: >> >> Country City Population PopulationRank >> Ukraine Kyiv 3000000 1 >> Ukraine Kharkiv 1400000 2 >> USA New York 9000000 1 >> USA Miami 6200000 2 >> >> Which you could further filter in another CTE or subquery where >> PopulationRank = 1. >> >> As I mentioned, I'm not sure how this translates into PySpark, but that's >> the general concept in SQL. >> >> On Mon, Dec 19, 2022 at 12:01 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> If we only wanted to know the biggest population, max function would >>> suffice. The problem is I also want the name of the city with the biggest >>> population. >>> >>> On Mon, Dec 19, 2022 at 11:58 AM Sean Owen <sro...@gmail.com> wrote: >>> >>>> As Mich says, isn't this just max by population partitioned by country >>>> in a window function? >>>> >>>> On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker < >>>> oliv...@broadinstitute.org> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> Thank you for the response! >>>>> >>>>> I can think of two ways to get the largest city by country, but both >>>>> seem to be inefficient: >>>>> >>>>> (1) I could group by country, sort each group by population, add the >>>>> row number within each group, and then retain only cities with a row >>>>> number >>>>> equal to 1. But it seems wasteful to sort everything when I only want the >>>>> largest of each country >>>>> >>>>> (2) I could group by country, get the maximum city population for >>>>> each country, join that with the original data frame, and then retain only >>>>> cities with population equal to the maximum population in the country. But >>>>> that seems also expensive because I need to join. >>>>> >>>>> Am I missing something? >>>>> >>>>> Thanks! >>>>> >>>>> Best, Oliver >>>>> >>>>> On Mon, Dec 19, 2022 at 10:59 AM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> In spark you can use windowing function >>>>>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to >>>>>> achieve this >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>>>>> oliv...@broadinstitute.org> wrote: >>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> How can I retain from each group only the row for which one value >>>>>>> is the maximum of the group? For example, imagine a DataFrame containing >>>>>>> all major cities in the world, with three columns: (1) City name (2) >>>>>>> Country (3) population. How would I get a DataFrame that only contains >>>>>>> the >>>>>>> largest city in each country? Thanks! >>>>>>> >>>>>>> Best, Oliver >>>>>>> >>>>>>> -- >>>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>>> Institute <http://www.broadinstitute.org/> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Oliver Ruebenacker, Ph.D. (he) >>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>>> <http://www.broadinstitute.org/> >>>>> >>>> >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>