Re: adding a column to a groupBy (dataframe)

2019-06-07 Thread Marcelo Valle
Hi Bruno, that's really interesting... So, to use explode, I would have to do a group by on countries and a collect_all on cities, then explode the cities, right? Am I understanding the idea right? I think this could produce the results I want. But what would be the behaviour under the hood?

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Bruno Nassivet
Hi Marcelo, Maybe the spark.sql.functions.explode give what you need? // Bruno > Le 6 juin 2019 à 16:02, Marcelo Valle a écrit : > > Generating the city id (child) is easy, monotonically increasing id worked > for me. > > The problem is the country (parent) which has to be in both

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
t; -- > *From:* Marcelo Valle > *Sent:* Thursday, June 6, 2019 16:02 > *To:* Magnus Nilsson > *Cc:* user @spark > *Subject:* Re: adding a column to a groupBy (dataframe) > > Generating the city id (child) is easy, monotonically increasing id wor

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
Generating the city id (child) is easy, monotonically increasing id worked for me. The problem is the country (parent) which has to be in both countries and cities data frames. On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson wrote: > Well, you could do a repartition on cityname/nrOfCities and

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Magnus Nilsson
Well, you could do a repartition on cityname/nrOfCities and use the spark_partition_id function or the mappartitionswithindex dataframe method to add a city Id column. Then just split the dataframe into two subsets. Be careful of hashcollisions on the reparition Key though, or more than one city

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
Akshay, First of all, thanks for the answer. I *am* using monotonically increasing id, but that's not my problem. My problem is I want to output 2 tables from 1 data frame, 1 parent table with ID for the group by and 1 child table with the parent id without the group by. I was able to solve this

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
Additionally there is "uuid" function available as well if that helps your use case. Akshay Bhardwaj +91-97111-33849 On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Marcelo, > > If you are using spark 2.3+ and dataset API/SparkSQL,you can use this >

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
Hi Marcelo, If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt function "monotonically_increasing_id" in Spark. A little tweaking using Spark sql inbuilt functions can enable you to achieve this without having to write code or define RDDs with map/reduce functions.

adding a column to a groupBy (dataframe)

2019-05-29 Thread Marcelo Valle
Hi all, I am new to spark and I am trying to write an application using dataframes that normalize data. So I have a dataframe `denormalized_cities` with 3 columns: COUNTRY, CITY, CITY_NICKNAME Here is what I want to do: 1. Map by country, then for each country generate a new ID and write