I think this will be slow because you have to do a group by then do a join (my 
table has 7 million records). I am looking for something like reduceByKey(), 
e.g.

rdd.reduceByKey( (a, b) => if (a.timeStamp > b.timeStamp) a else b )

Does it have similar thing in DataFrame? Of course I can convert a DataFrame to 
RDD and then invoke the recudeByKey

Ningjun

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Thursday, April 30, 2015 3:41 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key


1. Do a group by and get Max. In your example select id, Max(DT) from t group 
by id. Name this j.
2. Join t,j on id and DT=mxdt

This is how we used to query RDBMS before window functions show up.
As I understand from SQL, group by allow you to do sum(), average(), max(), 
mn(). But how do I select the entire row in the group with maximum column 
timeStamp? For example

id1,  value1, 2015-01-01
id1, value2,  2015-01-02
id2,  value3, 2015-01-01
id2, value4,  2015-01-02

I want to return
id1, value2,  2015-01-02
id2, value4,  2015-01-02

I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give 
an example code snipet?

Thanks
Ningjun

From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Wednesday, April 29, 2015 5:54 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: HOw can I merge multiple DataFrame and remove duplicated key


Its no different, you would use group by and aggregate function to do so.
On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
I have multiple DataFrame objects each stored in a parquet file.  The DataFrame 
just contains 3 columns (id,  value,  timeStamp). I need to union all the 
DataFrame objects together but for duplicated id only keep the record with the 
latest timestamp. How can I  do that?

I can do this for RDDs by sc.union() to union all the RDDs and then do a 
reduceByKey() to remove duplicated id by keeping only the one with latest 
timeStamp field. But how do I do it for DataFrame?


Ningjun

Reply via email to