RE: HOw can I merge multiple DataFrame and remove duplicated key

ayan guha Thu, 30 Apr 2015 00:41:44 -0700

1. Do a group by and get Max. In your example select id, Max(DT) from t
group by id. Name this j.
2. Join t,j on id and DT=mxdt


This is how we used to query RDBMS before window functions show up.

As I understand from SQL, group by allow you to do sum(), average(), max(),
mn(). But how do I select the entire row in the group with maximum column
timeStamp? For example



id1,  value1, 2015-01-01

id1, value2,  2015-01-02

id2,  value3, 2015-01-01

id2, value4,  2015-01-02



I want to return

id1, value2,  2015-01-02

id2, value4,  2015-01-02



I can use reduceByKey() in RDD but how to do it using DataFrame? Can you
give an example code snipet?



Thanks

Ningjun



*From:* ayan guha [mailto:guha.a...@gmail.com]
*Sent:* Wednesday, April 29, 2015 5:54 PM
*To:* Wang, Ningjun (LNG-NPV)
*Cc:* user@spark.apache.org
*Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key



Its no different, you would use group by and aggregate function to do so.

On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com>
wrote:

I have multiple DataFrame objects each stored in a parquet file.  The
DataFrame just contains 3 columns (id,  value,  timeStamp). I need to union
all the DataFrame objects together but for duplicated id only keep the
record with the latest timestamp. How can I  do that?



I can do this for RDDs by sc.union() to union all the RDDs and then do a
reduceByKey() to remove duplicated id by keeping only the one with latest
timeStamp field. But how do I do it for DataFrame?





Ningjun

RE: HOw can I merge multiple DataFrame and remove duplicated key

Reply via email to