I think this will be slow because you have to do a group by then do a join (my table has 7 million records). I am looking for something like reduceByKey(), e.g.
rdd.reduceByKey( (a, b) => if (a.timeStamp > b.timeStamp) a else b ) Does it have similar thing in DataFrame? Of course I can convert a DataFrame to RDD and then invoke the recudeByKey Ningjun From: ayan guha [mailto:guha.a...@gmail.com] Sent: Thursday, April 30, 2015 3:41 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: RE: HOw can I merge multiple DataFrame and remove duplicated key 1. Do a group by and get Max. In your example select id, Max(DT) from t group by id. Name this j. 2. Join t,j on id and DT=mxdt This is how we used to query RDBMS before window functions show up. As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>] Sent: Wednesday, April 29, 2015 5:54 PM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: HOw can I merge multiple DataFrame and remove duplicated key Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs by sc.union() to union all the RDDs and then do a reduceByKey() to remove duplicated id by keeping only the one with latest timeStamp field. But how do I do it for DataFrame? Ningjun