1. Do a group by and get Max. In your example select id, Max(DT) from t group by id. Name this j. 2. Join t,j on id and DT=mxdt
This is how we used to query RDBMS before window functions show up. As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun *From:* ayan guha [mailto:guha.a...@gmail.com] *Sent:* Wednesday, April 29, 2015 5:54 PM *To:* Wang, Ningjun (LNG-NPV) *Cc:* user@spark.apache.org *Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <ningjun.w...@lexisnexis.com> wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs by sc.union() to union all the RDDs and then do a reduceByKey() to remove duplicated id by keeping only the one with latest timeStamp field. But how do I do it for DataFrame? Ningjun