As I understand from SQL, group by allow you to do sum(), average(), max(), mn(). But how do I select the entire row in the group with maximum column timeStamp? For example
id1, value1, 2015-01-01 id1, value2, 2015-01-02 id2, value3, 2015-01-01 id2, value4, 2015-01-02 I want to return id1, value2, 2015-01-02 id2, value4, 2015-01-02 I can use reduceByKey() in RDD but how to do it using DataFrame? Can you give an example code snipet? Thanks Ningjun From: ayan guha [mailto:[email protected]] Sent: Wednesday, April 29, 2015 5:54 PM To: Wang, Ningjun (LNG-NPV) Cc: [email protected] Subject: Re: HOw can I merge multiple DataFrame and remove duplicated key Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <[email protected]<mailto:[email protected]>> wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to union all the DataFrame objects together but for duplicated id only keep the record with the latest timestamp. How can I do that? I can do this for RDDs by sc.union() to union all the RDDs and then do a reduceByKey() to remove duplicated id by keeping only the one with latest timeStamp field. But how do I do it for DataFrame? Ningjun
