Its no different, you would use group by and aggregate function to do so.
On 30 Apr 2015 02:15, "Wang, Ningjun (LNG-NPV)" <[email protected]>
wrote:

>  I have multiple DataFrame objects each stored in a parquet file.  The
> DataFrame just contains 3 columns (id,  value,  timeStamp). I need to union
> all the DataFrame objects together but for duplicated id only keep the
> record with the latest timestamp. How can I  do that?
>
>
>
> I can do this for RDDs by sc.union() to union all the RDDs and then do a
> reduceByKey() to remove duplicated id by keeping only the one with latest
> timeStamp field. But how do I do it for DataFrame?
>
>
>
>
>
> Ningjun
>
>
>

Reply via email to