Jorn, Thanks for the response. My downstream database is Kudu.
1. Yes. As you have suggested, I have been using a central caching mechanism that caches the rdd results and to make a comparison with the next batch to check for the latest timestamps and ignore the old timestamps. But, I see handling this is not easy and not efficient. 2. My main objective is to update the record with the latest timestamp. If I define timestamp as primary key then all I will be doing is a normal insert as, timestamp will always be unique(most probably as in my case it is nano second granulized). I am looking for some functionality with in Spark to achieve this. I am reading about windowing technique and watermarking but, I am doubtful as they are used only for aggregations and not sure if I can use them in these scenario. Any suggestion are appreciated. Thanks, Ravi -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org