Jorn,

Thanks for the response. My downstream database is Kudu.

1. Yes. As you have suggested, I have been using a central caching mechanism
that caches the rdd results and to make a comparison with the next batch to
check for the latest timestamps and ignore the old timestamps. But, I see
handling this is not easy and not efficient.

2. My main objective is to update the record with the latest timestamp. If I
define timestamp as primary key then all I will be doing is a normal insert
as, timestamp will always be unique(most probably as in my case it is nano
second granulized).

I am looking for some functionality with in Spark to achieve this. I am
reading about windowing technique and watermarking but, I am doubtful as
they are used only for aggregations and not sure if I can use them in these
scenario. Any suggestion are appreciated.


Thanks,
Ravi




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to