What DB do you have? 

You have some options, such as
1) use a key value store (they can be accessed very efficiently) to see if 
there has been a newer key already processed - if yes then ignore value if no 
then insert into database
2) redesign the key to include the timestamp and find out the latest one when 
querying the database 

> On 11. May 2018, at 23:25, ravidspark <ravi.pegas...@gmail.com> wrote:
> 
> Hi All,
> 
> I am using Spark 2.2.0 & I have below use case:
> 
> *Reading from Kafka using Spark Streaming and updating(not just inserting)
> the records into downstream database*
> 
> I understand that the way Spark read messages from Kafka will not be in a
> order of timestamp as stored in Kafka partitions rather, in the order of
> offsets of the partitions. So, for suppose if there are two messages in
> kafka with the same key but one message with timestamp which is latest and
> is placed in the smallest offset, one more message with oldest timestamp
> placed in at earliest offset. In this case, as Spark reads from smallest ->
> earliest offset, the latest timestamp will be processed first and then
> oldest timestamp resulting in an unordered ingestion into the DB.
> 
> If both these messages fell into the same rdd, then applying a reduce
> function we can ignore the message with oldest timestamp and process the
> latest timestamp message. But, I am not quite sure how to handle if these
> messages fall into different RDD's in the stream. An approach I was trying
> is to hit the DB and retrieve the timestamp in DB for that key and compare
> and ignore if old timestamp. But, this is not an efficient way when handling
> millions of messages as DB handling is expensive.
> 
> Is there a better way of solving this problem?
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to