What DB do you have? You have some options, such as 1) use a key value store (they can be accessed very efficiently) to see if there has been a newer key already processed - if yes then ignore value if no then insert into database 2) redesign the key to include the timestamp and find out the latest one when querying the database
> On 11. May 2018, at 23:25, ravidspark <ravi.pegas...@gmail.com> wrote: > > Hi All, > > I am using Spark 2.2.0 & I have below use case: > > *Reading from Kafka using Spark Streaming and updating(not just inserting) > the records into downstream database* > > I understand that the way Spark read messages from Kafka will not be in a > order of timestamp as stored in Kafka partitions rather, in the order of > offsets of the partitions. So, for suppose if there are two messages in > kafka with the same key but one message with timestamp which is latest and > is placed in the smallest offset, one more message with oldest timestamp > placed in at earliest offset. In this case, as Spark reads from smallest -> > earliest offset, the latest timestamp will be processed first and then > oldest timestamp resulting in an unordered ingestion into the DB. > > If both these messages fell into the same rdd, then applying a reduce > function we can ignore the message with oldest timestamp and process the > latest timestamp message. But, I am not quite sure how to handle if these > messages fall into different RDD's in the stream. An approach I was trying > is to hit the DB and retrieve the timestamp in DB for that key and compare > and ignore if old timestamp. But, this is not an efficient way when handling > millions of messages as DB handling is expensive. > > Is there a better way of solving this problem? > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org