Hi All, Floating this again. Any suggestions?
Akshay Bhardwaj +91-97111-33849 On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < [email protected]> wrote: > Hi Experts, > > I am using spark structured streaming to read message from Kafka, with a > producer that works with at-least once guarantee. This streaming job is > running on Yarn cluster with hadoop 2.7 and spark 2.3 > > What is the most reliable strategy for avoiding duplicate data within > stream in the scenarios of fail-over or job restarts/re-submits, and > guarantee exactly once non-duplicate stream? > > > 1. One of the strategies I have read other people using is to maintain > an external KV store for unique-key/checksum of the incoming message, and > write to a 2nd kafka topic only if the checksum is not present in KV store. > - My doubts with this approach is how to ensure safe write to both the > 2nd topic and to KV store for storing checksum, in the case of unwanted > failures. How does that guarantee exactly-once with restarts? > > Any suggestions are highly appreciated. > > > Akshay Bhardwaj > +91-97111-33849 >
