Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Akshay Bhardwaj Wed, 01 May 2019 02:16:12 -0700

Hi All,

Floating this again. Any suggestions?



Akshay Bhardwaj
+91-97111-33849


On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj <
[email protected]> wrote:

> Hi Experts,
>
> I am using spark structured streaming to read message from Kafka, with a
> producer that works with at-least once guarantee. This streaming job is
> running on Yarn cluster with hadoop 2.7 and spark 2.3
>
> What is the most reliable strategy for avoiding duplicate data within
> stream in the scenarios of fail-over or job restarts/re-submits, and
> guarantee exactly once non-duplicate stream?
>
>
>    1. One of the strategies I have read other people using is to maintain
>    an external KV store for unique-key/checksum of the incoming message, and
>    write to a 2nd kafka topic only if the checksum is not present in KV store.
>    - My doubts with this approach is how to ensure safe write to both the
>       2nd topic and to KV store for storing checksum, in the case of unwanted
>       failures. How does that guarantee exactly-once with restarts?
>
> Any suggestions are highly appreciated.
>
>
> Akshay Bhardwaj
> +91-97111-33849
>

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

Reply via email to