Hi Anastasios,
Thanks for this.
I have a few doubts with this approach. The dropDuplicate operation will
keep all the data across triggers.
1. Where is this data stored?
- IN_MEMORY state means the data is not persisted during job resubmit.
- Persistence in disk like HDFS has proved to be unreliable, as I
have encountered corrupted files which causes errors on job restarts.
Akshay Bhardwaj
+91-97111-33849
On Wed, May 1, 2019 at 3:20 PM Anastasios Zouzias <[email protected]> wrote:
> Hi,
>
> Have you checked the docs, i.e.,
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#streaming-deduplication
>
> You can generate a uuid column in your streaming DataFrame and drop
> duplicate messages with a single line of code.
>
> Best,
> Anastasios
>
> On Wed, May 1, 2019 at 11:15 AM Akshay Bhardwaj <
> [email protected]> wrote:
>
>> Hi All,
>>
>> Floating this again. Any suggestions?
>>
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>>
>> On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj <
>> [email protected]> wrote:
>>
>>> Hi Experts,
>>>
>>> I am using spark structured streaming to read message from Kafka, with a
>>> producer that works with at-least once guarantee. This streaming job is
>>> running on Yarn cluster with hadoop 2.7 and spark 2.3
>>>
>>> What is the most reliable strategy for avoiding duplicate data within
>>> stream in the scenarios of fail-over or job restarts/re-submits, and
>>> guarantee exactly once non-duplicate stream?
>>>
>>>
>>> 1. One of the strategies I have read other people using is to
>>> maintain an external KV store for unique-key/checksum of the incoming
>>> message, and write to a 2nd kafka topic only if the checksum is not
>>> present
>>> in KV store.
>>> - My doubts with this approach is how to ensure safe write to both
>>> the 2nd topic and to KV store for storing checksum, in the case of
>>> unwanted
>>> failures. How does that guarantee exactly-once with restarts?
>>>
>>> Any suggestions are highly appreciated.
>>>
>>>
>>> Akshay Bhardwaj
>>> +91-97111-33849
>>>
>>
>
> --
> -- Anastasios Zouzias
> <[email protected]>
>