Off the top of my head... (Each may have it's own issues)

If upstream you add a uniqueId to all your records, then you may use a
BloomFilter to approximate if you've seen a row before.
The problem I can see with that approach is how to repopulate the bloom
filter on restarts.

If you are certain that you're not going to reprocess some data after a
certain time, i.e. there is no way I'm going to get the same data in 2
hours, it may only happen in the last 2 hours, then you may also keep the
state of uniqueId's as well, and then age them out after a certain time.


Best,
Burak

On Tue, Jan 24, 2017 at 9:53 PM, shyla deshpande <deshpandesh...@gmail.com>
wrote:

> Please share your thoughts.....
>
> On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande <deshpandesh...@gmail.com
> > wrote:
>
>>
>>
>> On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> My streaming application stores lot of aggregations using mapWithState.
>>>
>>> I want to know what are all the possible ways I can make it idempotent.
>>>
>>> Please share your views.
>>>
>>> Thanks
>>>
>>> On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande <
>>> deshpandesh...@gmail.com> wrote:
>>>
>>>> In a Wordcount application which  stores the count of all the words
>>>> input so far using mapWithState.  How do I make sure my counts are not
>>>> messed up if I happen to read a line more than once?
>>>>
>>>> Appreciate your response.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Reply via email to