I didn't know anything about a Hive Sink, I'll check the JIRA about it, thanks.
The pipeline is Flume-Kafka-SparkStreaming-XXX

So I guess I should deal in SparkStreaming with it, right? I guess
that it would be easy to do it with an UUID interceptor or is there
another way easier?

2014-12-03 22:56 GMT+01:00 Roshan Naik <[email protected]>:
> Using the UUID interceptor at the source closest to data origination.. it
> will help identify duplicate events after they are delivered.
>
> If it satisfies your use case, the upcoming Hive Sink will mitigate the
> problem a little bit (since it uses transactions to write to destination).
>
> -roshan
>
>
> On Wed, Dec 3, 2014 at 8:44 AM, Joey Echeverria <[email protected]> wrote:
>>
>> There's nothing built into Flume to deal with duplicates, it only
>> provides at-least-once delivery semantics.
>>
>> You'll have to handle it in your data processing applications or add
>> an ETL step to deal with duplicates before making data available for
>> other queries.
>>
>> -Joey
>>
>> On Wed, Dec 3, 2014 at 5:46 AM, Guillermo Ortiz <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > I would like to know if there's a easy way to deal with data
>> > duplication when an agent crashs and it resends same data again.
>> >
>> > Is there any mechanism to deal with it in Flume,
>>
>>
>>
>> --
>> Joey Echeverria
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader of
> this message is not the intended recipient, you are hereby notified that any
> printing, copying, dissemination, distribution, disclosure or forwarding of
> this communication is strictly prohibited. If you have received this
> communication in error, please contact the sender immediately and delete it
> from your system. Thank You.

Reply via email to