Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
Ok,
Thanks for your answers

On 3/22/17, 1:34 PM, "Cody Koeninger"  wrote:

If you're talking about reading the same message multiple times in a
failure situation, see

https://github.com/koeninger/kafka-exactly-once

If you're talking about producing the same message multiple times in a
failure situation, keep an eye on


https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

If you're talking about producers just misbehaving and sending
different copies of what is essentially the same message from a domain
perspective, you have to dedupe that with your own logic.

On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaver  wrote:
> You have to handle de-duplication upstream or downstream. It might
> technically be possible to handle this in Spark but you'll probably have a
> better time handling duplicates in the service that reads from Kafka.
>
> On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart 
> wrote:
>>
>> Hi,
>> we are trying to build a spark streaming solution that subscribe and push
>> to kafka.
>>
>> But we are running into the problem of duplicates events.
>>
>> Right now, I am doing a “forEachRdd” and loop over the message of each
>> partition and send those message to kafka.
>>
>>
>>
>> Is there any good way of solving that issue?
>>
>>
>>
>> thanks
>
>
>
>
> --
> Regards,
>
> Matt
> Data Engineer
> https://www.linkedin.com/in/mdeaver
> http://mattdeav.pythonanywhere.com/




Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a
failure situation, see

https://github.com/koeninger/kafka-exactly-once

If you're talking about producing the same message multiple times in a
failure situation, keep an eye on

https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging

If you're talking about producers just misbehaving and sending
different copies of what is essentially the same message from a domain
perspective, you have to dedupe that with your own logic.

On Wed, Mar 22, 2017 at 2:52 PM, Matt Deaver  wrote:
> You have to handle de-duplication upstream or downstream. It might
> technically be possible to handle this in Spark but you'll probably have a
> better time handling duplicates in the service that reads from Kafka.
>
> On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart 
> wrote:
>>
>> Hi,
>> we are trying to build a spark streaming solution that subscribe and push
>> to kafka.
>>
>> But we are running into the problem of duplicates events.
>>
>> Right now, I am doing a “forEachRdd” and loop over the message of each
>> partition and send those message to kafka.
>>
>>
>>
>> Is there any good way of solving that issue?
>>
>>
>>
>> thanks
>
>
>
>
> --
> Regards,
>
> Matt
> Data Engineer
> https://www.linkedin.com/in/mdeaver
> http://mattdeav.pythonanywhere.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
You have to handle de-duplication upstream or downstream. It might
technically be possible to handle this in Spark but you'll probably have a
better time handling duplicates in the service that reads from Kafka.

On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart 
wrote:

> Hi,
> we are trying to build a spark streaming solution that subscribe and push
> to kafka.
>
> But we are running into the problem of duplicates events.
>
> Right now, I am doing a “forEachRdd” and loop over the message of each
> partition and send those message to kafka.
>
>
>
> Is there any good way of solving that issue?
>
>
>
> thanks
>



-- 
Regards,

Matt
Data Engineer
https://www.linkedin.com/in/mdeaver
http://mattdeav.pythonanywhere.com/


Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi,
we are trying to build a spark streaming solution that subscribe and push to 
kafka.
But we are running into the problem of duplicates events.
Right now, I am doing a “forEachRdd” and loop over the message of each 
partition and send those message to kafka.

Is there any good way of solving that issue?

thanks


Re: Spark Streaming to Kafka

2015-05-19 Thread twinkle sachdeva
Thanks Saisai.

On Wed, May 20, 2015 at 11:23 AM, Saisai Shao 
wrote:

> I think here is the PR https://github.com/apache/spark/pull/2994 you
> could refer to.
>
> 2015-05-20 13:41 GMT+08:00 twinkle sachdeva :
>
>> Hi,
>>
>> As Spark streaming is being nicely integrated with consuming messages
>> from Kafka, so I thought of asking the forum, that is there any
>> implementation available for pushing data to Kafka from Spark Streaming too?
>>
>> Any link(s) will be helpful.
>>
>> Thanks and Regards,
>> Twinkle
>>
>
>


Re: Spark Streaming to Kafka

2015-05-19 Thread Saisai Shao
I think here is the PR https://github.com/apache/spark/pull/2994 you could
refer to.

2015-05-20 13:41 GMT+08:00 twinkle sachdeva :

> Hi,
>
> As Spark streaming is being nicely integrated with consuming messages from
> Kafka, so I thought of asking the forum, that is there any implementation
> available for pushing data to Kafka from Spark Streaming too?
>
> Any link(s) will be helpful.
>
> Thanks and Regards,
> Twinkle
>


Spark Streaming to Kafka

2015-05-19 Thread twinkle sachdeva
Hi,

As Spark streaming is being nicely integrated with consuming messages from
Kafka, so I thought of asking the forum, that is there any implementation
available for pushing data to Kafka from Spark Streaming too?

Any link(s) will be helpful.

Thanks and Regards,
Twinkle