Re: Data Loss - Spark streaming

2014-12-16 Thread Gerard Maas
Hi Jeniba,

The second part of this meetup recording has a very good answer to your
question.  TD explains the current behavior and the on-going work in Spark
Streaming to fix HA.
https://www.youtube.com/watch?v=jcJq3ZalXD8


-kr, Gerard.

On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson 
jeniba.john...@lntinfotech.com wrote:

 Hi,

 I need a clarification, while running streaming examples, suppose the
 batch interval is set to 5 minutes, after collecting the data from the
 input source(FLUME) and  processing till 5 minutes.
 What will happen to the data which is flowing continuously from the input
 source to spark streaming ? Will that data be stored somewhere or else the
 data will be lost ?
 Or else what is the solution to capture each and every data without any
 loss in Spark streaming.

 Awaiting for your kind reply.


 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s

On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote:

 Hi Jeniba,

 The second part of this meetup recording has a very good answer to your
 question.  TD explains the current behavior and the on-going work in Spark
 Streaming to fix HA.
 https://www.youtube.com/watch?v=jcJq3ZalXD8


 -kr, Gerard.

 On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson 
 jeniba.john...@lntinfotech.com wrote:

 Hi,

 I need a clarification, while running streaming examples, suppose the
 batch interval is set to 5 minutes, after collecting the data from the
 input source(FLUME) and  processing till 5 minutes.
 What will happen to the data which is flowing continuously from the input
 source to spark streaming ? Will that data be stored somewhere or else the
 data will be lost ?
 Or else what is the solution to capture each and every data without any
 loss in Spark streaming.

 Awaiting for your kind reply.


 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain
 confidential or privileged information for the intended recipient(s).
 Unintended recipients are prohibited from taking action on the basis of
 information in this e-mail and using or disseminating the information, and
 must notify the sender and delete it from their system. LT Infotech will
 not accept responsibility or liability for the accuracy or completeness of,
 or the presence of any virus or disabling code in this e-mail




Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Tobias Pfeiffer
Hi Wei,

On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com
wrote:

 Since our application cannot tolerate losing customer data, I am wondering
 what is the best way for us to address this issue.
 1) We are thinking writing application specific logic to address the data
 loss. To us, the problem seems to be caused by that Kinesis receivers
 advanced their checkpoint before we know for sure the data is replicated.
 For example, we can do another checkpoint ourselves to remember the kinesis
 sequence number for data that has been processed by spark streaming. When
 Kinesis receiver is restarted due to worker failures, we restarted it from
 the checkpoint we tracked.


This sounds pretty much to me like the way Kafka does it. So, I am not
saying that the stock KafkaReceiver does what you want (it may or may not),
but it should be possible to update the offset (corresponds to sequence
number) in Zookeeper only after data has been replicated successfully. I
guess replace Kinesis by Kafka is not in option for you, but you may
consider pulling Kinesis data into Kafka before processing with Spark?

Tobias


RE: Data loss - Spark streaming and network receiver

2014-08-18 Thread Shao, Saisai
I think Currently Spark Streaming lack a data acknowledging mechanism when data 
is stored and replicated in BlockManager, so potentially data will be lost even 
pulled into Kafka, say if data is stored just in BlockGenerator not BM, while 
in the meantime Kafka itself commit the consumer offset, also at this point 
node is failed, from Kafka’s point this part of data is feed into Spark 
Streaming but actually this data is not yet processed, so potentially this part 
of data will never be processed again, unless you read the whole partition 
again.

To solve this potential data loss problem, Spark Streaming needs to offer a 
data acknowledging mechanism, so custom Receiver can use this acknowledgement 
to do checkpoint or recovery, like Storm.

Besides, driver failure is another story need to be carefully considered. So 
currently it is hard to make sure no data loss in Spark Streaming, still need 
to improve at some points ☺.

Thanks
Jerry

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Tuesday, August 19, 2014 10:47 AM
To: Wei Liu
Cc: user
Subject: Re: Data loss - Spark streaming and network receiver

Hi Wei,

On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu 
wei@stellarloyalty.commailto:wei@stellarloyalty.com wrote:
Since our application cannot tolerate losing customer data, I am wondering what 
is the best way for us to address this issue.
1) We are thinking writing application specific logic to address the data loss. 
To us, the problem seems to be caused by that Kinesis receivers advanced their 
checkpoint before we know for sure the data is replicated. For example, we can 
do another checkpoint ourselves to remember the kinesis sequence number for 
data that has been processed by spark streaming. When Kinesis receiver is 
restarted due to worker failures, we restarted it from the checkpoint we 
tracked.

This sounds pretty much to me like the way Kafka does it. So, I am not saying 
that the stock KafkaReceiver does what you want (it may or may not), but it 
should be possible to update the offset (corresponds to sequence number) in 
Zookeeper only after data has been replicated successfully. I guess replace 
Kinesis by Kafka is not in option for you, but you may consider pulling 
Kinesis data into Kafka before processing with Spark?

Tobias



Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Dibyendu Bhattacharya
Dear All,

Recently I have written a Spark Kafka Consumer to solve this problem. Even
we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
and consumer code has no handle to offset management.

The below code solves this problem, and this has is being tested in our
Spark Cluster and this working fine as of now.

https://github.com/dibbhatt/kafka-spark-consumer

This is Low Level Kafka Consumer using Kafka Simple Consumer API.

Please have a look at it and let me know your opinion. This has been
written to eliminate the Data loss by committing the offset after it is
written to BM. Also existing HighLevel KafkaUtils does not have any feature
to control Data Flow, and is gives Out Of Memory error is there is too much
backlogs in Kafka. This consumer solves this problem as well.  And this
code has been modified from earlier Storm Kafka consumer code and it has
lot of other features like recovery from Kafka node failures, ZK failures,
recover from Offset errors etc.

Regards,
Dibyendu


On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai saisai.s...@intel.com wrote:

  I think Currently Spark Streaming lack a data acknowledging mechanism
 when data is stored and replicated in BlockManager, so potentially data
 will be lost even pulled into Kafka, say if data is stored just in
 BlockGenerator not BM, while in the meantime Kafka itself commit the
 consumer offset, also at this point node is failed, from Kafka’s point this
 part of data is feed into Spark Streaming but actually this data is not yet
 processed, so potentially this part of data will never be processed again,
 unless you read the whole partition again.



 To solve this potential data loss problem, Spark Streaming needs to offer
 a data acknowledging mechanism, so custom Receiver can use this
 acknowledgement to do checkpoint or recovery, like Storm.



 Besides, driver failure is another story need to be carefully considered.
 So currently it is hard to make sure no data loss in Spark Streaming, still
 need to improve at some points J.



 Thanks

 Jerry



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
 *Sent:* Tuesday, August 19, 2014 10:47 AM
 *To:* Wei Liu
 *Cc:* user
 *Subject:* Re: Data loss - Spark streaming and network receiver



 Hi Wei,



 On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com
 wrote:

 Since our application cannot tolerate losing customer data, I am wondering
 what is the best way for us to address this issue.

 1) We are thinking writing application specific logic to address the data
 loss. To us, the problem seems to be caused by that Kinesis receivers
 advanced their checkpoint before we know for sure the data is replicated.
 For example, we can do another checkpoint ourselves to remember the kinesis
 sequence number for data that has been processed by spark streaming. When
 Kinesis receiver is restarted due to worker failures, we restarted it from
 the checkpoint we tracked.



 This sounds pretty much to me like the way Kafka does it. So, I am not
 saying that the stock KafkaReceiver does what you want (it may or may not),
 but it should be possible to update the offset (corresponds to sequence
 number) in Zookeeper only after data has been replicated successfully. I
 guess replace Kinesis by Kafka is not in option for you, but you may
 consider pulling Kinesis data into Kafka before processing with Spark?



 Tobias





Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
Thank you all for responding to my question. I am pleasantly surprised by
this many prompt responses I got. It shows the strength of the spark
community.

Kafka is still an option for us, I will check out the link provided by
Dibyendu.

Meanwhile if someone out there already figured this out with Kinesis,
please keep your suggestion coming. Thanks.

Thanks,
Wei


On Mon, Aug 18, 2014 at 9:31 PM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 Dear All,

 Recently I have written a Spark Kafka Consumer to solve this problem. Even
 we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
 and consumer code has no handle to offset management.

 The below code solves this problem, and this has is being tested in our
 Spark Cluster and this working fine as of now.

 https://github.com/dibbhatt/kafka-spark-consumer

 This is Low Level Kafka Consumer using Kafka Simple Consumer API.

 Please have a look at it and let me know your opinion. This has been
 written to eliminate the Data loss by committing the offset after it is
 written to BM. Also existing HighLevel KafkaUtils does not have any feature
 to control Data Flow, and is gives Out Of Memory error is there is too much
 backlogs in Kafka. This consumer solves this problem as well.  And this
 code has been modified from earlier Storm Kafka consumer code and it has
 lot of other features like recovery from Kafka node failures, ZK failures,
 recover from Offset errors etc.

 Regards,
 Dibyendu


 On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think Currently Spark Streaming lack a data acknowledging mechanism
 when data is stored and replicated in BlockManager, so potentially data
 will be lost even pulled into Kafka, say if data is stored just in
 BlockGenerator not BM, while in the meantime Kafka itself commit the
 consumer offset, also at this point node is failed, from Kafka’s point this
 part of data is feed into Spark Streaming but actually this data is not yet
 processed, so potentially this part of data will never be processed again,
 unless you read the whole partition again.



 To solve this potential data loss problem, Spark Streaming needs to offer
 a data acknowledging mechanism, so custom Receiver can use this
 acknowledgement to do checkpoint or recovery, like Storm.



 Besides, driver failure is another story need to be carefully considered.
 So currently it is hard to make sure no data loss in Spark Streaming, still
 need to improve at some points J.



 Thanks

 Jerry



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp]
 *Sent:* Tuesday, August 19, 2014 10:47 AM
 *To:* Wei Liu
 *Cc:* user
 *Subject:* Re: Data loss - Spark streaming and network receiver



 Hi Wei,



 On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com
 wrote:

 Since our application cannot tolerate losing customer data, I am
 wondering what is the best way for us to address this issue.

 1) We are thinking writing application specific logic to address the data
 loss. To us, the problem seems to be caused by that Kinesis receivers
 advanced their checkpoint before we know for sure the data is replicated.
 For example, we can do another checkpoint ourselves to remember the kinesis
 sequence number for data that has been processed by spark streaming. When
 Kinesis receiver is restarted due to worker failures, we restarted it from
 the checkpoint we tracked.



 This sounds pretty much to me like the way Kafka does it. So, I am not
 saying that the stock KafkaReceiver does what you want (it may or may not),
 but it should be possible to update the offset (corresponds to sequence
 number) in Zookeeper only after data has been replicated successfully. I
 guess replace Kinesis by Kafka is not in option for you, but you may
 consider pulling Kinesis data into Kafka before processing with Spark?



 Tobias