Re: KafkaUtils explicit acks
Do you actually need spark streaming per se for your use case? If you're just trying to read data out of kafka into hbase, would something like this non-streaming rdd work for you: https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka Note that if you're trying to get exactly-once semantics out of kafka, you need either idempotent writes, or a transactional relationship between the place you're storing data and the place you're storing offsets. Using normal batch rdds instead of streaming makes the second approach pretty trivial actually. On Tue, Dec 16, 2014 at 6:34 AM, Mukesh Jha wrote: > > I agree that this is not a trivial task as in this approach the kafka > ack's will be done by the SparkTasks that means a plug-able mean to ack > your input data source i.e. changes in core. > > From my limited experience with Kafka + Spark what I've seem is If spark > tasks takes longer time than the batch interval the next batch waits for > the previous one to finish, so I was wondering if offset management can be > done by spark too. > > I'm just trying to figure out if this seems to be a worthwhile addition to > have? > > On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai > wrote: >> >> Hi, >> >> >> >> It is not a trivial work to acknowledge the offsets when RDD is fully >> processed, I think from my understanding only modify the KafakUtils is not >> enough to meet your requirement, you need to add a metadata management >> stuff for each block/RDD, and track them both in executor-driver side, and >> many other things should also be taken care J. >> >> >> >> Thanks >> >> Jerry >> >> >> >> *From:* mukh@gmail.com [mailto:mukh@gmail.com] *On Behalf Of *Mukesh >> Jha >> *Sent:* Monday, December 15, 2014 1:31 PM >> *To:* Tathagata Das >> *Cc:* francois.garil...@typesafe.com; user@spark.apache.org >> *Subject:* Re: KafkaUtils explicit acks >> >> >> >> Thanks TD & Francois for the explanation & documentation. I'm curious if >> we have any performance benchmark with & without WAL for >> spark-streaming-kafka. >> >> >> >> Also In spark-streaming-kafka (as kafka provides a way to acknowledge >> logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets >> only when the RRDs are fully processed and are getting evicted out of the >> Spark memory thus we can be cent percent sure that all the records are >> getting processed in the system. >> >> I was thinking if it's good to have the kafka offset information of each >> batch as part of RDDs metadata and commit the offsets once the RDDs lineage >> is complete. >> >> >> >> On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >> I am updating the docs right now. Here is a staged copy that you can >> have sneak peek of. This will be part of the Spark 1.2. >> >> >> http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html >> >> The updated fault-tolerance section tries to simplify the explanation >> of when and what data can be lost, and how to prevent that using the >> new experimental feature of write ahead logs. >> Any feedback will be much appreciated. >> >> TD >> >> >> On Wed, Dec 10, 2014 at 2:42 AM, wrote: >> > [sorry for the botched half-message] >> > >> > Hi Mukesh, >> > >> > There’s been some great work on Spark Streaming reliability lately. >> > https://www.youtube.com/watch?v=jcJq3ZalXD8 >> > Look at the links from: >> > https://issues.apache.org/jira/browse/SPARK-3129 >> > >> > I’m not aware of any doc yet (did I miss something ?) but you can look >> at >> > the ReliableKafkaReceiver’s test suite: >> > >> > >> external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala >> > >> > — >> > FG >> > >> > >> > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha >> > wrote: >> >> >> >> Hello Guys, >> >> >> >> Any insights on this?? >> >> If I'm not clear enough my question is how can I use kafka consumer and >> >> not loose any data in cases of failures with spark-streaming. >> >> >> >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha >> >> wrote: >> >>> >> >>> Hello Experts, >> >>> &g
Re: KafkaUtils explicit acks
I agree that this is not a trivial task as in this approach the kafka ack's will be done by the SparkTasks that means a plug-able mean to ack your input data source i.e. changes in core. >From my limited experience with Kafka + Spark what I've seem is If spark tasks takes longer time than the batch interval the next batch waits for the previous one to finish, so I was wondering if offset management can be done by spark too. I'm just trying to figure out if this seems to be a worthwhile addition to have? On Mon, Dec 15, 2014 at 11:39 AM, Shao, Saisai wrote: > > Hi, > > > > It is not a trivial work to acknowledge the offsets when RDD is fully > processed, I think from my understanding only modify the KafakUtils is not > enough to meet your requirement, you need to add a metadata management > stuff for each block/RDD, and track them both in executor-driver side, and > many other things should also be taken care J. > > > > Thanks > > Jerry > > > > *From:* mukh@gmail.com [mailto:mukh@gmail.com] *On Behalf Of *Mukesh > Jha > *Sent:* Monday, December 15, 2014 1:31 PM > *To:* Tathagata Das > *Cc:* francois.garil...@typesafe.com; user@spark.apache.org > *Subject:* Re: KafkaUtils explicit acks > > > > Thanks TD & Francois for the explanation & documentation. I'm curious if > we have any performance benchmark with & without WAL for > spark-streaming-kafka. > > > > Also In spark-streaming-kafka (as kafka provides a way to acknowledge > logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets > only when the RRDs are fully processed and are getting evicted out of the > Spark memory thus we can be cent percent sure that all the records are > getting processed in the system. > > I was thinking if it's good to have the kafka offset information of each > batch as part of RDDs metadata and commit the offsets once the RDDs lineage > is complete. > > > > On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > > I am updating the docs right now. Here is a staged copy that you can > have sneak peek of. This will be part of the Spark 1.2. > > > http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html > > The updated fault-tolerance section tries to simplify the explanation > of when and what data can be lost, and how to prevent that using the > new experimental feature of write ahead logs. > Any feedback will be much appreciated. > > TD > > > On Wed, Dec 10, 2014 at 2:42 AM, wrote: > > [sorry for the botched half-message] > > > > Hi Mukesh, > > > > There's been some great work on Spark Streaming reliability lately. > > https://www.youtube.com/watch?v=jcJq3ZalXD8 > > Look at the links from: > > https://issues.apache.org/jira/browse/SPARK-3129 > > > > I'm not aware of any doc yet (did I miss something ?) but you can look at > > the ReliableKafkaReceiver's test suite: > > > > > external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > > > > -- > > FG > > > > > > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha > > wrote: > >> > >> Hello Guys, > >> > >> Any insights on this?? > >> If I'm not clear enough my question is how can I use kafka consumer and > >> not loose any data in cases of failures with spark-streaming. > >> > >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha > >> wrote: > >>> > >>> Hello Experts, > >>> > >>> I'm working on a spark app which reads data from kafka & persists it in > >>> hbase. > >>> > >>> Spark documentation states the below [1] that in case of worker failure > >>> we can loose some data. If not how can I make my kafka stream more > reliable? > >>> I have seen there is a simple consumer [2] but I'm not sure if it has > >>> been used/tested extensively. > >>> > >>> I was wondering if there is a way to explicitly acknowledge the kafka > >>> offsets once they are replicated in memory of other worker nodes (if > it's > >>> not already done) to tackle this issue. > >>> > >>> Any help is appreciated in advance. > >>> > >>> > >>> Using any input source that receives data through a network - For > >>> network-based data sources like Kafka and Flume, the received input > data is > >>> replicated in memory between nodes of the cluster (default replication > >>> factor is 2). So if a worker node fails, then the system can recompute > the > >>> lost from the the left over copy of the input data. However, if the > worker > >>> node where a network receiver was running fails, then a tiny bit of > data may > >>> be lost, that is, the data received by the system but not yet > replicated to > >>> other node(s). The receiver will be started on a different node and it > will > >>> continue to receive data. > >>> https://github.com/dibbhatt/kafka-spark-consumer > >>> > >>> Txz, > >>> > >>> Mukesh Jha > >> > >> > >> > >> > >> -- > >> > >> > >> Thanks & Regards, > >> > >> Mukesh Jha > > > > > > > > > -- > > > > > > Thanks & Regards, > > *Mukesh Jha * > -- Thanks & Regards, *Mukesh Jha *
RE: KafkaUtils explicit acks
Hi, It is not a trivial work to acknowledge the offsets when RDD is fully processed, I think from my understanding only modify the KafakUtils is not enough to meet your requirement, you need to add a metadata management stuff for each block/RDD, and track them both in executor-driver side, and many other things should also be taken care :). Thanks Jerry From: mukh@gmail.com [mailto:mukh@gmail.com] On Behalf Of Mukesh Jha Sent: Monday, December 15, 2014 1:31 PM To: Tathagata Das Cc: francois.garil...@typesafe.com; user@spark.apache.org Subject: Re: KafkaUtils explicit acks Thanks TD & Francois for the explanation & documentation. I'm curious if we have any performance benchmark with & without WAL for spark-streaming-kafka. Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets only when the RRDs are fully processed and are getting evicted out of the Spark memory thus we can be cent percent sure that all the records are getting processed in the system. I was thinking if it's good to have the kafka offset information of each batch as part of RDDs metadata and commit the offsets once the RDDs lineage is complete. On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das mailto:tathagata.das1...@gmail.com>> wrote: I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data can be lost, and how to prevent that using the new experimental feature of write ahead logs. Any feedback will be much appreciated. TD On Wed, Dec 10, 2014 at 2:42 AM, mailto:francois.garil...@typesafe.com>> wrote: > [sorry for the botched half-message] > > Hi Mukesh, > > There's been some great work on Spark Streaming reliability lately. > https://www.youtube.com/watch?v=jcJq3ZalXD8 > Look at the links from: > https://issues.apache.org/jira/browse/SPARK-3129 > > I'm not aware of any doc yet (did I miss something ?) but you can look at > the ReliableKafkaReceiver's test suite: > > external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > > - > FG > > > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha > mailto:me.mukesh@gmail.com>> > wrote: >> >> Hello Guys, >> >> Any insights on this?? >> If I'm not clear enough my question is how can I use kafka consumer and >> not loose any data in cases of failures with spark-streaming. >> >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha >> mailto:me.mukesh@gmail.com>> >> wrote: >>> >>> Hello Experts, >>> >>> I'm working on a spark app which reads data from kafka & persists it in >>> hbase. >>> >>> Spark documentation states the below [1] that in case of worker failure >>> we can loose some data. If not how can I make my kafka stream more reliable? >>> I have seen there is a simple consumer [2] but I'm not sure if it has >>> been used/tested extensively. >>> >>> I was wondering if there is a way to explicitly acknowledge the kafka >>> offsets once they are replicated in memory of other worker nodes (if it's >>> not already done) to tackle this issue. >>> >>> Any help is appreciated in advance. >>> >>> >>> Using any input source that receives data through a network - For >>> network-based data sources like Kafka and Flume, the received input data is >>> replicated in memory between nodes of the cluster (default replication >>> factor is 2). So if a worker node fails, then the system can recompute the >>> lost from the the left over copy of the input data. However, if the worker >>> node where a network receiver was running fails, then a tiny bit of data may >>> be lost, that is, the data received by the system but not yet replicated to >>> other node(s). The receiver will be started on a different node and it will >>> continue to receive data. >>> https://github.com/dibbhatt/kafka-spark-consumer >>> >>> Txz, >>> >>> Mukesh Jha >> >> >> >> >> -- >> >> >> Thanks & Regards, >> >> Mukesh Jha > > -- Thanks & Regards, Mukesh Jha<mailto:me.mukesh@gmail.com>
Re: KafkaUtils explicit acks
Thanks TD & Francois for the explanation & documentation. I'm curious if we have any performance benchmark with & without WAL for spark-streaming-kafka. Also In spark-streaming-kafka (as kafka provides a way to acknowledge logs) on top of WAL can we modify KafkaUtils to acknowledge the offsets only when the RRDs are fully processed and are getting evicted out of the Spark memory thus we can be cent percent sure that all the records are getting processed in the system. I was thinking if it's good to have the kafka offset information of each batch as part of RDDs metadata and commit the offsets once the RDDs lineage is complete. On Thu, Dec 11, 2014 at 6:26 PM, Tathagata Das wrote: > > I am updating the docs right now. Here is a staged copy that you can > have sneak peek of. This will be part of the Spark 1.2. > > > http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html > > The updated fault-tolerance section tries to simplify the explanation > of when and what data can be lost, and how to prevent that using the > new experimental feature of write ahead logs. > Any feedback will be much appreciated. > > TD > > On Wed, Dec 10, 2014 at 2:42 AM, wrote: > > [sorry for the botched half-message] > > > > Hi Mukesh, > > > > There's been some great work on Spark Streaming reliability lately. > > https://www.youtube.com/watch?v=jcJq3ZalXD8 > > Look at the links from: > > https://issues.apache.org/jira/browse/SPARK-3129 > > > > I'm not aware of any doc yet (did I miss something ?) but you can look at > > the ReliableKafkaReceiver's test suite: > > > > > external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > > > > -- > > FG > > > > > > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha > > wrote: > >> > >> Hello Guys, > >> > >> Any insights on this?? > >> If I'm not clear enough my question is how can I use kafka consumer and > >> not loose any data in cases of failures with spark-streaming. > >> > >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha > >> wrote: > >>> > >>> Hello Experts, > >>> > >>> I'm working on a spark app which reads data from kafka & persists it in > >>> hbase. > >>> > >>> Spark documentation states the below [1] that in case of worker failure > >>> we can loose some data. If not how can I make my kafka stream more > reliable? > >>> I have seen there is a simple consumer [2] but I'm not sure if it has > >>> been used/tested extensively. > >>> > >>> I was wondering if there is a way to explicitly acknowledge the kafka > >>> offsets once they are replicated in memory of other worker nodes (if > it's > >>> not already done) to tackle this issue. > >>> > >>> Any help is appreciated in advance. > >>> > >>> > >>> Using any input source that receives data through a network - For > >>> network-based data sources like Kafka and Flume, the received input > data is > >>> replicated in memory between nodes of the cluster (default replication > >>> factor is 2). So if a worker node fails, then the system can recompute > the > >>> lost from the the left over copy of the input data. However, if the > worker > >>> node where a network receiver was running fails, then a tiny bit of > data may > >>> be lost, that is, the data received by the system but not yet > replicated to > >>> other node(s). The receiver will be started on a different node and it > will > >>> continue to receive data. > >>> https://github.com/dibbhatt/kafka-spark-consumer > >>> > >>> Txz, > >>> > >>> Mukesh Jha > >> > >> > >> > >> > >> -- > >> > >> > >> Thanks & Regards, > >> > >> Mukesh Jha > > > > > -- Thanks & Regards, *Mukesh Jha *
Re: KafkaUtils explicit acks
I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data can be lost, and how to prevent that using the new experimental feature of write ahead logs. Any feedback will be much appreciated. TD On Wed, Dec 10, 2014 at 2:42 AM, wrote: > [sorry for the botched half-message] > > Hi Mukesh, > > There’s been some great work on Spark Streaming reliability lately. > https://www.youtube.com/watch?v=jcJq3ZalXD8 > Look at the links from: > https://issues.apache.org/jira/browse/SPARK-3129 > > I’m not aware of any doc yet (did I miss something ?) but you can look at > the ReliableKafkaReceiver’s test suite: > > external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala > > — > FG > > > On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha > wrote: >> >> Hello Guys, >> >> Any insights on this?? >> If I'm not clear enough my question is how can I use kafka consumer and >> not loose any data in cases of failures with spark-streaming. >> >> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha >> wrote: >>> >>> Hello Experts, >>> >>> I'm working on a spark app which reads data from kafka & persists it in >>> hbase. >>> >>> Spark documentation states the below [1] that in case of worker failure >>> we can loose some data. If not how can I make my kafka stream more reliable? >>> I have seen there is a simple consumer [2] but I'm not sure if it has >>> been used/tested extensively. >>> >>> I was wondering if there is a way to explicitly acknowledge the kafka >>> offsets once they are replicated in memory of other worker nodes (if it's >>> not already done) to tackle this issue. >>> >>> Any help is appreciated in advance. >>> >>> >>> Using any input source that receives data through a network - For >>> network-based data sources like Kafka and Flume, the received input data is >>> replicated in memory between nodes of the cluster (default replication >>> factor is 2). So if a worker node fails, then the system can recompute the >>> lost from the the left over copy of the input data. However, if the worker >>> node where a network receiver was running fails, then a tiny bit of data may >>> be lost, that is, the data received by the system but not yet replicated to >>> other node(s). The receiver will be started on a different node and it will >>> continue to receive data. >>> https://github.com/dibbhatt/kafka-spark-consumer >>> >>> Txz, >>> >>> Mukesh Jha >> >> >> >> >> -- >> >> >> Thanks & Regards, >> >> Mukesh Jha > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KafkaUtils explicit acks
[sorry for the botched half-message] Hi Mukesh, There’s been some great work on Spark Streaming reliability lately. https://www.youtube.com/watch?v=jcJq3ZalXD8 Look at the links from: https://issues.apache.org/jira/browse/SPARK-3129 I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s test suite: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala — FG On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha wrote: > Hello Guys, > Any insights on this?? > If I'm not clear enough my question is how can I use kafka consumer and not > loose any data in cases of failures with spark-streaming. > On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha wrote: >> Hello Experts, >> >> I'm working on a spark app which reads data from kafka & persists it in >> hbase. >> >> Spark documentation states the below *[1]* that in case of worker failure >> we can loose some data. If not how can I make my kafka stream more reliable? >> I have seen there is a simple consumer *[2]* but I'm not sure if it has >> been used/tested extensively. >> >> I was wondering if there is a way to explicitly acknowledge the kafka >> offsets once they are replicated in memory of other worker nodes (if it's >> not already done) to tackle this issue. >> >> Any help is appreciated in advance. >> >> >>1. *Using any input source that receives data through a network* - For >>network-based data sources like *Kafka *and Flume, the received input >>data is replicated in memory between nodes of the cluster (default >>replication factor is 2). So if a worker node fails, then the system can >>recompute the lost from the the left over copy of the input data. However, >>if the *worker node where a network receiver was running fails, then a >>tiny bit of data may be lost*, that is, the data received by the >>system but not yet replicated to other node(s). The receiver will be >>started on a different node and it will continue to receive data. >>2. https://github.com/dibbhatt/kafka-spark-consumer >> >> Txz, >> >> *Mukesh Jha * >> > -- > Thanks & Regards, > *Mukesh Jha *
Re: KafkaUtils explicit acks
Hi Mukesh, There’s been some great work on Spark Streaming reliability lately I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s test suite: — FG On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha wrote: > Hello Guys, > Any insights on this?? > If I'm not clear enough my question is how can I use kafka consumer and not > loose any data in cases of failures with spark-streaming. > On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha wrote: >> Hello Experts, >> >> I'm working on a spark app which reads data from kafka & persists it in >> hbase. >> >> Spark documentation states the below *[1]* that in case of worker failure >> we can loose some data. If not how can I make my kafka stream more reliable? >> I have seen there is a simple consumer *[2]* but I'm not sure if it has >> been used/tested extensively. >> >> I was wondering if there is a way to explicitly acknowledge the kafka >> offsets once they are replicated in memory of other worker nodes (if it's >> not already done) to tackle this issue. >> >> Any help is appreciated in advance. >> >> >>1. *Using any input source that receives data through a network* - For >>network-based data sources like *Kafka *and Flume, the received input >>data is replicated in memory between nodes of the cluster (default >>replication factor is 2). So if a worker node fails, then the system can >>recompute the lost from the the left over copy of the input data. However, >>if the *worker node where a network receiver was running fails, then a >>tiny bit of data may be lost*, that is, the data received by the >>system but not yet replicated to other node(s). The receiver will be >>started on a different node and it will continue to receive data. >>2. https://github.com/dibbhatt/kafka-spark-consumer >> >> Txz, >> >> *Mukesh Jha * >> > -- > Thanks & Regards, > *Mukesh Jha *
Re: KafkaUtils explicit acks
Hello Guys, Any insights on this?? If I'm not clear enough my question is how can I use kafka consumer and not loose any data in cases of failures with spark-streaming. On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha wrote: > Hello Experts, > > I'm working on a spark app which reads data from kafka & persists it in > hbase. > > Spark documentation states the below *[1]* that in case of worker failure > we can loose some data. If not how can I make my kafka stream more reliable? > I have seen there is a simple consumer *[2]* but I'm not sure if it has > been used/tested extensively. > > I was wondering if there is a way to explicitly acknowledge the kafka > offsets once they are replicated in memory of other worker nodes (if it's > not already done) to tackle this issue. > > Any help is appreciated in advance. > > >1. *Using any input source that receives data through a network* - For >network-based data sources like *Kafka *and Flume, the received input >data is replicated in memory between nodes of the cluster (default >replication factor is 2). So if a worker node fails, then the system can >recompute the lost from the the left over copy of the input data. However, >if the *worker node where a network receiver was running fails, then a >tiny bit of data may be lost*, that is, the data received by the >system but not yet replicated to other node(s). The receiver will be >started on a different node and it will continue to receive data. >2. https://github.com/dibbhatt/kafka-spark-consumer > > Txz, > > *Mukesh Jha * > -- Thanks & Regards, *Mukesh Jha *