Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s test suite: — FG On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha <me.mukesh....@gmail.com> wrote: > Hello Guys, > Any insights on this?? > If I'm not clear enough my question is how can I use kafka consumer and not > loose any data in cases of failures with spark-streaming. > On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha <me.mukesh....@gmail.com> wrote: >> Hello Experts, >> >> I'm working on a spark app which reads data from kafka & persists it in >> hbase. >> >> Spark documentation states the below *[1]* that in case of worker failure >> we can loose some data. If not how can I make my kafka stream more reliable? >> I have seen there is a simple consumer *[2]* but I'm not sure if it has >> been used/tested extensively. >> >> I was wondering if there is a way to explicitly acknowledge the kafka >> offsets once they are replicated in memory of other worker nodes (if it's >> not already done) to tackle this issue. >> >> Any help is appreciated in advance. >> >> >> 1. *Using any input source that receives data through a network* - For >> network-based data sources like *Kafka *and Flume, the received input >> data is replicated in memory between nodes of the cluster (default >> replication factor is 2). So if a worker node fails, then the system can >> recompute the lost from the the left over copy of the input data. However, >> if the *worker node where a network receiver was running fails, then a >> tiny bit of data may be lost*, that is, the data received by the >> system but not yet replicated to other node(s). The receiver will be >> started on a different node and it will continue to receive data. >> 2. https://github.com/dibbhatt/kafka-spark-consumer >> >> Txz, >> >> *Mukesh Jha <me.mukesh....@gmail.com>* >> > -- > Thanks & Regards, > *Mukesh Jha <me.mukesh....@gmail.com>*