Re: Data loss scenarios

Milind Vaidya Wed, 20 Jan 2016 11:37:29 -0800

Correct me if I am wrong, do you mean that the kafka log was not processed
due to slow topology (in turn bolt) and the log was deleted due to
retention period exhaustion ?
Thats is still configurable and can be fine tuned. I mean, we can adjust
retention interval and/or see why bolt is taking so much time to process
this such a log.


I am more interested if some process (say KafkaSpout) dies due to random
error. What will happen to messages if any on the wire ? How will it be
handled on Kafka cluster side and Topology Bolt side which is reading from
kafka spout.



On Wed, Jan 20, 2016 at 5:22 AM, John Yost <[email protected]> wrote:

> The only data loss I've seen is where a topology with KafkaSpout gets so
> far behind that the Kafka log segment for a given partition is rotated.  In
> such a scenario, you'll see an OffsetOutOfRangeException.
>
> --John
>
> On Tue, Jan 19, 2016 at 5:21 PM, Milind Vaidya <[email protected]> wrote:
>
>> Yes. In a sunny day scenario there is no data loss. But we are trying to
>> list some cases where there will be a data loss, or at least we want to
>> consider different scenarios in which one or more components fail and see
>> how the kafka-storm set up reacts to that and if there is any data loss.
>>
>> We had some scenarios like you mentioned where the maxOffsetBehind
>> setting led to some problems due to down stream slow operations. But we are
>> not worried about kafka retention period either, that is a configuration
>> issue. What we are looking at is some thread accidentally dying say
>> kafka-spout or some kafka host containing all partitions for a topic goes
>> down etc.
>>
>>
>>
>> On Sat, Jan 16, 2016 at 5:32 AM, Abhishek Agarwal <[email protected]>
>> wrote:
>>
>>> The kafka spout doesn't have a data loss scenario unless you have
>>> modified the maxOffsetBehind setting (Long.MAX_VALUE by default) and
>>> acks/fails are being done properly. Though data could be lost due to
>>> retention being kicked in kafka. The topology will keep retrying a timed
>>> out message but kafka is not going to keep it forever.
>>>
>>> On Fri, Jan 15, 2016 at 12:21 AM, Milind Vaidya <[email protected]>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I have been using kafka-storm setup for more than a year, running
>>>> almost 10 different topologies.
>>>>
>>>> The flow is something like this
>>>>
>>>> Producer --> Kafka Cluster --> Storm cluster --> MongoDB.
>>>>
>>>> The zookeeper keeps the metadata.
>>>>
>>>> So far the approach was little ad hoc and  want it to be more
>>>> disciplined. We are trying to achieve no data loss and automation in case
>>>> of failure handling.
>>>>
>>>> What are the failure scenarios in case of a storm cluster ? Failure as
>>>> in data loss. We will be trying to cover once we know them.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Abhishek Agarwal
>>>
>>>
>>
>

Re: Data loss scenarios

Reply via email to