Re: Achieving Consistency and Durability

2014-10-24 Thread Kyle Banker
Looks great, Gwen. I've added a few comments to the ticket.

On Mon, Oct 20, 2014 at 2:32 PM, Gwen Shapira  wrote:

> Hi Kyle,
>
> I added new documentation, which will hopefully help. Please take a look
> here:
> https://issues.apache.org/jira/browse/KAFKA-1555
>
> I've heard rumors that you are very very good at documenting, so I'm
> looking forward to your comments.
>
> Note that I'm completely ignoring the acks>1 case since we are about
> to remove it.
>
> Gwen
>
> On Wed, Oct 15, 2014 at 1:21 PM, Kyle Banker  wrote:
> > Thanks very much for these clarifications, Gwen.
> >
> > I'd recommend modifying the following phrase describing "acks=-1":
> >
> > "This option provides the best durability, we guarantee that no messages
> > will be lost as long as at least one in sync replica remains."
> >
> > The "as long as at least one in sync replica remains" is such a huge
> > caveat. It should be noted that "acks=-1" provides no actual durability
> > guarantees unless min.isr is also used to specify a majority of replicas.
> >
> > In addition, I was curious if you might comment on my other recent
> posting
> > "Consistency and Availability on Node Failures" and possibly add this
> > scenario to the docs. With acks=-1 and min.isr=2 and a 3-replica topic
> in a
> > 12-node Kafka cluster, there's a relatively high probability that losing
> 2
> > nodes from this cluster will result in an inability to write to the
> cluster.
> >
> > On Tue, Oct 14, 2014 at 4:50 PM, Gwen Shapira 
> wrote:
> >
> >> ack = 2 *will* throw an exception when there's only one node in ISR.
> >>
> >> The problem with ack=2 is that if you have 3 replicas and you got acks
> >> from 2 of them, the one replica which did not get the message can
> >> still be in ISR and get elected as leader, leading for a loss of the
> >> message. If you specify ack=3, you can't tolerate the failure of a
> >> single replica. Not amazing either.
> >>
> >> To makes things even worse, when specifying the number of acks you
> >> want, you don't always know how many replicas the topic should have,
> >> so its difficult to pick the correct number.
> >>
> >> acks = -1 solves that problem (since all messages need to get acked by
> >> all replicas), but introduces the new problem of not getting an
> >> exception if ISR shrank to 1 replica.
> >>
> >> Thats why the min.isr configuration was added.
> >>
> >> I hope this clarifies things :)
> >> I'm planning to add this to the docs in a day or two, so let me know
> >> if there are any additional explanations or scenarios you think we
> >> need to include.
> >>
> >> Gwen
> >>
> >> On Tue, Oct 14, 2014 at 12:27 PM, Scott Reynolds 
> >> wrote:
> >> > A question about 0.8.1.1 and acks. I was under the impression that
> >> setting
> >> > acks to 2 will not throw an exception when there is only one node in
> ISR.
> >> > Am I incorrect ? Thus the need for min_isr.
> >> >
> >> > On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker 
> >> wrote:
> >> >
> >> >> It's quite difficult to infer from the docs the exact techniques
> >> required
> >> >> to ensure consistency and durability in Kafka. I propose that we add
> a
> >> doc
> >> >> section detailing these techniques. I would be happy to help with
> this.
> >> >>
> >> >> The basic question is this: assuming that I can afford to temporarily
> >> halt
> >> >> production to Kafka, how do I ensure that no message written to
> Kafka is
> >> >> ever lost under typical failure scenarios (i.e., the loss of a single
> >> >> broker)?
> >> >>
> >> >> Here's my understanding of this for Kafka v0.8.1.1:
> >> >>
> >> >> 1. Create a topic with a replication factor of 3.
> >> >> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may
> >> >> successfully write even in a case where the data is written only to a
> >> >> single node).
> >> >>
> >> >> Even with these two precautions, there's always the possibility of an
> >> >> "unclean leader election." Can data loss still occur in this
> scenario?
> >> Is
> >> >> it possible to achieve this level of durability on v0.8.1.1?
> >> >>
> >> >> In Kafka v0.8.2, in addition to the above:
> >> >>
> >> >> 3. Ensure that the triple-replicated topic also disallows unclean
> leader
> >> >> election (https://issues.apache.org/jira/browse/KAFKA-1028).
> >> >>
> >> >> 4. Set the min.isr value of the producer to 2 and acks to -1 (
> >> >> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will
> >> then
> >> >> throw an exception if data can't be written to 2 out of 3 nodes.
> >> >>
> >> >> In addition to producer configuration and usage, there are also
> >> monitoring
> >> >> and operations considerations for achieving durability and
> consistency.
> >> As
> >> >> those are rather nuanced, it'd probably be easiest to just start
> >> iterating
> >> >> on a document to flesh those out.
> >> >>
> >> >> If anyone has any advice on how to better specify this, or how to get
> >> >> started on improving the docs, I'm happy to help out.
> >> >>
> >>
>


Re: Achieving Consistency and Durability

2014-10-20 Thread Gwen Shapira
Hi Kyle,

I added new documentation, which will hopefully help. Please take a look here:
https://issues.apache.org/jira/browse/KAFKA-1555

I've heard rumors that you are very very good at documenting, so I'm
looking forward to your comments.

Note that I'm completely ignoring the acks>1 case since we are about
to remove it.

Gwen

On Wed, Oct 15, 2014 at 1:21 PM, Kyle Banker  wrote:
> Thanks very much for these clarifications, Gwen.
>
> I'd recommend modifying the following phrase describing "acks=-1":
>
> "This option provides the best durability, we guarantee that no messages
> will be lost as long as at least one in sync replica remains."
>
> The "as long as at least one in sync replica remains" is such a huge
> caveat. It should be noted that "acks=-1" provides no actual durability
> guarantees unless min.isr is also used to specify a majority of replicas.
>
> In addition, I was curious if you might comment on my other recent posting
> "Consistency and Availability on Node Failures" and possibly add this
> scenario to the docs. With acks=-1 and min.isr=2 and a 3-replica topic in a
> 12-node Kafka cluster, there's a relatively high probability that losing 2
> nodes from this cluster will result in an inability to write to the cluster.
>
> On Tue, Oct 14, 2014 at 4:50 PM, Gwen Shapira  wrote:
>
>> ack = 2 *will* throw an exception when there's only one node in ISR.
>>
>> The problem with ack=2 is that if you have 3 replicas and you got acks
>> from 2 of them, the one replica which did not get the message can
>> still be in ISR and get elected as leader, leading for a loss of the
>> message. If you specify ack=3, you can't tolerate the failure of a
>> single replica. Not amazing either.
>>
>> To makes things even worse, when specifying the number of acks you
>> want, you don't always know how many replicas the topic should have,
>> so its difficult to pick the correct number.
>>
>> acks = -1 solves that problem (since all messages need to get acked by
>> all replicas), but introduces the new problem of not getting an
>> exception if ISR shrank to 1 replica.
>>
>> Thats why the min.isr configuration was added.
>>
>> I hope this clarifies things :)
>> I'm planning to add this to the docs in a day or two, so let me know
>> if there are any additional explanations or scenarios you think we
>> need to include.
>>
>> Gwen
>>
>> On Tue, Oct 14, 2014 at 12:27 PM, Scott Reynolds 
>> wrote:
>> > A question about 0.8.1.1 and acks. I was under the impression that
>> setting
>> > acks to 2 will not throw an exception when there is only one node in ISR.
>> > Am I incorrect ? Thus the need for min_isr.
>> >
>> > On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker 
>> wrote:
>> >
>> >> It's quite difficult to infer from the docs the exact techniques
>> required
>> >> to ensure consistency and durability in Kafka. I propose that we add a
>> doc
>> >> section detailing these techniques. I would be happy to help with this.
>> >>
>> >> The basic question is this: assuming that I can afford to temporarily
>> halt
>> >> production to Kafka, how do I ensure that no message written to Kafka is
>> >> ever lost under typical failure scenarios (i.e., the loss of a single
>> >> broker)?
>> >>
>> >> Here's my understanding of this for Kafka v0.8.1.1:
>> >>
>> >> 1. Create a topic with a replication factor of 3.
>> >> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may
>> >> successfully write even in a case where the data is written only to a
>> >> single node).
>> >>
>> >> Even with these two precautions, there's always the possibility of an
>> >> "unclean leader election." Can data loss still occur in this scenario?
>> Is
>> >> it possible to achieve this level of durability on v0.8.1.1?
>> >>
>> >> In Kafka v0.8.2, in addition to the above:
>> >>
>> >> 3. Ensure that the triple-replicated topic also disallows unclean leader
>> >> election (https://issues.apache.org/jira/browse/KAFKA-1028).
>> >>
>> >> 4. Set the min.isr value of the producer to 2 and acks to -1 (
>> >> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will
>> then
>> >> throw an exception if data can't be written to 2 out of 3 nodes.
>> >>
>> >> In addition to producer configuration and usage, there are also
>> monitoring
>> >> and operations considerations for achieving durability and consistency.
>> As
>> >> those are rather nuanced, it'd probably be easiest to just start
>> iterating
>> >> on a document to flesh those out.
>> >>
>> >> If anyone has any advice on how to better specify this, or how to get
>> >> started on improving the docs, I'm happy to help out.
>> >>
>>


Re: Achieving Consistency and Durability

2014-10-15 Thread Kyle Banker
Thanks very much for these clarifications, Gwen.

I'd recommend modifying the following phrase describing "acks=-1":

"This option provides the best durability, we guarantee that no messages
will be lost as long as at least one in sync replica remains."

The "as long as at least one in sync replica remains" is such a huge
caveat. It should be noted that "acks=-1" provides no actual durability
guarantees unless min.isr is also used to specify a majority of replicas.

In addition, I was curious if you might comment on my other recent posting
"Consistency and Availability on Node Failures" and possibly add this
scenario to the docs. With acks=-1 and min.isr=2 and a 3-replica topic in a
12-node Kafka cluster, there's a relatively high probability that losing 2
nodes from this cluster will result in an inability to write to the cluster.

On Tue, Oct 14, 2014 at 4:50 PM, Gwen Shapira  wrote:

> ack = 2 *will* throw an exception when there's only one node in ISR.
>
> The problem with ack=2 is that if you have 3 replicas and you got acks
> from 2 of them, the one replica which did not get the message can
> still be in ISR and get elected as leader, leading for a loss of the
> message. If you specify ack=3, you can't tolerate the failure of a
> single replica. Not amazing either.
>
> To makes things even worse, when specifying the number of acks you
> want, you don't always know how many replicas the topic should have,
> so its difficult to pick the correct number.
>
> acks = -1 solves that problem (since all messages need to get acked by
> all replicas), but introduces the new problem of not getting an
> exception if ISR shrank to 1 replica.
>
> Thats why the min.isr configuration was added.
>
> I hope this clarifies things :)
> I'm planning to add this to the docs in a day or two, so let me know
> if there are any additional explanations or scenarios you think we
> need to include.
>
> Gwen
>
> On Tue, Oct 14, 2014 at 12:27 PM, Scott Reynolds 
> wrote:
> > A question about 0.8.1.1 and acks. I was under the impression that
> setting
> > acks to 2 will not throw an exception when there is only one node in ISR.
> > Am I incorrect ? Thus the need for min_isr.
> >
> > On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker 
> wrote:
> >
> >> It's quite difficult to infer from the docs the exact techniques
> required
> >> to ensure consistency and durability in Kafka. I propose that we add a
> doc
> >> section detailing these techniques. I would be happy to help with this.
> >>
> >> The basic question is this: assuming that I can afford to temporarily
> halt
> >> production to Kafka, how do I ensure that no message written to Kafka is
> >> ever lost under typical failure scenarios (i.e., the loss of a single
> >> broker)?
> >>
> >> Here's my understanding of this for Kafka v0.8.1.1:
> >>
> >> 1. Create a topic with a replication factor of 3.
> >> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may
> >> successfully write even in a case where the data is written only to a
> >> single node).
> >>
> >> Even with these two precautions, there's always the possibility of an
> >> "unclean leader election." Can data loss still occur in this scenario?
> Is
> >> it possible to achieve this level of durability on v0.8.1.1?
> >>
> >> In Kafka v0.8.2, in addition to the above:
> >>
> >> 3. Ensure that the triple-replicated topic also disallows unclean leader
> >> election (https://issues.apache.org/jira/browse/KAFKA-1028).
> >>
> >> 4. Set the min.isr value of the producer to 2 and acks to -1 (
> >> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will
> then
> >> throw an exception if data can't be written to 2 out of 3 nodes.
> >>
> >> In addition to producer configuration and usage, there are also
> monitoring
> >> and operations considerations for achieving durability and consistency.
> As
> >> those are rather nuanced, it'd probably be easiest to just start
> iterating
> >> on a document to flesh those out.
> >>
> >> If anyone has any advice on how to better specify this, or how to get
> >> started on improving the docs, I'm happy to help out.
> >>
>


Re: Achieving Consistency and Durability

2014-10-14 Thread Gwen Shapira
ack = 2 *will* throw an exception when there's only one node in ISR.

The problem with ack=2 is that if you have 3 replicas and you got acks
from 2 of them, the one replica which did not get the message can
still be in ISR and get elected as leader, leading for a loss of the
message. If you specify ack=3, you can't tolerate the failure of a
single replica. Not amazing either.

To makes things even worse, when specifying the number of acks you
want, you don't always know how many replicas the topic should have,
so its difficult to pick the correct number.

acks = -1 solves that problem (since all messages need to get acked by
all replicas), but introduces the new problem of not getting an
exception if ISR shrank to 1 replica.

Thats why the min.isr configuration was added.

I hope this clarifies things :)
I'm planning to add this to the docs in a day or two, so let me know
if there are any additional explanations or scenarios you think we
need to include.

Gwen

On Tue, Oct 14, 2014 at 12:27 PM, Scott Reynolds  wrote:
> A question about 0.8.1.1 and acks. I was under the impression that setting
> acks to 2 will not throw an exception when there is only one node in ISR.
> Am I incorrect ? Thus the need for min_isr.
>
> On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker  wrote:
>
>> It's quite difficult to infer from the docs the exact techniques required
>> to ensure consistency and durability in Kafka. I propose that we add a doc
>> section detailing these techniques. I would be happy to help with this.
>>
>> The basic question is this: assuming that I can afford to temporarily halt
>> production to Kafka, how do I ensure that no message written to Kafka is
>> ever lost under typical failure scenarios (i.e., the loss of a single
>> broker)?
>>
>> Here's my understanding of this for Kafka v0.8.1.1:
>>
>> 1. Create a topic with a replication factor of 3.
>> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may
>> successfully write even in a case where the data is written only to a
>> single node).
>>
>> Even with these two precautions, there's always the possibility of an
>> "unclean leader election." Can data loss still occur in this scenario? Is
>> it possible to achieve this level of durability on v0.8.1.1?
>>
>> In Kafka v0.8.2, in addition to the above:
>>
>> 3. Ensure that the triple-replicated topic also disallows unclean leader
>> election (https://issues.apache.org/jira/browse/KAFKA-1028).
>>
>> 4. Set the min.isr value of the producer to 2 and acks to -1 (
>> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will then
>> throw an exception if data can't be written to 2 out of 3 nodes.
>>
>> In addition to producer configuration and usage, there are also monitoring
>> and operations considerations for achieving durability and consistency. As
>> those are rather nuanced, it'd probably be easiest to just start iterating
>> on a document to flesh those out.
>>
>> If anyone has any advice on how to better specify this, or how to get
>> started on improving the docs, I'm happy to help out.
>>


Re: Achieving Consistency and Durability

2014-10-14 Thread Scott Reynolds
A question about 0.8.1.1 and acks. I was under the impression that setting
acks to 2 will not throw an exception when there is only one node in ISR.
Am I incorrect ? Thus the need for min_isr.

On Tue, Oct 14, 2014 at 11:50 AM, Kyle Banker  wrote:

> It's quite difficult to infer from the docs the exact techniques required
> to ensure consistency and durability in Kafka. I propose that we add a doc
> section detailing these techniques. I would be happy to help with this.
>
> The basic question is this: assuming that I can afford to temporarily halt
> production to Kafka, how do I ensure that no message written to Kafka is
> ever lost under typical failure scenarios (i.e., the loss of a single
> broker)?
>
> Here's my understanding of this for Kafka v0.8.1.1:
>
> 1. Create a topic with a replication factor of 3.
> 2. Use a sync producer and set acks to 2. (Setting acks to -1 may
> successfully write even in a case where the data is written only to a
> single node).
>
> Even with these two precautions, there's always the possibility of an
> "unclean leader election." Can data loss still occur in this scenario? Is
> it possible to achieve this level of durability on v0.8.1.1?
>
> In Kafka v0.8.2, in addition to the above:
>
> 3. Ensure that the triple-replicated topic also disallows unclean leader
> election (https://issues.apache.org/jira/browse/KAFKA-1028).
>
> 4. Set the min.isr value of the producer to 2 and acks to -1 (
> https://issues.apache.org/jira/browse/KAFKA-1555). The producer will then
> throw an exception if data can't be written to 2 out of 3 nodes.
>
> In addition to producer configuration and usage, there are also monitoring
> and operations considerations for achieving durability and consistency. As
> those are rather nuanced, it'd probably be easiest to just start iterating
> on a document to flesh those out.
>
> If anyone has any advice on how to better specify this, or how to get
> started on improving the docs, I'm happy to help out.
>


Achieving Consistency and Durability

2014-10-14 Thread Kyle Banker
It's quite difficult to infer from the docs the exact techniques required
to ensure consistency and durability in Kafka. I propose that we add a doc
section detailing these techniques. I would be happy to help with this.

The basic question is this: assuming that I can afford to temporarily halt
production to Kafka, how do I ensure that no message written to Kafka is
ever lost under typical failure scenarios (i.e., the loss of a single
broker)?

Here's my understanding of this for Kafka v0.8.1.1:

1. Create a topic with a replication factor of 3.
2. Use a sync producer and set acks to 2. (Setting acks to -1 may
successfully write even in a case where the data is written only to a
single node).

Even with these two precautions, there's always the possibility of an
"unclean leader election." Can data loss still occur in this scenario? Is
it possible to achieve this level of durability on v0.8.1.1?

In Kafka v0.8.2, in addition to the above:

3. Ensure that the triple-replicated topic also disallows unclean leader
election (https://issues.apache.org/jira/browse/KAFKA-1028).

4. Set the min.isr value of the producer to 2 and acks to -1 (
https://issues.apache.org/jira/browse/KAFKA-1555). The producer will then
throw an exception if data can't be written to 2 out of 3 nodes.

In addition to producer configuration and usage, there are also monitoring
and operations considerations for achieving durability and consistency. As
those are rather nuanced, it'd probably be easiest to just start iterating
on a document to flesh those out.

If anyone has any advice on how to better specify this, or how to get
started on improving the docs, I'm happy to help out.