Re: kafka + autoscaling groups fuckery

2016-07-05 Thread Gian Merlino
I think it'd be possible to avoid special-casing replacements, but it might
be a bad idea network-traffic-wise, especially for rolling upgrades.

My experience running Kafka on AWS is that rebalancing with multi-day
retention periods can take a really long time, and can torch the cluster if
you rebalance too much at once. Rebalancing the entire cluster at once was
a recipe for having the cluster DoS itself. So we chose to special-case
upgrades/replacements just to avoid having too much rebalancing traffic.
What we did was store all the Kafka data, including the broker id, on EBS
volumes. Replacements mean attaching an old EBS volume to a new instance –
that way the id and data are reused, and no rebalance traffic is needed
(just some catch-up). New additions mean creating a new EBS volume, new
broker id, and new instance, and balancing partitions onto it.

If you have short retention periods then this is probably less of an issue.

(NB: this experience is from Kafka 0.8.x, not sure if something has
improved in Kafka since then)

On Sun, Jul 3, 2016 at 11:35 AM, Charity Majors  wrote:

> It would be cool to know if Netflix runs kafka in ASGs ... I can't find any
> mention of it online.  (https://github.com/Netflix/suro/wiki/FAQ sorta
> implies that maybe they do, but it's not clear, and also old.)
>
> I've seen other people talking about running kafka in ASGs, e.g.
> http://blog.jimjh.com/building-elastic-clusters.html, but they all rely on
> reusing broker IDs.  Which certainly makes it easier, but imho is the wrong
> way to do this for all the reasons I listed before.
>
> On Sun, Jul 3, 2016 at 11:29 AM, Charity Majors  wrote:
>
> > Great talks, but not relevant to either of my problems -- the golang
> > client not rebalancing the consumer offset topic, or autoscaling group
> > behavior (which is I think is probably just a consequence of the first).
> >
> > Thanks though, there's good stuff in here.
> >
> > On Sun, Jul 3, 2016 at 10:23 AM, James Cheng 
> wrote:
> >
> >> Charity,
> >>
> >> I'm not sure about the specific problem you are having, but about Kafka
> >> on AWS, Netflix did a talk at a meetup about their Kafka installation on
> >> AWS. There might be some useful information in there. There is a video
> >> stream as well as slides, and maybe you can get in touch with the
> speakers.
> >> Look in the comment section for links to the slides and video.
> >>
> >> Kafka at Netflix
> >>
> >>
> http://www.meetup.com//http-kafka-apache-org/events/220355031/?showDescription=true
> >>
> >> There's also a talk about running Kafka on Mesos, which might be
> relevant.
> >>
> >> Kafka on Mesos
> >>
> >>
> http://www.meetup.com//http-kafka-apache-org/events/222537743/?showDescription=true
> >>
> >> -James
> >>
> >> Sent from my iPhone
> >>
> >> > On Jul 2, 2016, at 5:15 PM, Charity Majors  wrote:
> >> >
> >> > Gwen, thanks for the response.
> >> >
> >> > 1.1 Your life may be a bit simpler if you have a way of starting a new
> >> >
> >> >> broker with the same ID as the old one - this means it will
> >> >> automatically pick up the old replicas and you won't need to
> >> >> rebalance. Makes life slightly easier in some cases.
> >> >
> >> > Yeah, this is definitely doable, I just don't *want* to do it.  I
> really
> >> > want all of these to share the same code path: 1) rolling all nodes in
> >> an
> >> > ASG to pick up a new AMI, 2) hardware failure / unintentional node
> >> > termination, 3) resizing the ASG and rebalancing the data across
> nodes.
> >> >
> >> > Everything but the first one means generating new node IDs, so I would
> >> > rather just do that across the board.  It's the solution that really
> >> fits
> >> > the ASG model best, so I'm reluctant to give up on it.
> >> >
> >> >
> >> >> 1.2 Careful not too rebalance too many partitions at once - you only
> >> >> have so much bandwidth and currently Kafka will not throttle
> >> >> rebalancing traffic.
> >> >
> >> > Nod, got it.  This is def something I plan to work on hardening once I
> >> have
> >> > the basic nut of things working (or if I've had to give up on it and
> >> accept
> >> > a lesser solution).
> >> >
> >> >
> >> >> 2. I think your rebalance script is not rebalancing the offsets
> topic?
> >> >> It still has a replica on broker 1002. You have two good replicas, so
> >> >> you are no where near disaster, but make sure you get this working
> >> >> too.
> >> >
> >> > Yes, this is another problem I am working on in parallel.  The Shopify
> >> > sarama library  uses the
> >> > __consumer_offsets topic, but it does *not* let you rebalance or
> resize
> >> the
> >> > topic when consumers connect, disconnect, or restart.
> >> >
> >> > "Note that Sarama's Consumer implementation does not currently support
> >> > automatic consumer-group rebalancing and offset tracking"
> >> >
> >> > I'm working on trying to get the 

Re: kafka + autoscaling groups fuckery

2016-07-03 Thread Charity Majors
It would be cool to know if Netflix runs kafka in ASGs ... I can't find any
mention of it online.  (https://github.com/Netflix/suro/wiki/FAQ sorta
implies that maybe they do, but it's not clear, and also old.)

I've seen other people talking about running kafka in ASGs, e.g.
http://blog.jimjh.com/building-elastic-clusters.html, but they all rely on
reusing broker IDs.  Which certainly makes it easier, but imho is the wrong
way to do this for all the reasons I listed before.

On Sun, Jul 3, 2016 at 11:29 AM, Charity Majors  wrote:

> Great talks, but not relevant to either of my problems -- the golang
> client not rebalancing the consumer offset topic, or autoscaling group
> behavior (which is I think is probably just a consequence of the first).
>
> Thanks though, there's good stuff in here.
>
> On Sun, Jul 3, 2016 at 10:23 AM, James Cheng  wrote:
>
>> Charity,
>>
>> I'm not sure about the specific problem you are having, but about Kafka
>> on AWS, Netflix did a talk at a meetup about their Kafka installation on
>> AWS. There might be some useful information in there. There is a video
>> stream as well as slides, and maybe you can get in touch with the speakers.
>> Look in the comment section for links to the slides and video.
>>
>> Kafka at Netflix
>>
>> http://www.meetup.com//http-kafka-apache-org/events/220355031/?showDescription=true
>>
>> There's also a talk about running Kafka on Mesos, which might be relevant.
>>
>> Kafka on Mesos
>>
>> http://www.meetup.com//http-kafka-apache-org/events/222537743/?showDescription=true
>>
>> -James
>>
>> Sent from my iPhone
>>
>> > On Jul 2, 2016, at 5:15 PM, Charity Majors  wrote:
>> >
>> > Gwen, thanks for the response.
>> >
>> > 1.1 Your life may be a bit simpler if you have a way of starting a new
>> >
>> >> broker with the same ID as the old one - this means it will
>> >> automatically pick up the old replicas and you won't need to
>> >> rebalance. Makes life slightly easier in some cases.
>> >
>> > Yeah, this is definitely doable, I just don't *want* to do it.  I really
>> > want all of these to share the same code path: 1) rolling all nodes in
>> an
>> > ASG to pick up a new AMI, 2) hardware failure / unintentional node
>> > termination, 3) resizing the ASG and rebalancing the data across nodes.
>> >
>> > Everything but the first one means generating new node IDs, so I would
>> > rather just do that across the board.  It's the solution that really
>> fits
>> > the ASG model best, so I'm reluctant to give up on it.
>> >
>> >
>> >> 1.2 Careful not too rebalance too many partitions at once - you only
>> >> have so much bandwidth and currently Kafka will not throttle
>> >> rebalancing traffic.
>> >
>> > Nod, got it.  This is def something I plan to work on hardening once I
>> have
>> > the basic nut of things working (or if I've had to give up on it and
>> accept
>> > a lesser solution).
>> >
>> >
>> >> 2. I think your rebalance script is not rebalancing the offsets topic?
>> >> It still has a replica on broker 1002. You have two good replicas, so
>> >> you are no where near disaster, but make sure you get this working
>> >> too.
>> >
>> > Yes, this is another problem I am working on in parallel.  The Shopify
>> > sarama library  uses the
>> > __consumer_offsets topic, but it does *not* let you rebalance or resize
>> the
>> > topic when consumers connect, disconnect, or restart.
>> >
>> > "Note that Sarama's Consumer implementation does not currently support
>> > automatic consumer-group rebalancing and offset tracking"
>> >
>> > I'm working on trying to get the sarama-cluster to do something here.  I
>> > think these problems are likely related, I'm not sure wtf you are
>> > *supposed* to do to rebalance this god damn topic.  It also seems like
>> we
>> > aren't using a consumer group which sarama-cluster depends on to
>> rebalance
>> > a topic.  I'm still pretty confused by the 0.9 "consumer group" stuff.
>> >
>> > Seriously considering downgrading to the latest 0.8 release, because
>> > there's a massive gap in documentation for the new stuff in 0.9 (like
>> > consumer groups) and we don't really need any of the new features.
>> >
>> > A common work-around is to configure the consumer to handle "offset
>> >> out of range" exception by jumping to the last offset available in the
>> >> log. This is the behavior of the Java client, and it would have saved
>> >> your consumer here. Go client looks very low level, so I don't know
>> >> how easy it is to do that.
>> >
>> > Erf, this seems like it would almost guarantee data loss.  :(  Will
>> check
>> > it out tho.
>> >
>> > If I were you, I'd retest your ASG scripts without the auto leader
>> >> election - since your own scripts can / should handle that.
>> >
>> > Okay, this is straightforward enough.  Will try it.  And will keep
>> tryingn
>> > to figure out how to balance the __consumer_offsets topic, since I

Re: kafka + autoscaling groups fuckery

2016-07-03 Thread Charity Majors
Great talks, but not relevant to either of my problems -- the golang client
not rebalancing the consumer offset topic, or autoscaling group behavior
(which is I think is probably just a consequence of the first).

Thanks though, there's good stuff in here.

On Sun, Jul 3, 2016 at 10:23 AM, James Cheng  wrote:

> Charity,
>
> I'm not sure about the specific problem you are having, but about Kafka on
> AWS, Netflix did a talk at a meetup about their Kafka installation on AWS.
> There might be some useful information in there. There is a video stream as
> well as slides, and maybe you can get in touch with the speakers. Look in
> the comment section for links to the slides and video.
>
> Kafka at Netflix
>
> http://www.meetup.com//http-kafka-apache-org/events/220355031/?showDescription=true
>
> There's also a talk about running Kafka on Mesos, which might be relevant.
>
> Kafka on Mesos
>
> http://www.meetup.com//http-kafka-apache-org/events/222537743/?showDescription=true
>
> -James
>
> Sent from my iPhone
>
> > On Jul 2, 2016, at 5:15 PM, Charity Majors  wrote:
> >
> > Gwen, thanks for the response.
> >
> > 1.1 Your life may be a bit simpler if you have a way of starting a new
> >
> >> broker with the same ID as the old one - this means it will
> >> automatically pick up the old replicas and you won't need to
> >> rebalance. Makes life slightly easier in some cases.
> >
> > Yeah, this is definitely doable, I just don't *want* to do it.  I really
> > want all of these to share the same code path: 1) rolling all nodes in an
> > ASG to pick up a new AMI, 2) hardware failure / unintentional node
> > termination, 3) resizing the ASG and rebalancing the data across nodes.
> >
> > Everything but the first one means generating new node IDs, so I would
> > rather just do that across the board.  It's the solution that really fits
> > the ASG model best, so I'm reluctant to give up on it.
> >
> >
> >> 1.2 Careful not too rebalance too many partitions at once - you only
> >> have so much bandwidth and currently Kafka will not throttle
> >> rebalancing traffic.
> >
> > Nod, got it.  This is def something I plan to work on hardening once I
> have
> > the basic nut of things working (or if I've had to give up on it and
> accept
> > a lesser solution).
> >
> >
> >> 2. I think your rebalance script is not rebalancing the offsets topic?
> >> It still has a replica on broker 1002. You have two good replicas, so
> >> you are no where near disaster, but make sure you get this working
> >> too.
> >
> > Yes, this is another problem I am working on in parallel.  The Shopify
> > sarama library  uses the
> > __consumer_offsets topic, but it does *not* let you rebalance or resize
> the
> > topic when consumers connect, disconnect, or restart.
> >
> > "Note that Sarama's Consumer implementation does not currently support
> > automatic consumer-group rebalancing and offset tracking"
> >
> > I'm working on trying to get the sarama-cluster to do something here.  I
> > think these problems are likely related, I'm not sure wtf you are
> > *supposed* to do to rebalance this god damn topic.  It also seems like we
> > aren't using a consumer group which sarama-cluster depends on to
> rebalance
> > a topic.  I'm still pretty confused by the 0.9 "consumer group" stuff.
> >
> > Seriously considering downgrading to the latest 0.8 release, because
> > there's a massive gap in documentation for the new stuff in 0.9 (like
> > consumer groups) and we don't really need any of the new features.
> >
> > A common work-around is to configure the consumer to handle "offset
> >> out of range" exception by jumping to the last offset available in the
> >> log. This is the behavior of the Java client, and it would have saved
> >> your consumer here. Go client looks very low level, so I don't know
> >> how easy it is to do that.
> >
> > Erf, this seems like it would almost guarantee data loss.  :(  Will check
> > it out tho.
> >
> > If I were you, I'd retest your ASG scripts without the auto leader
> >> election - since your own scripts can / should handle that.
> >
> > Okay, this is straightforward enough.  Will try it.  And will keep
> tryingn
> > to figure out how to balance the __consumer_offsets topic, since I
> > increasingly think that's the key to this giant mess.
> >
> > If anyone has any advice there, massively appreciated.
> >
> > Thanks,
> >
> > charity.
>


Re: kafka + autoscaling groups fuckery

2016-07-02 Thread Charity Majors
Gwen, thanks for the response.

1.1 Your life may be a bit simpler if you have a way of starting a new

> broker with the same ID as the old one - this means it will
> automatically pick up the old replicas and you won't need to
> rebalance. Makes life slightly easier in some cases.
>

Yeah, this is definitely doable, I just don't *want* to do it.  I really
want all of these to share the same code path: 1) rolling all nodes in an
ASG to pick up a new AMI, 2) hardware failure / unintentional node
termination, 3) resizing the ASG and rebalancing the data across nodes.

Everything but the first one means generating new node IDs, so I would
rather just do that across the board.  It's the solution that really fits
the ASG model best, so I'm reluctant to give up on it.


> 1.2 Careful not too rebalance too many partitions at once - you only
> have so much bandwidth and currently Kafka will not throttle
> rebalancing traffic.
>

Nod, got it.  This is def something I plan to work on hardening once I have
the basic nut of things working (or if I've had to give up on it and accept
a lesser solution).


> 2. I think your rebalance script is not rebalancing the offsets topic?
> It still has a replica on broker 1002. You have two good replicas, so
> you are no where near disaster, but make sure you get this working
> too.
>

Yes, this is another problem I am working on in parallel.  The Shopify
sarama library  uses the
__consumer_offsets topic, but it does *not* let you rebalance or resize the
topic when consumers connect, disconnect, or restart.

"Note that Sarama's Consumer implementation does not currently support
automatic consumer-group rebalancing and offset tracking"

I'm working on trying to get the sarama-cluster to do something here.  I
think these problems are likely related, I'm not sure wtf you are
*supposed* to do to rebalance this god damn topic.  It also seems like we
aren't using a consumer group which sarama-cluster depends on to rebalance
a topic.  I'm still pretty confused by the 0.9 "consumer group" stuff.

Seriously considering downgrading to the latest 0.8 release, because
there's a massive gap in documentation for the new stuff in 0.9 (like
consumer groups) and we don't really need any of the new features.

A common work-around is to configure the consumer to handle "offset
> out of range" exception by jumping to the last offset available in the
> log. This is the behavior of the Java client, and it would have saved
> your consumer here. Go client looks very low level, so I don't know
> how easy it is to do that.
>

Erf, this seems like it would almost guarantee data loss.  :(  Will check
it out tho.

If I were you, I'd retest your ASG scripts without the auto leader
> election - since your own scripts can / should handle that.
>

Okay, this is straightforward enough.  Will try it.  And will keep tryingn
to figure out how to balance the __consumer_offsets topic, since I
increasingly think that's the key to this giant mess.

If anyone has any advice there, massively appreciated.

Thanks,

charity.


Re: kafka + autoscaling groups fuckery

2016-06-28 Thread Gwen Shapira
Charity,

1. Nothing you do seems crazy to me. Kafka should be able to work with
auto-scaling and we should be able to fix the issues you are running
into.

There are few things you should be careful about when using the method
you described though:
1.1 Your life may be a bit simpler if you have a way of starting a new
broker with the same ID as the old one - this means it will
automatically pick up the old replicas and you won't need to
rebalance. Makes life slightly easier in some cases.
1.2 Careful not too rebalance too many partitions at once - you only
have so much bandwidth and currently Kafka will not throttle
rebalancing traffic.

2. I think your rebalance script is not rebalancing the offsets topic?
It still has a replica on broker 1002. You have two good replicas, so
you are no where near disaster, but make sure you get this working
too.

3. From the logs, it looks like something a bit more complex happened...
You started with brokers 1001,1002 and 1003. Until around 00:08 or so,
they are all gone. Then 1005,1006,1007 and 1008 show up, but they
still have no replicas. Then you get 1001, 1003, 1004 and 1005? and
then it moves to 1001, 1003, 1004 and 1009? I'm not sure I managed to
piece this together correctly (we need better log-extraction tools for
sure...), but it looks like we had plenty of opportunities for things
to go wrong :)

4. We have a known race condition where two leader elections in close
proximity get cause a consumer to accidentally get ahead.
One culprit is "auto.leader.rebalance.auto=enable" - this can trigger
leader election at bad timing (see:
https://issues.apache.org/jira/browse/KAFKA-3670) which can lead to
the loss of the last offset after consumers saw it. In cases where the
admin controls rebalances, we often turn it off. You can trigger
rebalances based on your knowledge without Kafka automatically doing
extra rebalancing.

Another culprit can be "unclean.leader.election.enable", but you don't
have that :)

A common work-around is to configure the consumer to handle "offset
out of range" exception by jumping to the last offset available in the
log. This is the behavior of the Java client, and it would have saved
your consumer here. Go client looks very low level, so I don't know
how easy it is to do that.

If I were you, I'd retest your ASG scripts without the auto leader
election - since your own scripts can / should handle that.

Hope this helps,

Gwen

On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors  wrote:
> Hi there,
>
> I just finished implementing kafka + autoscaling groups in a way that made
> sense to me.  I have a _lot_ of experience with ASGs and various storage
> types but I'm a kafka noob (about 4-5 months of using in development and
> staging and pre-launch production).
>
> It seems to be working fine from the Kafka POV but causing troubling side
> effects elsewhere that I don't understand.  I don't know enough about Kafka
> to know if my implementation is just fundamentally flawed for some reason,
> or if so how and why.
>
> My process is basically this:
>
> - Terminate a node, or increment the size of the ASG by one.  (I'm not doing
> any graceful shutdowns because I don't want to rely on graceful shutdowns,
> and I'm not attempting to act upon more than one node at a time.  Planning
> on doing a ZK lock or something later to enforce one process at a time, if I
> can work the major kinks out.)
>
> - Firstboot script, which runs on all hosts from rc.init.  (We run ASGs for
> *everything.)  It infers things like the chef role, environment, cluster
> name, etc, registers DNS, bootstraps and runs chef-client, etc.  For storage
> nodes, it formats and mounts a PIOPS volume under the right mount point, or
> just remounts the volume if it already contains data.  Etc.
>
> - Run a balancing script from firstboot on kafka nodes.  It checks to see
> how many brokers there are and what their ids are, and checks for any
> underbalanced partitions with less than 3 ISRs.  Then we generate a new
> assignment file for rebalancing partitions, and execute it.  We watch on the
> host for all the partitions to finish rebalancing, then complete.
>
> - So far so good.  I have repeatedly killed kafka nodes and had them come
> up, rebalance the cluster, and everything on the kafka side looks healthy.
> All the partitions have the correct number of ISRs, etc.
>
> But after doing this, we have repeatedly gotten into a state where consumers
> that are pulling off the kafka partitions enter a weird state where their
> last known offset is *ahead* of the last known offset for that partition,
> and we can't recover from it.
>
> A example.  Last night I terminated ... I think it was broker 1002 or 1005,
> and it came back up as broker 1009.  It rebalanced on boot, everything
> looked good from the kafka side.  This morning we noticed that the storage
> node that maps to partition 5 has been broken for like 22 hours, it thinks
> the next offset is too far ahead / out 

Re: kafka + autoscaling groups fuckery

2016-06-28 Thread Charity Majors
Reasons.

Investigated it thoroughly, believe me.  Some of the limitations that
Kinesis uses to protect itself are non starters for us.

Forgot to mention, we are using 0.9.0.1-0.



On Tue, Jun 28, 2016 at 3:56 PM, Pradeep Gollakota 
wrote:

> Just out of curiosity, if you guys are in AWS for everything, why not use
> Kinesis?
>
> On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors  wrote:
>
> > Hi there,
> >
> > I just finished implementing kafka + autoscaling groups in a way that
> made
> > sense to me.  I have a _lot_ of experience with ASGs and various storage
> > types but I'm a kafka noob (about 4-5 months of using in development and
> > staging and pre-launch production).
> >
> > It seems to be working fine from the Kafka POV but causing troubling side
> > effects elsewhere that I don't understand.  I don't know enough about
> Kafka
> > to know if my implementation is just fundamentally flawed for some
> reason,
> > or if so how and why.
> >
> > My process is basically this:
> >
> > - *Terminate a node*, or increment the size of the ASG by one.  (I'm not
> > doing any graceful shutdowns because I don't want to rely on graceful
> > shutdowns, and I'm not attempting to act upon more than one node at a
> > time.  Planning on doing a ZK lock or something later to enforce one
> > process at a time, if I can work the major kinks out.)
> >
> > - *Firstboot script,* which runs on all hosts from rc.init.  (We run ASGs
> > for *everything.)  It infers things like the chef role, environment,
> > cluster name, etc, registers DNS, bootstraps and runs chef-client, etc.
> > For storage nodes, it formats and mounts a PIOPS volume under the right
> > mount point, or just remounts the volume if it already contains data.
> Etc.
> >
> > - *Run a balancing script from firstboot* on kafka nodes.  It checks to
> > see how many brokers there are and what their ids are, and checks for any
> > underbalanced partitions with less than 3 ISRs.  Then we generate a new
> > assignment file for rebalancing partitions, and execute it.  We watch on
> > the host for all the partitions to finish rebalancing, then complete.
> >
> > *- So far so good*.  I have repeatedly killed kafka nodes and had them
> > come up, rebalance the cluster, and everything on the kafka side looks
> > healthy.  All the partitions have the correct number of ISRs, etc.
> >
> > But after doing this, we have repeatedly gotten into a state where
> > consumers that are pulling off the kafka partitions enter a weird state
> > where their last known offset is *ahead* of the last known offset for
> that
> > partition, and we can't recover from it.
> >
> > *A example.*  Last night I terminated ... I think it was broker 1002 or
> > 1005, and it came back up as broker 1009.  It rebalanced on boot,
> > everything looked good from the kafka side.  This morning we noticed that
> > the storage node that maps to partition 5 has been broken for like 22
> > hours, it thinks the next offset is too far ahead / out of bounds so
> > stopped consuming.  This happened shortly after broker 1009 came online
> and
> > the consumer caught up.
> >
> > From the storage node log:
> >
> > time="2016-06-28T21:51:48.286035635Z" level=info msg="Serving at
> > 0.0.0.0:8089..."
> > time="2016-06-28T21:51:48.293946529Z" level=error msg="Error creating
> > consumer" error="kafka server: The requested offset is outside the range
> of
> > offsets maintained by the server for the given topic/partition."
> > time="2016-06-28T21:51:48.294532365Z" level=error msg="Failed to start
> > services: kafka server: The requested offset is outside the range of
> > offsets maintained by the server for the given topic/partition."
> > time="2016-06-28T21:51:48.29461156Z" level=info msg="Shutting down..."
> >
> > From the mysql mapping of partitions to storage nodes/statuses:
> >
> > PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> > hound-kennel
> >
> > Listing by default. Use -action  > setstate, addslot, removeslot, removenode> for other actions
> >
> > PartStatus  Last UpdatedHostname
> > 0   live2016-06-28 22:29:10 + UTC
>  retriever-772045ec
> > 1   live2016-06-28 22:29:29 + UTC
>  retriever-75e0e4f2
> > 2   live2016-06-28 22:29:25 + UTC
>  retriever-78804480
> > 3   live2016-06-28 22:30:01 + UTC
>  retriever-c0da5f85
> > 4   live2016-06-28 22:29:42 + UTC
>  retriever-122c6d8e
> > 5   2016-06-28 21:53:48 + UTC
> >
> >
> > PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> > hound-kennel -partition 5 -action nextoffset
> >
> > Next offset for partition 5: 12040353
> >
> >
> > Interestingly, the primary for partition 5 is 1004, and its follower is
> > the new node 1009.  (Partition 2 has 1009 as its leader and 1004 as its
> > follower, and seems just fine.)

Re: kafka + autoscaling groups fuckery

2016-06-28 Thread Pradeep Gollakota
Just out of curiosity, if you guys are in AWS for everything, why not use
Kinesis?

On Tue, Jun 28, 2016 at 3:49 PM, Charity Majors  wrote:

> Hi there,
>
> I just finished implementing kafka + autoscaling groups in a way that made
> sense to me.  I have a _lot_ of experience with ASGs and various storage
> types but I'm a kafka noob (about 4-5 months of using in development and
> staging and pre-launch production).
>
> It seems to be working fine from the Kafka POV but causing troubling side
> effects elsewhere that I don't understand.  I don't know enough about Kafka
> to know if my implementation is just fundamentally flawed for some reason,
> or if so how and why.
>
> My process is basically this:
>
> - *Terminate a node*, or increment the size of the ASG by one.  (I'm not
> doing any graceful shutdowns because I don't want to rely on graceful
> shutdowns, and I'm not attempting to act upon more than one node at a
> time.  Planning on doing a ZK lock or something later to enforce one
> process at a time, if I can work the major kinks out.)
>
> - *Firstboot script,* which runs on all hosts from rc.init.  (We run ASGs
> for *everything.)  It infers things like the chef role, environment,
> cluster name, etc, registers DNS, bootstraps and runs chef-client, etc.
> For storage nodes, it formats and mounts a PIOPS volume under the right
> mount point, or just remounts the volume if it already contains data.  Etc.
>
> - *Run a balancing script from firstboot* on kafka nodes.  It checks to
> see how many brokers there are and what their ids are, and checks for any
> underbalanced partitions with less than 3 ISRs.  Then we generate a new
> assignment file for rebalancing partitions, and execute it.  We watch on
> the host for all the partitions to finish rebalancing, then complete.
>
> *- So far so good*.  I have repeatedly killed kafka nodes and had them
> come up, rebalance the cluster, and everything on the kafka side looks
> healthy.  All the partitions have the correct number of ISRs, etc.
>
> But after doing this, we have repeatedly gotten into a state where
> consumers that are pulling off the kafka partitions enter a weird state
> where their last known offset is *ahead* of the last known offset for that
> partition, and we can't recover from it.
>
> *A example.*  Last night I terminated ... I think it was broker 1002 or
> 1005, and it came back up as broker 1009.  It rebalanced on boot,
> everything looked good from the kafka side.  This morning we noticed that
> the storage node that maps to partition 5 has been broken for like 22
> hours, it thinks the next offset is too far ahead / out of bounds so
> stopped consuming.  This happened shortly after broker 1009 came online and
> the consumer caught up.
>
> From the storage node log:
>
> time="2016-06-28T21:51:48.286035635Z" level=info msg="Serving at
> 0.0.0.0:8089..."
> time="2016-06-28T21:51:48.293946529Z" level=error msg="Error creating
> consumer" error="kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.294532365Z" level=error msg="Failed to start
> services: kafka server: The requested offset is outside the range of
> offsets maintained by the server for the given topic/partition."
> time="2016-06-28T21:51:48.29461156Z" level=info msg="Shutting down..."
>
> From the mysql mapping of partitions to storage nodes/statuses:
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel
>
> Listing by default. Use -action  setstate, addslot, removeslot, removenode> for other actions
>
> PartStatus  Last UpdatedHostname
> 0   live2016-06-28 22:29:10 + UTC   retriever-772045ec
> 1   live2016-06-28 22:29:29 + UTC   retriever-75e0e4f2
> 2   live2016-06-28 22:29:25 + UTC   retriever-78804480
> 3   live2016-06-28 22:30:01 + UTC   retriever-c0da5f85
> 4   live2016-06-28 22:29:42 + UTC   retriever-122c6d8e
> 5   2016-06-28 21:53:48 + UTC
>
>
> PRODUCTION ubuntu@retriever-112c6d8d:/srv/hound/retriever/log$
> hound-kennel -partition 5 -action nextoffset
>
> Next offset for partition 5: 12040353
>
>
> Interestingly, the primary for partition 5 is 1004, and its follower is
> the new node 1009.  (Partition 2 has 1009 as its leader and 1004 as its
> follower, and seems just fine.)
>
> I've attached all the kafka logs for the broker 1009 node since it
> launched yesterday.
>
> I guess my main question is: *Is there something I am fundamentally
> missing about the kafka model that makes it it not play well with
> autoscaling?*  I see a couple of other people on the internet talking
> about using ASGs with kafka, but always in the context of maintaining a
> list of broker ids and reusing them.
>
> *I don't want to do that.  I want the path