Disaster Recovery (was: Re: Suggestions when all replicas of a partition are dead?)

Steve Miller Sat, 08 Aug 2015 07:18:32 -0700

   I could go either way.  I think that if ZK is up, then Kafka's going to go 
crazy trying to figure out who's the master of what, but maybe I'm not thinking 
the problem through clearly.

   That does beg the issue: it seems like it'd be good to have something 
written down somewhere to say how one should do a whole-cluster shutdown if one 
has to, and how one should recover from a surprise whole-cluster shutdown 
(e.g., someone hits the emergency-power-off button) should one happen.  It 
seems like in any long-lived Kafka cluster, that's going to happen 
*eventually*; even if the answer is "you're doomed, start over" at least that's 
documented and it can be worked into the plan.

   If one has an emergency power-off, or needs to shut it all down and bring it 
all back up, what *should* be the order of operations, then?

   In my particular situation, I think that what would have happened is that 
once we brought the rebuilt hosts back into the cluster, they'd have recreated 
the relevant partitions -- with no data, of course -- and negotiated who's the 
leader and other than the data loss, we might be OK.

   I'm not *as* clear though on what happens to the offsets for that partition 
in that scenario.  Would Kafka fish up the last offset for those partitions 
from ZK and start there?  Or would the offset for those partitions be reset to 
zero?

   If it's reset to 0, I could see much client wackiness, as clients say things 
like "message 1000? pah! I don't need that, my offset is 635213513516!", 
leading to desperation moves like changing group IDs or poking Zookeeper in the 
eye from zookeeper-client.

   What's supposed to happen there?

   Finally, one thing that *is* clear is that messing around with the topics 
while things are in this sort of deranged state leads to tears.  We tried to do 
some things like delete _schemas before the broken hosts were repaved and 
brought back online, and at that point nothing I could do seemed to restore 
_schemas to functioning.

   The deletion didn't seem to happen.

   The partition data in ZK ended up being completely missing.

   None of the brokers seemed to want to forget about the metadata for that 
topic because they'd all decided they weren't the leader.  Attempts at getting 
them to redo leader election didn't seem to make a difference.

   Restarting the brokers (doing a rolling restart, with 10 minutes in between 
in case things needed to do replication -- which they shouldn't, we'd cut off 
the inbound data feeds!) just ended up with the fun of 
https://issues.apache.org/jira/browse/KAFKA-1554.

   Stopping all the brokers at once, deleting the /admin/delete_topics/_schemas 
and /brokers/topics/_schemas keys from ZK, deleting any 10485760-byte-sized 
index files just in case, deleting the directories for _schemas-0 everywhere, 
and starting everything again, seems to have resulted in a completely unstable, 
unusable, cluster, with the same error from KAFKA-1554, but with index files 
that aren't the usual 10485760-byte junk size.

   I figure we'll pave it and start over but I think it'd be useful (not just 
to me) to have a better idea of the failure states here and how to recover from 
them.

        -Steve

On Fri, Aug 07, 2015 at 08:36:28PM +0000, Daniel Compton wrote:
> I would have thought you'd want ZK up before Kafka started, but I don't
> have any strong data to back that up.
> On Sat, 8 Aug 2015 at 7:59 AM Steve Miller <st...@idrathernotsay.com> wrote:
> 
> >    So... we had an extensive recabling exercise, during which we had to
> > shut down and derack and rerack a whole Kafka cluster.  Then when we
> > brought it back up, we discovered the hard way that two hosts had their
> > "rebuild on reboot" flag set in Cobbler.
> >
> >    Everything on those hosts is gone as a result, of course.  And a total
> > of four partitions had their primary and their replica on the two hosts
> > that were nuked.
> >
> >    This isn't the end of the world, in some sense: it's annoying, but
> > that's why we did this now before we brought the cluster into "real"
> > production rather than being in a pre-production state.  The data is all
> > transient anyway (well, except for _schemas, of course, which in accordance
> > to Murphy's law was one of the topics affected, but we have that mirrored
> > elsewhere).
> >
> >    Still, if there's an obvious way to recover from this, I couldn't find
> > it googling around for a while.
> >
> >    What's the recommended approach here?  Do we need to delete these
> > topics and start over?  Do we need to delete *everything* and start over?
> >
> >    (Also, other than "don't do that!" what's the recommended way to deal
> > with the situation where you need to take a whole cluster down all at
> > once?  Any order of operations related to how you shut down all the Kafka
> > nodes, especially WRT how you shut down Zookeeper?  We deliberately brought
> > Kafka up first *without* ZK, then brought up ZK, so that the brokers
> > wouldn't go nuts with leader election and the like, which seemed to make
> > sense, FWIW.)
> >
> >         -Steve
> >
> -- 
> --
> Daniel

Disaster Recovery (was: Re: Suggestions when all replicas of a partition are dead?)

Reply via email to