Re: How to handle broker disk failure

Koert Kuipers Wed, 21 Jan 2015 07:52:02 -0800

same situation with us. we run jbod and actually dont replace the failed
data disks at all. we simply keep boxes running until non-failed drives
falls below some threshold. so our procedure with kafka would be:
1) ideally kafka server simply survives failed disk and keeps going, and
fixes itself with the data disks left.
2) if kafka server does not survive failed drive can we start it back up
with one less data disk and it will fix itself?



On Wed, Jan 21, 2015 at 6:11 AM, svante karlsson <s...@csi.se> wrote:

> Is it possible to continue to server topics from the remaining disks while
> waiting for a replacement disk or will the broker exit/stop working. (we
> would like to be able to replace disks in a relaxed manner since we have
> the datacenter colocated and we don't have permanent staff there since
> there is simply not enough things to do to motivate 24h staffing)
>
> If we trigger a rebalance during the downtime the under replicated
> topics/partitions will hopefully be moved somewhere else? What happens the
> when we add the broker again - now with a new empty disk. Will all over
> replicated partitions be removed from the reinserted broker and finally
> should/must we trigger a rebalance?
>
> /svante
>
> 2015-01-21 2:56 GMT+01:00 Jun Rao <j...@confluent.io>:
>
> > Actually, you don't need to reassign partitions in this case. You just
> need
> > to replace the bad disk and restart the broker. It will copy the missing
> > data over automatically.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Jan 20, 2015 at 1:02 AM, svante karlsson <s...@csi.se> wrote:
> >
> > > I'm trying to figure out the best way to handle a disk failure in a
> live
> > > environment.
> > >
> > > The obvious (and naive) solution is to decommission the broker and let
> > > other brokers taker over and create new followers. Then replace the
> disk
> > > and clean the remaining log directories and add the broker again.
> > >
> > > The disadvantage with this approach is of course the network overhead
> and
> > > the time it takes to reassign partitions.
> > >
> > > Is there a better way?
> > >
> > > As a sub question, is it possible to continue running a broker with a
> > > failed drive and still serve the remaining partitions?
> > >
> > > thanks,
> > > svante
> > >
> >
>

Re: How to handle broker disk failure

Reply via email to