On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka <[email protected]> wrote: > At Thu, 9 Aug 2012 16:43:38 +0800, > Yunkai Zhang wrote: >> >> From: Yunkai Zhang <[email protected]> >> >> V2: >> - fix a typo >> - when an object is updated, delete it old version >> - reset cluster recovery state in finish_recovery() >> >> Yunkai Zhang (11): >> sheep: enable variale-length of join_message in response of join >> event >> sheep: share joining nodes with newly added sheep >> sheep: delay to process recovery caused by LEAVE event just like JOIN >> event >> sheep: don't cleanup working directory when sheep joined back >> sheep: read objects only from live nodes >> sheep: write objects only on live nodes >> sheep: mark dirty object that belongs to the leaving nodes >> sheep: send dirty object list to each sheep when cluster do recovery >> sheep: do recovery with dirty object list >> collie: update 'collie cluster recover info' commands >> collie: update doc about 'collie cluster recover disable' >> >> collie/cluster.c | 46 ++++++++--- >> include/internal_proto.h | 32 ++++++-- >> include/sheep.h | 23 ++++++ >> man/collie.8 | 2 +- >> sheep/cluster.h | 29 +------ >> sheep/cluster/accord.c | 2 +- >> sheep/cluster/corosync.c | 9 ++- >> sheep/cluster/local.c | 2 +- >> sheep/cluster/zookeeper.c | 2 +- >> sheep/farm/trunk.c | 2 +- >> sheep/gateway.c | 39 ++++++++- >> sheep/group.c | 202 >> +++++++++++++++++++++++++++++++++++++++++----- >> sheep/object_list_cache.c | 182 +++++++++++++++++++++++++++++++++++++++-- >> sheep/ops.c | 85 ++++++++++++++++--- >> sheep/recovery.c | 133 +++++++++++++++++++++++++++--- >> sheep/sheep_priv.h | 57 ++++++++++++- >> 16 files changed, 743 insertions(+), 104 deletions(-) > > I've looked into this series, and IMHO the change is too complex. > > With this series, when recovery is disabled and there are left nodes, > sheep can succeed in a write operation even if the data is not fully > replicated. But, if we allow it, it is difficult to prevent VMs from > reading old data. Actually this series put a lot of effort into it.
We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible(< 5 minutes). This patch is implemented based on those assumption above. And maybe it's difficult, but it's algorithm is clear, just three steps(from the description from the 9th patch's commit log): 1) If a sheep joined back to the cluster, but there are some objects which have been deleted after this sheep left, such objects stay in its working directory, after recovery start, this sheep will send its object list to other sheeps. So after fetched all object list from cluster, each sheep should screen out these deleted objects list. 2) A sheep which have been left and joined back should drop the old version objects and recover the new ones from other sheeps. 3) The objects which have been updated should not recovered from a joined back sheep. > > I'd suggest allowing epoch increment even when recover is > disabled. If recovery work recovers only rw->prio_oids and delays the > recovery of rw->oids, I think we can get the similar benefit with much > simpler way: > http://www.mail-archive.com/[email protected]/msg05439.html In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. > > Thanks, > > Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
