On Tue, Jul 10, 2012 at 9:04 AM, Liu Yuan <[email protected]> wrote: > On 07/09/2012 08:43 PM, Yunkai Zhang wrote: >> This patch: >> >> <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg >> >> will cause new problem, 99% crash was caused by this patch in my >> testing when start/stop sheep concurrently. >> > > No, why are you so sure of this when you don't catch exact lines of > code? I have to remind you that op=0x0 case exists long before the > 982d5ab patch. Actually 982d5ab patch does reduce the chances of > segfault. I got the 982d5ab patch when I have been debugging op=0x0 bug, > where I found there might be multiple bugs inside. > >> Splitting cluster event into notify/cfgchg event lists and giving >> priority to process cfgchg event will break the event order which is >> the most important thing in distributed system. >> >> For example: >> >> One sheep send a notify message to the cluster, but at the same time, >> there are sheeps joining/leaving into the cluster, then the notify >> message was pushed back by these join/leave events, so the notify >> handler could not be executed opportunely, it will cause some >> variables not to be initialized correctly. >> > > Please specify what is the problem. 'not executed opportunely' speaks > nothing. > > For now, total order of confchg is kept and priority of handle confchg > doesn't break things unless We find any real proofs. > > Again, please come to any conclusion before you get the real evidence. > > Thanks, > Yuan > >> In my testing, if we start/stop sheep concurrently, this segment fault >> will nearly be caused: >> Program terminated with signal 11, Segmentation fault. >> #0 0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981 >> 981 return !!op->process_main; >> (gdb) where >> #0 0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981 >> #1 0x00000000004057e7 in prepare_cluster_msg (req=0xb03ca0, >> sizep=0x7fff129c3640) at group.c:275 >> #2 0x000000000040585c in cluster_op_done (work=0xb03d60) at group.c:290 >> #3 0x000000000040ebaf in bs_thread_request_done (fd=12, events=1, >> >> >> In fact, I have completed zookeeper patch which will also split event >> list into cfgchg/notify list, but it has the similar problems.
Let me dig more. -- Yunkai Zhang Work at Taobao -- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
