On Mon, Jul 9, 2012 at 5:49 PM, Liu Yuan <[email protected]> wrote: > On 07/09/2012 05:45 PM, Liu Yuan wrote: >> On 07/09/2012 09:58 AM, Liu Yuan wrote: >>> Got an weird segfault, >>> >>> (gdb) where >>> #0 0x0000000000411936 in do_process_work (work=0xd13c70) at ops.c:992 >>> #1 0x000000000040ed05 in worker_routine (arg=0xd12a20) at work.c:171 >>> #2 0x00007f43f992c971 in start_thread (arg=<value optimized out>) at >>> pthread_create.c:304 >>> #3 0x00007f43f8eeef3d in clone () at >>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >>> #4 0x0000000000000000 in ?? () >>> >>> sheep.log: >>> ... >>> Jul 09 09:47:23 [main] client_handler(764) connection seems to be dead >>> Jul 09 09:47:23 [main] clear_client(703) refcnt:0, fd:14, ::1:43328 >>> Jul 09 09:47:23 [main] destroy_client(672) connection from: ::1:43328 >>> Jul 09 09:47:23 [main] cdrv_cpg_deliver(448) 5 >>> Jul 09 09:47:23 [main] sd_notify_handler(851) size: 96, from: IPv4 >>> ip:127.0.0.1 port:7000 >>> Jul 09 09:47:23 [main] client_tx_handler(663) connection from: 13, ::1:43330 >>> Jul 09 09:47:23 [main] client_handler(764) connection seems to be dead >>> Jul 09 09:47:23 [main] clear_client(703) refcnt:0, fd:13, ::1:43330 >>> Jul 09 09:47:23 [main] destroy_client(672) connection from: ::1:43330 >>> Jul 09 09:47:23 [main] listen_handler(819) accepted a new connection: 13 >>> Jul 09 09:47:23 [main] listen_handler(819) accepted a new connection: 14 >>> Jul 09 09:47:23 [block] do_process_work(990) 80, 0 , 32579 <--- XXX >>> Jul 09 09:47:23 [main] client_rx_handler(577) connection from: 14, ::1:43337 >>> Jul 09 09:47:23 [main] queue_request(323) 2 >>> Jul 09 09:47:23 [main] crash_handler(408) sheep pid 5326 exited >>> unexpectedly. >>> >>> Thanks, >>> Yuan >>> >> >> These segmentation fault is suspected to be caused by >> >> * <cc458b9> 2012-07-06 sheep: free all requests when connection is dead >> * <7ce7048> 2012-07-06 sheep: simplify client_decref() and move it into >> free_request() and add a helper function >> >> set. >> > > And other patch set too, I tries > > <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg > > it works well.
This patch: <982d5ab> corosync: fix cluster hang by cluster requests blocking confchg will cause new problem, 99% crash was caused by this patch in my testing when start/stop sheep concurrently. Splitting cluster event into notify/cfgchg event lists and giving priority to process cfgchg event will break the event order which is the most important thing in distributed system. For example: One sheep send a notify message to the cluster, but at the same time, there are sheeps joining/leaving into the cluster, then the notify message was pushed back by these join/leave events, so the notify handler could not be executed opportunely, it will cause some variables not to be initialized correctly. In my testing, if we start/stop sheep concurrently, this segment fault will nearly be caused: Program terminated with signal 11, Segmentation fault. #0 0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981 981 return !!op->process_main; (gdb) where #0 0x00000000004118b4 in has_process_main (op=0x0) at ops.c:981 #1 0x00000000004057e7 in prepare_cluster_msg (req=0xb03ca0, sizep=0x7fff129c3640) at group.c:275 #2 0x000000000040585c in cluster_op_done (work=0xb03d60) at group.c:290 #3 0x000000000040ebaf in bs_thread_request_done (fd=12, events=1, In fact, I have completed zookeeper patch which will also split event list into cfgchg/notify list, but it has the similar problems. > > Thanks, > Yuan > > > -- > sheepdog mailing list > [email protected] > http://lists.wpkg.org/mailman/listinfo/sheepdog -- Yunkai Zhang Work at Taobao -- sheepdog mailing list [email protected] http://lists.wpkg.org/mailman/listinfo/sheepdog
