From: Yunkai Zhang <[email protected]>

In old code, corosync driver could not process blocked event
correctly.

For example:
Suppose there was two requests: R1, R2.
1) In queue_cluster_request(), R1 sent a BLOCK event(B1) to cluster,
   R1 was added to sys->pending_list.
2) When B1 was received, cdrv_cpg_deliver() was executed, and sd_block_handler()
   would be called in __dispatch_corosycn_one(). sd_block_handler() will get R1
   from sys->pending_list(but not delete R1 from it), cluster_op_running was set
   TRUE, and then queue_work().
3) Before cluster_op_done() of R1 was executed, R2 was coming and sent a BLOCK
   event(B2) to cluster in queue_cluster_request().
4) Now, cluster_op_done() of R1 was called, R1 sent an UNBLOCK event(U1) to
   cluster, and cluster_op_running was set FALSE.
5) And then, B2 was received, cdrv_cpg_deliver()->__dispatch_corosycn_one()
   would be called. Because cluster_op_running was FALSE again, so
   sd_block_handler() would be executed again, as R1 was also at the head of
   sys->pending_list, then R1 would be queue_work() again, ..., this bug will
   lead to so many segment fault.

Accord has the same problem, I will fix it in next patch. But zookeeper dirver
works well at this situation.

Signed-off-by: Yunkai Zhang <[email protected]>
---
 sheep/cluster/corosync.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/sheep/cluster/corosync.c b/sheep/cluster/corosync.c
index bd955bb..7810a2e 100644
--- a/sheep/cluster/corosync.c
+++ b/sheep/cluster/corosync.c
@@ -38,6 +38,7 @@ static struct cpg_node cpg_nodes[SD_MAX_NODES];
 static size_t nr_cpg_nodes;
 static int self_elect;
 static int join_finished;
+static int corosync_notify_blocked;
 
 /* event types which are dispatched in corosync_dispatch() */
 enum corosync_event_type {
@@ -342,7 +343,8 @@ static int __corosync_dispatch_one(struct corosync_event 
*cevent)
                sd_leave_handler(&cevent->sender.ent, entries, nr_cpg_nodes);
                break;
        case COROSYNC_EVENT_TYPE_BLOCK:
-               sd_block_handler(&cevent->sender.ent);
+               if (sd_block_handler(&cevent->sender.ent))
+                       corosync_notify_blocked = 1;
 
                /* block other messages until the unblock message comes */
                return 0;
@@ -368,6 +370,14 @@ static void __corosync_dispatch(void)
                        cevent = list_first_entry(&corosync_notify_list,
                                                  typeof(*cevent), list);
 
+               /*
+                * When there is unfinished blocked event, only cfgchg events
+                * could continue to be processed(as we have given priotiry to
+                * process cfgchg events now).
+                */
+               if (!event_is_confchg(cevent->type) && corosync_notify_blocked)
+                       return;
+
                /* update join status */
                if (!join_finished) {
                        switch (cevent->type) {
@@ -687,6 +697,8 @@ static void corosync_unblock(void *msg, size_t msg_len)
 {
        send_message(COROSYNC_MSG_TYPE_UNBLOCK, 0, &this_node, NULL, 0,
                     msg, msg_len);
+
+       corosync_notify_blocked = 0;
 }
 
 static int corosync_notify(void *msg, size_t msg_len)
-- 
1.7.10.4

-- 
sheepdog mailing list
[email protected]
http://lists.wpkg.org/mailman/listinfo/sheepdog

Reply via email to