** Branch linked: lp:~andreserl/ubuntu/natty/keepalived/lp619712

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to keepalived in Ubuntu.
https://bugs.launchpad.net/bugs/619712

Title:
  keepalived vrrp race condition and fix (versions 1.1.17 and 1.2.0 but
  perhaps all?)

Status in Release Notes for Ubuntu:
  Invalid
Status in “keepalived” package in Ubuntu:
  Fix Committed
Status in “keepalived” source package in Maverick:
  Won't Fix

Bug description:
  Binary package hint: keepalived

  Hello,

  About a keepalived vrrp race condition and fix (versions 1.1.17 and 1.2.0)
  and probably all other versions.

  Impact: possible complete/partial datacenter disruptions due to gratious
  arp/multicast floods


  The situation to reproduce the race condition

  The race condition got triggered in our situation with 1 cluster of 4
  machines, sharing  3 (vlan) interfaces.

  Reason for doing doing so, not really importand, using vrrp and redundant
  cluster (2x2) for changing the default gateway, in case of a lease line
  failure.

  
  The most basic setup, and easyest way to trigger is a subset of above
  -3 machines, with keepalived
  -2 interfaces/machine, defined in keepalived as vrrp_instance
  -1 vrrp_sync_group, with those interfaces included

  some key settings:

  machine1:
        initial state: state BACKUP
        int 1, prio 250
        int 2, prio 250
  machine2:
        initial state: state BACKUP
        int 1, prio 200
        int 2, prio 200
  machine3:
        initial state: state MASTER
        int 1, prio 150
        int 2, prio 150


  Machine3 is set to master, just to trigger the race condition, but in
  initial state BACKUP, a connection failure (without a link down) will do the
  same, which is the situation we encountered.

  
  WARNING: below may/will cause network disruption, so be prepared.

  The way to trigger, is having started keepalived on machine 1 & 2, where
  machine 1 gently becomes a master as meant/configured.

  Now, start machine3, and here it happens, as soon machine sends a single
  vrrp message (as it want's to be master), machine 1 & 2 get's crazy(haywire), 
and
  come in a race condition which will never stop, until you stop one of
  machine 1 or 2. Stopping machine 3 will definitly NOT stop the race.

  
  A sample of the relevant logging of the race:

  machine 1:
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV610) Received 
lower prio advert, forcing new election
  Aug 16 13:22:27 saaplsprd01 Keepalived_vrrp: VRRP_Instance(AMV910) Received 
lower prio advert, forcing new election

  
  machine 2:
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV610) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV610) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV610) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV610) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV610) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Instance(AMV910) forcing a 
new MASTER election
  Aug 16 13:22:27 saaplsprd02 Keepalived_vrrp: VRRP_Group(GROUP1) Transition to 
MASTER state


  After some review of the keepalived code, and analyzed traces, and
  researching the situation, i came to a fix of the issue.  Be aware i've
  never looked at the keepalived code before, so perhaps, there may be a more
  ultimate fix.

  This patch prevents sending another vrrp message in the
  vrrp_sync.c: vrrp_sync_master_election() when already in the isync->wantstate 
is in a
  VRRP_STATE_GOTO_MASTER  state.

  old code in vrrp_sync.c:
  void
  vrrp_sync_master_election(vrrp_rt * vrrp)
  {
          vrrp_rt *isync;
          vrrp_sgroup *vgroup = vrrp->sync;
          list l = vgroup->index_list;
          element e;

          if (vrrp->wantstate != VRRP_STATE_GOTO_MASTER)
                  return;
          if (GROUP_STATE(vgroup) == VRRP_STATE_FAULT)
                  return;

          log_message(LOG_INFO, "VRRP_Group(%s) Transition to MASTER state",
                 GROUP_NAME(vgroup));

          /* Perform sync index */
          for (e = LIST_HEAD(l); e; ELEMENT_NEXT(e)) {
                  isync = ELEMENT_DATA(e);
                  if (isync != vrrp) {
                          /* Force a new protocol master election */
                          isync->wantstate = VRRP_STATE_GOTO_MASTER;
                          log_message(LOG_INFO,
                                 "VRRP_Instance(%s) forcing a new MASTER 
election",
                                 isync->iname);
                          vrrp_send_adv(isync, isync->effective_priority);
                  }
          }
  }

  
  patched code in vrrp_sync.c:
  void
  vrrp_sync_master_election(vrrp_rt * vrrp)
  {
          vrrp_rt *isync;
          vrrp_sgroup *vgroup = vrrp->sync;
          list l = vgroup->index_list;
          element e;

          if (vrrp->wantstate != VRRP_STATE_GOTO_MASTER)
                  return;
          if (GROUP_STATE(vgroup) == VRRP_STATE_FAULT)
                  return;

          log_message(LOG_INFO, "VRRP_Group(%s) Transition to MASTER state",
                 GROUP_NAME(vgroup));

          /* Perform sync index */
          for (e = LIST_HEAD(l); e; ELEMENT_NEXT(e)) {
                  isync = ELEMENT_DATA(e);
                  if (isync != vrrp) {
                          if ( isync->wantstate != VRRP_STATE_GOTO_MASTER ) {   
  // Fix ArFi 2010/08/16 to prevent VRRP loops with 3 or more members and 2 or 
more shared interfaces in 1 group 
                          /* Force a new protocol master election */
                          isync->wantstate = VRRP_STATE_GOTO_MASTER;
                          log_message(LOG_INFO,
                                 "VRRP_Instance(%s) forcing a new MASTER 
election",
                                 isync->iname);
                          vrrp_send_adv(isync, isync->effective_priority);
                          }
                  }
          }
  }

  
  diff keepalived-1.1.17:

  # diff  -u vrrp_sync.c.orig_ vrrp_sync.c
  --- vrrp_sync.c.orig_ 2010-08-17 14:12:30.944634215 +0200
  +++ vrrp_sync.c       2010-08-17 14:15:05.964634624 +0200
  @@ -157,12 +157,14 @@
        for (e = LIST_HEAD(l); e; ELEMENT_NEXT(e)) {
                isync = ELEMENT_DATA(e);
                if (isync != vrrp) {
  +                     if ( isync->wantstate != VRRP_STATE_GOTO_MASTER ) {     
// Fix ArFi 2010/08/16 to prevent VRRP loops with 3 or more members and 2 or 
more shared interfaces in 1 group 
                        /* Force a new protocol master election */
                        isync->wantstate = VRRP_STATE_GOTO_MASTER;
                        log_message(LOG_INFO,
                               "VRRP_Instance(%s) forcing a new MASTER 
election",
                               isync->iname);
                        vrrp_send_adv(isync, isync->effective_priority);
  +                     }
                }
        }
   }

  
  diff keepalived-1.2.0:

  # diff -u vrrp_sync.c.orig_ vrrp_sync.c
  --- vrrp_sync.c.orig_ 2010-08-18 08:17:52.766910325 +0200
  +++ vrrp_sync.c       2010-08-18 08:20:23.216968929 +0200
  @@ -155,12 +155,14 @@
        for (e = LIST_HEAD(l); e; ELEMENT_NEXT(e)) {
                isync = ELEMENT_DATA(e);
                if (isync != vrrp) {
  +                     if ( isync->wantstate != VRRP_STATE_GOTO_MASTER ) {     
// Fix ArFi 2010/08/16 to prevent VRRP loops with 3 or more members and 2 or 
more shared interfaces in 1 group
                        /* Force a new protocol master election */
                        isync->wantstate = VRRP_STATE_GOTO_MASTER;
                        log_message(LOG_INFO,
                               "VRRP_Instance(%s) forcing a new MASTER 
election",
                               isync->iname);
                        vrrp_send_adv(isync, isync->effective_priority);
  +                     }
                }
        }
   }




  This solves our issue, and we did some basic testing with good results, but 
it may be wise an
  experienced keepalived developer should have a look at it too.

  
  PS: marked this as a security vulnerability as it _may_ bring a 
system/datacenter
  down, by just a single (vrrp) packet.
  Posted this bugreport as well on the keepalived-de...@lists.sourceforge.net 
ML.

  
  Regards,

  Arjan Filius
  iafil...@xs4all.nl

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : ubuntu-ha@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

Reply via email to