Hi!

Working with the bonding driver mode 4 (LACP) several times I am stuck
in a situation when link aggregation port stopped forwarding packets after some
time of normal operation. Recreating aggregation group on the switch didn't help
in that situations. The only way out was to restart my application.

I started investigating the source code of the bonding driver and discovered
that the rx_machine() function doesn't follow IEEE Std 802.1AX-2008 standard.

It looks like the following part of rx_machine() code implements 
the recordPDU function described in the section "5.4.9 Functions" of the 
standard.

                bool match = port->actor.system_priority ==
                        lacp->partner.port_params.system_priority &&
                        is_same_ether_addr(&agg->actor.system,
                        &lacp->partner.port_params.system) &&
                        port->actor.port_priority ==
                        lacp->partner.port_params.port_priority &&
                        port->actor.port_number ==
                        lacp->partner.port_params.port_number;
                        
                ...

                /* If LACP partner params match this port actor params */
                if (match == true && ACTOR_STATE(port, AGGREGATION) ==
                                PARTNER_STATE(port,     AGGREGATION))
                        PARTNER_STATE_SET(port, SYNCHRONIZATION);
                else if (!PARTNER_STATE(port, AGGREGATION) && ACTOR_STATE(port,
                                AGGREGATION))
                        PARTNER_STATE_SET(port, SYNCHRONIZATION);
                else
                        PARTNER_STATE_CLR(port, SYNCHRONIZATION);

Problem #1:
According to recordPDU function, the "Partner_Key" parameter carried in the
received PDU should be compared to Actor_Oper_Port_Key.
But the bonding driver doesn't do it. It only compares system_priority, system, 
port_priority and
port_number when evaluated match variable.

Problem #2:
Also, the standard indicates that:
"Partner_Oper_Port_State.Synchronization is set to TRUE if all of these 
parameters match,
Actor_State.Synchronization in the received PDU is set to TRUE, and LACP will 
actively
maintain the link in the aggregation."

But the bonding driver doesn't check that Actor_State.Synchronization in the 
received PDU is set to TRUE.

Problem #3:
Also, the standard indicates that:
"Partner_Oper_Port_State.Synchronization is also set to TRUE if the value of
Actor_State.Aggregation in the received PDU is set to FALSE (i.e., indicates an 
Individual
link), Actor_State.Synchronization in the received PDU is set to TRUE, and LACP 
will
actively maintain the link."

The bonding driver only partly follows that rule and doesn't check
that Actor_State.Synchronization in the received PDU is set to TRUE.
Also, it checks
ACTOR_STATE(port, AGGREGATION)
but the standard doesn't say anything about this.

My proposal is to replace partner state sync flag evalution block with the a 
following
one in order to more strictly follow the standart:

                /* If LACP partner params match this port actor params */
                if ((match == true && lacp->partner.port_params.key == 
port->actor.key &&
                                  ACTOR_STATE(port, AGGREGATION) == 
PARTNER_STATE(port, AGGREGATION) &&
                                  STATE_FLAG(lacp->actor.state, 
SYNCHRONIZATION) == true) ||
                        (STATE_FLAG(lacp->actor.state, AGGREGATION) == false &&
                                  STATE_FLAG(lacp->actor.state, 
SYNCHRONIZATION) == true)
                        )
                        PARTNER_STATE_SET(port, SYNCHRONIZATION);
                else
                        PARTNER_STATE_CLR(port, SYNCHRONIZATION);

...

#define STATE_FLAG(_p, _f) (!!CHECK_FLAGS(_p, STATE_ ## _f))


I am not sure yet if the described problems are causing the driver to stuck in 
a kind of deadlock situation
in my application, but I think they might be the sources of my problem.

Could someone take a look at my suggestions and help
me to find out why my LACP boding port doesn't work correctly?

Thank you.

--
Alex Kiselev

Reply via email to