On 05/01/2018 23:30, Scott Long wrote:

On Jan 5, 2018, at 11:20 AM, Eugene Grosbein <eu...@grosbein.net> wrote:

CC'ng scottl@ as author of the change in question.

06.01.2018 0:39, Matt Joras wrote:

For what it's worth, this was the conclusion I came to, and at Isilon
we've made the same change being discussed here. For the case of
drivers that end up using a queue index for the flowid, you end up
with pathological behavior on the lagg; the flowid ends up getting
right shifted by (default) 16. So in the case of e.g. two bxe(4)
interfaces with 4 queues, you always end up choosing the interface in
the lagg at index 0.
Then why does if_lagg shifts 16 bits by default? Is seems senseless.
This was introduced with r260070 by scottl:

At the time, we were using cxgbe interfaces which inserted a reasonable RSS
hash into the flowid field.  The shift was done to expose different bits to the
index/modulo scheme used by the LACP module vs the cxgbe module.  In
hindsight, I should not have set a default value of 16, I should have left it at
zero so that default behavior would be preserved.

Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases.  Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.

Reviewed by:    max
Obtained from:  Netflix
MFC after:      3 days
This commit message does not point to an example of NIC driver that would set
non-zero bits 16 and higher for flowid so that shift result would be non-zero.
Do we really have such a driver?

Yes.

Anyway, this lagg's default seems to be very driver-centric.

For example, Intel driver family also do not use such high bits for flowid
just like mentioned bxe(4).

scottl@moe:~/svn/head/sys/dev % grep -r iri_flowid *
bnxt/bnxt_txrx.c:               ri->iri_flowid = le32toh(rcp->rss_hash);
bnxt/bnxt_txrx.c:               ri->iri_flowid = le32toh(tpas->low.rss_hash);
e1000/em_txrx.c:        ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);
e1000/igb_txrx.c:       ri->iri_flowid =
ixgbe/ix_txrx.c:        ri->iri_flowid = le32toh(rxd->wb.lower.hi_dword.rss);

The number of drivers that set m_pkhhdr.flowid directly to an RSS hash looks
to be:

cxgb
cxgbe
mlx4
mlx5
qlnx
qlxgbe
qlxge
vmxnet3

Maybe the hardware doesn’t do a great job with generating a useful RSS hash,
but that’s tangential to this conversation.

We should change flowid_shift default to 0 for if_lagg(4), shouldn't we?

In the short term, yes.  What I see is that it’s too expensive to do a hash 
calculation
on every TX packet in LACP (for anything resembling line rate), and flowid is 
unreliable
when a connection is initiated without an RX packet triggering it.  I don’t 
understand
the consequences of the TX initiation problem well enough to offer a solution.  
For the
problem of flowid being used inconsistently by drivers (i.e. not populating all 
32 bits
or using a weak RSS), that’s really a driver problem.

What I’d recommend is that the LACP code check for M_HASHTYPE_NONE and
M_HASHTYPE_OPAQUE and calculate a new hash if either are set (effectively
ignoring the flowid).  It’s then up to the drivers to set the correct hash type 
that
corresponds with what they’re putting into the flowid field.  An opaque type 
would
mean that the value is useful to the driver but should not be considered useful
anywhere else.  You’ll get more correct and less surprising behavior from that, 
at
the penalty of every TX packet requiring a hash calculation for hardware/drivers
that are crummy.

Mixing the hash methods causes problems with out of order packets even just for the first packet, and using a hash which is not what's configured by lagghash is confusing at best so that could be fixed to say "flowid" if that's whats going to happen or perhaps update it to the hash type that flowid represents if that's possible.

LACP already checks for M_HASHTYPE_NONE if use_flowid is enabled and manually calculates a hash, which is what leads the the first packet port selection inconsistency.

It's not clear what all the implications of the first packet port inconsistency is, it will likely be dependent a large number of factors including network hardware, layout and dest host + config., but when we've seen this in the 3 and 4 packet of a stream it leads to the destination sending RST, dropping the stream on the floor for 2% of all streams on our test box with 2 x ixgbe interfaces.

Questions:

1. Is it possible to determine the hash method used by the HW and use
   that for all first packets?
2. Is it possible to significantly improve the performance the CPU hashing?
3. Is it possible to tell from the mbuf that its part of a connection
   instigated from the current host?

    Regards
    Steve
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to