Re: hardware checksum ix and ixl

David Gwynne Mon, 07 Feb 2022 00:47:27 -0800

On Mon, Feb 07, 2022 at 02:39:38PM +1000, David Gwynne wrote:
> On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote:
> > On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> > > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > > > Hi,
> > > > > 
> > > > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > > > for the output path on strict alignment architectures.
> > > > > 
> > > > > I have merged jan@'s diffs and added some sanity checks and
> > > > > workarounds.
> > > > > 
> > > > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > > > >   to extract the IP, IPv6, TCP header.
> > > > > - If the header is in the first mbuf, use m_data for the fast path.
> > > > > - Add netstat counter for invalid header chains.  This makes
> > > > >   us aware when hardware checksumming fails.
> > > > > - Add netstat counter for header copies.  This indicates that
> > > > >   better storage allocation in the network stack is possible.
> > > > >   It also allows to recognize alignment problems on non-strict
> > > > >   architectures.
> > > > > - There is not risk of crashes on sparc64.
> > > > > 
> > > > > Does this aproach make sense?
> > > > 
> > > > I think it is overly complicated and too much data is copied around.
> > > > First of all there is no need to extract ipproto.
> > > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > > > are not set for other protos).
> > > > Because of this only they ip_hlen needs to be accessed and this can be
> > > > done with m_getptr().
> > > 
> > > A solution with m_getptr() is where we started.  It has already
> > > been rejected.  The problem is that there are 6 ways to implement
> > > this feature.  Every option has its drawbacks and was rejected.
> > > 
> > > Options are:
> > > 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> > > 2. m_getptr() and access the byte or word.  Header fields should be
> > >    accessed by structs.
> > > 3. Always m_copydata.  Too much overhead.
> > > 4. Always use m_data.  Kernel may crash or use invalid data.
> > > 5. Combination of m_data and m_copydata.  Too complex.
> > > 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> > > 
> > > In my measurements checksum offloading gave us 10% performance boost
> > > so I want this feature.  Other drivers also have it.
> > > 
> > > Could we get another idea or find a consensus which option to use?
> > 
> > after staring at this for a few hours my conclusion is option 1 actually
> > is the right approach, but the diff for ixl has a bug.
> > 
> > > > In the IP6 case even more can be skipped since ip_hlen is static for 
> > > > IPv6.
> > > > 
> > > > In ixl(4) also the tcp header lenght needs to be extracted. Again the 
> > > > code
> > > > can be simplified because HW checksumming is only enabled if ip_hlen == 
> > > > 5
> > > > and so the offset of the th_off field is static (for both IPv4 and 
> > > > IPv6).
> > > > Again m_getptr can be used to just access the byte with th_off.
> > > > 
> > > > Longterm in_proto_cksum_out() should probably help provide the th_off
> > > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > > > IP options on UDP and TCP?
> > > 
> > > Other diffs have been rejected as they make too many assumtions how
> > > the stack works.
> > 
> > my opinion is we can make these assumptions. the CSUM_OUT flags are
> > only set in very specific places where the stack itself is constructing
> > or checking properly aligned headers, and then the stack maintains
> > this alignment until it reaches the driver. this is where the first
> > of the bugs in the ixl diff comes in.
> > 
> > in the diff ixl_tx_setup_offload() is called after the dma mapping
> > occurs. this is implemented in this code:
> > 
> > static inline int
> > ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
> > {
> >     int error;
> > 
> >     error = bus_dmamap_load_mbuf(dmat, map, m,
> >         BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
> >     if (error != EFBIG)
> >             return (error);
> > 
> >     error = m_defrag(m, M_DONTWAIT);
> >     if (error != 0)
> >             return (error);
> > 
> >     return (bus_dmamap_load_mbuf(dmat, map, m,
> >         BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
> > }
> > 
> > the problem is that when we get a heavily fragmented mbuf we call
> > m_defrag, which basically allocates a cluster and copies all the
> > data from the fragments into it. however, m_defrag does not maintain
> > the alignment of payload, meaning the ethernet header gets to start
> > on a 4 byte boundary cos that's what we get from the cluster pool,
> > and the IP and TCP payloads end up with a 2 byte misalignmnet.
> > 
> > my proposed solution to this is to set up the offload bits before
> > calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit.
> > 
> > because of when the CSUM_OUT flags are set, i think m_getptr is fine for
> > pulling the mbuf apart. if it's not i want the code to blow up so it
> > gets fixed. that's why the code in my diff below lacks defenses.
> > 
> > the other bug is that the uint64_t holding the offload flags isn't
> > reset between being called for different mbufs. my solution to that is
> > to have ixl_tx_setup_offload() return the flags so the value in
> > ixl_start is overwritten every time.
> > 
> > lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if
> > the CSUM_OUT flags arent set. ixl_tx_setup_offload() returns 0 early
> > if none of the flags are set now.
> > 
> > i'll have a look at ix and the ixl rx bits tomorrow.
> 
> ix(4) also looked at the packet after a possible m_defrag. this works on
> amd64 with ipv4 and ipv6, and with and without vlans and it seems to
> behave as expected. we'll try a sparc64 soon.
> 
> it's a lot easier to read after it's applied.


new diff. this one has actual csum+vlans working without leaky routes
fixing things for me without me noticing. it's also been tested on
sparc64.

Index: if_ix.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.181
diff -u -p -r1.181 if_ix.c
--- if_ix.c     27 Jan 2022 18:28:44 -0000      1.181
+++ if_ix.c     7 Feb 2022 08:44:32 -0000
@@ -156,7 +156,8 @@ int ixgbe_encap(struct tx_ring *, struct
 int    ixgbe_dma_malloc(struct ix_softc *, bus_size_t,
                    struct ixgbe_dma_alloc *, int);
 void   ixgbe_dma_free(struct ix_softc *, struct ixgbe_dma_alloc *);
-int    ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
+static int
+       ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
            uint32_t *);
 int    ixgbe_tso_setup(struct tx_ring *, struct mbuf *, uint32_t *,
            uint32_t *);
@@ -1391,11 +1392,6 @@ ixgbe_encap(struct tx_ring *txr, struct 
        cmd_type_len = (IXGBE_ADVTXD_DTYP_DATA |
            IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT);
 
-#if NVLAN > 0
-       if (m_head->m_flags & M_VLANTAG)
-               cmd_type_len |= IXGBE_ADVTXD_DCMD_VLE;
-#endif
-
        /*
         * Important to capture the first descriptor
         * used because it will contain the index of
@@ -1406,6 +1402,14 @@ ixgbe_encap(struct tx_ring *txr, struct 
        map = txbuf->map;
 
        /*
+        * Set the appropriate offload context
+        * this will becomes the first descriptor.
+        */
+       ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
+       if (ntxc == -1)
+               goto xmit_fail;
+
+       /*
         * Map the packet for DMA.
         */
        switch (bus_dmamap_load_mbuf(txr->txdma.dma_tag, map,
@@ -1422,14 +1426,6 @@ ixgbe_encap(struct tx_ring *txr, struct 
                return (0);
        }
 
-       /*
-        * Set the appropriate offload context
-        * this will becomes the first descriptor.
-        */
-       ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
-       if (ntxc == -1)
-               goto xmit_fail;
-
        i = txr->next_avail_desc + ntxc;
        if (i >= sc->num_tx_desc)
                i -= sc->num_tx_desc;
@@ -1880,6 +1876,7 @@ ixgbe_setup_interface(struct ix_softc *s
 #endif
 
        ifp->if_capabilities |= IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv4;
+       ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
 
        /*
         * Specify the media types supported by this sc and register
@@ -2426,138 +2423,106 @@ ixgbe_free_transmit_buffers(struct tx_ri
  *
  **********************************************************************/
 
-int
-ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp,
-    uint32_t *cmd_type_len, uint32_t *olinfo_status)
+static inline void
+ixgbe_csum_offload(struct mbuf *mp,
+    uint32_t *vlan_macip_lens, uint32_t *type_tucmd_mlhl)
 {
-       struct ixgbe_adv_tx_context_desc *TXD;
-       struct ixgbe_tx_buf *tx_buffer;
-#if NVLAN > 0
-       struct ether_vlan_header *eh;
-#else
-       struct ether_header *eh;
-#endif
-       struct ip *ip;
-#ifdef notyet
-       struct ip6_hdr *ip6;
-#endif
+       struct ether_header *eh = mtod(mp, struct ether_header *);
        struct mbuf *m;
-       int     ipoff;
-       uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0;
-       int     ehdrlen, ip_hlen = 0;
-       uint16_t etype;
-       uint8_t ipproto = 0;
-       int     offload = TRUE;
-       int     ctxd = txr->next_avail_desc;
-#if NVLAN > 0
-       uint16_t vtag = 0;
-#endif
+       int hoff;
+       uint32_t iphlen;
+       uint8_t ipproto;
 
-#if notyet
-       /* First check if TSO is to be used */
-       if (mp->m_pkthdr.csum_flags & CSUM_TSO)
-               return (ixgbe_tso_setup(txr, mp, cmd_type_len, olinfo_status));
-#endif
+       *vlan_macip_lens |= (sizeof(*eh) << IXGBE_ADVTXD_MACLEN_SHIFT);
 
-       if ((mp->m_pkthdr.csum_flags & (M_TCP_CSUM_OUT | M_UDP_CSUM_OUT)) == 0)
-               offload = FALSE;
+       switch (ntohs(eh->ether_type)) {
+       case ETHERTYPE_IP: {
+               struct ip *ip;
 
-       /* Indicate the whole packet as payload when not doing TSO */
-       *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT;
-
-       /* Now ready a context descriptor */
-       TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd];
-       tx_buffer = &txr->tx_buffers[ctxd];
+               m = m_getptr(mp, sizeof(*eh), &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip));
+               ip = (struct ip *)(mtod(m, caddr_t) + hoff);
 
-       /*
-        * In advanced descriptors the vlan tag must
-        * be placed into the descriptor itself. Hence
-        * we need to make one even if not doing offloads.
-        */
-#if NVLAN > 0
-       if (mp->m_flags & M_VLANTAG) {
-               vtag = mp->m_pkthdr.ether_vtag;
-               vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT);
-       } else
-#endif
-       if (offload == FALSE)
-               return (0);     /* No need for CTX */
+               iphlen = ip->ip_hl << 2;
+               ipproto = ip->ip_p;
 
-       /*
-        * Determine where frame payload starts.
-        * Jump over vlan headers if already present,
-        * helpful for QinQ too.
-        */
-       if (mp->m_len < sizeof(struct ether_header))
-               return (-1);
-#if NVLAN > 0
-       eh = mtod(mp, struct ether_vlan_header *);
-       if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
-               if (mp->m_len < sizeof(struct ether_vlan_header))
-                       return (-1);
-               etype = ntohs(eh->evl_proto);
-               ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
-       } else {
-               etype = ntohs(eh->evl_encap_proto);
-               ehdrlen = ETHER_HDR_LEN;
+               *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
+               break;
        }
-#else
-       eh = mtod(mp, struct ether_header *);
-       etype = ntohs(eh->ether_type);
-       ehdrlen = ETHER_HDR_LEN;
-#endif
 
-       /* Set the ether header length */
-       vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT;
+#ifdef INET6
+       case ETHERTYPE_IPV6: {
+               struct ip6_hdr *ip6;
+
+               m = m_getptr(mp, sizeof(*eh), &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip6));
+               ip6 = (struct ip6_hdr *)(mtod(m, caddr_t) + hoff);
 
-       switch (etype) {
-       case ETHERTYPE_IP:
-               if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip))
-                       return (-1);
-               m = m_getptr(mp, ehdrlen, &ipoff);
-               KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip));
-               ip = (struct ip *)(m->m_data + ipoff);
-               ip_hlen = ip->ip_hl << 2;
-               ipproto = ip->ip_p;
-               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
-               break;
-#ifdef notyet
-       case ETHERTYPE_IPV6:
-               if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip6))
-                       return (-1);
-               m = m_getptr(mp, ehdrlen, &ipoff);
-               KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip6));
-               ip6 = (struct ip6 *)(m->m_data + ipoff);
-               ip_hlen = sizeof(*ip6);
-               /* XXX-BZ this will go badly in case of ext hdrs. */
+               iphlen = sizeof(*ip6);
                ipproto = ip6->ip6_nxt;
-               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
+
+               *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
                break;
+       }
 #endif
+
        default:
-               offload = FALSE;
-               break;
+               panic("CSUM_OUT set for non-IP packet");
+               /* NOTREACHED */
        }
 
-       vlan_macip_lens |= ip_hlen;
-       type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
+       *vlan_macip_lens |= iphlen;
 
        switch (ipproto) {
        case IPPROTO_TCP:
-               if (mp->m_pkthdr.csum_flags & M_TCP_CSUM_OUT)
-                       type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
+               KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT));
+               *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
                break;
        case IPPROTO_UDP:
-               if (mp->m_pkthdr.csum_flags & M_UDP_CSUM_OUT)
-                       type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
+               KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_UDP_CSUM_OUT));
+               *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
                break;
        default:
-               offload = FALSE;
-               break;
+               panic("CSUM_OUT set for wrong protocol");
+               /* NOTREACHED */
+       }
+}
+
+static int
+ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp,
+    uint32_t *cmd_type_len, uint32_t *olinfo_status)
+{
+       struct ixgbe_adv_tx_context_desc *TXD;
+       struct ixgbe_tx_buf *tx_buffer;
+       uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0;
+       int     ctxd = txr->next_avail_desc;
+       int     offload = 0;
+
+       /* Indicate the whole packet as payload when not doing TSO */
+       *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT;
+
+#if NVLAN > 0
+       if (ISSET(mp->m_flags, M_VLANTAG)) {
+               uint32_t vtag = mp->m_pkthdr.ether_vtag;
+               vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT);
+               *cmd_type_len |= IXGBE_ADVTXD_DCMD_VLE;
+               offload |= 1;
        }
+#endif
 
-       if (offload) /* For the TX descriptor setup */
+       if (ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT|M_UDP_CSUM_OUT)) {
+               ixgbe_csum_offload(mp, &vlan_macip_lens, &type_tucmd_mlhl);
                *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8;
+               offload |= 1;
+       }
+
+       if (!offload)
+               return (0);
+
+       TXD = (struct ixgbe_adv_tx_context_desc *)&txr->tx_base[ctxd];
+       tx_buffer = &txr->tx_buffers[ctxd];
+
+       type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
 
        /* Now copy bits into descriptor */
        TXD->vlan_macip_lens = htole32(vlan_macip_lens);
Index: ixgbe.h
===================================================================
RCS file: /cvs/src/sys/dev/pci/ixgbe.h,v
retrieving revision 1.32
diff -u -p -r1.32 ixgbe.h
--- ixgbe.h     18 Jul 2020 07:18:22 -0000      1.32
+++ ixgbe.h     7 Feb 2022 08:44:32 -0000
@@ -65,6 +65,7 @@
 #include <netinet/in.h>
 #include <netinet/if_ether.h>
 #include <netinet/ip.h>
+#include <netinet/ip6.h>
 
 #if NBPFILTER > 0
 #include <net/bpf.h>

Re: hardware checksum ix and ixl

Reply via email to