On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote:
> On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote:
> > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote:
> > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote:
> > > > Hi,
> > > > 
> > > > There were some problems with ix(4) and ixl(4) hardware checksumming
> > > > for the output path on strict alignment architectures.
> > > > 
> > > > I have merged jan@'s diffs and added some sanity checks and
> > > > workarounds.
> > > > 
> > > > - If the first mbuf is not aligned or not contigous, use m_copydata()
> > > >   to extract the IP, IPv6, TCP header.
> > > > - If the header is in the first mbuf, use m_data for the fast path.
> > > > - Add netstat counter for invalid header chains.  This makes
> > > >   us aware when hardware checksumming fails.
> > > > - Add netstat counter for header copies.  This indicates that
> > > >   better storage allocation in the network stack is possible.
> > > >   It also allows to recognize alignment problems on non-strict
> > > >   architectures.
> > > > - There is not risk of crashes on sparc64.
> > > > 
> > > > Does this aproach make sense?
> > > 
> > > I think it is overly complicated and too much data is copied around.
> > > First of all there is no need to extract ipproto.
> > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they
> > > are not set for other protos).
> > > Because of this only they ip_hlen needs to be accessed and this can be
> > > done with m_getptr().
> > 
> > A solution with m_getptr() is where we started.  It has already
> > been rejected.  The problem is that there are 6 ways to implement
> > this feature.  Every option has its drawbacks and was rejected.
> > 
> > Options are:
> > 1. m_getptr() and access the struct.  Alignment cannot be guaranteed.
> > 2. m_getptr() and access the byte or word.  Header fields should be
> >    accessed by structs.
> > 3. Always m_copydata.  Too much overhead.
> > 4. Always use m_data.  Kernel may crash or use invalid data.
> > 5. Combination of m_data and m_copydata.  Too complex.
> > 6. Track the fields in mbuf header.  Too fragile and memory overhead.
> > 
> > In my measurements checksum offloading gave us 10% performance boost
> > so I want this feature.  Other drivers also have it.
> > 
> > Could we get another idea or find a consensus which option to use?
> 
> after staring at this for a few hours my conclusion is option 1 actually
> is the right approach, but the diff for ixl has a bug.
> 
> > > In the IP6 case even more can be skipped since ip_hlen is static for IPv6.
> > > 
> > > In ixl(4) also the tcp header lenght needs to be extracted. Again the code
> > > can be simplified because HW checksumming is only enabled if ip_hlen == 5
> > > and so the offset of the th_off field is static (for both IPv4 and IPv6).
> > > Again m_getptr can be used to just access the byte with th_off.
> > > 
> > > Longterm in_proto_cksum_out() should probably help provide the th_off
> > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs
> > > IP options on UDP and TCP?
> > 
> > Other diffs have been rejected as they make too many assumtions how
> > the stack works.
> 
> my opinion is we can make these assumptions. the CSUM_OUT flags are
> only set in very specific places where the stack itself is constructing
> or checking properly aligned headers, and then the stack maintains
> this alignment until it reaches the driver. this is where the first
> of the bugs in the ixl diff comes in.
> 
> in the diff ixl_tx_setup_offload() is called after the dma mapping
> occurs. this is implemented in this code:
> 
> static inline int
> ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m)
> {
>       int error;
> 
>       error = bus_dmamap_load_mbuf(dmat, map, m,
>           BUS_DMA_STREAMING | BUS_DMA_NOWAIT);
>       if (error != EFBIG)
>               return (error);
> 
>       error = m_defrag(m, M_DONTWAIT);
>       if (error != 0)
>               return (error);
> 
>       return (bus_dmamap_load_mbuf(dmat, map, m,
>           BUS_DMA_STREAMING | BUS_DMA_NOWAIT));
> }
> 
> the problem is that when we get a heavily fragmented mbuf we call
> m_defrag, which basically allocates a cluster and copies all the
> data from the fragments into it. however, m_defrag does not maintain
> the alignment of payload, meaning the ethernet header gets to start
> on a 4 byte boundary cos that's what we get from the cluster pool,
> and the IP and TCP payloads end up with a 2 byte misalignmnet.
> 
> my proposed solution to this is to set up the offload bits before
> calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit.
> 
> because of when the CSUM_OUT flags are set, i think m_getptr is fine for
> pulling the mbuf apart. if it's not i want the code to blow up so it
> gets fixed. that's why the code in my diff below lacks defenses.
> 
> the other bug is that the uint64_t holding the offload flags isn't
> reset between being called for different mbufs. my solution to that is
> to have ixl_tx_setup_offload() return the flags so the value in
> ixl_start is overwritten every time.
> 
> lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if
> the CSUM_OUT flags arent set. ixl_tx_setup_offload() returns 0 early
> if none of the flags are set now.
> 
> i'll have a look at ix and the ixl rx bits tomorrow.

ix(4) also looked at the packet after a possible m_defrag. this works on
amd64 with ipv4 and ipv6, and with and without vlans and it seems to
behave as expected. we'll try a sparc64 soon.

it's a lot easier to read after it's applied.

Index: if_ix.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.181
diff -u -p -r1.181 if_ix.c
--- if_ix.c     27 Jan 2022 18:28:44 -0000      1.181
+++ if_ix.c     6 Feb 2022 12:32:21 -0000
@@ -156,7 +156,8 @@ int ixgbe_encap(struct tx_ring *, struct
 int    ixgbe_dma_malloc(struct ix_softc *, bus_size_t,
                    struct ixgbe_dma_alloc *, int);
 void   ixgbe_dma_free(struct ix_softc *, struct ixgbe_dma_alloc *);
-int    ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
+static int
+       ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *,
            uint32_t *);
 int    ixgbe_tso_setup(struct tx_ring *, struct mbuf *, uint32_t *,
            uint32_t *);
@@ -1406,6 +1407,14 @@ ixgbe_encap(struct tx_ring *txr, struct 
        map = txbuf->map;
 
        /*
+        * Set the appropriate offload context
+        * this will becomes the first descriptor.
+        */
+       ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
+       if (ntxc == -1)
+               goto xmit_fail;
+
+       /*
         * Map the packet for DMA.
         */
        switch (bus_dmamap_load_mbuf(txr->txdma.dma_tag, map,
@@ -1422,14 +1431,6 @@ ixgbe_encap(struct tx_ring *txr, struct 
                return (0);
        }
 
-       /*
-        * Set the appropriate offload context
-        * this will becomes the first descriptor.
-        */
-       ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status);
-       if (ntxc == -1)
-               goto xmit_fail;
-
        i = txr->next_avail_desc + ntxc;
        if (i >= sc->num_tx_desc)
                i -= sc->num_tx_desc;
@@ -1880,6 +1881,7 @@ ixgbe_setup_interface(struct ix_softc *s
 #endif
 
        ifp->if_capabilities |= IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv4;
+       ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6;
 
        /*
         * Specify the media types supported by this sc and register
@@ -2426,32 +2428,85 @@ ixgbe_free_transmit_buffers(struct tx_ri
  *
  **********************************************************************/
 
-int
+static void
+ixgbe_csum_offload(struct mbuf *mp,
+    uint32_t *vlan_macip_lens_p, uint32_t *type_tucmd_mlhl_p)
+{
+       struct ether_header *eh = mtod(mp, struct ether_header *);
+       struct mbuf *m;
+       int hoff;
+       uint32_t iphlen;
+       uint32_t vlan_macip_lens = 0;
+       uint32_t type_tucmd_mlhl = 0;
+       uint8_t ipproto;
+
+       vlan_macip_lens |= (sizeof(*eh) << IXGBE_ADVTXD_MACLEN_SHIFT);
+
+       switch (ntohs(eh->ether_type)) {
+       case ETHERTYPE_IP: {
+               struct ip *ip;
+
+               m = m_getptr(mp, sizeof(*eh), &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip));
+               ip = (struct ip *)(mtod(m, caddr_t) + hoff);
+
+               iphlen = ip->ip_hl << 2;
+               ipproto = ip->ip_p;
+
+               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
+               break;
+       }
+
+#ifdef INET6
+       case ETHERTYPE_IPV6: {
+               struct ip6_hdr *ip6;
+
+               m = m_getptr(mp, sizeof(*eh), &hoff);
+               KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip6));
+               ip6 = (struct ip6_hdr *)(mtod(m, caddr_t) + hoff);
+
+               iphlen = sizeof(*ip6);
+               ipproto = ip6->ip6_nxt;
+
+               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
+               break;
+       }
+#endif
+
+       default:
+               panic("CSUM_OUT set for non-IP packet");
+               /* NOTREACHED */
+       }
+
+       vlan_macip_lens |= iphlen;
+
+       switch (ipproto) {
+       case IPPROTO_TCP:
+               KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT));
+               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
+               break;
+       case IPPROTO_UDP:
+               KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_UDP_CSUM_OUT));
+               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
+               break;
+       default:
+               panic("CSUM_OUT set for wrong protocol");
+               /* NOTREACHED */
+       }
+
+       *vlan_macip_lens_p = vlan_macip_lens;
+       *type_tucmd_mlhl_p = type_tucmd_mlhl;
+}
+
+static int
 ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp,
     uint32_t *cmd_type_len, uint32_t *olinfo_status)
 {
        struct ixgbe_adv_tx_context_desc *TXD;
        struct ixgbe_tx_buf *tx_buffer;
-#if NVLAN > 0
-       struct ether_vlan_header *eh;
-#else
-       struct ether_header *eh;
-#endif
-       struct ip *ip;
-#ifdef notyet
-       struct ip6_hdr *ip6;
-#endif
-       struct mbuf *m;
-       int     ipoff;
        uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0;
-       int     ehdrlen, ip_hlen = 0;
-       uint16_t etype;
-       uint8_t ipproto = 0;
-       int     offload = TRUE;
        int     ctxd = txr->next_avail_desc;
-#if NVLAN > 0
-       uint16_t vtag = 0;
-#endif
+       int     offload = 0;
 
 #if notyet
        /* First check if TSO is to be used */
@@ -2459,112 +2514,35 @@ ixgbe_tx_ctx_setup(struct tx_ring *txr, 
                return (ixgbe_tso_setup(txr, mp, cmd_type_len, olinfo_status));
 #endif
 
-       if ((mp->m_pkthdr.csum_flags & (M_TCP_CSUM_OUT | M_UDP_CSUM_OUT)) == 0)
-               offload = FALSE;
-
        /* Indicate the whole packet as payload when not doing TSO */
        *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT;
 
-       /* Now ready a context descriptor */
-       TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd];
-       tx_buffer = &txr->tx_buffers[ctxd];
-
-       /*
-        * In advanced descriptors the vlan tag must
-        * be placed into the descriptor itself. Hence
-        * we need to make one even if not doing offloads.
-        */
 #if NVLAN > 0
-       if (mp->m_flags & M_VLANTAG) {
-               vtag = mp->m_pkthdr.ether_vtag;
+       if (ISSET(mp->m_flags, M_VLANTAG)) {
+               uint32_t vtag = mp->m_pkthdr.ether_vtag;
                vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT);
-       } else
-#endif
-       if (offload == FALSE)
-               return (0);     /* No need for CTX */
-
-       /*
-        * Determine where frame payload starts.
-        * Jump over vlan headers if already present,
-        * helpful for QinQ too.
-        */
-       if (mp->m_len < sizeof(struct ether_header))
-               return (-1);
-#if NVLAN > 0
-       eh = mtod(mp, struct ether_vlan_header *);
-       if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) {
-               if (mp->m_len < sizeof(struct ether_vlan_header))
-                       return (-1);
-               etype = ntohs(eh->evl_proto);
-               ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN;
-       } else {
-               etype = ntohs(eh->evl_encap_proto);
-               ehdrlen = ETHER_HDR_LEN;
+               offload |= 1;
        }
-#else
-       eh = mtod(mp, struct ether_header *);
-       etype = ntohs(eh->ether_type);
-       ehdrlen = ETHER_HDR_LEN;
 #endif
 
-       /* Set the ether header length */
-       vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT;
-
-       switch (etype) {
-       case ETHERTYPE_IP:
-               if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip))
-                       return (-1);
-               m = m_getptr(mp, ehdrlen, &ipoff);
-               KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip));
-               ip = (struct ip *)(m->m_data + ipoff);
-               ip_hlen = ip->ip_hl << 2;
-               ipproto = ip->ip_p;
-               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4;
-               break;
-#ifdef notyet
-       case ETHERTYPE_IPV6:
-               if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip6))
-                       return (-1);
-               m = m_getptr(mp, ehdrlen, &ipoff);
-               KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip6));
-               ip6 = (struct ip6 *)(m->m_data + ipoff);
-               ip_hlen = sizeof(*ip6);
-               /* XXX-BZ this will go badly in case of ext hdrs. */
-               ipproto = ip6->ip6_nxt;
-               type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6;
-               break;
-#endif
-       default:
-               offload = FALSE;
-               break;
+       if (ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT|M_UDP_CSUM_OUT)) {
+               ixgbe_csum_offload(mp, &vlan_macip_lens, &type_tucmd_mlhl);
+               *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8;
+               offload |= 1;
        }
 
-       vlan_macip_lens |= ip_hlen;
-       type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
-
-       switch (ipproto) {
-       case IPPROTO_TCP:
-               if (mp->m_pkthdr.csum_flags & M_TCP_CSUM_OUT)
-                       type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP;
-               break;
-       case IPPROTO_UDP:
-               if (mp->m_pkthdr.csum_flags & M_UDP_CSUM_OUT)
-                       type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP;
-               break;
-       default:
-               offload = FALSE;
-               break;
-       }
+       if (!offload)
+               return (0);
 
-       if (offload) /* For the TX descriptor setup */
-               *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8;
+       type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT;
 
-       /* Now copy bits into descriptor */
-       TXD->vlan_macip_lens = htole32(vlan_macip_lens);
-       TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl);
+       TXD = (struct ixgbe_adv_tx_context_desc *)&txr->tx_base[ctxd];
+       htolem32(&TXD->vlan_macip_lens, vlan_macip_lens);
        TXD->seqnum_seed = htole32(0);
+       htolem32(&TXD->type_tucmd_mlhl, type_tucmd_mlhl);
        TXD->mss_l4len_idx = htole32(0);
 
+       tx_buffer = &txr->tx_buffers[ctxd];
        tx_buffer->m_head = NULL;
        tx_buffer->eop_index = -1;
 
Index: ixgbe.h
===================================================================
RCS file: /cvs/src/sys/dev/pci/ixgbe.h,v
retrieving revision 1.32
diff -u -p -r1.32 ixgbe.h
--- ixgbe.h     18 Jul 2020 07:18:22 -0000      1.32
+++ ixgbe.h     6 Feb 2022 12:32:21 -0000
@@ -65,6 +65,7 @@
 #include <netinet/in.h>
 #include <netinet/if_ether.h>
 #include <netinet/ip.h>
+#include <netinet/ip6.h>
 
 #if NBPFILTER > 0
 #include <net/bpf.h>


Reply via email to