On Mon, Feb 07, 2022 at 02:39:38PM +1000, David Gwynne wrote: > On Fri, Feb 04, 2022 at 09:29:56PM +1000, David Gwynne wrote: > > On Fri, Jan 28, 2022 at 05:26:01PM +0100, Alexander Bluhm wrote: > > > On Wed, Jan 26, 2022 at 11:05:51AM +0100, Claudio Jeker wrote: > > > > On Wed, Jan 26, 2022 at 01:29:42AM +0100, Alexander Bluhm wrote: > > > > > Hi, > > > > > > > > > > There were some problems with ix(4) and ixl(4) hardware checksumming > > > > > for the output path on strict alignment architectures. > > > > > > > > > > I have merged jan@'s diffs and added some sanity checks and > > > > > workarounds. > > > > > > > > > > - If the first mbuf is not aligned or not contigous, use m_copydata() > > > > > to extract the IP, IPv6, TCP header. > > > > > - If the header is in the first mbuf, use m_data for the fast path. > > > > > - Add netstat counter for invalid header chains. This makes > > > > > us aware when hardware checksumming fails. > > > > > - Add netstat counter for header copies. This indicates that > > > > > better storage allocation in the network stack is possible. > > > > > It also allows to recognize alignment problems on non-strict > > > > > architectures. > > > > > - There is not risk of crashes on sparc64. > > > > > > > > > > Does this aproach make sense? > > > > > > > > I think it is overly complicated and too much data is copied around. > > > > First of all there is no need to extract ipproto. > > > > The code can just use the M_TCP_CSUM_OUT and M_UDP_CSUM_OUT flags (they > > > > are not set for other protos). > > > > Because of this only they ip_hlen needs to be accessed and this can be > > > > done with m_getptr(). > > > > > > A solution with m_getptr() is where we started. It has already > > > been rejected. The problem is that there are 6 ways to implement > > > this feature. Every option has its drawbacks and was rejected. > > > > > > Options are: > > > 1. m_getptr() and access the struct. Alignment cannot be guaranteed. > > > 2. m_getptr() and access the byte or word. Header fields should be > > > accessed by structs. > > > 3. Always m_copydata. Too much overhead. > > > 4. Always use m_data. Kernel may crash or use invalid data. > > > 5. Combination of m_data and m_copydata. Too complex. > > > 6. Track the fields in mbuf header. Too fragile and memory overhead. > > > > > > In my measurements checksum offloading gave us 10% performance boost > > > so I want this feature. Other drivers also have it. > > > > > > Could we get another idea or find a consensus which option to use? > > > > after staring at this for a few hours my conclusion is option 1 actually > > is the right approach, but the diff for ixl has a bug. > > > > > > In the IP6 case even more can be skipped since ip_hlen is static for > > > > IPv6. > > > > > > > > In ixl(4) also the tcp header lenght needs to be extracted. Again the > > > > code > > > > can be simplified because HW checksumming is only enabled if ip_hlen == > > > > 5 > > > > and so the offset of the th_off field is static (for both IPv4 and > > > > IPv6). > > > > Again m_getptr can be used to just access the byte with th_off. > > > > > > > > Longterm in_proto_cksum_out() should probably help provide the th_off > > > > field. I think enforcing ip_hlen == 5 for UDP and TCP is fine, who needs > > > > IP options on UDP and TCP? > > > > > > Other diffs have been rejected as they make too many assumtions how > > > the stack works. > > > > my opinion is we can make these assumptions. the CSUM_OUT flags are > > only set in very specific places where the stack itself is constructing > > or checking properly aligned headers, and then the stack maintains > > this alignment until it reaches the driver. this is where the first > > of the bugs in the ixl diff comes in. > > > > in the diff ixl_tx_setup_offload() is called after the dma mapping > > occurs. this is implemented in this code: > > > > static inline int > > ixl_load_mbuf(bus_dma_tag_t dmat, bus_dmamap_t map, struct mbuf *m) > > { > > int error; > > > > error = bus_dmamap_load_mbuf(dmat, map, m, > > BUS_DMA_STREAMING | BUS_DMA_NOWAIT); > > if (error != EFBIG) > > return (error); > > > > error = m_defrag(m, M_DONTWAIT); > > if (error != 0) > > return (error); > > > > return (bus_dmamap_load_mbuf(dmat, map, m, > > BUS_DMA_STREAMING | BUS_DMA_NOWAIT)); > > } > > > > the problem is that when we get a heavily fragmented mbuf we call > > m_defrag, which basically allocates a cluster and copies all the > > data from the fragments into it. however, m_defrag does not maintain > > the alignment of payload, meaning the ethernet header gets to start > > on a 4 byte boundary cos that's what we get from the cluster pool, > > and the IP and TCP payloads end up with a 2 byte misalignmnet. > > > > my proposed solution to this is to set up the offload bits before > > calling ixl_load_mbuf. i'm confident this is the bug that deraadt@ hit. > > > > because of when the CSUM_OUT flags are set, i think m_getptr is fine for > > pulling the mbuf apart. if it's not i want the code to blow up so it > > gets fixed. that's why the code in my diff below lacks defenses. > > > > the other bug is that the uint64_t holding the offload flags isn't > > reset between being called for different mbufs. my solution to that is > > to have ixl_tx_setup_offload() return the flags so the value in > > ixl_start is overwritten every time. > > > > lastly, ixl shouldn't (doesn't need to?) spelunk in the packet if > > the CSUM_OUT flags arent set. ixl_tx_setup_offload() returns 0 early > > if none of the flags are set now. > > > > i'll have a look at ix and the ixl rx bits tomorrow. > > ix(4) also looked at the packet after a possible m_defrag. this works on > amd64 with ipv4 and ipv6, and with and without vlans and it seems to > behave as expected. we'll try a sparc64 soon. > > it's a lot easier to read after it's applied.
new diff. this one has actual csum+vlans working without leaky routes fixing things for me without me noticing. it's also been tested on sparc64. Index: if_ix.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_ix.c,v retrieving revision 1.181 diff -u -p -r1.181 if_ix.c --- if_ix.c 27 Jan 2022 18:28:44 -0000 1.181 +++ if_ix.c 7 Feb 2022 08:44:32 -0000 @@ -156,7 +156,8 @@ int ixgbe_encap(struct tx_ring *, struct int ixgbe_dma_malloc(struct ix_softc *, bus_size_t, struct ixgbe_dma_alloc *, int); void ixgbe_dma_free(struct ix_softc *, struct ixgbe_dma_alloc *); -int ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *, +static int + ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, uint32_t *, uint32_t *); int ixgbe_tso_setup(struct tx_ring *, struct mbuf *, uint32_t *, uint32_t *); @@ -1391,11 +1392,6 @@ ixgbe_encap(struct tx_ring *txr, struct cmd_type_len = (IXGBE_ADVTXD_DTYP_DATA | IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT); -#if NVLAN > 0 - if (m_head->m_flags & M_VLANTAG) - cmd_type_len |= IXGBE_ADVTXD_DCMD_VLE; -#endif - /* * Important to capture the first descriptor * used because it will contain the index of @@ -1406,6 +1402,14 @@ ixgbe_encap(struct tx_ring *txr, struct map = txbuf->map; /* + * Set the appropriate offload context + * this will becomes the first descriptor. + */ + ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status); + if (ntxc == -1) + goto xmit_fail; + + /* * Map the packet for DMA. */ switch (bus_dmamap_load_mbuf(txr->txdma.dma_tag, map, @@ -1422,14 +1426,6 @@ ixgbe_encap(struct tx_ring *txr, struct return (0); } - /* - * Set the appropriate offload context - * this will becomes the first descriptor. - */ - ntxc = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status); - if (ntxc == -1) - goto xmit_fail; - i = txr->next_avail_desc + ntxc; if (i >= sc->num_tx_desc) i -= sc->num_tx_desc; @@ -1880,6 +1876,7 @@ ixgbe_setup_interface(struct ix_softc *s #endif ifp->if_capabilities |= IFCAP_CSUM_TCPv4 | IFCAP_CSUM_UDPv4; + ifp->if_capabilities |= IFCAP_CSUM_TCPv6 | IFCAP_CSUM_UDPv6; /* * Specify the media types supported by this sc and register @@ -2426,138 +2423,106 @@ ixgbe_free_transmit_buffers(struct tx_ri * **********************************************************************/ -int -ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp, - uint32_t *cmd_type_len, uint32_t *olinfo_status) +static inline void +ixgbe_csum_offload(struct mbuf *mp, + uint32_t *vlan_macip_lens, uint32_t *type_tucmd_mlhl) { - struct ixgbe_adv_tx_context_desc *TXD; - struct ixgbe_tx_buf *tx_buffer; -#if NVLAN > 0 - struct ether_vlan_header *eh; -#else - struct ether_header *eh; -#endif - struct ip *ip; -#ifdef notyet - struct ip6_hdr *ip6; -#endif + struct ether_header *eh = mtod(mp, struct ether_header *); struct mbuf *m; - int ipoff; - uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0; - int ehdrlen, ip_hlen = 0; - uint16_t etype; - uint8_t ipproto = 0; - int offload = TRUE; - int ctxd = txr->next_avail_desc; -#if NVLAN > 0 - uint16_t vtag = 0; -#endif + int hoff; + uint32_t iphlen; + uint8_t ipproto; -#if notyet - /* First check if TSO is to be used */ - if (mp->m_pkthdr.csum_flags & CSUM_TSO) - return (ixgbe_tso_setup(txr, mp, cmd_type_len, olinfo_status)); -#endif + *vlan_macip_lens |= (sizeof(*eh) << IXGBE_ADVTXD_MACLEN_SHIFT); - if ((mp->m_pkthdr.csum_flags & (M_TCP_CSUM_OUT | M_UDP_CSUM_OUT)) == 0) - offload = FALSE; + switch (ntohs(eh->ether_type)) { + case ETHERTYPE_IP: { + struct ip *ip; - /* Indicate the whole packet as payload when not doing TSO */ - *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT; - - /* Now ready a context descriptor */ - TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd]; - tx_buffer = &txr->tx_buffers[ctxd]; + m = m_getptr(mp, sizeof(*eh), &hoff); + KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip)); + ip = (struct ip *)(mtod(m, caddr_t) + hoff); - /* - * In advanced descriptors the vlan tag must - * be placed into the descriptor itself. Hence - * we need to make one even if not doing offloads. - */ -#if NVLAN > 0 - if (mp->m_flags & M_VLANTAG) { - vtag = mp->m_pkthdr.ether_vtag; - vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT); - } else -#endif - if (offload == FALSE) - return (0); /* No need for CTX */ + iphlen = ip->ip_hl << 2; + ipproto = ip->ip_p; - /* - * Determine where frame payload starts. - * Jump over vlan headers if already present, - * helpful for QinQ too. - */ - if (mp->m_len < sizeof(struct ether_header)) - return (-1); -#if NVLAN > 0 - eh = mtod(mp, struct ether_vlan_header *); - if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { - if (mp->m_len < sizeof(struct ether_vlan_header)) - return (-1); - etype = ntohs(eh->evl_proto); - ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; - } else { - etype = ntohs(eh->evl_encap_proto); - ehdrlen = ETHER_HDR_LEN; + *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4; + break; } -#else - eh = mtod(mp, struct ether_header *); - etype = ntohs(eh->ether_type); - ehdrlen = ETHER_HDR_LEN; -#endif - /* Set the ether header length */ - vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT; +#ifdef INET6 + case ETHERTYPE_IPV6: { + struct ip6_hdr *ip6; + + m = m_getptr(mp, sizeof(*eh), &hoff); + KASSERT(m != NULL && m->m_len - hoff >= sizeof(*ip6)); + ip6 = (struct ip6_hdr *)(mtod(m, caddr_t) + hoff); - switch (etype) { - case ETHERTYPE_IP: - if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip)) - return (-1); - m = m_getptr(mp, ehdrlen, &ipoff); - KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip)); - ip = (struct ip *)(m->m_data + ipoff); - ip_hlen = ip->ip_hl << 2; - ipproto = ip->ip_p; - type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4; - break; -#ifdef notyet - case ETHERTYPE_IPV6: - if (mp->m_pkthdr.len < ehdrlen + sizeof(*ip6)) - return (-1); - m = m_getptr(mp, ehdrlen, &ipoff); - KASSERT(m != NULL && m->m_len - ipoff >= sizeof(*ip6)); - ip6 = (struct ip6 *)(m->m_data + ipoff); - ip_hlen = sizeof(*ip6); - /* XXX-BZ this will go badly in case of ext hdrs. */ + iphlen = sizeof(*ip6); ipproto = ip6->ip6_nxt; - type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6; + + *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6; break; + } #endif + default: - offload = FALSE; - break; + panic("CSUM_OUT set for non-IP packet"); + /* NOTREACHED */ } - vlan_macip_lens |= ip_hlen; - type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT; + *vlan_macip_lens |= iphlen; switch (ipproto) { case IPPROTO_TCP: - if (mp->m_pkthdr.csum_flags & M_TCP_CSUM_OUT) - type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP; + KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT)); + *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP; break; case IPPROTO_UDP: - if (mp->m_pkthdr.csum_flags & M_UDP_CSUM_OUT) - type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP; + KASSERT(ISSET(mp->m_pkthdr.csum_flags, M_UDP_CSUM_OUT)); + *type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP; break; default: - offload = FALSE; - break; + panic("CSUM_OUT set for wrong protocol"); + /* NOTREACHED */ + } +} + +static int +ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp, + uint32_t *cmd_type_len, uint32_t *olinfo_status) +{ + struct ixgbe_adv_tx_context_desc *TXD; + struct ixgbe_tx_buf *tx_buffer; + uint32_t vlan_macip_lens = 0, type_tucmd_mlhl = 0; + int ctxd = txr->next_avail_desc; + int offload = 0; + + /* Indicate the whole packet as payload when not doing TSO */ + *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT; + +#if NVLAN > 0 + if (ISSET(mp->m_flags, M_VLANTAG)) { + uint32_t vtag = mp->m_pkthdr.ether_vtag; + vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT); + *cmd_type_len |= IXGBE_ADVTXD_DCMD_VLE; + offload |= 1; } +#endif - if (offload) /* For the TX descriptor setup */ + if (ISSET(mp->m_pkthdr.csum_flags, M_TCP_CSUM_OUT|M_UDP_CSUM_OUT)) { + ixgbe_csum_offload(mp, &vlan_macip_lens, &type_tucmd_mlhl); *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8; + offload |= 1; + } + + if (!offload) + return (0); + + TXD = (struct ixgbe_adv_tx_context_desc *)&txr->tx_base[ctxd]; + tx_buffer = &txr->tx_buffers[ctxd]; + + type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT; /* Now copy bits into descriptor */ TXD->vlan_macip_lens = htole32(vlan_macip_lens); Index: ixgbe.h =================================================================== RCS file: /cvs/src/sys/dev/pci/ixgbe.h,v retrieving revision 1.32 diff -u -p -r1.32 ixgbe.h --- ixgbe.h 18 Jul 2020 07:18:22 -0000 1.32 +++ ixgbe.h 7 Feb 2022 08:44:32 -0000 @@ -65,6 +65,7 @@ #include <netinet/in.h> #include <netinet/if_ether.h> #include <netinet/ip.h> +#include <netinet/ip6.h> #if NBPFILTER > 0 #include <net/bpf.h>