Still looking for ok's...
On Sat, Feb 11, 2012 at 01:27 +0100, Mike Belopuhov wrote:
> Hi,
>
> As it became evident, ix is driven by Low Latency Interrupts
> on 82599 to do all sorts of processing instead of the regular
> Rx/Tx queue interrupts. LLI is an additional facility that
> is aimed to do out of band signalling for certain conditions.
> Therefore it has it's own interrupt moderation settings
> independent of the Tx/Rx queue interrupts. There are bunch
> of conditions that cause LLI but only one of them is turned
> on by default: signal when rx ring shrinks below 64
> descriptors. Certainly, MCLGETI causes loads of this
> interrupts because most of the time we're running with less
> than 64 rx descriptors. On top of that this works as a
> positive feedback: you get more LLIs, you lose ticks, ring
> shrinks, you get more LLIs and so on.
>
> Previously we had LLIs completely unmoderated: 30-50k intr/s
> under high load isn't something we haven't seen. At that
> point Claudio and I have figured that these were LLIs that
> we killing us. To mitigate that Claudio has found a way to
> moderate them. Unfortunately the longest inter-interrupt
> interval Intel allows us to set up for LLI is 60 us. That
> accounts for 16 666 intr/s and this is a precise number you'd
> see with a 82599. According to the (surprisingly incorrectly
> documented) interrupt throttling parameters used in the
> driver, regular interrupt would only happen once in 500 us.
>
> So in my previous attempt at fixing ix performance I made
> rxeof do more work, staying longer in the interrupt context
> and processing more than 4 rx descriptors per interrupt:
> http://marc.info/?l=openbsd-tech&m=132222989805096&w=2
> Obviously, I was working around LLIs.
>
> To disable LLI firing in the event of the Rx ring depletion
> one needs to write 0 to the threshold part of the SRRCTL
> register. In the diff below I suggest starting with a
> zeroed value of the register and build gradually from that.
>
> I have also changed how we setup the inter-interrupt
> interval to calculate an appropriate register value based
> on the desired number of interrupts per second. Tests
> have shown that 8000 intr/s (or 125 us intervals) are good
> enough throughput and latency wise. On a selected platform
> (HP DL160 G6 with Xeon E5504, 2Ghz) 82599 has shown the
> following results:
>
> Interrupt rate, | Throughput for 64 byte | Average latency,
> intr/s | packet, kpps | ms
> ----------------+------------------------+-----------------
> 8000 | 535 | 0.1
> 4000 | 550 | 0.3
>
> As you can see OpenBSD performs very well within the 125us ..
> 250us range, though just a 3% increase in throughput triples
> the average latency.
>
> Similar results were obtained from Core2 duo and Xeon E56xx
> series CPUs.
>
> [Please note that test kernels included the well-known
> icmp error copy fix to achieve better performance]
>
> However it's possible to tune it a bit more, I suggest
> a simple diff below for inclusion into 5.1.
>
> OK?
>
> Index: if_ix.c
> ===================================================================
> RCS file: /cvs/openbsd/src/sys/dev/pci/if_ix.c,v
> retrieving revision 1.60
> diff -u -p -r1.60 if_ix.c
> --- if_ix.c 20 Jan 2012 14:48:49 -0000 1.60
> +++ if_ix.c 10 Feb 2012 20:52:17 -0000
> @@ -634,8 +634,8 @@ ixgbe_init(void *arg)
> struct ix_softc *sc = (struct ix_softc *)arg;
> struct ifnet *ifp = &sc->arpcom.ac_if;
> struct rx_ring *rxr = sc->rx_rings;
> - uint32_t k, txdctl, rxdctl, rxctrl, mhadd, gpie;
> - int i, s, err, llimode = 0;
> + uint32_t k, txdctl, rxdctl, rxctrl, mhadd, gpie, itr;
> + int i, s, err;
>
> INIT_DEBUGOUT("ixgbe_init: begin");
>
> @@ -703,7 +703,6 @@ ixgbe_init(void *arg)
> * interrupts hitting the card when the ring is getting full.
> */
> gpie |= 0xf << IXGBE_GPIE_LLI_DELAY_SHIFT;
> - llimode = IXGBE_EITR_LLI_MOD;
> }
>
> if (sc->msix > 1) {
> @@ -807,9 +806,14 @@ ixgbe_init(void *arg)
> }
> }
>
> - /* Set moderation on the Link interrupt */
> - IXGBE_WRITE_REG(&sc->hw, IXGBE_EITR(sc->linkvec),
> - IXGBE_LINK_ITR | llimode);
> + /* Setup interrupt moderation */
> + if (sc->hw.mac.type == ixgbe_mac_82598EB)
> + itr = (8000000 / IXGBE_INTS_PER_SEC) & 0xff8;
> + else {
> + itr = (4000000 / IXGBE_INTS_PER_SEC) & 0xff8;
> + itr |= IXGBE_EITR_LLI_MOD | IXGBE_EITR_CNT_WDIS;
> + }
> + IXGBE_WRITE_REG(&sc->hw, IXGBE_EITR(0), itr);
>
> /* Config/Enable Link */
> ixgbe_config_link(sc);
> @@ -2832,10 +2836,7 @@ ixgbe_initialize_receive_units(struct ix
> sc->num_rx_desc * sizeof(union ixgbe_adv_rx_desc));
>
> /* Set up the SRRCTL register */
> - srrctl = IXGBE_READ_REG(&sc->hw, IXGBE_SRRCTL(i));
> - srrctl &= ~IXGBE_SRRCTL_BSIZEHDR_MASK;
> - srrctl &= ~IXGBE_SRRCTL_BSIZEPKT_MASK;
> - srrctl |= bufsz;
> + srrctl = bufsz;
> if (rxr->hdr_split) {
> /* Use a standard mbuf for the header */
> srrctl |= ((IXGBE_RX_HDR <<
> Index: if_ix.h
> ===================================================================
> RCS file: /cvs/openbsd/src/sys/dev/pci/if_ix.h,v
> retrieving revision 1.13
> diff -u -p -r1.13 if_ix.h
> --- if_ix.h 15 Jun 2011 00:03:00 -0000 1.13
> +++ if_ix.h 10 Feb 2012 20:50:59 -0000
> @@ -119,12 +119,9 @@
> #define IXGBE_QUEUE_HUNG 2
>
> /*
> - * Interrupt Moderation parameters
> + * Interrupt Moderation parameters
> */
> -#define IXGBE_LOW_LATENCY 128
> -#define IXGBE_AVE_LATENCY 400
> -#define IXGBE_BULK_LATENCY 1200
> -#define IXGBE_LINK_ITR 2000
> +#define IXGBE_INTS_PER_SEC 8000
>
> /* Used for auto RX queue configuration */
> extern int mp_ncpus;