Re: vr(4) TX interrupt reduction

2013-01-17 Thread Darren Tucker
On Wed, Jan 16, 2013 at 01:38:01PM +1100, Darren Tucker wrote:
 On Mon, Jan 14, 2013 at 10:10:55PM +1100, Darren Tucker wrote:
  On my ALIX, it increase the IP routing throughput from 80Mbit/s to
  85Mbit/s while reducing the interrupt CPU usage from 99% to 80%.
 
 It turns out that due to an error on my part, most of this improvment
 was due to one of the test kernels being complied without POOL_DEBUG
 while the baseline one was.  Sigh.  I think there's still some gains
 to be had, though.

OK, here's another diff which does seem to help.  It seems that there's
two different bits in the TX descriptor that control interrupts.  Quoting
from the VT6105M spec:

TDES1 bit 23: Interrupt Control. 0: No interrupt when Transmit OK, 1:
Interrupt when Transmit OK

TDES3 bit 0: Interrupt Control. 0: issue interrupt for this packet.  1:
no interrupt generated

Why two bits with apparently similar functionality?  Beats me, but only
the second actually seems to do anything on my VT6105M here.   CPU usage
for routing 85 mbit/s drops CPU usage from about 75% to about 50%
(famous last words).  Your milage may vary and so on.

Index: if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.124
diff -u -p -r1.124 if_vr.c
--- if_vr.c 16 Jan 2013 06:15:50 -  1.124
+++ if_vr.c 17 Jan 2013 04:15:05 -
@@ -724,7 +724,7 @@ vr_list_tx_init(struct vr_softc *sc)
cd = sc-vr_cdata;
ld = sc-vr_ldata;
 
-   cd-vr_tx_cnt = 0;
+   cd-vr_tx_cnt = cd-vr_tx_pkts = 0;
 
for (i = 0; i  VR_TX_LIST_CNT; i++) {
cd-vr_tx_chain[i].vr_ptr = ld-vr_tx_list[i];
@@ -1292,7 +1292,15 @@ vr_encap(struct vr_softc *sc, struct vr_
}
 
/* Set EOP on the last descriptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
+
+   /* set the disable-interrupt bit except on every Nth packet */
+   if (sc-vr_cdata.vr_tx_pkts++ % 16)
+   f-vr_next |= htole32(VR_TXNEXT_INTDISABLE);
+   else {
+   sc-vr_cdata.vr_tx_pkts = 0;
+   f-vr_ctl |= htole32(VR_TXCTL_FINT);
+   }
 
return (0);
 }
@@ -1581,6 +1589,11 @@ vr_watchdog(struct ifnet *ifp)
struct vr_softc *sc;
 
sc = ifp-if_softc;
+
+   /* Reclaim first as we don't request an interrupt for every packet. */
+   vr_txeof(sc);
+   if (sc-vr_cdata.vr_tx_cnt == 0);
+   return;
 
ifp-if_oerrors++;
printf(%s: watchdog timeout\n, sc-sc_dev.dv_xname);
Index: if_vrreg.h
===
RCS file: /cvs/src/sys/dev/pci/if_vrreg.h,v
retrieving revision 1.33
diff -u -p -r1.33 if_vrreg.h
--- if_vrreg.h  16 Jan 2013 05:25:57 -  1.33
+++ if_vrreg.h  17 Jan 2013 04:15:05 -
@@ -429,6 +429,9 @@ struct vr_desc {
 #define VR_TXCTL_LASTFRAG  0x0040
 #define VR_TXCTL_FINT  0x0080
 
+/* TDES3 aka vr_next */
+#define VR_TXNEXT_INTDISABLE   0x0001
+
 #define VR_MAXFRAGS8
 #define VR_RX_LIST_CNT 128
 #define VR_TX_LIST_CNT 128
@@ -467,6 +470,7 @@ struct vr_chain_data {
struct vr_chain *vr_tx_cons;
struct vr_chain *vr_tx_prod;
int vr_tx_cnt;
+   int vr_tx_pkts;
 };
 
 struct vr_mii_frame {
-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



Re: vr(4) TX interrupt reduction

2013-01-17 Thread Darren Tucker
On Thu, Jan 17, 2013 at 09:34:32PM +1100, Darren Tucker wrote:
 OK, here's another diff which does seem to help.  It seems that there's
 two different bits in the TX descriptor that control interrupts.  Quoting
 from the VT6105M spec:

Thanks to Mark Patruck for noticing that the previous patch didn't
actually help, due to a bug I introduced in a last minute obviously
correct clean up.

Index: dev/pci/if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.124
diff -u -p -r1.124 if_vr.c
--- dev/pci/if_vr.c 16 Jan 2013 06:15:50 -  1.124
+++ dev/pci/if_vr.c 17 Jan 2013 21:54:19 -
@@ -724,7 +724,7 @@ vr_list_tx_init(struct vr_softc *sc)
cd = sc-vr_cdata;
ld = sc-vr_ldata;
 
-   cd-vr_tx_cnt = 0;
+   cd-vr_tx_cnt = cd-vr_tx_pkts = 0;
 
for (i = 0; i  VR_TX_LIST_CNT; i++) {
cd-vr_tx_chain[i].vr_ptr = ld-vr_tx_list[i];
@@ -1292,7 +1292,15 @@ vr_encap(struct vr_softc *sc, struct vr_
}
 
/* Set EOP on the last descriptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
+
+   /* set the disable-interrupt bit except on every Nth packet */
+   if (sc-vr_cdata.vr_tx_pkts++  8)
+   f-vr_next |= htole32(VR_TXNEXT_INTDISABLE);
+   else {
+   sc-vr_cdata.vr_tx_pkts = 0;
+   f-vr_ctl |= htole32(VR_TXCTL_FINT);
+   }
 
return (0);
 }
@@ -1581,6 +1589,11 @@ vr_watchdog(struct ifnet *ifp)
struct vr_softc *sc;
 
sc = ifp-if_softc;
+
+   /* Reclaim first as we don't request an interrupt for every packet. */
+   vr_txeof(sc);
+   if (sc-vr_cdata.vr_tx_cnt == 0);
+   return;
 
ifp-if_oerrors++;
printf(%s: watchdog timeout\n, sc-sc_dev.dv_xname);

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



Re: vr(4) TX interrupt reduction

2013-01-17 Thread Darren Tucker
On Fri, Jan 18, 2013 at 09:00:25AM +1100, Darren Tucker wrote:
 Thanks to Mark Patruck for noticing that the previous patch didn't
 actually help, due to a bug I introduced in a last minute obviously
 correct clean up.

The turd polishing continues unabated.  This adds the interrupt disable
bit to descriptors where there's more than one mbuf in a packet.

Index: dev/pci/if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.125
diff -u -p -r1.125 if_vr.c
--- dev/pci/if_vr.c 17 Jan 2013 21:49:48 -  1.125
+++ dev/pci/if_vr.c 18 Jan 2013 02:04:55 -
@@ -724,7 +724,7 @@ vr_list_tx_init(struct vr_softc *sc)
cd = sc-vr_cdata;
ld = sc-vr_ldata;
 
-   cd-vr_tx_cnt = 0;
+   cd-vr_tx_cnt = cd-vr_tx_pkts = 0;
 
for (i = 0; i  VR_TX_LIST_CNT; i++) {
cd-vr_tx_chain[i].vr_ptr = ld-vr_tx_list[i];
@@ -1198,7 +1198,7 @@ vr_encap(struct vr_softc *sc, struct vr_
struct vr_chain *c = *cp;
struct vr_desc  *f = NULL;
struct mbuf *m_new = NULL;
-   u_int32_t   vr_ctl = 0, vr_status = 0;
+   u_int32_t   vr_ctl = 0, vr_status = 0, intdisable = 0;
bus_dmamap_ttxmap;
int i, runt = 0;
 
@@ -1259,6 +1259,15 @@ vr_encap(struct vr_softc *sc, struct vr_
}
 #endif
 
+   /*
+* We only want TX completion interrupts on every Nth packet.
+* We need to set VR_TXNEXT_INTDISABLE on every descriptor except
+* for the last discriptor of every Nth packet, where we set
+* VR_TXCTL_FINT.
+*/
+   if (sc-vr_cdata.vr_tx_pkts++ % 8 != 0)
+   intdisable = VR_TXNEXT_INTDISABLE;
+
if (m_new != NULL) {
m_freem(m_head);
 
@@ -1276,7 +1285,7 @@ vr_encap(struct vr_softc *sc, struct vr_
f-vr_ctl |= htole32(VR_TXCTL_FIRSTFRAG);
f-vr_status = htole32(vr_status);
f-vr_data = htole32(txmap-dm_segs[i].ds_addr);
-   f-vr_next = htole32(c-vr_nextdesc-vr_paddr);
+   f-vr_next = htole32(c-vr_nextdesc-vr_paddr | intdisable);
sc-vr_cdata.vr_tx_cnt++;
}
 
@@ -1288,12 +1297,15 @@ vr_encap(struct vr_softc *sc, struct vr_
VR_TXCTL_TLINK | vr_ctl);
f-vr_status = htole32(vr_status);
f-vr_data = 
htole32(sc-sc_zeromap.vrm_map-dm_segs[0].ds_addr);
-   f-vr_next = htole32(c-vr_nextdesc-vr_paddr);
+   f-vr_next = htole32(c-vr_nextdesc-vr_paddr | intdisable);
sc-vr_cdata.vr_tx_cnt++;
}
 
/* Set EOP on the last descriptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
+
+   if (!intdisable)
+   f-vr_ctl |= htole32(VR_TXCTL_FINT);
 
return (0);
 }
@@ -1582,6 +1594,15 @@ vr_watchdog(struct ifnet *ifp)
struct vr_softc *sc;
 
sc = ifp-if_softc;
+
+   /*
+* Because we're only asking for completion interrupts only every
+* few packets, occasionally the watchdog will fire when we have
+* some TX descriptors to reclaim, so check for that first.
+*/
+   vr_txeof(sc);
+   if (sc-vr_cdata.vr_tx_cnt == 0);
+   return;
 
ifp-if_oerrors++;
printf(%s: watchdog timeout\n, sc-sc_dev.dv_xname);
Index: dev/pci/if_vrreg.h
===
RCS file: /cvs/src/sys/dev/pci/if_vrreg.h,v
retrieving revision 1.33
diff -u -p -r1.33 if_vrreg.h
--- dev/pci/if_vrreg.h  16 Jan 2013 05:25:57 -  1.33
+++ dev/pci/if_vrreg.h  18 Jan 2013 02:04:55 -
@@ -429,6 +429,9 @@ struct vr_desc {
 #define VR_TXCTL_LASTFRAG  0x0040
 #define VR_TXCTL_FINT  0x0080
 
+/* TDES3 aka vr_next */
+#define VR_TXNEXT_INTDISABLE   0x0001
+
 #define VR_MAXFRAGS8
 #define VR_RX_LIST_CNT 128
 #define VR_TX_LIST_CNT 128
@@ -467,6 +470,7 @@ struct vr_chain_data {
struct vr_chain *vr_tx_cons;
struct vr_chain *vr_tx_prod;
int vr_tx_cnt;
+   int vr_tx_pkts;
 };
 
 struct vr_mii_frame {

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



Re: vr(4) TX interrupt reduction

2013-01-17 Thread Chris Cappuccio
Darren Tucker [dtuc...@zip.com.au] wrote:
 On Fri, Jan 18, 2013 at 09:00:25AM +1100, Darren Tucker wrote:
  Thanks to Mark Patruck for noticing that the previous patch didn't
  actually help, due to a bug I introduced in a last minute obviously
  correct clean up.
 
 The turd polishing continues unabated.  This adds the interrupt disable
 bit to descriptors where there's more than one mbuf in a packet.
 

Just in case anyone else cares about low-end performance, look closely. This 
is a managed TX interrupt moderation. As Darren shows, it produces significant
benefit for the CPUs often paired with these low-end chips. Any other tulip-like
chip, or other fast E chip that has a per-packet interrupt disable bit
should be looked at.



Re: vr(4) TX interrupt reduction

2013-01-17 Thread Darren Tucker
On Fri, Jan 18, 2013 at 01:09:50PM +1100, Darren Tucker wrote:
 The turd polishing continues unabated.

and continues to continue.  This adds a quirk to frob the
interrupt-disable bit for VT6105M only.

After checking all of the spec sheets that I can find, I found the TX
interrupt disable bit (bit 0 in TDES3) in the specs for VT6105M,
VT6102 and VT8235.

It's not in the spec for VT6105 or VT6105LOM, however in those the
specs for the data pointer only include the top 30 bits, and bit 0
is unspecified.  It's not in VT86C100A03 at all.

What does all this mean?  beats me, but I'm concerned that setting the
low bit in an address might confuse chips not expecting it, so this
enables it only for chips that have been tested.  For the others, it'll
still do FINT interrupt reduction, which is similar to what we already
do (and more or less what FreeBSD does now).

Routing TCP through an ALIX (iperf at 85Mbit/s), this reduced CPU usage
from ~70% to ~50% and reduced the total interrupt load from 21k/sec
to 13k/sec.  Locally generating 64 byte UDP packets (the worst case,
basically) increases the output rate by about 50%.

ok?

Index: dev/pci/if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.125
diff -u -p -r1.125 if_vr.c
--- dev/pci/if_vr.c 17 Jan 2013 21:49:48 -  1.125
+++ dev/pci/if_vr.c 18 Jan 2013 05:14:18 -
@@ -163,6 +163,7 @@ int vr_alloc_mbuf(struct vr_softc *, str
 #defineVR_Q_CSUM   (11)
 #defineVR_Q_CAM(12)
 #defineVR_Q_HWTAG  (13)
+#defineVR_Q_INTDISABLE (14)
 
 struct vr_type {
pci_vendor_id_t vr_vid;
@@ -178,7 +179,7 @@ struct vr_type {
{ PCI_VENDOR_VIATECH, PCI_PRODUCT_VIATECH_VT6105,
0 },
{ PCI_VENDOR_VIATECH, PCI_PRODUCT_VIATECH_VT6105M,
-   VR_Q_CSUM | VR_Q_CAM | VR_Q_HWTAG },
+   VR_Q_CSUM | VR_Q_CAM | VR_Q_HWTAG | VR_Q_INTDISABLE },
{ PCI_VENDOR_DELTA, PCI_PRODUCT_DELTA_RHINEII,
VR_Q_NEEDALIGN },
{ PCI_VENDOR_ADDTRON, PCI_PRODUCT_ADDTRON_RHINEII,
@@ -724,7 +725,7 @@ vr_list_tx_init(struct vr_softc *sc)
cd = sc-vr_cdata;
ld = sc-vr_ldata;
 
-   cd-vr_tx_cnt = 0;
+   cd-vr_tx_cnt = cd-vr_tx_pkts = 0;
 
for (i = 0; i  VR_TX_LIST_CNT; i++) {
cd-vr_tx_chain[i].vr_ptr = ld-vr_tx_list[i];
@@ -1198,7 +1199,7 @@ vr_encap(struct vr_softc *sc, struct vr_
struct vr_chain *c = *cp;
struct vr_desc  *f = NULL;
struct mbuf *m_new = NULL;
-   u_int32_t   vr_ctl = 0, vr_status = 0;
+   u_int32_t   vr_ctl = 0, vr_status = 0, intdisable = 0;
bus_dmamap_ttxmap;
int i, runt = 0;
 
@@ -1259,6 +1260,18 @@ vr_encap(struct vr_softc *sc, struct vr_
}
 #endif
 
+   /*
+* We only want TX completion interrupts on every Nth packet.
+* We need to set VR_TXNEXT_INTDISABLE on every descriptor except
+* for the last discriptor of every Nth packet, where we set
+* VR_TXCTL_FINT.  The former is in the specs for only some chips.
+* present: VT6102 VT6105M VT8235M
+* not present: VT86C100 6105LOM
+*/
+   if (++sc-vr_cdata.vr_tx_pkts % VR_TX_INTR_THRESH != 0 
+   sc-vr_quirks  VR_Q_INTDISABLE)
+   intdisable = VR_TXNEXT_INTDISABLE;
+
if (m_new != NULL) {
m_freem(m_head);
 
@@ -1276,7 +1289,7 @@ vr_encap(struct vr_softc *sc, struct vr_
f-vr_ctl |= htole32(VR_TXCTL_FIRSTFRAG);
f-vr_status = htole32(vr_status);
f-vr_data = htole32(txmap-dm_segs[i].ds_addr);
-   f-vr_next = htole32(c-vr_nextdesc-vr_paddr);
+   f-vr_next = htole32(c-vr_nextdesc-vr_paddr | intdisable);
sc-vr_cdata.vr_tx_cnt++;
}
 
@@ -1288,12 +1301,15 @@ vr_encap(struct vr_softc *sc, struct vr_
VR_TXCTL_TLINK | vr_ctl);
f-vr_status = htole32(vr_status);
f-vr_data = 
htole32(sc-sc_zeromap.vrm_map-dm_segs[0].ds_addr);
-   f-vr_next = htole32(c-vr_nextdesc-vr_paddr);
+   f-vr_next = htole32(c-vr_nextdesc-vr_paddr | intdisable);
sc-vr_cdata.vr_tx_cnt++;
}
 
/* Set EOP on the last descriptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
+
+   if (sc-vr_cdata.vr_tx_pkts % VR_TX_INTR_THRESH == 0)
+   f-vr_ctl |= htole32(VR_TXCTL_FINT);
 
return (0);
 }
@@ -1582,6 +1598,15 @@ vr_watchdog(struct ifnet *ifp)
struct vr_softc *sc;
 
sc = ifp-if_softc;
+
+   /*
+* Since we're only asking for completion interrupts only every
+* few packets, occasionally the watchdog will fire when we 

Re: vr(4) TX interrupt reduction

2013-01-15 Thread Darren Tucker
On Mon, Jan 14, 2013 at 10:10:55PM +1100, Darren Tucker wrote:
 On my ALIX, it increase the IP routing throughput from 80Mbit/s to
 85Mbit/s while reducing the interrupt CPU usage from 99% to 80%.

It turns out that due to an error on my part, most of this improvment
was due to one of the test kernels being complied without POOL_DEBUG
while the baseline one was.  Sigh.  I think there's still some gains
to be had, though.

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



vr(4) TX interrupt reduction

2013-01-14 Thread Darren Tucker
Hi all.

This patch below reduces the number of interrupts on the transmit side
of vr(4).  Currently we set the TX competion interrupt bit on each
outbound packet.  This patch changes it to only set the interrupt bit on
the last packet enqueued.  (One thing of note is that the interrupt bit
is set on the first of potentially several TX descriptors for a packet
whereas currently it's the last.  I'm not sure if this matters, but if
it does I can change it).

On my ALIX, it increase the IP routing throughput from 80Mbit/s to
85Mbit/s while reducing the interrupt CPU usage from 99% to 80%.

Testing on any VIA Rhine chips would be appreciated (especially ones
that are not 6105M like my ALIX).

(Help/suggestions from jsing@ chris@ mikeb@ dlg@...)

Thanks.

Index: dev/pci/if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.121
diff -u -p -r1.121 if_vr.c
--- dev/pci/if_vr.c 1 Dec 2012 09:55:03 -   1.121
+++ dev/pci/if_vr.c 14 Jan 2013 10:50:58 -
@@ -1254,8 +1254,8 @@ vr_encap(struct vr_softc *sc, struct vr_
sc-vr_cdata.vr_tx_cnt++;
}
 
-   /* Set EOP on the last desciptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   /* Set EOP on the last descriptor */
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
 
return (0);
 }
@@ -1265,6 +1265,13 @@ vr_encap(struct vr_softc *sc, struct vr_
  * to the mbuf data regions directly in the transmit lists. We also save a
  * copy of the pointers since the transmit list fragment pointers are
  * physical addresses.
+ *
+ * Notes:
+ * We only set ownership bit on first descriptor for a each packet.
+ * We'll only want an interrupt for a given packet if either vr_encap fails
+ * (ie the descriptor ring is full) or if the interface input queue is
+ * empty, and we need to set the interrupt bit before setting the owner bit
+ * to turn it over to the hardware.
  */
 
 void
@@ -1272,7 +1279,7 @@ vr_start(struct ifnet *ifp)
 {
struct vr_softc *sc;
struct mbuf *m_head;
-   struct vr_chain *cur_tx, *head_tx;
+   struct vr_chain *cur_tx, *head_tx, *prev_tx = NULL;
 
sc = ifp-if_softc;
 
@@ -1298,9 +1305,9 @@ vr_start(struct ifnet *ifp)
IF_PREPEND(ifp-if_snd, m_head);
break;
}
-
-   /* Only set ownership bit on first descriptor */
-   head_tx-vr_ptr-vr_status |= htole32(VR_TXSTAT_OWN);
+   /* successfully enqueued new packet, we're done with previous */
+   if (prev_tx != NULL)
+   prev_tx-vr_ptr-vr_status |= htole32(VR_TXSTAT_OWN);
 
 #if NBPFILTER  0
/*
@@ -1312,7 +1319,13 @@ vr_start(struct ifnet *ifp)
BPF_DIRECTION_OUT);
 #endif
cur_tx = cur_tx-vr_nextdesc;
+   prev_tx = head_tx;
+   }
+   if (prev_tx != NULL) {
+   prev_tx-vr_ptr-vr_ctl |= htole32(VR_TXCTL_FINT);
+   prev_tx-vr_ptr-vr_status |= htole32(VR_TXSTAT_OWN);
}
+
if (sc-vr_cdata.vr_tx_cnt != 0) {
sc-vr_cdata.vr_tx_prod = cur_tx;
 
-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Michał Markowski
2013/1/14 Darren Tucker dtuc...@zip.com.au:
 Testing on any VIA Rhine chips would be appreciated (especially ones
 that are not 6105M like my ALIX).

Hi, nothing conclusive on VIA VT6107 (dmesg: vr0 at pci0 dev 10
function 0 VIA RhineII-2 rev 0x8d: irq 15, address
00:16:35:06:af:eb).

current kernel: 9835.56 int/s, 84.0 Mbps (100 s avg)
with your patch: 9857.85 int/s, 84.0 Mbps

No problems observed, though.

-- 
Michał Markowski



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Chris Cappuccio
Micha?? Markowski [markows...@gmail.com] wrote:
 2013/1/14 Darren Tucker dtuc...@zip.com.au:
  Testing on any VIA Rhine chips would be appreciated (especially ones
  that are not 6105M like my ALIX).
 
 Hi, nothing conclusive on VIA VT6107 (dmesg: vr0 at pci0 dev 10
 function 0 VIA RhineII-2 rev 0x8d: irq 15, address
 00:16:35:06:af:eb).
 
 current kernel: 9835.56 int/s, 84.0 Mbps (100 s avg)
 with your patch: 9857.85 int/s, 84.0 Mbps
 
 No problems observed, though.
 

This will only affect TX direction interrupts. Can you try and generate a stream
of UDP traffic at full rate with a program like iperf to test just TX?



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Stuart Henderson
On 2013/01/14 14:03, Chris Cappuccio wrote:
 Micha?? Markowski [markows...@gmail.com] wrote:
  2013/1/14 Darren Tucker dtuc...@zip.com.au:
   Testing on any VIA Rhine chips would be appreciated (especially ones
   that are not 6105M like my ALIX).
  
  Hi, nothing conclusive on VIA VT6107 (dmesg: vr0 at pci0 dev 10
  function 0 VIA RhineII-2 rev 0x8d: irq 15, address
  00:16:35:06:af:eb).
  
  current kernel: 9835.56 int/s, 84.0 Mbps (100 s avg)
  with your patch: 9857.85 int/s, 84.0 Mbps
  
  No problems observed, though.
  
 
 This will only affect TX direction interrupts. Can you try and generate a 
 stream
 of UDP traffic at full rate with a program like iperf to test just TX?
 

source: tcpbench -u -B packetsize somehost
sink: tcpbench -s -u



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Michał Markowski
2013/1/14 Chris Cappuccio ch...@nmedia.net:
 This will only affect TX direction interrupts. Can you try and generate a 
 stream
 of UDP traffic at full rate with a program like iperf to test just TX?

Those numbers are from `iperf -c a -t 100 -i 10` on vr machine.
Iperf man page says: user must establish both a server (to discard
traffic) and a client (to generate traffic) so yes, it's TX.

When I try UDP test with -b higher than 87m:

$ iperf -c a -u -b 100m -t 100 -i 10

Client connecting to a, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 9.00 KByte (default)

[  3] local 192.168.1.192 port 25889 connected with 192.168.1.50 port 5001
zsh: segmentation fault (core dumped)  iperf -c a -u -b 100m -t 100 -i 10
$ gdb /usr/local/bin/iperf iperf.core
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as i386-unknown-openbsd5.2...(no debugging
symbols found)

Core was generated by `iperf'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libpthread.so.17.0...done.
Loaded symbols for /usr/lib/libpthread.so.17.0
Reading symbols from /usr/lib/libstdc++.so.55.0...done.
Loaded symbols for /usr/lib/libstdc++.so.55.0
Reading symbols from /usr/lib/libm.so.7.1...done.
Loaded symbols for /usr/lib/libm.so.7.1
Symbols already loaded for /usr/lib/libpthread.so.17.0
Reading symbols from /usr/lib/libc.so.66.1...done.
Loaded symbols for /usr/lib/libc.so.66.1
Reading symbols from /usr/libexec/ld.so...done.
Loaded symbols for /usr/libexec/ld.so
#0  strlen (str=0x2 Address 0x2 out of bounds) at
/usr/src/lib/libc/string/strlen.c:43
43  for (s = str; *s; ++s)
(gdb)


-- 
Michał Markowski



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Chris Cappuccio
Micha?? Markowski [markows...@gmail.com] wrote:
 2013/1/14 Chris Cappuccio ch...@nmedia.net:
  This will only affect TX direction interrupts. Can you try and generate a 
  stream
  of UDP traffic at full rate with a program like iperf to test just TX?
 
 Those numbers are from `iperf -c a -t 100 -i 10` on vr machine.
 Iperf man page says: user must establish both a server (to discard
 traffic) and a client (to generate traffic) so yes, it's TX.
 

Well your numbers clearly show almost no difference. But TCP might not
be the best way to test due to the regular ack reply.

 When I try UDP test with -b higher than 87m:
 

Clearly, the current ports iperf is buggy. Stuart has the ticket on
tcpbench, in, uhh, udpbench mode.



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Darren Tucker
On Mon, Jan 14, 2013 at 10:10:55PM +1100, Darren Tucker wrote:
 This patch below reduces the number of interrupts on the transmit side
 of vr(4).  Currently we set the TX competion interrupt bit on each
 outbound packet.  This patch changes it to only set the interrupt bit on
 the last packet enqueued.  (One thing of note is that the interrupt bit
 is set on the first of potentially several TX descriptors for a packet
 whereas currently it's the last.  I'm not sure if this matters, but if
 it does I can change it).

Here's an updated diff where, when a single packet spans multiple TX
descriptors, it'll only sent the interrupt request bit on the final
discriptor (which matches the current behaviour for that case).

Index: dev/pci/if_vr.c
===
RCS file: /cvs/src/sys/dev/pci/if_vr.c,v
retrieving revision 1.121
diff -u -p -r1.121 if_vr.c
--- dev/pci/if_vr.c 1 Dec 2012 09:55:03 -   1.121
+++ dev/pci/if_vr.c 14 Jan 2013 23:23:57 -
@@ -1254,8 +1254,8 @@ vr_encap(struct vr_softc *sc, struct vr_
sc-vr_cdata.vr_tx_cnt++;
}
 
-   /* Set EOP on the last desciptor */
-   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG | VR_TXCTL_FINT);
+   /* Set EOP on the last descriptor */
+   f-vr_ctl |= htole32(VR_TXCTL_LASTFRAG);
 
return (0);
 }
@@ -1265,6 +1265,13 @@ vr_encap(struct vr_softc *sc, struct vr_
  * to the mbuf data regions directly in the transmit lists. We also save a
  * copy of the pointers since the transmit list fragment pointers are
  * physical addresses.
+ *
+ * Notes:
+ * We only set ownership bit on first descriptor for a each packet.
+ * We'll only want an interrupt for a given packet if either vr_encap fails
+ * (ie the descriptor ring is full) or if the interface input queue is
+ * empty, and we need to set the interrupt bit before setting the owner bit
+ * to turn it over to the hardware.
  */
 
 void
@@ -1273,6 +1280,7 @@ vr_start(struct ifnet *ifp)
struct vr_softc *sc;
struct mbuf *m_head;
struct vr_chain *cur_tx, *head_tx;
+   struct vr_chain *prev_head_tx = NULL, *prev_tail_tx = NULL;
 
sc = ifp-if_softc;
 
@@ -1298,9 +1306,10 @@ vr_start(struct ifnet *ifp)
IF_PREPEND(ifp-if_snd, m_head);
break;
}
-
-   /* Only set ownership bit on first descriptor */
-   head_tx-vr_ptr-vr_status |= htole32(VR_TXSTAT_OWN);
+   /* successfully enqueued new packet, we're done with previous */
+   if (prev_head_tx != NULL)
+   prev_head_tx-vr_ptr-vr_status |=
+   htole32(VR_TXSTAT_OWN);
 
 #if NBPFILTER  0
/*
@@ -1311,9 +1320,22 @@ vr_start(struct ifnet *ifp)
bpf_mtap_ether(ifp-if_bpf, head_tx-vr_mbuf,
BPF_DIRECTION_OUT);
 #endif
+   prev_head_tx = head_tx;
+   prev_tail_tx = cur_tx;
cur_tx = cur_tx-vr_nextdesc;
}
if (sc-vr_cdata.vr_tx_cnt != 0) {
+   /*
+* If we enqueued any packets, we want to set FINT on the last
+* descriptor for the last packet, but set TXOWN on the first
+* descriptor of the last packet.
+*/
+   if (prev_tail_tx != NULL)
+   prev_tail_tx-vr_ptr-vr_ctl |= htole32(VR_TXCTL_FINT);
+   if (prev_head_tx != NULL)
+   prev_head_tx-vr_ptr-vr_status |=
+   htole32(VR_TXSTAT_OWN);
+
sc-vr_cdata.vr_tx_prod = cur_tx;
 
bus_dmamap_sync(sc-sc_dmat, sc-sc_listmap.vrm_map, 0,

-- 
Darren Tucker (dtucker at zip.com.au)
GPG key 8FF4FA69 / D9A3 86E9 7EEE AF4B B2D4  37C9 C982 80C7 8FF4 FA69
Good judgement comes with experience. Unfortunately, the experience
usually comes from bad judgement.



Re: vr(4) TX interrupt reduction

2013-01-14 Thread Michał Markowski
2013/1/15 Chris Cappuccio ch...@nmedia.net:
 Well your numbers clearly show almost no difference. But TCP might not
 be the best way to test due to the regular ack reply.

Ok, -current kernel:

$ int1=`vmstat -i | awk '$1 ~ /vr0/ {print $2}'`; tcpbench -u -t 100
-r 1 r; int2=`vmstat -i | awk '$1 ~ /vr0/ {print $2}'`; let intps
= (int2 - int1) / 100; echo int/s: $intps
Elapsed:   1 Mbps:  87.668 Peak Mbps:  87.668 Tx PPS:7444
Elapsed:   2 Mbps:  87.217 Peak Mbps:  87.668 Tx PPS:7406
Elapsed:   3 Mbps:  87.210 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:   4 Mbps:  87.212 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:   5 Mbps:  87.210 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:   6 Mbps:  87.210 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:   7 Mbps:  87.215 Peak Mbps:  87.668 Tx PPS:7406
Elapsed:   8 Mbps:  87.208 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:   89995 Mbps:  87.207 Peak Mbps:  87.668 Tx PPS:7405
Elapsed:  15 Mbps:  87.212 Peak Mbps:  87.668 Tx PPS:7405
int/s: 7472

Patched (http://marc.info/?l=openbsd-techm=135820613320957q=raw) kernel:

$ int1=`vmstat -i | awk '$1 ~ /vr0/ {print $2}'`; tcpbench -u -t 100
-r 1 r; int2=`vmstat -i | awk '$1 ~ /vr0/ {print $2}'`; let intps
= (int2 - int1) / 100; echo int/s: $intps
Elapsed:   1 Mbps:  87.638 Peak Mbps:  87.638 Tx PPS:7442
Elapsed:   2 Mbps:  87.198 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   3 Mbps:  87.193 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   4 Mbps:  87.192 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   5 Mbps:  87.192 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   6 Mbps:  87.199 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   7 Mbps:  87.194 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   8 Mbps:  87.193 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:   9 Mbps:  87.198 Peak Mbps:  87.638 Tx PPS:7404
Elapsed:  10 Mbps:  87.190 Peak Mbps:  87.638 Tx PPS:7404
int/s: 7471

-- 
Michał Markowski