Re: get rid of em_realign()

2015-08-14 Thread David Gwynne

 On 14 Aug 2015, at 02:44, Theo de Raadt dera...@cvs.openbsd.org wrote:
 
 Regarding other strict-alignment architectures, em(4) is probably one
 of the more popular gigabit ethernet options for those architectures
 that have PCI slots.  I don't think any of these machines are severely
 memory starved, but memory might be limited to something like 256MB of
 physical memory.
 
 I have a few of these setups.  I'll keep an eye on it after it is commited.
 

ok by me too.

generally, the memory used by our rx rings compared to other OSes is 
hilariously tiny. we have some wiggle room.



Re: get rid of em_realign()

2015-08-13 Thread Martin Pieuchot
On 11/08/15(Tue) 16:53, Mark Kettenis wrote:
 Like ix(4), em(4) hardware doesn't provide an easy/efficient way to
 guarantee alignment of protocol headers for received mbufs.  The
 current code makes an attempt to m_adj() the mbuf if the maximum
 hardware frame size is smaller than the cluster size.  But ever since
 we changed our drivers to always accept jumbo frames if the hardware
 is capable of doing so, this codepath is effectively dead on
 jumbo-capable hardware.  So on jumbo-capable hardware we'll always end
 up patching up the mbuf after the fact, which effectively means we'll
 do a bcopy() of the entire packet on strict alignment architectures.
 
 The diff below avoids this by taking an approach similar to ix(4).
 This means that on strict alignment architectures, we'll use a larger
 cluster size such that after m_adj()usting the mbuf there are at least
 2048 bytes for the hardware to play with.  This means we'll waste
 (quite) a bit of memory on those architectures.  But I think it makes
 more sense than copying around all the data in the rx path, especially
 with the direction we've taken with processing received packets in a
 kernel thread that may be running on a different CPU than the
 interrupt handler.  As a bonus, it simplifies the code quite a bit.

I like code simplification and the fact that it uses an approach similar
to ix(4).

 If we ever manage to fix the *8 pool diff, we can reduce the amount of
 wasted memory, by adding a suitably sized mbuf pool on
 strict-alignment architectures.  At least I think that would be
 preferable over the dedicated driver-specific pool that dlg@ proposed
 for ix(4).

This makes sense to me.

 That is why I specified the size of the cluster as
 
   #define EM_MCLBYTES (EM_RXBUFFER_2048 + ETHER_ALIGN)
 
 Unfortunately that means that systat mbuf output is a bit confusing:
 
 IFACE LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM
 System0   25624   4
  409614  12
 
 ...
 
 em0  20501410   25614
 
 
 See how the mbufs used by em0 are reported as 4096 under System, but
 2050 under em0?  We could fix that by looking up the right pool, and
 using its item size in the rxr ioctl if people think this is a problem.

How many sparc64 come with em(4)?  Can we assume that the amount of
wasted memory on such system is acceptable?  What about other strict-
alignment architectures?

 Index: if_em.c
 ===
 RCS file: /cvs/src/sys/dev/pci/if_em.c,v
 retrieving revision 1.299
 diff -u -p -r1.299 if_em.c
 --- if_em.c   24 Jun 2015 09:40:54 -  1.299
 +++ if_em.c   11 Aug 2015 14:20:10 -
 @@ -229,11 +229,6 @@ void em_disable_aspm(struct em_softc *);
  void em_txeof(struct em_softc *);
  int  em_allocate_receive_structures(struct em_softc *);
  int  em_allocate_transmit_structures(struct em_softc *);
 -#ifdef __STRICT_ALIGNMENT
 -void em_realign(struct em_softc *, struct mbuf *, u_int16_t *);
 -#else
 -#define em_realign(a, b, c) /* a, b, c */
 -#endif
  int  em_rxfill(struct em_softc *);
  void em_rxeof(struct em_softc *);
  void em_receive_checksum(struct em_softc *, struct em_rx_desc *,
 @@ -706,7 +701,7 @@ em_ioctl(struct ifnet *ifp, u_long comma
  
   case SIOCGIFRXR:
   error = if_rxr_ioctl((struct if_rxrinfo *)ifr-ifr_data,
 - NULL, MCLBYTES, sc-rx_ring);
 + NULL, EM_MCLBYTES, sc-rx_ring);
   break;
  
   default:
 @@ -2518,14 +2513,15 @@ em_get_buf(struct em_softc *sc, int i)
   return (ENOBUFS);
   }
  
 - m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES);
 + m = MCLGETI(NULL, M_DONTWAIT, NULL, EM_MCLBYTES);
   if (!m) {
   sc-mbuf_cluster_failed++;
   return (ENOBUFS);
   }
 - m-m_len = m-m_pkthdr.len = MCLBYTES;
 - if (sc-hw.max_frame_size = (MCLBYTES - ETHER_ALIGN))
 - m_adj(m, ETHER_ALIGN);
 + m-m_len = m-m_pkthdr.len = EM_MCLBYTES;
 +#ifdef __STRICT_ALIGNMENT
 + m_adj(m, ETHER_ALIGN);
 +#endif
  
   error = bus_dmamap_load_mbuf(sc-rxtag, pkt-map, m, BUS_DMA_NOWAIT);
   if (error) {
 @@ -2574,8 +2570,8 @@ em_allocate_receive_structures(struct em
  
   rx_buffer = sc-rx_buffer_area;
   for (i = 0; i  sc-num_rx_desc; i++, rx_buffer++) {
 - error = bus_dmamap_create(sc-rxtag, MCLBYTES, 1,
 - MCLBYTES, 0, BUS_DMA_NOWAIT, rx_buffer-map);
 + error = bus_dmamap_create(sc-rxtag, EM_MCLBYTES, 1,
 + EM_MCLBYTES, 0, BUS_DMA_NOWAIT, rx_buffer-map);
   if (error != 0) {
   printf(%s: em_allocate_receive_structures: 
   bus_dmamap_create failed; error %u\n,
 @@ -2771,53 +2767,6 @@ em_free_receive_structures(struct em_sof
   }
  }
  
 -#ifdef __STRICT_ALIGNMENT
 -void
 -em_realign(struct em_softc *sc, 

Re: get rid of em_realign()

2015-08-13 Thread Ted Unangst
Martin Pieuchot wrote:
 
 How many sparc64 come with em(4)?  Can we assume that the amount of
 wasted memory on such system is acceptable?  What about other strict-
 alignment architectures?

just(? mostly?) t5120. mine has 32gb in it. it is, or could be, a popular
openbsd machine. it's also new enough it's almost certainly got memory to
spare, even if you only have 8gb or so. (it does have 4 em, btw.)



Re: get rid of em_realign()

2015-08-13 Thread Mark Kettenis
 From: Ted Unangst t...@tedunangst.com
 Date: Thu, 13 Aug 2015 10:04:55 -0400
 
 Martin Pieuchot wrote:
  
  How many sparc64 come with em(4)?  Can we assume that the amount of
  wasted memory on such system is acceptable?  What about other strict-
  alignment architectures?
 
 just(? mostly?) t5120. mine has 32gb in it. it is, or could be, a popular
 openbsd machine. it's also new enough it's almost certainly got memory to
 spare, even if you only have 8gb or so. (it does have 4 em, btw.)

The oldest sparc64 machine to ship with onboard em(4) was the t2k,
which really isn't such a memory starved machine either.

Regarding other strict-alignment architectures, em(4) is probably one
of the more popular gigabit ethernet options for those architectures
that have PCI slots.  I don't think any of these machines are severely
memory starved, but memory might be limited to something like 256MB of
physical memory.

Since we don't fully populate the em(4) rx rings, the memory
consumption of the mbufs on the rings isn't going to be very large on
interfaces that don't receive a lot of traffic.  But I don't have a
good idea how likely it is for a lot of received mbufs to be queued up
in the network stack.



Re: get rid of em_realign()

2015-08-13 Thread Martin Pieuchot
On 13/08/15(Thu) 16:43, Mark Kettenis wrote:
  From: Ted Unangst t...@tedunangst.com
  Date: Thu, 13 Aug 2015 10:04:55 -0400
  
  Martin Pieuchot wrote:
   
   How many sparc64 come with em(4)?  Can we assume that the amount of
   wasted memory on such system is acceptable?  What about other strict-
   alignment architectures?
  
  just(? mostly?) t5120. mine has 32gb in it. it is, or could be, a popular
  openbsd machine. it's also new enough it's almost certainly got memory to
  spare, even if you only have 8gb or so. (it does have 4 em, btw.)
 
 The oldest sparc64 machine to ship with onboard em(4) was the t2k,
 which really isn't such a memory starved machine either.
 
 Regarding other strict-alignment architectures, em(4) is probably one
 of the more popular gigabit ethernet options for those architectures
 that have PCI slots.  I don't think any of these machines are severely
 memory starved, but memory might be limited to something like 256MB of
 physical memory.
 
 Since we don't fully populate the em(4) rx rings, the memory
 consumption of the mbufs on the rings isn't going to be very large on
 interfaces that don't receive a lot of traffic.  But I don't have a
 good idea how likely it is for a lot of received mbufs to be queued up
 in the network stack.

Then I'd say the best way to learn is to commit your diff :)



Re: get rid of em_realign()

2015-08-13 Thread Theo de Raadt
 Regarding other strict-alignment architectures, em(4) is probably one
 of the more popular gigabit ethernet options for those architectures
 that have PCI slots.  I don't think any of these machines are severely
 memory starved, but memory might be limited to something like 256MB of
 physical memory.

I have a few of these setups.  I'll keep an eye on it after it is commited.



get rid of em_realign()

2015-08-11 Thread Mark Kettenis
Like ix(4), em(4) hardware doesn't provide an easy/efficient way to
guarantee alignment of protocol headers for received mbufs.  The
current code makes an attempt to m_adj() the mbuf if the maximum
hardware frame size is smaller than the cluster size.  But ever since
we changed our drivers to always accept jumbo frames if the hardware
is capable of doing so, this codepath is effectively dead on
jumbo-capable hardware.  So on jumbo-capable hardware we'll always end
up patching up the mbuf after the fact, which effectively means we'll
do a bcopy() of the entire packet on strict alignment architectures.

The diff below avoids this by taking an approach similar to ix(4).
This means that on strict alignment architectures, we'll use a larger
cluster size such that after m_adj()usting the mbuf there are at least
2048 bytes for the hardware to play with.  This means we'll waste
(quite) a bit of memory on those architectures.  But I think it makes
more sense than copying around all the data in the rx path, especially
with the direction we've taken with processing received packets in a
kernel thread that may be running on a different CPU than the
interrupt handler.  As a bonus, it simplifies the code quite a bit.

If we ever manage to fix the *8 pool diff, we can reduce the amount of
wasted memory, by adding a suitably sized mbuf pool on
strict-alignment architectures.  At least I think that would be
preferable over the dedicated driver-specific pool that dlg@ proposed
for ix(4).  That is why I specified the size of the cluster as

  #define EM_MCLBYTES   (EM_RXBUFFER_2048 + ETHER_ALIGN)

Unfortunately that means that systat mbuf output is a bit confusing:

IFACE LIVELOCKS  SIZE ALIVE   LWM   HWM   CWM
System0   25624   4
 409614  12

...

em0  20501410   25614


See how the mbufs used by em0 are reported as 4096 under System, but
2050 under em0?  We could fix that by looking up the right pool, and
using its item size in the rxr ioctl if people think this is a problem.

ok?


Index: if_em.c
===
RCS file: /cvs/src/sys/dev/pci/if_em.c,v
retrieving revision 1.299
diff -u -p -r1.299 if_em.c
--- if_em.c 24 Jun 2015 09:40:54 -  1.299
+++ if_em.c 11 Aug 2015 14:20:10 -
@@ -229,11 +229,6 @@ void em_disable_aspm(struct em_softc *);
 void em_txeof(struct em_softc *);
 int  em_allocate_receive_structures(struct em_softc *);
 int  em_allocate_transmit_structures(struct em_softc *);
-#ifdef __STRICT_ALIGNMENT
-void em_realign(struct em_softc *, struct mbuf *, u_int16_t *);
-#else
-#define em_realign(a, b, c) /* a, b, c */
-#endif
 int  em_rxfill(struct em_softc *);
 void em_rxeof(struct em_softc *);
 void em_receive_checksum(struct em_softc *, struct em_rx_desc *,
@@ -706,7 +701,7 @@ em_ioctl(struct ifnet *ifp, u_long comma
 
case SIOCGIFRXR:
error = if_rxr_ioctl((struct if_rxrinfo *)ifr-ifr_data,
-   NULL, MCLBYTES, sc-rx_ring);
+   NULL, EM_MCLBYTES, sc-rx_ring);
break;
 
default:
@@ -2518,14 +2513,15 @@ em_get_buf(struct em_softc *sc, int i)
return (ENOBUFS);
}
 
-   m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES);
+   m = MCLGETI(NULL, M_DONTWAIT, NULL, EM_MCLBYTES);
if (!m) {
sc-mbuf_cluster_failed++;
return (ENOBUFS);
}
-   m-m_len = m-m_pkthdr.len = MCLBYTES;
-   if (sc-hw.max_frame_size = (MCLBYTES - ETHER_ALIGN))
-   m_adj(m, ETHER_ALIGN);
+   m-m_len = m-m_pkthdr.len = EM_MCLBYTES;
+#ifdef __STRICT_ALIGNMENT
+   m_adj(m, ETHER_ALIGN);
+#endif
 
error = bus_dmamap_load_mbuf(sc-rxtag, pkt-map, m, BUS_DMA_NOWAIT);
if (error) {
@@ -2574,8 +2570,8 @@ em_allocate_receive_structures(struct em
 
rx_buffer = sc-rx_buffer_area;
for (i = 0; i  sc-num_rx_desc; i++, rx_buffer++) {
-   error = bus_dmamap_create(sc-rxtag, MCLBYTES, 1,
-   MCLBYTES, 0, BUS_DMA_NOWAIT, rx_buffer-map);
+   error = bus_dmamap_create(sc-rxtag, EM_MCLBYTES, 1,
+   EM_MCLBYTES, 0, BUS_DMA_NOWAIT, rx_buffer-map);
if (error != 0) {
printf(%s: em_allocate_receive_structures: 
bus_dmamap_create failed; error %u\n,
@@ -2771,53 +2767,6 @@ em_free_receive_structures(struct em_sof
}
 }
 
-#ifdef __STRICT_ALIGNMENT
-void
-em_realign(struct em_softc *sc, struct mbuf *m, u_int16_t *prev_len_adj)
-{
-   unsigned char tmp_align_buf[ETHER_ALIGN];
-   int tmp_align_buf_len = 0;
-
-   /*
-* The Ethernet payload is not 32-bit aligned when
-* Jumbo packets are enabled, so on architectures with
-* strict alignment we need to shift the entire packet
-* ETHER_ALIGN bytes. Ugh.
-