On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
> On a System where you use the maximum socketbuffer size of 256kbyte you
> can run out of memory after less then 9k open sockets.
> 
> My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
> I choose this area after reading the comments in 
> sys/arch/amd64/include/pmap.h.
> This patch further changes the maximum sucketbuffer size from 256k to 1gb as
> it is described in the rfc1323 S2.3.
> 
> I tested this diff with the ix, em and urndis driver. I know that this
> diff only works
> for amd64 right now, but i wanted to send this diff as a proposal what could 
> be
> done. Maybe somebody has a different solution for this Problem or can me why
> this is a bad idea.

hey simon,

first, some background.

the 4G watermark is less about limiting the amount of memory used
by the network stack and more about making the memory addressable
by as many devices, including network cards, as possible. we support
older chips that only deal with 32 bit addresses (and one or two
stupid ones with an inability to address over 1G), so we took the
conservative option and made made the memory generally usable without
developers having to think about it much.

you could argue that if you should be able to give big addresses
to modern cards, but that falls down if you are forwarding packets
between a modern and old card, cos the old card will want to dma
the packet the modern card rxed, but it needs it below the 4g line.
even if you dont have an old card, in todays hotplug world you might
plug an old device in. either way, the future of an mbuf is very
hard for the kernel to predict.

secondly, allocating more than 4g at a time to socket buffers is
generally a waste of memory. in practice you should scale the amount
of memory available to sockets according to the size of the tcp
windows you need to saturate the bandwidth available to the box.
this means if you want to sustain a gigabit of traffic with a 300ms
round trip time for packets, you'd "only" need ~37.5 megabytes of
buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
still below 4G. allowing more use of memory for buffers would likely
induce latency.

the above means that if you want to sustain a single 40G tcp
connection to that host you'd need to be able to place 1.5G on the
socket buffer, which is above the 1G you mention above. however,
if you want to sustain 2 connections, you ideally want to fairly
share the 1.5G between both sockets. they should get 750M each.

fairly sharing buffers between the sockets may already be in place
in openbsd. when i reworked the pools subsystem i set it up so
things sleeping on memory were woken up in order.

it occurs to me that perhaps we should limit mbufs by the bytes
they can use rather than the number of them. that would also work
well if we moved to per cpu caches for mbufs and clusters, cos the
number of active mbufs in the system becomes hard to limit accurately
if we want cpus to run independently.

if you want something to work on in this area, could you look at
letting sockets use the "jumbo" clusters instead of assuming
everything has to be in 2k clusters? i started on thsi with the
diff below, but it broke ospfd and i never got back to it.

if you get it working, it would be interested to test creating even
bigger cluster pools, eg, a 1M or 4M mbuf cluster.

cheers,
dlg

Index: uipc_socket.c
===================================================================
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.135
diff -u -p -r1.135 uipc_socket.c
--- uipc_socket.c       11 Dec 2014 19:21:57 -0000      1.135
+++ uipc_socket.c       22 Dec 2014 01:11:03 -0000
@@ -493,15 +493,18 @@ restart:
                                        mlen = MLEN;
                                }
                                if (resid >= MINCLSIZE && space >= MCLBYTES) {
-                                       MCLGET(m, M_NOWAIT);
+                                       MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+                                           lmin(space, MAXMCLBYTES)));
                                        if ((m->m_flags & M_EXT) == 0)
                                                goto nopages;
                                        if (atomic && top == 0) {
-                                               len = lmin(MCLBYTES - max_hdr,
-                                                   resid);
+                                               len = lmin(resid,
+                                                   m->m_ext.ext_size -
+                                                   max_hdr);
                                                m->m_data += max_hdr;
                                        } else
-                                               len = lmin(MCLBYTES, resid);
+                                               len = lmin(resid,
+                                                   m->m_ext.ext_size);
                                        space -= len;
                                } else {
 nopages:

Reply via email to