On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote: > On a System where you use the maximum socketbuffer size of 256kbyte you > can run out of memory after less then 9k open sockets. > > My patch adds a new uvm_constraint for the mbufs with a bigger memory area. > I choose this area after reading the comments in > sys/arch/amd64/include/pmap.h. > This patch further changes the maximum sucketbuffer size from 256k to 1gb as > it is described in the rfc1323 S2.3. > > I tested this diff with the ix, em and urndis driver. I know that this > diff only works > for amd64 right now, but i wanted to send this diff as a proposal what could > be > done. Maybe somebody has a different solution for this Problem or can me why > this is a bad idea.
hey simon, first, some background. the 4G watermark is less about limiting the amount of memory used by the network stack and more about making the memory addressable by as many devices, including network cards, as possible. we support older chips that only deal with 32 bit addresses (and one or two stupid ones with an inability to address over 1G), so we took the conservative option and made made the memory generally usable without developers having to think about it much. you could argue that if you should be able to give big addresses to modern cards, but that falls down if you are forwarding packets between a modern and old card, cos the old card will want to dma the packet the modern card rxed, but it needs it below the 4g line. even if you dont have an old card, in todays hotplug world you might plug an old device in. either way, the future of an mbuf is very hard for the kernel to predict. secondly, allocating more than 4g at a time to socket buffers is generally a waste of memory. in practice you should scale the amount of memory available to sockets according to the size of the tcp windows you need to saturate the bandwidth available to the box. this means if you want to sustain a gigabit of traffic with a 300ms round trip time for packets, you'd "only" need ~37.5 megabytes of buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is still below 4G. allowing more use of memory for buffers would likely induce latency. the above means that if you want to sustain a single 40G tcp connection to that host you'd need to be able to place 1.5G on the socket buffer, which is above the 1G you mention above. however, if you want to sustain 2 connections, you ideally want to fairly share the 1.5G between both sockets. they should get 750M each. fairly sharing buffers between the sockets may already be in place in openbsd. when i reworked the pools subsystem i set it up so things sleeping on memory were woken up in order. it occurs to me that perhaps we should limit mbufs by the bytes they can use rather than the number of them. that would also work well if we moved to per cpu caches for mbufs and clusters, cos the number of active mbufs in the system becomes hard to limit accurately if we want cpus to run independently. if you want something to work on in this area, could you look at letting sockets use the "jumbo" clusters instead of assuming everything has to be in 2k clusters? i started on thsi with the diff below, but it broke ospfd and i never got back to it. if you get it working, it would be interested to test creating even bigger cluster pools, eg, a 1M or 4M mbuf cluster. cheers, dlg Index: uipc_socket.c =================================================================== RCS file: /cvs/src/sys/kern/uipc_socket.c,v retrieving revision 1.135 diff -u -p -r1.135 uipc_socket.c --- uipc_socket.c 11 Dec 2014 19:21:57 -0000 1.135 +++ uipc_socket.c 22 Dec 2014 01:11:03 -0000 @@ -493,15 +493,18 @@ restart: mlen = MLEN; } if (resid >= MINCLSIZE && space >= MCLBYTES) { - MCLGET(m, M_NOWAIT); + MCLGETI(m, M_NOWAIT, NULL, lmin(resid, + lmin(space, MAXMCLBYTES))); if ((m->m_flags & M_EXT) == 0) goto nopages; if (atomic && top == 0) { - len = lmin(MCLBYTES - max_hdr, - resid); + len = lmin(resid, + m->m_ext.ext_size - + max_hdr); m->m_data += max_hdr; } else - len = lmin(MCLBYTES, resid); + len = lmin(resid, + m->m_ext.ext_size); space -= len; } else { nopages: