Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Mark Kettenis
> Date: Fri, 12 Aug 2016 14:26:34 +0200
> From: Claudio Jeker 
> 
> On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:
> > 
> > > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > > 
> > > I sent this message to dlg@ directly to discuss my modification of his
> > > diff to make the
> > > bigger mbuf clusters work. i got no response so far, thats why i
> > > decided to post it on tech@
> > > directly. Maybe this way i get faster some feedback :)
> > 
> > hey simon,
> > 
> > i was travelling when you sent your mail to me and then it fell out of my 
> > head. sorry about that.
> > 
> > if this is working correctly then i would like to put it in the tree. from 
> > the light testing i have done, it is working correctly. would anyone object?
> > 
> > some performance measurement would also be interesting :)
> > 
> 
> I would prefer we take the diff I started at n2k16. I need to dig it out
> though.

I think the subject of the thread has become misleading.  At least the
diff I think David and Simon are talking about is about using the
larger mbuf pools for socket buffers and no longer about using memory
>4G for them.

David, Simon, best to start all over again, and repost the diff with a
proper subject and explanation.  You shouldn't be forcing other
developers to read through several pages of private conversations.



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Claudio Jeker
On Fri, Aug 12, 2016 at 04:38:45PM +1000, David Gwynne wrote:
> 
> > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > 
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
> 
> hey simon,
> 
> i was travelling when you sent your mail to me and then it fell out of my 
> head. sorry about that.
> 
> if this is working correctly then i would like to put it in the tree. from 
> the light testing i have done, it is working correctly. would anyone object?
> 
> some performance measurement would also be interesting :)
> 

I would prefer we take the diff I started at n2k16. I need to dig it out
though.

-- 
:wq Claudio



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Mark Kettenis
> From: David Gwynne 
> Date: Fri, 12 Aug 2016 16:38:45 +1000
> 
> > On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> > 
> > I sent this message to dlg@ directly to discuss my modification of his
> > diff to make the
> > bigger mbuf clusters work. i got no response so far, thats why i
> > decided to post it on tech@
> > directly. Maybe this way i get faster some feedback :)
> 
> hey simon,
> 
> i was travelling when you sent your mail to me and then it fell out
> of my head. sorry about that.
> 
> if this is working correctly then i would like to put it in the tree. from 
> the light testing i have done, it is working correctly. would anyone object?
> 
> some performance measurement would also be interesting :)

Hmm, during debugging I've relied on the fact that only drivers
allocate the larger mbuf clusters for their rx rings.

Anyway, shouldn't the diff be using ulmin()?


> dlg
> 
> > 
> > BR
> > Simon
> > 
> > ### Original Mail:
> > 
> > -- Forwarded message --
> > From: Simon Mages 
> > Date: Fri, 22 Jul 2016 13:24:24 +0200
> > Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> > To: David Gwynne 
> > 
> > Hi,
> > 
> > I think i found the problem with your diff regarding the bigger mbuf 
> > clusters.
> > 
> > You choose a buffer size based on space and resid, but what happens when 
> > resid
> > is larger then space and space is for example 2050? The cluster choosen has
> > then the size 4096. But this size is to large for the socket buffer. In the
> > past this was never a problem because you only allocated external clusters
> > of size MCLBYTES and this was only done when space was larger then MCLBYTES.
> > 
> > diff:
> > Index: kern/uipc_socket.c
> > ===
> > RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> > retrieving revision 1.152
> > diff -u -p -u -p -r1.152 uipc_socket.c
> > --- kern/uipc_socket.c  13 Jun 2016 21:24:43 -  1.152
> > +++ kern/uipc_socket.c  22 Jul 2016 10:56:02 -
> > @@ -496,15 +496,18 @@ restart:
> > mlen = MLEN;
> > }
> > if (resid >= MINCLSIZE && space >= MCLBYTES) {
> > -   MCLGET(m, M_NOWAIT);
> > +   MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> > +   lmin(space, MAXMCLBYTES)));
> > if ((m->m_flags & M_EXT) == 0)
> > goto nopages;
> > if (atomic && top == 0) {
> > -   len = ulmin(MCLBYTES - max_hdr,
> > -   resid);
> > +   len = lmin(lmin(resid, space),
> > +   m->m_ext.ext_size -
> > +   max_hdr);
> > m->m_data += max_hdr;
> > } else
> > -   len = ulmin(MCLBYTES, resid);
> > +   len = lmin(lmin(resid, space),
> > +   m->m_ext.ext_size);
> > space -= len;
> > } else {
> > nopages:
> > 
> > Im using this diff no for a while on my notebook and everything works as
> > expected. But i had no time to realy test it or test the performance. This 
> > will
> > be my next step.
> > 
> > I reproduced the unix socket problem you mentioned with the following little
> > programm:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #include 
> > #include 
> > #include 
> > 
> > #define FILE "/tmp/afile"
> > 
> > int senddesc(int fd, int so);
> > int recvdesc(int so);
> > 
> > int
> > main(void)
> > {
> > struct stat sb;
> > int sockpair[2];
> > pid_t pid = 0;
> > int status;
> > int newfile;
> > 
> > if (unlink(FILE) < 0)
> > warn("unlink: %s", FILE);
> > 
> > int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
> > 
> > if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
> > err(1, "socketpair");
> > 
> > if ((pid =fork())) {
> > senddesc(file, sockpair[0]);
> > if (waitpid(pid, , 0) < 0)
> > err(1, "waitpid");
> > } else {
> > newfile = recvdesc(sockpair[1]);
> > if (fstat(newfile, ) < 0)
> > err(1, "fstat");
> > }
> > 
> > return 0;
> > }
> > 
> > int
> > senddesc(int fd, int so)
> > {
> > struct msghdr msg;
> > struct cmsghdr *cmsg;
> > union {
> > struct 

Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread Tinker

On 2016-06-23 05:42, Theo de Raadt wrote:

secondly, allocating more than 4g at a time to socket buffers is
generally a waste of memory.


and there is one further problem.

Eventually, this subsystem will starve the system.  Other subsystems
which also need large amounts of memory, then have to scramble.  There
have to be backpressure mechanisms in each subsystem to force out
memory.

There is no such mechanism in socket buffers.

The mechanisms in the remaining parts of the kernel have always proven
to be weak, as in, they don't interact as nicely as we want, to create
space.  There has been much work to make them work better.

However in socket buffers, there is no such mechanism.  What are
you going to do.  Throw data away?  You can't do that.  Therefore,
you are holding the remaining system components hostage, and your
diff creates deadlock.

You probably tested your diff under ideal conditions with gobs of
memory...


The backpressure mechanism to free up [disk IO] buffer cache content is 
really effective though, so 90 is a mostly suitable bufcachepercent 
sysctl setting, right?




Re: [PATCH] let the mbufs use more then 4gb of memory

2016-08-12 Thread David Gwynne

> On 1 Aug 2016, at 21:07, Simon Mages  wrote:
> 
> I sent this message to dlg@ directly to discuss my modification of his
> diff to make the
> bigger mbuf clusters work. i got no response so far, thats why i
> decided to post it on tech@
> directly. Maybe this way i get faster some feedback :)

hey simon,

i was travelling when you sent your mail to me and then it fell out of my head. 
sorry about that.

if this is working correctly then i would like to put it in the tree. from the 
light testing i have done, it is working correctly. would anyone object?

some performance measurement would also be interesting :)

dlg

> 
> BR
> Simon
> 
> ### Original Mail:
> 
> -- Forwarded message --
> From: Simon Mages 
> Date: Fri, 22 Jul 2016 13:24:24 +0200
> Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
> To: David Gwynne 
> 
> Hi,
> 
> I think i found the problem with your diff regarding the bigger mbuf clusters.
> 
> You choose a buffer size based on space and resid, but what happens when resid
> is larger then space and space is for example 2050? The cluster choosen has
> then the size 4096. But this size is to large for the socket buffer. In the
> past this was never a problem because you only allocated external clusters
> of size MCLBYTES and this was only done when space was larger then MCLBYTES.
> 
> diff:
> Index: kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.152
> diff -u -p -u -p -r1.152 uipc_socket.c
> --- kern/uipc_socket.c13 Jun 2016 21:24:43 -  1.152
> +++ kern/uipc_socket.c22 Jul 2016 10:56:02 -
> @@ -496,15 +496,18 @@ restart:
>   mlen = MLEN;
>   }
>   if (resid >= MINCLSIZE && space >= MCLBYTES) {
> - MCLGET(m, M_NOWAIT);
> + MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
> + lmin(space, MAXMCLBYTES)));
>   if ((m->m_flags & M_EXT) == 0)
>   goto nopages;
>   if (atomic && top == 0) {
> - len = ulmin(MCLBYTES - max_hdr,
> - resid);
> + len = lmin(lmin(resid, space),
> + m->m_ext.ext_size -
> + max_hdr);
>   m->m_data += max_hdr;
>   } else
> - len = ulmin(MCLBYTES, resid);
> + len = lmin(lmin(resid, space),
> + m->m_ext.ext_size);
>   space -= len;
>   } else {
> nopages:
> 
> Im using this diff no for a while on my notebook and everything works as
> expected. But i had no time to realy test it or test the performance. This 
> will
> be my next step.
> 
> I reproduced the unix socket problem you mentioned with the following little
> programm:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> #include 
> #include 
> 
> #define FILE "/tmp/afile"
> 
> int senddesc(int fd, int so);
> int recvdesc(int so);
> 
> int
> main(void)
> {
>   struct stat sb;
>   int sockpair[2];
>   pid_t pid = 0;
>   int status;
>   int newfile;
> 
>   if (unlink(FILE) < 0)
>   warn("unlink: %s", FILE);
> 
>   int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);
> 
>   if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
>   err(1, "socketpair");
> 
>   if ((pid =fork())) {
>   senddesc(file, sockpair[0]);
>   if (waitpid(pid, , 0) < 0)
>   err(1, "waitpid");
>   } else {
>   newfile = recvdesc(sockpair[1]);
>   if (fstat(newfile, ) < 0)
>   err(1, "fstat");
>   }
> 
>   return 0;
> }
> 
> int
> senddesc(int fd, int so)
> {
>   struct msghdr msg;
>   struct cmsghdr *cmsg;
>   union {
>   struct cmsghdr  hdr;
>   unsigned char   buf[CMSG_SPACE(sizeof(int))];
>   } cmsgbuf;
> 
>   char *cbuf = calloc(6392, sizeof(char));
>   memset(cbuf, 'K', 6392);
>   struct iovec iov = {
>   .iov_base = cbuf,
>   .iov_len = 6392,
>   };
> 
>   memset(, 0, sizeof(struct msghdr));
>   msg.msg_iov = 
>   msg.msg_iovlen = 1;
>   msg.msg_control = 
>   

Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-29 Thread Claudio Jeker
On Thu, Jun 23, 2016 at 02:41:53PM +0200, Mark Kettenis wrote:
> > Date: Thu, 23 Jun 2016 13:09:28 +0200
> > From: Alexander Bluhm 
> > 
> > On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> > > secondly, allocating more than 4g at a time to socket buffers is
> > > generally a waste of memory. in practice you should scale the amount
> > > of memory available to sockets according to the size of the tcp
> > > windows you need to saturate the bandwidth available to the box.
> > 
> > Currently OpenBSD limits the socket buffer size to 256k.
> > #define SB_MAX  (256*1024)  /* default for max chars in sockbuf 
> > */
> > 
> > For downloading large files from the internet this is not sufficinet
> > anymore.  After customer complaints we have increased the limit to
> > 1MB.  This still does not give maximum throughput, but granting
> > more could easily result in running out of mbufs.  16MB would be
> > sufficent.
> > 
> > Besides from single connections with high throughput we also have
> > a lot of long running connections, say some 1.  Each connection
> > over a relay needs two sockets and four socket buffers.  With 1MB
> > limit and 1 connections the theoretical maximum is 40GB.
> > 
> > It is hard to figure out which connections need socket buffer space
> > in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
> > there sbchecklowmem() has a first come first serve policy.  Another
> > challenge is, that the peers on both sides of the relay can decide
> > wether they fill our buffers.
> > 
> > Besides from finding a smarter algorithm to distribute the socket
> > buffer space, increasing the number of mbufs could be a solution.
> > Our server machines mostly relay connection data, there I seems
> > seductive to use much more mbuf memory to speed up TCP connetions.
> > Without 64 bit DMA most memory of the machine is unused.
> > 
> > Also modern BIOS maps only 2GB in low region.  All DMA devices must
> > share these.  Putting mbufs high should reduce pressure.
> > 
> > Of course there are problems with network adaptors that support
> > less DMA space and with hotplug configurations.  For a general
> > solution we can implement bounce buffers, disable the feature on
> > such machines or have a knob.
> 
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some security
> benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
> that does support it doesn't always have it enabled.  But for modern
> hardware an iommu is pretty much standard, except for the absolute
> low-end.  But those low-end machines tend to have only 2GB of memory
> anyway.

Another option is to use m_defrag() to move the mbuf from high mem down in
case it is needed. I think this is much simpler to implement and devices
that need it can be identified fairly easy. This only solves the TX side
on the RX side the bouncing would need to be done in the socketbuffers (it
would make sense to use large mclusters in socketbuffers and copy the data
over. 

-- 
:wq Claudio



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-25 Thread Stefan Fritsch
On Thursday 23 June 2016 14:41:53, Mark Kettenis wrote:
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some
> security benefits.  Not all amd64 hardware supports an IOMMU.  And
> hardware that does support it doesn't always have it enabled.  But
> for modern hardware an iommu is pretty much standard, except for
> the absolute low-end.  But those low-end machines tend to have only
> 2GB of memory anyway.

On amd64, modern would mean skylake or newer. At least until haswell 
(not sure about broadwell), Intel considered vt-d to be a high-end 
feature and many desktop CPUs don't have it enabled. It is easy to 
find systems with >=16 GB RAM without IOMMU.

Stefan



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-23 Thread Chris Cappuccio
Mark Kettenis [mark.kette...@xs4all.nl] wrote:
> 
> We really don't want to implement bounce-buffers.  Adding IOMMU
> support is probably a better approach as it also brings some security
> benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
> that does support it doesn't always have it enabled.  But for modern
> hardware an iommu is pretty much standard, except for the absolute
> low-end.  But those low-end machines tend to have only 2GB of memory
> anyway.

Is the sparc64 iommu code port usable for this purpose?

http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/Attic/sg_dma.c



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-23 Thread Mark Kettenis
> Date: Thu, 23 Jun 2016 13:09:28 +0200
> From: Alexander Bluhm 
> 
> On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> > secondly, allocating more than 4g at a time to socket buffers is
> > generally a waste of memory. in practice you should scale the amount
> > of memory available to sockets according to the size of the tcp
> > windows you need to saturate the bandwidth available to the box.
> 
> Currently OpenBSD limits the socket buffer size to 256k.
> #define SB_MAX  (256*1024)  /* default for max chars in sockbuf */
> 
> For downloading large files from the internet this is not sufficinet
> anymore.  After customer complaints we have increased the limit to
> 1MB.  This still does not give maximum throughput, but granting
> more could easily result in running out of mbufs.  16MB would be
> sufficent.
> 
> Besides from single connections with high throughput we also have
> a lot of long running connections, say some 1.  Each connection
> over a relay needs two sockets and four socket buffers.  With 1MB
> limit and 1 connections the theoretical maximum is 40GB.
> 
> It is hard to figure out which connections need socket buffer space
> in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
> there sbchecklowmem() has a first come first serve policy.  Another
> challenge is, that the peers on both sides of the relay can decide
> wether they fill our buffers.
> 
> Besides from finding a smarter algorithm to distribute the socket
> buffer space, increasing the number of mbufs could be a solution.
> Our server machines mostly relay connection data, there I seems
> seductive to use much more mbuf memory to speed up TCP connetions.
> Without 64 bit DMA most memory of the machine is unused.
> 
> Also modern BIOS maps only 2GB in low region.  All DMA devices must
> share these.  Putting mbufs high should reduce pressure.
> 
> Of course there are problems with network adaptors that support
> less DMA space and with hotplug configurations.  For a general
> solution we can implement bounce buffers, disable the feature on
> such machines or have a knob.

We really don't want to implement bounce-buffers.  Adding IOMMU
support is probably a better approach as it also brings some security
benefits.  Not all amd64 hardware supports an IOMMU.  And hardware
that does support it doesn't always have it enabled.  But for modern
hardware an iommu is pretty much standard, except for the absolute
low-end.  But those low-end machines tend to have only 2GB of memory
anyway.



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-23 Thread Alexander Bluhm
On Wed, Jun 22, 2016 at 10:54:27PM +1000, David Gwynne wrote:
> secondly, allocating more than 4g at a time to socket buffers is
> generally a waste of memory. in practice you should scale the amount
> of memory available to sockets according to the size of the tcp
> windows you need to saturate the bandwidth available to the box.

Currently OpenBSD limits the socket buffer size to 256k.
#define SB_MAX  (256*1024)  /* default for max chars in sockbuf */

For downloading large files from the internet this is not sufficinet
anymore.  After customer complaints we have increased the limit to
1MB.  This still does not give maximum throughput, but granting
more could easily result in running out of mbufs.  16MB would be
sufficent.

Besides from single connections with high throughput we also have
a lot of long running connections, say some 1.  Each connection
over a relay needs two sockets and four socket buffers.  With 1MB
limit and 1 connections the theoretical maximum is 40GB.

It is hard to figure out which connections need socket buffer space
in advance.  tcp_update_{snd,rcv}space() adjusts it dynamically,
there sbchecklowmem() has a first come first serve policy.  Another
challenge is, that the peers on both sides of the relay can decide
wether they fill our buffers.

Besides from finding a smarter algorithm to distribute the socket
buffer space, increasing the number of mbufs could be a solution.
Our server machines mostly relay connection data, there I seems
seductive to use much more mbuf memory to speed up TCP connetions.
Without 64 bit DMA most memory of the machine is unused.

Also modern BIOS maps only 2GB in low region.  All DMA devices must
share these.  Putting mbufs high should reduce pressure.

Of course there are problems with network adaptors that support
less DMA space and with hotplug configurations.  For a general
solution we can implement bounce buffers, disable the feature on
such machines or have a knob.

bluhm



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-22 Thread Theo de Raadt
> secondly, allocating more than 4g at a time to socket buffers is
> generally a waste of memory.

and there is one further problem.

Eventually, this subsystem will starve the system.  Other subsystems
which also need large amounts of memory, then have to scramble.  There
have to be backpressure mechanisms in each subsystem to force out
memory.

There is no such mechanism in socket buffers.

The mechanisms in the remaining parts of the kernel have always proven
to be weak, as in, they don't interact as nicely as we want, to create
space.  There has been much work to make them work better.

However in socket buffers, there is no such mechanism.  What are
you going to do.  Throw data away?  You can't do that.  Therefore,
you are holding the remaining system components hostage, and your
diff creates deadlock.

You probably tested your diff under ideal conditions with gobs of
memory...

 



Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-22 Thread Claudio Jeker
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
> On a System where you use the maximum socketbuffer size of 256kbyte you
> can run out of memory after less then 9k open sockets.
> 
> My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
> I choose this area after reading the comments in 
> sys/arch/amd64/include/pmap.h.
> This patch further changes the maximum sucketbuffer size from 256k to 1gb as
> it is described in the rfc1323 S2.3.

You read that RFC wrong. I see no reason to increase the socketbuffer size
to such a huge value. A change like this is currently not acceptable.
 
> I tested this diff with the ix, em and urndis driver. I know that this
> diff only works
> for amd64 right now, but i wanted to send this diff as a proposal what could 
> be
> done. Maybe somebody has a different solution for this Problem or can me why
> this is a bad idea.
> 

Are you sure that all drivers are able to handle memory with physical
addresses that are more than 32bit long? I doubt this. I think a lot more
is needed than this diff to make this work even just for amd64.

> 
> Index: arch/amd64/amd64/bus_dma.c
> ===
> RCS file: /openbsd/src/sys/arch/amd64/amd64/bus_dma.c,v
> retrieving revision 1.49
> diff -u -p -u -p -r1.49 bus_dma.c
> --- arch/amd64/amd64/bus_dma.c17 Dec 2015 17:16:04 -  1.49
> +++ arch/amd64/amd64/bus_dma.c22 Jun 2016 11:33:17 -
> @@ -584,7 +584,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t,
>*/
>   pmap_extract(pmap, vaddr, (paddr_t *));
> 
> - if (curaddr > dma_constraint.ucr_high)
> + if (curaddr > mbuf_constraint.ucr_high)
>   panic("Non dma-reachable buffer at curaddr %#lx(raw)",
>   curaddr);
> 
> Index: arch/amd64/amd64/machdep.c
> ===
> RCS file: /openbsd/src/sys/arch/amd64/amd64/machdep.c,v
> retrieving revision 1.221
> diff -u -p -u -p -r1.221 machdep.c
> --- arch/amd64/amd64/machdep.c21 May 2016 00:56:43 -  1.221
> +++ arch/amd64/amd64/machdep.c22 Jun 2016 11:33:17 -
> @@ -202,9 +202,11 @@ struct vm_map *phys_map = NULL;
>  /* UVM constraint ranges. */
>  struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffUL };
>  struct uvm_constraint_range  dma_constraint = { 0x0, 0xUL };
> +struct uvm_constraint_range  mbuf_constraint = { 0x0, 0xfUL };
>  struct uvm_constraint_range *uvm_md_constraints[] = {
>  _constraint,
>  _constraint,
> +_constraint,
>  NULL,
>  };
> 
> Index: kern/uipc_mbuf.c
> ===
> RCS file: /openbsd/src/sys/kern/uipc_mbuf.c,v
> retrieving revision 1.226
> diff -u -p -u -p -r1.226 uipc_mbuf.c
> --- kern/uipc_mbuf.c  13 Jun 2016 21:24:43 -  1.226
> +++ kern/uipc_mbuf.c  22 Jun 2016 11:33:18 -
> @@ -153,7 +153,7 @@ mbinit(void)
> 
>   pool_init(, MSIZE, 0, 0, 0, "mbufpl", NULL);
>   pool_setipl(, IPL_NET);
> - pool_set_constraints(, _dma_contig);
> + pool_set_constraints(, _mbuf_contig);
>   pool_setlowat(, mblowat);
> 
>   pool_init(, PACKET_TAG_MAXSIZE + sizeof(struct m_tag),
> @@ -166,7 +166,7 @@ mbinit(void)
>   pool_init([i], mclsizes[i], 0, 0, 0,
>   mclnames[i], NULL);
>   pool_setipl([i], IPL_NET);
> - pool_set_constraints([i], _dma_contig);
> + pool_set_constraints([i], _mbuf_contig);
>   pool_setlowat([i], mcllowat);
>   }
> 
> Index: sys/socketvar.h
> ===
> RCS file: /openbsd/src/sys/sys/socketvar.h,v
> retrieving revision 1.60
> diff -u -p -u -p -r1.60 socketvar.h
> --- sys/socketvar.h   25 Feb 2016 07:39:09 -  1.60
> +++ sys/socketvar.h   22 Jun 2016 11:33:18 -
> @@ -112,7 +112,7 @@ struct socket {
>   short   sb_flags;   /* flags, see below */
>   u_short sb_timeo;   /* timeout for read/write */
>   } so_rcv, so_snd;
> -#define  SB_MAX  (256*1024)  /* default for max chars in 
> sockbuf */
> +#define  SB_MAX  (1024*1024*1024)/* default for max chars in 
> sockbuf */
>  #define  SB_LOCK 0x01/* lock on data queue */
>  #define  SB_WANT 0x02/* someone is waiting to lock */
>  #define  SB_WAIT 0x04/* someone is waiting for 
> data/space */
> Index: uvm/uvm_extern.h
> ===
> RCS file: /openbsd/src/sys/uvm/uvm_extern.h,v
> retrieving revision 1.139
> diff -u -p -u -p -r1.139 uvm_extern.h
> --- uvm/uvm_extern.h  5 Jun 2016 08:35:57 -   1.139
> +++ uvm/uvm_extern.h  22 Jun 2016 11:33:18 -
> @@ -234,6 +234,7 @@ extern struct uvmexp uvmexp;
>  /* Constraint 

Re: [PATCH] let the mbufs use more then 4gb of memory

2016-06-22 Thread David Gwynne
On Wed, Jun 22, 2016 at 01:58:25PM +0200, Simon Mages wrote:
> On a System where you use the maximum socketbuffer size of 256kbyte you
> can run out of memory after less then 9k open sockets.
> 
> My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
> I choose this area after reading the comments in 
> sys/arch/amd64/include/pmap.h.
> This patch further changes the maximum sucketbuffer size from 256k to 1gb as
> it is described in the rfc1323 S2.3.
> 
> I tested this diff with the ix, em and urndis driver. I know that this
> diff only works
> for amd64 right now, but i wanted to send this diff as a proposal what could 
> be
> done. Maybe somebody has a different solution for this Problem or can me why
> this is a bad idea.

hey simon,

first, some background.

the 4G watermark is less about limiting the amount of memory used
by the network stack and more about making the memory addressable
by as many devices, including network cards, as possible. we support
older chips that only deal with 32 bit addresses (and one or two
stupid ones with an inability to address over 1G), so we took the
conservative option and made made the memory generally usable without
developers having to think about it much.

you could argue that if you should be able to give big addresses
to modern cards, but that falls down if you are forwarding packets
between a modern and old card, cos the old card will want to dma
the packet the modern card rxed, but it needs it below the 4g line.
even if you dont have an old card, in todays hotplug world you might
plug an old device in. either way, the future of an mbuf is very
hard for the kernel to predict.

secondly, allocating more than 4g at a time to socket buffers is
generally a waste of memory. in practice you should scale the amount
of memory available to sockets according to the size of the tcp
windows you need to saturate the bandwidth available to the box.
this means if you want to sustain a gigabit of traffic with a 300ms
round trip time for packets, you'd "only" need ~37.5 megabytes of
buffers. to sustain 40 gigabit you'd need 1.5 gigabytes, which is
still below 4G. allowing more use of memory for buffers would likely
induce latency.

the above means that if you want to sustain a single 40G tcp
connection to that host you'd need to be able to place 1.5G on the
socket buffer, which is above the 1G you mention above. however,
if you want to sustain 2 connections, you ideally want to fairly
share the 1.5G between both sockets. they should get 750M each.

fairly sharing buffers between the sockets may already be in place
in openbsd. when i reworked the pools subsystem i set it up so
things sleeping on memory were woken up in order.

it occurs to me that perhaps we should limit mbufs by the bytes
they can use rather than the number of them. that would also work
well if we moved to per cpu caches for mbufs and clusters, cos the
number of active mbufs in the system becomes hard to limit accurately
if we want cpus to run independently.

if you want something to work on in this area, could you look at
letting sockets use the "jumbo" clusters instead of assuming
everything has to be in 2k clusters? i started on thsi with the
diff below, but it broke ospfd and i never got back to it.

if you get it working, it would be interested to test creating even
bigger cluster pools, eg, a 1M or 4M mbuf cluster.

cheers,
dlg

Index: uipc_socket.c
===
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.135
diff -u -p -r1.135 uipc_socket.c
--- uipc_socket.c   11 Dec 2014 19:21:57 -  1.135
+++ uipc_socket.c   22 Dec 2014 01:11:03 -
@@ -493,15 +493,18 @@ restart:
mlen = MLEN;
}
if (resid >= MINCLSIZE && space >= MCLBYTES) {
-   MCLGET(m, M_NOWAIT);
+   MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+   lmin(space, MAXMCLBYTES)));
if ((m->m_flags & M_EXT) == 0)
goto nopages;
if (atomic && top == 0) {
-   len = lmin(MCLBYTES - max_hdr,
-   resid);
+   len = lmin(resid,
+   m->m_ext.ext_size -
+   max_hdr);
m->m_data += max_hdr;
} else
-   len = lmin(MCLBYTES, resid);
+   len = lmin(resid,
+   m->m_ext.ext_size);