On Tue, Jan 24, 2017 at 02:55:50AM +0100, Alexander Bluhm wrote:
> On Wed, Dec 14, 2016 at 03:52:32PM +1000, David Gwynne wrote:
> > > Wouldn't it make sense to use atomic operations to keep track of the
> > > amount of memory that was allocated?
> >
> > the mtx may be slow, but it is a slow path id like to keep simple.
>
> Is allocate and free of mbuf clusters in the slow path? I would
> expect that it has to be done for every packet. An atomic operation
> would not solve the slowdown in the fast path, perhaps a multi-cpu
> counter?
no, this is only done at the level of pool page allocation. the
pool pages are what get cut up into mbufs and mbuf clusters.
the only difference between mbufs and clusters at the pool level
is their size, and therefore which pool they come from.
on my firewall the interesting numbers are:
$ vmstat -m | grep -e ^mbuf -e ^mcl -e ^Name
Name Size Requests Fail InUse Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
mbufpl 256 1502227362815 357415 2547 22176623 22176415 208 953 1 8 1
mcl2k 2048 1446266146051 0 709 104851993 104851840 153 1712 0 8 46
mcl4k 4096 7438867115 0 0 519649 519648 1 35 0 8 1
mcl8k 8192 3388545041 0 0 583882 583881 1 61 0 8 1
mcl9k 9216 15293011574 0 18 507792 507790 2 31 0 8 0
mbufs get used 1502227362815 / 22176623 or 67739 times per page
allocation. 2k clusters get used 1446266146051 / 104851993 or 13793
per page allocation.
> > > Long run I suppose we want to drop nmbclust and let users tune the
> > > total amount of memory available for clusters and set the initial
> > > amount to a percentage of physical memory?
> >
> > i do want to move to specifying memory in terms of bytes instead of 2k
> > clusters.
>
> That might be better. But leave it as it is for now. Percentage
> of physical memory is bad as we have architectures that cannot do
> DMA everywhere.
>
> > how much should be available by default is a complicated issue.
>
> Yes. Maybe we have to reconsier the default. We had it per cluster
> size before, now the limit is for all clusters.
percentage of memory isnt a good idea. the memory available to the
stack should be sized by how much tcp you want to do, where how
much is determined by the bandwidth and latency you need to support.
that is regardless of whether you have 4G or 512G of RAM.
> > >> void
> > >> mbcpuinit()
> > >> {
> > >> + int i;
> > >> +
> > >> mbstat = counters_alloc_ncpus(mbstat, MBSTAT_COUNT, M_DEVBUF);
> > >> +
> > >> + pool_cache_init(&mbpool);
> > >> + pool_cache_init(&mtagpool);
> > >> +
> > >> + for (i = 0; i < nitems(mclsizes); i++)
> > >> + pool_cache_init(&mclpools[i]);
> > >> }
>
> Is the pool cache related to the global mbuf limit?
no, but it is possible because of it.
the idea of the per cpu caches is to let cpus operate completely
independently most of the time. having an item limit on a pool means
the cpus need to coordinate on every pool put and get so they can
check and update that limit.
> Apart from the problem that I don't know wether the mutex kills
> performance, the diff looks good.
the tests ive done and simon mages has done show a slight benefit.
id expect to see that grow as we use pools more concurrently.