Re: per cpu memory in the kernel

David Gwynne Sat, 13 Aug 2016 23:27:07 -0700

> On 13 Aug 2016, at 5:48 AM, Mark Kettenis <[email protected]> wrote:
> 
>> From: Martin Pieuchot <[email protected]>
>> Date: Fri, 12 Aug 2016 20:44:04 +0200
>> 
>> On 08/11/16 06:43, David Gwynne wrote:
>>> ive been tinkering with per cpu memory in the kernel
>>> [...]
>> 
>> I'd like to have more people comment on this because we need an MP-safe
>> way to handle counters in the network stack.
> 
> I had a first look.  Need to have a bit more time to digest this.
> 
>>> im still debating whether the API should do protection against
>>> interrupts on the local cpu by handling spls for you. at the moment
>>> it is up to the caller to manually splfoo() and splx(), but im half
>>> convinced that cpumem_enter and _leave should do that on behalf of
>>> the caller.
>> 
>> Let's see how it looks like when we have more usages of per cpu memory.
> 
> The per-CPU counters API is likely to need it though.  I'm less sure
> about the per-CPU memory API itself.


there are a couple of arguments against it.

one big reason i didnt put it in the api itself at the time was the lack of MI 
splraise(). ive since fixed that on everything except arm as far as i can tell. 
arm may be ok, but its hard to read and my zaurus is dusty. it's worth figuring 
out though cos the pool parts of the diff use splraise instead of a mutex per 
cpu.

id also have to store the ipl somewhere inside struct cpumem, but be clever 
about avoiding more indirection on the way to the per cpu memory. so far the 
caller already has a pretty good idea of what level it is or should be at, and 
generally keeps state on the stack (unlike mutexes).

> 
>>> this diff also includes two uses of the percpu code. one is to
>>> provide per cpu caches of pool items, and the other is per cpu
>>> counters for mbuf statistics.
>> 
>> I'd like to discuss pool cache later.
> 
> The pool cache sits on top of the per-CPU memory API isn't it?  So it
> doesn't need to go in simultaniously.

yes.

> 
>>> ive added a wrapper around percpu memory for counters. basically
>>> you ask it for N counters, where each counter is a uint64_t.
>>> counters_enter will give you a per cpu reference to these N counters
>>> which you can increment and decrement as you wish. internally the
>>> api will version the per cpu counters so a reader can know when
>>> theyre consistent, which is important on 32bit archs (where 64bit
>>> ops arent necessarily atomic), or where you want several counters
>>> to be consistent with each other (like packet and byte counters).
>>> counters_read is provided to use the above machinery for a consistent
>>> read.
>> 
>> Comment below.
>> 
>>> secondly, there is a boot strapping problem with per cpu data
>>> structures, which is very apparent with the mbuf layer. the problem
>>> is we dont know how many cpus are in the system until we're halfway
>>> through attaching device drivers in the system. however, if we want
>>> to use percpu data structures during attach we need to know how
>>> many cpus we have. mbufs are allocated during attach, so we need
>>> to know how many cpus we have before attach.
>> 
>> What about using ncpusfound?
> 
> On sparc64 that gets called very early, before pmap_bootstrap(),
> because we need to allocate some per-CPU memory ;).
> 
> On other systems we typically calculate it on the fly as we're
> attaching CPUs.  That may not be early enough.

i tried ncpusfound and MAXCPUS. ncpusfound is generally unavailable before 
cpu_configure, and im trying to avoid wasting memory with MAXCPUs. using global 
memory for the boot cpu and doing a "realloc" after cpu_configure seems to work 
well.

the second diff i mailed out has realloc continue to use the global as the 
cpumem for cpu 0 and allocate more memory for 1 and up. this means we only 
waste a pointer of memory for percpu memory that needs to work straight after 
boot.

on UP we will waste nothing cos we dont use struct cpumem, and avoid needing to 
allocate anything further at runtime.

> 
>>> +uint64_t *
>>> +counters_enter(struct counters_ref *ref, struct cpumem *cm)
>>> +{
>>> +   ref->gen = cpumem_enter(cm);
>>> +   (*ref->gen)++; /* make the generation number odd */
>>> +   return (ref->gen + 1);
>>> +}
>>> +
>>> +void
>>> +counters_leave(struct counters_ref *ref, struct cpumem *cm)
>>> +{
>>> +   membar_producer();
>>> +   (*ref->gen)++; /* make the generation number even again */
>>> +   cpumem_leave(cm, ref->gen);
>>> +}
>> 
>> So every counter++ will now look like a critical section?  Do we really
>> need a memory barrier for every +1?  What can happen if a CPU is reading
>> a value that is currently being updated?  Can't we live with that?
> 
> Not for 64-bit counters on 32-bit machines.  You could do something
> clever and read the upper 32 bits, then read the lower 32 bits and
> then read the upper 32-bits again and start over of you get different
> value the 2nd time around.  But you still need to guarantee that
> whoever updates the counter writes the upper 32 bits and lower 32 bits
> in a particular order, and that they become visible to other CPUs in
> that order.  And for that you need memory bars.

yeah.

the generation number is used in sort of the same way as reading the top 32 
bits of a 64 bit counter twice to look for changes, but it applies to a whole 
series of updates. on the read side anyway. on the write side the extra loads 
and stores for the generation number are trivial in comparison to cost of 
waiting for a main memory access. if its in cache its cheap.

on archs that we care about speed on (amd64), membar_producer and 
membar_consumer happen to just be compiler barriers, there's no extra cpu 
instructions to flush caches. even on archs that do have explicit opcodes for 
memory barriers theyre cheap if the cacheline is uncontended.

dlg

Re: per cpu memory in the kernel

Reply via email to