Re: per cpu memory in the kernel

Mark Kettenis Mon, 15 Aug 2016 05:44:32 -0700

> From: David Gwynne <[email protected]>
> Date: Sun, 14 Aug 2016 16:26:45 +1000
> 
> > On 13 Aug 2016, at 5:48 AM, Mark Kettenis <[email protected]> wrote:
> > 
> >> From: Martin Pieuchot <[email protected]>
> >> Date: Fri, 12 Aug 2016 20:44:04 +0200
> >> 
> >> On 08/11/16 06:43, David Gwynne wrote:
> >>> ive been tinkering with per cpu memory in the kernel
> >>> [...]
> >> 
> >> I'd like to have more people comment on this because we need an MP-safe
> >> way to handle counters in the network stack.
> > 
> > I had a first look.  Need to have a bit more time to digest this.
> > 
> >>> im still debating whether the API should do protection against
> >>> interrupts on the local cpu by handling spls for you. at the moment
> >>> it is up to the caller to manually splfoo() and splx(), but im half
> >>> convinced that cpumem_enter and _leave should do that on behalf of
> >>> the caller.
> >> 
> >> Let's see how it looks like when we have more usages of per cpu memory.
> > 
> > The per-CPU counters API is likely to need it though.  I'm less sure
> > about the per-CPU memory API itself.
> 
> there are a couple of arguments against it.
> 
> one big reason i didnt put it in the api itself at the time was the
> lack of MI splraise(). ive since fixed that on everything except arm
> as far as i can tell. arm may be ok, but its hard to read and my
> zaurus is dusty. it's worth figuring out though cos the pool parts
> of the diff use splraise instead of a mutex per cpu.
> 
> id also have to store the ipl somewhere inside struct cpumem, but be
> clever about avoiding more indirection on the way to the per cpu
> memory. so far the caller already has a pretty good idea of what
> level it is or should be at, and generally keeps state on the stack
> (unlike mutexes).


And I suppose in most cases when updating the counters you'll already
be at the desired spl.

So I'm with mpi@ here.  Let's stick to what you have now and revise
after we have used it in a couple of places.

> >>> ive added a wrapper around percpu memory for counters. basically
> >>> you ask it for N counters, where each counter is a uint64_t.
> >>> counters_enter will give you a per cpu reference to these N counters
> >>> which you can increment and decrement as you wish. internally the
> >>> api will version the per cpu counters so a reader can know when
> >>> theyre consistent, which is important on 32bit archs (where 64bit
> >>> ops arent necessarily atomic), or where you want several counters
> >>> to be consistent with each other (like packet and byte counters).
> >>> counters_read is provided to use the above machinery for a consistent
> >>> read.
> >> 
> >> Comment below.
> >> 
> >>> secondly, there is a boot strapping problem with per cpu data
> >>> structures, which is very apparent with the mbuf layer. the problem
> >>> is we dont know how many cpus are in the system until we're halfway
> >>> through attaching device drivers in the system. however, if we want
> >>> to use percpu data structures during attach we need to know how
> >>> many cpus we have. mbufs are allocated during attach, so we need
> >>> to know how many cpus we have before attach.
> >> 
> >> What about using ncpusfound?
> > 
> > On sparc64 that gets called very early, before pmap_bootstrap(),
> > because we need to allocate some per-CPU memory ;).
> > 
> > On other systems we typically calculate it on the fly as we're
> > attaching CPUs.  That may not be early enough.
> 
> i tried ncpusfound and MAXCPUS. ncpusfound is generally unavailable
> before cpu_configure, and im trying to avoid wasting memory with
> MAXCPUs. using global memory for the boot cpu and doing a "realloc"
> after cpu_configure seems to work well.
> 
> the second diff i mailed out has realloc continue to use the global
> as the cpumem for cpu 0 and allocate more memory for 1 and up. this
> means we only waste a pointer of memory for percpu memory that needs
> to work straight after boot.

I think that's perfectly acceptable.

Hiwver, looking at the code xxx_realloc() might be a bit misleading.
One could naively think that these interfaces could be used to change
the size of the per-CPU memory area.  But that's not the case isn't it?

It also seems that you currently rely on each subsystem that uses
per-CPU memory to call the appropriate xxx_realloc function itself.
It might be feasable to do this automatically in percpu_init() by
using ELF section magic to construct a list of CPUMEM_BOOT_MEMORY()
declarations.  But perhaps it is best to file that away as a future
improvement for now.

> on UP we will waste nothing cos we dont use struct cpumem, and avoid
> needing to allocate anything further at runtime.
> 
> > 
> >>> +uint64_t *
> >>> +counters_enter(struct counters_ref *ref, struct cpumem *cm)
> >>> +{
> >>> + ref->gen = cpumem_enter(cm);
> >>> + (*ref->gen)++; /* make the generation number odd */
> >>> + return (ref->gen + 1);
> >>> +}
> >>> +
> >>> +void
> >>> +counters_leave(struct counters_ref *ref, struct cpumem *cm)
> >>> +{
> >>> + membar_producer();
> >>> + (*ref->gen)++; /* make the generation number even again */
> >>> + cpumem_leave(cm, ref->gen);
> >>> +}
> >> 
> >> So every counter++ will now look like a critical section?  Do we really
> >> need a memory barrier for every +1?  What can happen if a CPU is reading
> >> a value that is currently being updated?  Can't we live with that?
> > 
> > Not for 64-bit counters on 32-bit machines.  You could do something
> > clever and read the upper 32 bits, then read the lower 32 bits and
> > then read the upper 32-bits again and start over of you get different
> > value the 2nd time around.  But you still need to guarantee that
> > whoever updates the counter writes the upper 32 bits and lower 32 bits
> > in a particular order, and that they become visible to other CPUs in
> > that order.  And for that you need memory bars.
> 
> yeah.
> 
> the generation number is used in sort of the same way as reading the
> top 32 bits of a 64 bit counter twice to look for changes, but it
> applies to a whole series of updates. on the read side anyway. on
> the write side the extra loads and stores for the generation number
> are trivial in comparison to cost of waiting for a main memory
> access. if its in cache its cheap.
> 
> on archs that we care about speed on (amd64), membar_producer and
> membar_consumer happen to just be compiler barriers, there's no
> extra cpu instructions to flush caches. even on archs that do have
> explicit opcodes for memory barriers theyre cheap if the cacheline
> is uncontended.

Almost certainly true for the architectures we care about.

On a side node, I suppose that on architectures that make use of the
exclusivity of cache lines to implement memory barriers in an
efficient way, actual atomic instructions would be fairly efficient as
well.  It's just x86 that is a bit challenged here with the bus
locking semantics ;).  And there memory barriers are indeed very
cheap.  Just make sure you put your counters in normal cachable
memory!

Re: per cpu memory in the kernel

Reply via email to