Re: include/linux/pcounter.h

2008-02-26 Thread Ingo Molnar

* David Miller [EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Sat, 16 Feb 2008 11:26:18 -0800
 
  On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet [EMAIL PROTECTED] wrote:
  
   Yes, per connection basis. Some workloads want to open/close more 
   than 1000 sockets per second.
  
  ie: slowpath
 
 Definitely not slow path in the networking.
 
 Connection rates are definitely as, or more, important than packet 
 rates for certain workloads.

but the main and fundamental question still remains unanswered (more 
than 3 weeks after Andrew asked that question): why was this piece of 
general infrastructure merged via net.git and not submitted to lkml 
ever? The code touching -mm does _not_ count as review.

Now that there was review of it and there is clearly controversy, the 
code should be reverted/undone and resubmitted after all review 
observations have been addressed. Just sitting around and ignoring 
objections, hoping for the code to hit v2.6.25 is rather un-nice ...

Ingo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-16 Thread Eric Dumazet
.
That is ... once in a while...

Now when we allocate a new socket, code to increment the socket count is :

c03a74a8 tcp_pcounter_add:
c03a74a8:   b8 90 26 5f c0  mov$0xc05f2690,%eax
c03a74ad:   64 8b 0d 10 f1 5e c0mov%fs:0xc05ef110,%ecx
c03a74b4:   01 14 01add%edx,(%ecx,%eax,1)
c03a74b7:   c3  ret

That is 4 instructions. I could be two in the future, thanks to current work 
on fs/gs based percpu variables.


Current percpu_counters implementation is more expensive :

c021467b __percpu_counter_add:
c021467b:   55  push   %ebp
c021467c:   57  push   %edi
c021467d:   89 c7   mov%eax,%edi
c021467f:   56  push   %esi
c0214680:   53  push   %ebx
c0214681:   83 ec 04sub$0x4,%esp
c0214684:   8b 40 14mov0x14(%eax),%eax
c0214687:   64 8b 1d 08 f0 5e c0mov%fs:0xc05ef008,%ebx
c021468e:   8b 6c 24 18 mov0x18(%esp),%ebp
c0214692:   f7 d0   not%eax
c0214694:   8b 1c 98mov(%eax,%ebx,4),%ebx
c0214697:   89 1c 24mov%ebx,(%esp)
c021469a:   8b 03   mov(%ebx),%eax
c021469c:   89 c3   mov%eax,%ebx
c021469e:   89 c6   mov%eax,%esi
c02146a0:   c1 fe 1fsar$0x1f,%esi
c02146a3:   89 e8   mov%ebp,%eax
c02146a5:   01 d3   add%edx,%ebx
c02146a7:   11 ce   adc%ecx,%esi
c02146a9:   99  cltd
c02146aa:   39 d6   cmp%edx,%esi
c02146ac:   7f 15   jg c02146c3 
__percpu_counter_add+0x48
c02146ae:   7c 04   jl c02146b4 
__percpu_counter_add+0x39

c02146b0:   39 eb   cmp%ebp,%ebx
c02146b2:   73 0f   jaec02146c3 
__percpu_counter_add+0x48

c02146b4:   f7 dd   neg%ebp
c02146b6:   89 e8   mov%ebp,%eax
c02146b8:   99  cltd
c02146b9:   39 d6   cmp%edx,%esi
c02146bb:   7f 20   jg c02146dd 
__percpu_counter_add+0x62
c02146bd:   7c 04   jl c02146c3 
__percpu_counter_add+0x48

c02146bf:   39 eb   cmp%ebp,%ebx
c02146c1:   77 1a   ja c02146dd 
__percpu_counter_add+0x62

c02146c3:   89 f8   mov%edi,%eax
c02146c5:   e8 04 cc 1f 00  call   c04112ce _spin_lock
c02146ca:   01 5f 04add%ebx,0x4(%edi)
c02146cd:   11 77 08adc%esi,0x8(%edi)
c02146d0:   8b 04 24mov(%esp),%eax
c02146d3:   c7 00 00 00 00 00   movl   $0x0,(%eax)
c02146d9:   fe 07   incb   (%edi)
c02146db:   eb 05   jmpc02146e2 
__percpu_counter_add+0x67

c02146dd:   8b 04 24mov(%esp),%eax
c02146e0:   89 18   mov%ebx,(%eax)
c02146e2:   58  pop%eax
c02146e3:   5b  pop%ebx
c02146e4:   5e  pop%esi
c02146e5:   5f  pop%edi
c02146e6:   5d  pop%ebp
c02146e7:   c3  ret


Once it is better, just make pcounter vanish.

It is even clearly stated at the top of include/linux/pcounter.h

/*
 * Using a dynamic percpu 'int' variable has a cost :
 * 1) Extra dereference
 * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
 * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of 
num_possible_cpus()*4
 *
 * This pcounter implementation is an abstraction to be able to use
 * either a static or a dynamic per cpu variable.
 * One dynamic per cpu variable gets a fast  cheap implementation, we can
 * change pcounter implementation too.
 */


We all agree.

Thank you
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-16 Thread Andrew Morton
   mov%edi,%eax
 c02146c5:   e8 04 cc 1f 00  call   c04112ce _spin_lock
 c02146ca:   01 5f 04add%ebx,0x4(%edi)
 c02146cd:   11 77 08adc%esi,0x8(%edi)
 c02146d0:   8b 04 24mov(%esp),%eax
 c02146d3:   c7 00 00 00 00 00   movl   $0x0,(%eax)
 c02146d9:   fe 07   incb   (%edi)
 c02146db:   eb 05   jmpc02146e2 
 __percpu_counter_add+0x67
 c02146dd:   8b 04 24mov(%esp),%eax
 c02146e0:   89 18   mov%ebx,(%eax)
 c02146e2:   58  pop%eax
 c02146e3:   5b  pop%ebx
 c02146e4:   5e  pop%esi
 c02146e5:   5f  pop%edi
 c02146e6:   5d  pop%ebp
 c02146e7:   c3  ret
 
 
 Once it is better, just make pcounter vanish.

Some of the stuff in there is from the __percpu_disguise() thing which we
probably can live without.

But I'd be surprised if benchmarking reveals that the pcounter code is
justifiable in its present networking application or indeed in any future
ones.

 It is even clearly stated at the top of include/linux/pcounter.h
 
 /*
   * Using a dynamic percpu 'int' variable has a cost :
   * 1) Extra dereference
   * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
   * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of 
 num_possible_cpus()*4
   *
   * This pcounter implementation is an abstraction to be able to use
   * either a static or a dynamic per cpu variable.
   * One dynamic per cpu variable gets a fast  cheap implementation, we can
   * change pcounter implementation too.
   */
 
 
 We all agree.
 

No we don't.  That comment is afaict wrong about the memory consumption and
the abstraction *isn't useful*.

Why do we want some abstraction which makes alloc_percpu() storage and
DEFINE_PERCPU storage look the same?  What use is there in that?  One is
per-object storage and one is singleton storage - they're quite different
things and they are used in quite different situations and they are
basically never interchangeable.  Yet we add this pretend-they're-the-same
wrapper around them which costs us an indirect function call on the
fastpath.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-16 Thread Eric Dumazet
  push   %edi
c021467d:   89 c7   mov%eax,%edi
c021467f:   56  push   %esi
c0214680:   53  push   %ebx
c0214681:   83 ec 04sub$0x4,%esp
c0214684:   8b 40 14mov0x14(%eax),%eax
c0214687:   64 8b 1d 08 f0 5e c0mov%fs:0xc05ef008,%ebx
c021468e:   8b 6c 24 18 mov0x18(%esp),%ebp
c0214692:   f7 d0   not%eax
c0214694:   8b 1c 98mov(%eax,%ebx,4),%ebx
c0214697:   89 1c 24mov%ebx,(%esp)
c021469a:   8b 03   mov(%ebx),%eax
c021469c:   89 c3   mov%eax,%ebx
c021469e:   89 c6   mov%eax,%esi
c02146a0:   c1 fe 1fsar$0x1f,%esi
c02146a3:   89 e8   mov%ebp,%eax
c02146a5:   01 d3   add%edx,%ebx
c02146a7:   11 ce   adc%ecx,%esi
c02146a9:   99  cltd
c02146aa:   39 d6   cmp%edx,%esi
c02146ac:   7f 15   jg c02146c3 
__percpu_counter_add+0x48
c02146ae:   7c 04   jl c02146b4 


One of the above two branches is taken ((FBC_BATCH-1)/FBC_BATCH)ths of the
time.



__percpu_counter_add+0x39
c02146b0:   39 eb   cmp%ebp,%ebx
c02146b2:   73 0f   jaec02146c3 
__percpu_counter_add+0x48

c02146b4:   f7 dd   neg%ebp
c02146b6:   89 e8   mov%ebp,%eax
c02146b8:   99  cltd
c02146b9:   39 d6   cmp%edx,%esi
c02146bb:   7f 20   jg c02146dd 
__percpu_counter_add+0x62
c02146bd:   7c 04   jl c02146c3 
__percpu_counter_add+0x48

c02146bf:   39 eb   cmp%ebp,%ebx
c02146c1:   77 1a   ja c02146dd 
__percpu_counter_add+0x62

c02146c3:   89 f8   mov%edi,%eax
c02146c5:   e8 04 cc 1f 00  call   c04112ce _spin_lock
c02146ca:   01 5f 04add%ebx,0x4(%edi)
c02146cd:   11 77 08adc%esi,0x8(%edi)
c02146d0:   8b 04 24mov(%esp),%eax
c02146d3:   c7 00 00 00 00 00   movl   $0x0,(%eax)
c02146d9:   fe 07   incb   (%edi)
c02146db:   eb 05   jmpc02146e2 
__percpu_counter_add+0x67

c02146dd:   8b 04 24mov(%esp),%eax
c02146e0:   89 18   mov%ebx,(%eax)
c02146e2:   58  pop%eax
c02146e3:   5b  pop%ebx
c02146e4:   5e  pop%esi
c02146e5:   5f  pop%edi
c02146e6:   5d  pop%ebp
c02146e7:   c3  ret


Once it is better, just make pcounter vanish.


Some of the stuff in there is from the __percpu_disguise() thing which we
probably can live without.

But I'd be surprised if benchmarking reveals that the pcounter code is
justifiable in its present networking application or indeed in any future
ones.


I have no benchmarks, but real workloads where it matters, and where userland 
eats icache/dcache all the time.





It is even clearly stated at the top of include/linux/pcounter.h

/*
  * Using a dynamic percpu 'int' variable has a cost :
  * 1) Extra dereference
  * Current per_cpu_ptr() implementation uses an array per 'percpu variable'.
  * 2) memory cost of NR_CPUS*(32+sizeof(void *)) instead of 
num_possible_cpus()*4
  *
  * This pcounter implementation is an abstraction to be able to use
  * either a static or a dynamic per cpu variable.
  * One dynamic per cpu variable gets a fast  cheap implementation, we can
  * change pcounter implementation too.
  */


We all agree.



No we don't.  That comment is afaict wrong about the memory consumption and
the abstraction *isn't useful*.


Fact is that we need percpu 32bits counters, and we need to have pointers to 
them.

Current percpu_counters cannot cope that.



Why do we want some abstraction which makes alloc_percpu() storage and
DEFINE_PERCPU storage look the same?  What use is there in that?  One is
per-object storage and one is singleton storage - they're quite different
things and they are used in quite different situations and they are
basically never interchangeable.  Yet we add this pretend-they're-the-same
wrapper around them which costs us an indirect function call on the
fastpath.


I believe all 'pcounter' are in fact statically allocated 'one per struct 
proto' to track inuse count. (search for DEFINE_PROTO_INUSE() uses)


# find net include|xargs grep -n DEFINE_PROTO_INUSE
net/dccp/ipv6.c:1105:DEFINE_PROTO_INUSE(dccp_v6)
net/dccp/ipv4.c:920:DEFINE_PROTO_INUSE(dccp_v4)
net/ipv6/udp.c:1001:DEFINE_PROTO_INUSE(udpv6)
net/ipv6/udplite.c:43

Re: include/linux/pcounter.h

2008-02-16 Thread Arjan van de Ven
On Sat, 16 Feb 2008 11:26:18 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

  indirect functions calls are everywhere in kernel, network, fs,
  everywhere.
 
 That doesn't make them fast.

just to emphasize this: an indirect function call is at least as expensive as 
an atomic operation on 
x86 cpus (if not more)

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-16 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Sat, 16 Feb 2008 11:26:18 -0800

 On Sat, 16 Feb 2008 13:03:54 +0100 Eric Dumazet [EMAIL PROTECTED] wrote:
 
  Yes, per connection basis. Some workloads want to open/close more than 1000 
  sockets per second.
 
 ie: slowpath

Definitely not slow path in the networking.

Connection rates are definitely as, or more, important than packet
rates for certain workloads.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-15 Thread Andrew Morton

- First up, why was this added at all?  We have percpu_counter.h which
  has several years development invested in it.  afaict it would suit the
  present applications of pcounters.

  If some deficiency in percpu_counters has been identified, is it
  possible to correct that deficiency rather than implementing a whole new
  counter type?  That would be much better.

- The comments in pcounter.h appear to indicate that there is a
  performance advantage (and we infer that the advantage is when the
  statically-allocated flavour of pcounters is used).  When compared with
  percpu_counters the number of data-reference indirections is the same as
  with percpu_counters, so no advantage there.

  And, bizarrely, because of a quite inappropriate abstraction toy, both
  flavours of pcounters add an indirect function call which I believe is
  significantly more expensive than a plain old pointer indirection.

  So it's quite possible that DEFINE_PCOUNTER-style counters consume more
  memory and are slower than percpu_counters.  They will surely be much
  slower on the read side.  More below.

  If we really want to put some helper wrappers around
  DEFINE_PER_CPU(s32) then I'd suggest that we should do that as a
  standalone thing and not attempt to graft the same interface onto two
  quite different types of storage (DEFINE_PER_CPU and alloc_per_cpu)

- The comment 2) in pcounter.h (which overflows 80 cols and probably
  wasn't passed through checkpatch) indicates that some other
  implementation (presumably plain old DEFINE_PER_CPU) will use
  NR_CPUS*(32+sizeof(void *)) bytes of storage.  But DEFINE_PCOUNTER will
  use as much memory as DEFINE_PER_CPU(s32) and both pcounter_alloc()-style
  pcounters and percpu_counters use
  num_possible_cpus()*sizeof(s32)+epsilon.

- The CONFIG_SMP=n stubs in pcounter.h are cheesy and are vulnerable to
  several well-known compilation risks which I always forget.  Should be
  converted to regular static inlines.

- the comment in lib/pcounter.c needlessly exceeds 80 cols.

- pcounter_dyn_add() will spew a
  use-of-smp_processor_id()-in-preemptible-code warning if used in places
  where one could reasonably use it.  The interface could do with a bit of
  a rethink.  Or at least some justification and documentation.

- pcounter_getval() will be disastrously inefficient if
  num_possible_cpus() is much greater than num_online_cpus().  It should
  use for_each_online_cpu() (as does percpu_counter), and implement a CPU
  hotplug notifier (as does percpu_counter).

  It will remain grossly inefficient at high CPU counts, unlike
  percpu_counters, which solved this problem by doing a batched spill into
  a central counter at add/sub time.

  The danger here is that someone will actually use this interface in new
  code.  Six months later (when it's too late to fix it) the big-NUMA guys
  come up and say whaa, when our user does this it disabled interrupts
  for N milliseconds.

- pcounter_getval() can return incorrect negative numbers.  This can
  cause caller malfunctions in very rare situations because callers just
  don't expect the things which they're counting to go negative.

  We experienced this during the evolution of percpu_counter.  See
  percpu_counter_sum_positive() and friends.

- pcounter_alloc() should return -ENOMEM on allocation error, not 1.

- pcounter_free() perhaps shouldn't test for (self-per_cpu_values !=
  NULL), because callers shouldn't be calling it if pcounter_alloc() failed
  (arguable).



afaict the whole implementation can and should be removed and replaced with
percpu_counters.  I don't think there's much point in its ability to manage
DEFINE_PER_CPU counters: pcounter_getval() remains grossly inefficient (and
can return negative values) and quite a bit of new code will need to be put
in place to address that.

But perhaps there are plans to evolve it into something further in the
future, I don't know.  But I would suggest that the people who have worked
upon percpu_counters (principally Gautham, Peter Z, clameter and me) be
involved in that work.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


include/linux/pcounter.h

2008-02-04 Thread Andrew Morton

e7d0362dd41e760f340c1b500646cc92522bd9d5 should have been folded into
de4d1db369785c29d68915edfee0cb70e8199f4c prior to merging.  We now and for
ever have a window of breakage which screws up git bisection.  Which I
just hit.  Which is the only reason I discovered the file's existence.


Please do not merge pieces of generic kernel infrastructure while keeping
it all secret on the netdev list.  Ever.  That code has a number of deficiencies
which I and probably others would have noted had it been offered to us
for review.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-04 Thread David Miller
From: Andrew Morton [EMAIL PROTECTED]
Date: Mon, 4 Feb 2008 01:44:02 -0800

 Please do not merge pieces of generic kernel infrastructure while
 keeping it all secret on the netdev list.  Ever.

It was so damn secret that it sat in your -mm tree for months.

Don't be rediculious Andrew.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: include/linux/pcounter.h

2008-02-04 Thread Andrew Morton
On Mon, 04 Feb 2008 16:20:35 -0800 (PST)
David Miller [EMAIL PROTECTED] wrote:

 From: Andrew Morton [EMAIL PROTECTED]
 Date: Mon, 4 Feb 2008 01:44:02 -0800
 
  Please do not merge pieces of generic kernel infrastructure while
  keeping it all secret on the netdev list.  Ever.
 
 It was so damn secret that it sat in your -mm tree for months.

I never noticed it and I doubt if anyone else did.  How was I (or anyone
else) to have known?

 Don't be rediculious Andrew.

There is nothing ridiculous about requiring that new generic kernel
infrastructure patches be appropriately submitted and reviewed.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html