Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Mon, 12 Nov 2007 21:18:17 +0100

> Christoph Lameter a écrit :
> > On Mon, 12 Nov 2007, Eric Dumazet wrote:
> >> For example, I do think using a per cpu memory storage on net_device 
> >> refcnt &
> >> last_rx could give us some speedups.
> > 
> > Note that there was a new patchset posted (titled cpu alloc v1) that 
> > provides on demand extension of the cpu areas.
> > 
> > See http://marc.info/?l=linux-kernel=119438261304093=2
> 
> Thank you Christoph. I was traveling last week so I missed that.
> 
> This new patchset looks very interesting, you did a fantastic job !

Yes I like it too.  It's in my backlog of things to test on
sparc64.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Mon, 12 Nov 2007 21:14:47 +0100

> I dont think this is a problem. Cpus numbers and ram size are related, even 
> if 
> Moore didnt predicted it;
> 
> Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
> Or call it a GPU and dont expect it to run linux :)
> 
> 99,9% of linux machines running on earth have less than 8 cpus and less than 
> 1000 ethernet/network devices.
> 
> In case we increase the number of cpus on a machine, the limiting factor is 
> the fact that cpus have to continually exchange on memory bus those highly 
> touched cache lines that contain refcounters or stats.

I totally agree with everything Eric is saying here.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Herbert Xu <[EMAIL PROTECTED]>
Date: Mon, 12 Nov 2007 18:52:35 +0800

> David Miller <[EMAIL PROTECTED]> wrote:
> > 
> > Each IP compression tunnel instance does an alloc_percpu().
> 
> Actually all IPComp tunnels share one set of objects which are
> allocated per-cpu.  So only the first tunnel would do that.
> 
> In fact that was precisely the reason why per-cpu is used in
> IPComp as otherwise we can just allocate normal memory.

Hmmm... indeed.  Thanks for clearing this up.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Christoph Lameter a écrit :

On Mon, 12 Nov 2007, Eric Dumazet wrote:

For example, I do think using a per cpu memory storage on net_device refcnt &
last_rx could give us some speedups.


Note that there was a new patchset posted (titled cpu alloc v1) that 
provides on demand extension of the cpu areas.


See http://marc.info/?l=linux-kernel=119438261304093=2


Thank you Christoph. I was traveling last week so I missed that.

This new patchset looks very interesting, you did a fantastic job !

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Luck, Tony a écrit :
Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.



Yes, but still desirable for future optimizations.

For example, I do think using a per cpu memory storage on net_device refcnt & 
last_rx could give us some speedups.


We do want to keep a very tight handle on bloat in per-cpu
allocations.  By definition the total allocation is multiplied
by the number of cpus.  Only ia64 has outrageous numbers of
cpus in a single system image today ... but the trend in
multi-core chips looks to have a Moore's law arc to it, so
everyone is going to be looking at lots of cpus before long.



I dont think this is a problem. Cpus numbers and ram size are related, even if 
Moore didnt predicted it;


Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
Or call it a GPU and dont expect it to run linux :)

99,9% of linux machines running on earth have less than 8 cpus and less than 
1000 ethernet/network devices.


In case we increase the number of cpus on a machine, the limiting factor is 
the fact that cpus have to continually exchange on memory bus those highly 
touched cache lines that contain refcounters or stats.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Luck, Tony
> > Ahh so the need to be able to expand per cpu memory storage on demand 
> > is not as critical as we thought.
> > 
>
> Yes, but still desirable for future optimizations.
>
> For example, I do think using a per cpu memory storage on net_device refcnt & 
> last_rx could give us some speedups.

We do want to keep a very tight handle on bloat in per-cpu
allocations.  By definition the total allocation is multiplied
by the number of cpus.  Only ia64 has outrageous numbers of
cpus in a single system image today ... but the trend in
multi-core chips looks to have a Moore's law arc to it, so
everyone is going to be looking at lots of cpus before long.

-Tony
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Christoph Lameter
On Mon, 12 Nov 2007, Eric Dumazet wrote:

> Christoph Lameter a écrit :
> > On Mon, 12 Nov 2007, Herbert Xu wrote:
> > 
> > > David Miller <[EMAIL PROTECTED]> wrote:
> > > > Each IP compression tunnel instance does an alloc_percpu().
> > > Actually all IPComp tunnels share one set of objects which are
> > > allocated per-cpu.  So only the first tunnel would do that.
> > 
> > Ahh so the need to be able to expand per cpu memory storage on demand is not
> > as critical as we thought.
> > 
> 
> Yes, but still desirable for future optimizations.
> 
> For example, I do think using a per cpu memory storage on net_device refcnt &
> last_rx could give us some speedups.

Note that there was a new patchset posted (titled cpu alloc v1) that 
provides on demand extension of the cpu areas.

See http://marc.info/?l=linux-kernel=119438261304093=2


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Christoph Lameter a écrit :

On Mon, 12 Nov 2007, Herbert Xu wrote:


David Miller <[EMAIL PROTECTED]> wrote:

Each IP compression tunnel instance does an alloc_percpu().

Actually all IPComp tunnels share one set of objects which are
allocated per-cpu.  So only the first tunnel would do that.


Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.




Yes, but still desirable for future optimizations.

For example, I do think using a per cpu memory storage on net_device refcnt & 
last_rx could give us some speedups.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Christoph Lameter
On Mon, 12 Nov 2007, Herbert Xu wrote:

> David Miller <[EMAIL PROTECTED]> wrote:
> > 
> > Each IP compression tunnel instance does an alloc_percpu().
> 
> Actually all IPComp tunnels share one set of objects which are
> allocated per-cpu.  So only the first tunnel would do that.

Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Herbert Xu
David Miller <[EMAIL PROTECTED]> wrote:
> 
> Each IP compression tunnel instance does an alloc_percpu().

Actually all IPComp tunnels share one set of objects which are
allocated per-cpu.  So only the first tunnel would do that.

In fact that was precisely the reason why per-cpu is used in
IPComp as otherwise we can just allocate normal memory.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Herbert Xu
David Miller [EMAIL PROTECTED] wrote:
 
 Each IP compression tunnel instance does an alloc_percpu().

Actually all IPComp tunnels share one set of objects which are
allocated per-cpu.  So only the first tunnel would do that.

In fact that was precisely the reason why per-cpu is used in
IPComp as otherwise we can just allocate normal memory.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Christoph Lameter
On Mon, 12 Nov 2007, Herbert Xu wrote:

 David Miller [EMAIL PROTECTED] wrote:
  
  Each IP compression tunnel instance does an alloc_percpu().
 
 Actually all IPComp tunnels share one set of objects which are
 allocated per-cpu.  So only the first tunnel would do that.

Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Christoph Lameter a écrit :

On Mon, 12 Nov 2007, Herbert Xu wrote:


David Miller [EMAIL PROTECTED] wrote:

Each IP compression tunnel instance does an alloc_percpu().

Actually all IPComp tunnels share one set of objects which are
allocated per-cpu.  So only the first tunnel would do that.


Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.




Yes, but still desirable for future optimizations.

For example, I do think using a per cpu memory storage on net_device refcnt  
last_rx could give us some speedups.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Christoph Lameter
On Mon, 12 Nov 2007, Eric Dumazet wrote:

 Christoph Lameter a écrit :
  On Mon, 12 Nov 2007, Herbert Xu wrote:
  
   David Miller [EMAIL PROTECTED] wrote:
Each IP compression tunnel instance does an alloc_percpu().
   Actually all IPComp tunnels share one set of objects which are
   allocated per-cpu.  So only the first tunnel would do that.
  
  Ahh so the need to be able to expand per cpu memory storage on demand is not
  as critical as we thought.
  
 
 Yes, but still desirable for future optimizations.
 
 For example, I do think using a per cpu memory storage on net_device refcnt 
 last_rx could give us some speedups.

Note that there was a new patchset posted (titled cpu alloc v1) that 
provides on demand extension of the cpu areas.

See http://marc.info/?l=linux-kernelm=119438261304093w=2


RE: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Luck, Tony
  Ahh so the need to be able to expand per cpu memory storage on demand 
  is not as critical as we thought.
  

 Yes, but still desirable for future optimizations.

 For example, I do think using a per cpu memory storage on net_device refcnt  
 last_rx could give us some speedups.

We do want to keep a very tight handle on bloat in per-cpu
allocations.  By definition the total allocation is multiplied
by the number of cpus.  Only ia64 has outrageous numbers of
cpus in a single system image today ... but the trend in
multi-core chips looks to have a Moore's law arc to it, so
everyone is going to be looking at lots of cpus before long.

-Tony
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Luck, Tony a écrit :
Ahh so the need to be able to expand per cpu memory storage on demand 
is not as critical as we thought.



Yes, but still desirable for future optimizations.

For example, I do think using a per cpu memory storage on net_device refcnt  
last_rx could give us some speedups.


We do want to keep a very tight handle on bloat in per-cpu
allocations.  By definition the total allocation is multiplied
by the number of cpus.  Only ia64 has outrageous numbers of
cpus in a single system image today ... but the trend in
multi-core chips looks to have a Moore's law arc to it, so
everyone is going to be looking at lots of cpus before long.



I dont think this is a problem. Cpus numbers and ram size are related, even if 
Moore didnt predicted it;


Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
Or call it a GPU and dont expect it to run linux :)

99,9% of linux machines running on earth have less than 8 cpus and less than 
1000 ethernet/network devices.


In case we increase the number of cpus on a machine, the limiting factor is 
the fact that cpus have to continually exchange on memory bus those highly 
touched cache lines that contain refcounters or stats.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread Eric Dumazet

Christoph Lameter a écrit :

On Mon, 12 Nov 2007, Eric Dumazet wrote:

For example, I do think using a per cpu memory storage on net_device refcnt 
last_rx could give us some speedups.


Note that there was a new patchset posted (titled cpu alloc v1) that 
provides on demand extension of the cpu areas.


See http://marc.info/?l=linux-kernelm=119438261304093w=2


Thank you Christoph. I was traveling last week so I missed that.

This new patchset looks very interesting, you did a fantastic job !

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Mon, 12 Nov 2007 18:52:35 +0800

 David Miller [EMAIL PROTECTED] wrote:
  
  Each IP compression tunnel instance does an alloc_percpu().
 
 Actually all IPComp tunnels share one set of objects which are
 allocated per-cpu.  So only the first tunnel would do that.
 
 In fact that was precisely the reason why per-cpu is used in
 IPComp as otherwise we can just allocate normal memory.

Hmmm... indeed.  Thanks for clearing this up.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 12 Nov 2007 21:14:47 +0100

 I dont think this is a problem. Cpus numbers and ram size are related, even 
 if 
 Moore didnt predicted it;
 
 Nobody wants to ship/build a 4096 cpus machine with 256 MB of ram inside.
 Or call it a GPU and dont expect it to run linux :)
 
 99,9% of linux machines running on earth have less than 8 cpus and less than 
 1000 ethernet/network devices.
 
 In case we increase the number of cpus on a machine, the limiting factor is 
 the fact that cpus have to continually exchange on memory bus those highly 
 touched cache lines that contain refcounters or stats.

I totally agree with everything Eric is saying here.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-12 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 12 Nov 2007 21:18:17 +0100

 Christoph Lameter a écrit :
  On Mon, 12 Nov 2007, Eric Dumazet wrote:
  For example, I do think using a per cpu memory storage on net_device 
  refcnt 
  last_rx could give us some speedups.
  
  Note that there was a new patchset posted (titled cpu alloc v1) that 
  provides on demand extension of the cpu areas.
  
  See http://marc.info/?l=linux-kernelm=119438261304093w=2
 
 Thank you Christoph. I was traveling last week so I missed that.
 
 This new patchset looks very interesting, you did a fantastic job !

Yes I like it too.  It's in my backlog of things to test on
sparc64.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Christoph Lameter
On Fri, 2 Nov 2007, Peter Zijlstra wrote:

> On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:
> 
> > Well I wonder if I should introduce it not as a replacement but as an 
> > alternative to allocpercpu? We can then gradually switch over. The 
> > existing API does not allow the specification of gfp_masks or alignements.
> 
> I've thought about suggesting that very thing. However, I think we need
> to have a clear view of where we're going with that so that we don't end
> up with two per cpu allocators because some users could not be converted
> over or some such.

At least in my tests so far show that it can be a full replacement but 
then I have only tested on x86_64 and Ia64. Its likely much easier to go
for the full replacement rather than in steps.

If we want dynamically sized virtually mapped per cpu areas then we may 
have issues on 32 bit platforms and with !MMU. So I would think that a 
fallback to a statically sized version may be needed. On the other hand
!MMU and 32 bit do not support a large number of processors. So we may be 
able to get away on 32 bit with a small virtual memory area.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Peter Zijlstra
On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:

> Well I wonder if I should introduce it not as a replacement but as an 
> alternative to allocpercpu? We can then gradually switch over. The 
> existing API does not allow the specification of gfp_masks or alignements.

I've thought about suggesting that very thing. However, I think we need
to have a clear view of where we're going with that so that we don't end
up with two per cpu allocators because some users could not be converted
over or some such.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Christoph Lameter
On Fri, 2 Nov 2007, Peter Zijlstra wrote:

> On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:
> 
> > Since you're the one who wants to change the semantics and guarentees
> > of this interface, perhaps it might help if you did some greps around
> > the tree to see how alloc_percpu() is actually used.  That's what
> > I did when I started running into trouble with your patches.
> 
> This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

Yes there are numerous uses. I even can increase page allocator 
performance and reduce its memory footprint by using it here.

> That means that for example each NFS mount also consumes a number of
> words - not quite sure from the top of my head how many, might be in the
> order of 24 bytes or something.
> 
> I once before started looking at this, because the current
> alloc_percpu() can have some false sharing - not that I have machines
> that are overly bothered by that. I like the idea of a strict percpu
> region, however do be aware of the users.

Well I wonder if I should introduce it not as a replacement but as an 
alternative to allocpercpu? We can then gradually switch over. The 
existing API does not allow the specification of gfp_masks or alignements.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Peter Zijlstra
On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:

> Since you're the one who wants to change the semantics and guarentees
> of this interface, perhaps it might help if you did some greps around
> the tree to see how alloc_percpu() is actually used.  That's what
> I did when I started running into trouble with your patches.

This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

That means that for example each NFS mount also consumes a number of
words - not quite sure from the top of my head how many, might be in the
order of 24 bytes or something.

I once before started looking at this, because the current
alloc_percpu() can have some false sharing - not that I have machines
that are overly bothered by that. I like the idea of a strict percpu
region, however do be aware of the users.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Christoph Lameter
On Fri, 2 Nov 2007, Peter Zijlstra wrote:

 On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:
 
  Well I wonder if I should introduce it not as a replacement but as an 
  alternative to allocpercpu? We can then gradually switch over. The 
  existing API does not allow the specification of gfp_masks or alignements.
 
 I've thought about suggesting that very thing. However, I think we need
 to have a clear view of where we're going with that so that we don't end
 up with two per cpu allocators because some users could not be converted
 over or some such.

At least in my tests so far show that it can be a full replacement but 
then I have only tested on x86_64 and Ia64. Its likely much easier to go
for the full replacement rather than in steps.

If we want dynamically sized virtually mapped per cpu areas then we may 
have issues on 32 bit platforms and with !MMU. So I would think that a 
fallback to a statically sized version may be needed. On the other hand
!MMU and 32 bit do not support a large number of processors. So we may be 
able to get away on 32 bit with a small virtual memory area.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Christoph Lameter
On Fri, 2 Nov 2007, Peter Zijlstra wrote:

 On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:
 
  Since you're the one who wants to change the semantics and guarentees
  of this interface, perhaps it might help if you did some greps around
  the tree to see how alloc_percpu() is actually used.  That's what
  I did when I started running into trouble with your patches.
 
 This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

Yes there are numerous uses. I even can increase page allocator 
performance and reduce its memory footprint by using it here.

 That means that for example each NFS mount also consumes a number of
 words - not quite sure from the top of my head how many, might be in the
 order of 24 bytes or something.
 
 I once before started looking at this, because the current
 alloc_percpu() can have some false sharing - not that I have machines
 that are overly bothered by that. I like the idea of a strict percpu
 region, however do be aware of the users.

Well I wonder if I should introduce it not as a replacement but as an 
alternative to allocpercpu? We can then gradually switch over. The 
existing API does not allow the specification of gfp_masks or alignements.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Peter Zijlstra
On Thu, 2007-11-01 at 15:58 -0700, David Miller wrote:

 Since you're the one who wants to change the semantics and guarentees
 of this interface, perhaps it might help if you did some greps around
 the tree to see how alloc_percpu() is actually used.  That's what
 I did when I started running into trouble with your patches.

This fancy new BDI stuff also lives off percpu_counter/alloc_percpu().

That means that for example each NFS mount also consumes a number of
words - not quite sure from the top of my head how many, might be in the
order of 24 bytes or something.

I once before started looking at this, because the current
alloc_percpu() can have some false sharing - not that I have machines
that are overly bothered by that. I like the idea of a strict percpu
region, however do be aware of the users.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-02 Thread Peter Zijlstra
On Fri, 2007-11-02 at 07:35 -0700, Christoph Lameter wrote:

 Well I wonder if I should introduce it not as a replacement but as an 
 alternative to allocpercpu? We can then gradually switch over. The 
 existing API does not allow the specification of gfp_masks or alignements.

I've thought about suggesting that very thing. However, I think we need
to have a clear view of where we're going with that so that we don't end
up with two per cpu allocators because some users could not be converted
over or some such.




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 18:06:17 -0700 (PDT)

> A reasonable implementation for 64 bit is likely going to depend on 
> reserving some virtual memory space for the per cpu mappings so that they 
> can be dynamically grown up to what the reserved virtual space allows.
> 
> F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
> then there is a limit on the per cpu space available of 16MB.

Now that I understand your implementation better, yes this
sounds just fine.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
Hmmm... On x86_64 we could take 8 terabyte virtual space (bit order 43)

With the worst case scenario of 16k of cpus (bit order 16) we are looking 
at 43-16 = 27 ~ 128MB per cpu. Each percpu can at max be mapped by 64 pmd 
entries. 4k support is actually max for projected hw. So we'd get 
to 512M. 

On IA64 we could take half of the vmemmap area which is 45 bits. So 
we could get up to 512MB (with 16k pages, 64k pages can get us even 
further) assuming we can at some point run 16 processors per node (4k is 
the current max which would put the limit on the per cpu area >1GB).

Lets say you have a system with 64 cpus and an area of 128M of per cpu 
storage. Then we are using 8GB of total memory for per cpu storage. The 
128M allows us to store f.e.  16 M of word size counters.

With SLAB and the current allocpercpu you would need the following for 
16M counters:

16M*32*64 (minimum alloc size of SLAB is 32 byte and we alloc via 
kmalloc) for the data.

16M*64*8 for the pointer arrays. 16M allocpercpu areas for 64 processors 
and a pointer size of 8 bytes.

So you would need to use 40G in current systems. The new scheme 
would only need 8GB for the same amount of counters.

So I think its unreasonable to assume that currently systems exist that 
can use more than 128m of allocpercpu space (assuming 64 cpus).

---
 include/asm-x86/pgtable_64.h |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/asm-x86/pgtable_64.h
===
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-01 18:15:52.282577904 
-0700
+++ linux-2.6/include/asm-x86/pgtable_64.h  2007-11-01 18:18:02.886979040 
-0700
@@ -138,10 +138,14 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START_AC(0xc200, UL)
 #define VMALLOC_END  _AC(0xe1ff, UL)
 #define VMEMMAP_START   _AC(0xe200, UL)
+#define PERCPU_START_AC(0xf200, UL)
+#define PERCPU_END  _AC(0xfa00, UL)
 #define MODULES_VADDR_AC(0x8800, UL)
 #define MODULES_END  _AC(0xfff0, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
+#define PERCPU_MIN_SHIFT   PMD_SHIFT
+#define PERCPU_BITS43
+
 #define _PAGE_BIT_PRESENT  0
 #define _PAGE_BIT_RW   1
 #define _PAGE_BIT_USER 2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> You cannot put limits of the amount of alloc_percpu() memory available
> to clients, please let's proceed with that basic understanding in
> mind.  We're wasting a ton of time discussing this fundamental issue.

There is no point in making absolute demands like "no limits". There are 
always limits to everything. 

A new implementation avoids the need to allocate per cpu arrays and also 
avoids the 32 bytes per object times cpus that are mostly wasted for small 
allocations today. So its going to potentially allow more per cpu objects
that available today.

A reasonable implementation for 64 bit is likely going to depend on 
reserving some virtual memory space for the per cpu mappings so that they 
can be dynamically grown up to what the reserved virtual space allows.

F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
then there is a limit on the per cpu space available of 16MB.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Fri, 2 Nov 2007, Eric Dumazet wrote:

> > Na. Some reasonable upper limit needs to be set. If we set that to say
> > 32Megabytes and do the virtual mapping then we can just populate the first
> > 2M and only allocate the remainder if we need it. Then we need to rely on
> > Mel's defrag stuff though defrag memory if we need it.
> 
> If a 2MB page is not available, could we revert using 4KB pages ? (like
> vmalloc stuff), paying an extra runtime overhead of course.

Sure. Its going to be like vmemmap. There will be limited imposed though 
by the amount of virtual space available. Basically the dynamic per cpu 
area can be at maximum

available_virtual_space / NR_CPUS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Eric Dumazet

Christoph Lameter a écrit :

On Thu, 1 Nov 2007, David Miller wrote:


From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will

be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.


Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.



And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.


Each tunnel needs 4 bytes per cpu?


well, if we move last_rx to a percpu var, we need  8 bytes of percpu space per 
net_device :)





You have to make it fully dynamic, there is no way around it.


Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.


If a 2MB page is not available, could we revert using 4KB pages ? (like 
vmalloc stuff), paying an extra runtime overhead of course.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 15:48:00 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > From: Christoph Lameter <[EMAIL PROTECTED]>
> > Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> > 
> > > After boot is complete we allow the reduction of the size of the per cpu 
> > > areas . Lets say we only need 128k per cpu. Then the remaining pages will
> > > be returned to the page allocator.
> > 
> > You don't know how much you will need.  I exhausted the limit on
> > sparc64 very late in the boot process when the last few userland
> > services were starting up.
> 
> Well you would be able to specify how much will remain. If not it will 
> just keep the 2M reserve around.
> 
> > And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
> > per-cpu allocation area.
> 
> Each tunnel needs 4 bytes per cpu?

Each IP compression tunnel instance does an alloc_percpu().

Since you're the one who wants to change the semantics and guarentees
of this interface, perhaps it might help if you did some greps around
the tree to see how alloc_percpu() is actually used.  That's what
I did when I started running into trouble with your patches.

You cannot put limits of the amount of alloc_percpu() memory available
to clients, please let's proceed with that basic understanding in
mind.  We're wasting a ton of time discussing this fundamental issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
> 
> > After boot is complete we allow the reduction of the size of the per cpu 
> > areas . Lets say we only need 128k per cpu. Then the remaining pages will
> > be returned to the page allocator.
> 
> You don't know how much you will need.  I exhausted the limit on
> sparc64 very late in the boot process when the last few userland
> services were starting up.

Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.

> And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
> per-cpu allocation area.

Each tunnel needs 4 bytes per cpu?

> You have to make it fully dynamic, there is no way around it.

Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

> After boot is complete we allow the reduction of the size of the per cpu 
> areas . Lets say we only need 128k per cpu. Then the remaining pages will
> be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.

And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.

You have to make it fully dynamic, there is no way around it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)
> 
> > On Thu, 1 Nov 2007, David Miller wrote:
> > 
> > > The remaining issue with accessing per-cpu areas at multiple virtual
> > > addresses is D-cache aliasing.
> > 
> > But that is not an issue for physicallly mapped caches.
> 
> Right but I'd like to use this on sparc64 which has L1 D-cache
> aliasing on some chips :-)

Hmmm... re my message I just send. Then we have to return the memory with 
the virtual address not with the physical address on sparc. May result in 
zones with holes though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)
> 
> > In order to make it truly dynamic we would have to virtually map the
> > area.  vmap? But that reduces performance.
> 
> But it would still be faster than the double-indirection we do now,
> right?

I think I have an idea how to do this. Its a bit x86_64 specific but here 
it goes.

We define a virtual area of NR_CPUS * 2M areas that are each mapped by a
PMD. That means we have a fixed virtual address for each cpus per cpu 
area. 

First cpu is at PER_CPU_START
Second cpu is at PER_CPU_START + 2M

So the per cpu area for cpu n is easily calculated using

PER_CPU_START + cpu << 19

without any lookups.

On bootup we allocate the 2M pages.

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will
be returned to the page allocator.

We create some sysfs thingy were one can see the current reserves of per 
cpu storage. If one wants to reduce memory then one can write something to 
that to return the remainder of the memory.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > The remaining issue with accessing per-cpu areas at multiple virtual
> > addresses is D-cache aliasing.
> 
> But that is not an issue for physicallly mapped caches.

Right but I'd like to use this on sparc64 which has L1 D-cache
aliasing on some chips :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> The remaining issue with accessing per-cpu areas at multiple virtual
> addresses is D-cache aliasing.

But that is not an issue for physicallly mapped caches.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)

> In order to make it truly dynamic we would have to virtually map the
> area.  vmap? But that reduces performance.

But it would still be faster than the double-indirection we do now,
right?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 05:57:12 -0700 (PDT)

> That is basically what IA64 is doing but it not usable because you would 
> have addresses that mean different things on different cpus. List head
> for example require back pointers. If you put a listhead into such a per 
> cpu area then you may corrupt another cpus per cpu area.

Indeed, but as I pointed out in another mail it actually works if you
set some rules:

1) List insert and delete is only allowed on local CPU lists.

2) List traversal is allowed on remote CPU lists.

I bet we could get all of the per-cpu users to abide by this
rule if we wanted to.

The remaining issue with accessing per-cpu areas at multiple virtual
addresses is D-cache aliasing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 06:01:14 -0700 (PDT)

> On Thu, 1 Nov 2007, David Miller wrote:
> 
> > IA64 seems to use it universally for every __get_cpu_var()
> > access, so maybe it works out somehow :-)))
> 
> IA64 does not do that. It addds the local cpu offset
> 
> #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
> __ia64_per_cpu_var(local_per_cpu_offset)))
> #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
> __ia64_per_cpu_var(local_per_cpu_offset)))

Oh I see, it's the offset itself which is accessed at the fixed
virtual address slot.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> > This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
> > in some other fashion soon afterwards.
> 
> And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

Good

> You'll definitely need to make this work dynamically somehow.

Obviously. Any ideas how?

I can probably calculate the size based on the number of online nodes when 
the per cpu areas are setup. But the setup is done before we even parse 
command line arguments. That would still mean a fixed size after bootup.

In order to make it truly dynamic we would have to virtually map the area. 
vmap? But that reduces performance.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> IA64 seems to use it universally for every __get_cpu_var()
> access, so maybe it works out somehow :-)))

IA64 does not do that. It addds the local cpu offset

#define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
#define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, Eric Dumazet wrote:

> I think this question already came in the past and Linus already answered it,
> but I again ask it. What about VM games with modern cpus (64 bits arches)
> 
> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout
> so that each cpu maps its own per_cpu area on this area, so that the local
> per_cpu data sits in the same virtual address on each cpu. Then we dont need a
> segment prefix nor adding a 'per_cpu offset'. No need to write special asm
> functions to read/write/increment a per_cpu data and gcc could use normal
> rules for optimizations.
> 
> We only would need adding "per_cpu offset" to get data for a given cpu.

That is basically what IA64 is doing but it not usable because you would 
have addresses that mean different things on different cpus. List head
for example require back pointers. If you put a listhead into such a per 
cpu area then you may corrupt another cpus per cpu area.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: David Miller <[EMAIL PROTECTED]>
Date: Thu, 01 Nov 2007 00:01:18 -0700 (PDT)

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)
> 
> > Index: linux-2.6/mm/allocpercpu.c
> > ===
> > --- linux-2.6.orig/mm/allocpercpu.c 2007-10-31 20:53:16.565486654 -0700
> > +++ linux-2.6/mm/allocpercpu.c  2007-10-31 21:00:27.553486484 -0700
>  ...
> > @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
> >  
> >  static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
> >  static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> > -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
> > +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
> >  
> >  #define CPU_DATA_OFFSET ((unsigned long)_cpu__cpu_area)
> >  
> 
> This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
> in some other fashion soon afterwards.

And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

You'll definitely need to make this work dynamically somehow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Thu, 01 Nov 2007 08:17:58 +0100

> Say we reserve on x86_64 a really huge (2^32 bytes) area, and change
> VM layout so that each cpu maps its own per_cpu area on this area,
> so that the local per_cpu data sits in the same virtual address on
> each cpu.

This is a mechanism used partially on IA64 already.

I think you have to be very careful, and you can only use this per-cpu
fixed virtual address area in extremely limited cases.

The reason is, I think the address matters, consider list heads, for
example.

So you couldn't do:

list_add(>list, _cpu_ptr(list_head));

and use that per-cpu fixed virtual address.

IA64 seems to use it universally for every __get_cpu_var()
access, so maybe it works out somehow :-)))

I guess if list modifications by remote cpus are disallowed, it would
work (list traversal works because using the fixed virtual address as
the list head sentinal is OK), but that is an extremely fragile
assumption to base the entire mechanism upon.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Eric Dumazet

Christoph Lameter a écrit :

This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):


   SLABSLUBSLUB+   SLUB-o   SLUB-a
   896  86  45  44  38  3 *
  1684  92  49  48  43  2 *
  3284  106 61  59  53  +++
  64102 129 82  88  75  ++
 128147 226 188 181 176 -
 256200 248 207 285 204 =
 512300 301 260 209 250 +
1024416 440 398 264 391 ++
2048720 542 530 390 511 +++
40961254342 342 336 376 3 *

alloc/free test
  SLABSLUBSLUB+   SLUB-oSLUB-a
  137-146 151 68-72   68-74 56-58   3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.



Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is 
not enough because many things will use per_cpu if only per_cpu was reasonably 
fast (ie not so many dereferences)


I think this question already came in the past and Linus already answered it, 
but I again ask it. What about VM games with modern cpus (64 bits arches)


Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout 
so that each cpu maps its own per_cpu area on this area, so that the local 
per_cpu data sits in the same virtual address on each cpu. Then we dont need a 
segment prefix nor adding a 'per_cpu offset'. No need to write special asm 
functions to read/write/increment a per_cpu data and gcc could use normal 
rules for optimizations.


We only would need adding "per_cpu offset" to get data for a given cpu.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)

> Index: linux-2.6/mm/allocpercpu.c
> ===
> --- linux-2.6.orig/mm/allocpercpu.c   2007-10-31 20:53:16.565486654 -0700
> +++ linux-2.6/mm/allocpercpu.c2007-10-31 21:00:27.553486484 -0700
 ...
> @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
>  
>  static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
>  static DEFINE_SPINLOCK(cpu_alloc_map_lock);
> -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
> +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
>  
>  #define CPU_DATA_OFFSET ((unsigned long)_cpu__cpu_area)
>  

This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
in some other fashion soon afterwards.

I'll try to debug this some more later, I've dumped enough time into
this already :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)

 Index: linux-2.6/mm/allocpercpu.c
 ===
 --- linux-2.6.orig/mm/allocpercpu.c   2007-10-31 20:53:16.565486654 -0700
 +++ linux-2.6/mm/allocpercpu.c2007-10-31 21:00:27.553486484 -0700
 ...
 @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
  
  static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
  static DEFINE_SPINLOCK(cpu_alloc_map_lock);
 -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
 +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
  
  #define CPU_DATA_OFFSET ((unsigned long)per_cpu__cpu_area)
  

This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
in some other fashion soon afterwards.

I'll try to debug this some more later, I've dumped enough time into
this already :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Eric Dumazet

Christoph Lameter a écrit :

This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):


   SLABSLUBSLUB+   SLUB-o   SLUB-a
   896  86  45  44  38  3 *
  1684  92  49  48  43  2 *
  3284  106 61  59  53  +++
  64102 129 82  88  75  ++
 128147 226 188 181 176 -
 256200 248 207 285 204 =
 512300 301 260 209 250 +
1024416 440 398 264 391 ++
2048720 542 530 390 511 +++
40961254342 342 336 376 3 *

alloc/free test
  SLABSLUBSLUB+   SLUB-oSLUB-a
  137-146 151 68-72   68-74 56-58   3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.



Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is 
not enough because many things will use per_cpu if only per_cpu was reasonably 
fast (ie not so many dereferences)


I think this question already came in the past and Linus already answered it, 
but I again ask it. What about VM games with modern cpus (64 bits arches)


Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout 
so that each cpu maps its own per_cpu area on this area, so that the local 
per_cpu data sits in the same virtual address on each cpu. Then we dont need a 
segment prefix nor adding a 'per_cpu offset'. No need to write special asm 
functions to read/write/increment a per_cpu data and gcc could use normal 
rules for optimizations.


We only would need adding per_cpu offset to get data for a given cpu.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Thu, 01 Nov 2007 08:17:58 +0100

 Say we reserve on x86_64 a really huge (2^32 bytes) area, and change
 VM layout so that each cpu maps its own per_cpu area on this area,
 so that the local per_cpu data sits in the same virtual address on
 each cpu.

This is a mechanism used partially on IA64 already.

I think you have to be very careful, and you can only use this per-cpu
fixed virtual address area in extremely limited cases.

The reason is, I think the address matters, consider list heads, for
example.

So you couldn't do:

list_add(obj-list, per_cpu_ptr(list_head));

and use that per-cpu fixed virtual address.

IA64 seems to use it universally for every __get_cpu_var()
access, so maybe it works out somehow :-)))

I guess if list modifications by remote cpus are disallowed, it would
work (list traversal works because using the fixed virtual address as
the list head sentinal is OK), but that is an extremely fragile
assumption to base the entire mechanism upon.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Thu, 01 Nov 2007 00:01:18 -0700 (PDT)

 From: Christoph Lameter [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)
 
  Index: linux-2.6/mm/allocpercpu.c
  ===
  --- linux-2.6.orig/mm/allocpercpu.c 2007-10-31 20:53:16.565486654 -0700
  +++ linux-2.6/mm/allocpercpu.c  2007-10-31 21:00:27.553486484 -0700
  ...
  @@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
   
   static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
   static DEFINE_SPINLOCK(cpu_alloc_map_lock);
  -static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
  +static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
   
   #define CPU_DATA_OFFSET ((unsigned long)per_cpu__cpu_area)
   
 
 This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
 in some other fashion soon afterwards.

And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

You'll definitely need to make this work dynamically somehow.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, Eric Dumazet wrote:

 I think this question already came in the past and Linus already answered it,
 but I again ask it. What about VM games with modern cpus (64 bits arches)
 
 Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout
 so that each cpu maps its own per_cpu area on this area, so that the local
 per_cpu data sits in the same virtual address on each cpu. Then we dont need a
 segment prefix nor adding a 'per_cpu offset'. No need to write special asm
 functions to read/write/increment a per_cpu data and gcc could use normal
 rules for optimizations.
 
 We only would need adding per_cpu offset to get data for a given cpu.

That is basically what IA64 is doing but it not usable because you would 
have addresses that mean different things on different cpus. List head
for example require back pointers. If you put a listhead into such a per 
cpu area then you may corrupt another cpus per cpu area.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 IA64 seems to use it universally for every __get_cpu_var()
 access, so maybe it works out somehow :-)))

IA64 does not do that. It addds the local cpu offset

#define __get_cpu_var(var) (*RELOC_HIDE(per_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
#define __raw_get_cpu_var(var) (*RELOC_HIDE(per_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

  This hunk helped the sparc64 looping OOPS I was getting, but cpus hang
  in some other fashion soon afterwards.
 
 And if I bump PER_CPU_ALLOC_SIZE up to 128K it seems to mostly work.

Good

 You'll definitely need to make this work dynamically somehow.

Obviously. Any ideas how?

I can probably calculate the size based on the number of online nodes when 
the per cpu areas are setup. But the setup is done before we even parse 
command line arguments. That would still mean a fixed size after bootup.

In order to make it truly dynamic we would have to virtually map the area. 
vmap? But that reduces performance.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 06:01:14 -0700 (PDT)

 On Thu, 1 Nov 2007, David Miller wrote:
 
  IA64 seems to use it universally for every __get_cpu_var()
  access, so maybe it works out somehow :-)))
 
 IA64 does not do that. It addds the local cpu offset
 
 #define __get_cpu_var(var) (*RELOC_HIDE(per_cpu__##var, 
 __ia64_per_cpu_var(local_per_cpu_offset)))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(per_cpu__##var, 
 __ia64_per_cpu_var(local_per_cpu_offset)))

Oh I see, it's the offset itself which is accessed at the fixed
virtual address slot.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 05:57:12 -0700 (PDT)

 That is basically what IA64 is doing but it not usable because you would 
 have addresses that mean different things on different cpus. List head
 for example require back pointers. If you put a listhead into such a per 
 cpu area then you may corrupt another cpus per cpu area.

Indeed, but as I pointed out in another mail it actually works if you
set some rules:

1) List insert and delete is only allowed on local CPU lists.

2) List traversal is allowed on remote CPU lists.

I bet we could get all of the per-cpu users to abide by this
rule if we wanted to.

The remaining issue with accessing per-cpu areas at multiple virtual
addresses is D-cache aliasing.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)

 In order to make it truly dynamic we would have to virtually map the
 area.  vmap? But that reduces performance.

But it would still be faster than the double-indirection we do now,
right?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 The remaining issue with accessing per-cpu areas at multiple virtual
 addresses is D-cache aliasing.

But that is not an issue for physicallly mapped caches.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)

 On Thu, 1 Nov 2007, David Miller wrote:
 
  The remaining issue with accessing per-cpu areas at multiple virtual
  addresses is D-cache aliasing.
 
 But that is not an issue for physicallly mapped caches.

Right but I'd like to use this on sparc64 which has L1 D-cache
aliasing on some chips :-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 From: Christoph Lameter [EMAIL PROTECTED]
 Date: Thu, 1 Nov 2007 06:03:44 -0700 (PDT)
 
  In order to make it truly dynamic we would have to virtually map the
  area.  vmap? But that reduces performance.
 
 But it would still be faster than the double-indirection we do now,
 right?

I think I have an idea how to do this. Its a bit x86_64 specific but here 
it goes.

We define a virtual area of NR_CPUS * 2M areas that are each mapped by a
PMD. That means we have a fixed virtual address for each cpus per cpu 
area. 

First cpu is at PER_CPU_START
Second cpu is at PER_CPU_START + 2M

So the per cpu area for cpu n is easily calculated using

PER_CPU_START + cpu  19

without any lookups.

On bootup we allocate the 2M pages.

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will
be returned to the page allocator.

We create some sysfs thingy were one can see the current reserves of per 
cpu storage. If one wants to reduce memory then one can write something to 
that to return the remainder of the memory.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 From: Christoph Lameter [EMAIL PROTECTED]
 Date: Thu, 1 Nov 2007 15:11:41 -0700 (PDT)
 
  On Thu, 1 Nov 2007, David Miller wrote:
  
   The remaining issue with accessing per-cpu areas at multiple virtual
   addresses is D-cache aliasing.
  
  But that is not an issue for physicallly mapped caches.
 
 Right but I'd like to use this on sparc64 which has L1 D-cache
 aliasing on some chips :-)

Hmmm... re my message I just send. Then we have to return the memory with 
the virtual address not with the physical address on sparc. May result in 
zones with holes though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

 After boot is complete we allow the reduction of the size of the per cpu 
 areas . Lets say we only need 128k per cpu. Then the remaining pages will
 be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.

And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.

You have to make it fully dynamic, there is no way around it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 From: Christoph Lameter [EMAIL PROTECTED]
 Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
 
  After boot is complete we allow the reduction of the size of the per cpu 
  areas . Lets say we only need 128k per cpu. Then the remaining pages will
  be returned to the page allocator.
 
 You don't know how much you will need.  I exhausted the limit on
 sparc64 very late in the boot process when the last few userland
 services were starting up.

Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.

 And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
 per-cpu allocation area.

Each tunnel needs 4 bytes per cpu?

 You have to make it fully dynamic, there is no way around it.

Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 15:48:00 -0700 (PDT)

 On Thu, 1 Nov 2007, David Miller wrote:
 
  From: Christoph Lameter [EMAIL PROTECTED]
  Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)
  
   After boot is complete we allow the reduction of the size of the per cpu 
   areas . Lets say we only need 128k per cpu. Then the remaining pages will
   be returned to the page allocator.
  
  You don't know how much you will need.  I exhausted the limit on
  sparc64 very late in the boot process when the last few userland
  services were starting up.
 
 Well you would be able to specify how much will remain. If not it will 
 just keep the 2M reserve around.
 
  And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
  per-cpu allocation area.
 
 Each tunnel needs 4 bytes per cpu?

Each IP compression tunnel instance does an alloc_percpu().

Since you're the one who wants to change the semantics and guarentees
of this interface, perhaps it might help if you did some greps around
the tree to see how alloc_percpu() is actually used.  That's what
I did when I started running into trouble with your patches.

You cannot put limits of the amount of alloc_percpu() memory available
to clients, please let's proceed with that basic understanding in
mind.  We're wasting a ton of time discussing this fundamental issue.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Eric Dumazet

Christoph Lameter a écrit :

On Thu, 1 Nov 2007, David Miller wrote:


From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will

be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.


Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.



And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.


Each tunnel needs 4 bytes per cpu?


well, if we move last_rx to a percpu var, we need  8 bytes of percpu space per 
net_device :)





You have to make it fully dynamic, there is no way around it.


Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.


If a 2MB page is not available, could we revert using 4KB pages ? (like 
vmalloc stuff), paying an extra runtime overhead of course.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

 You cannot put limits of the amount of alloc_percpu() memory available
 to clients, please let's proceed with that basic understanding in
 mind.  We're wasting a ton of time discussing this fundamental issue.

There is no point in making absolute demands like no limits. There are 
always limits to everything. 

A new implementation avoids the need to allocate per cpu arrays and also 
avoids the 32 bytes per object times cpus that are mostly wasted for small 
allocations today. So its going to potentially allow more per cpu objects
that available today.

A reasonable implementation for 64 bit is likely going to depend on 
reserving some virtual memory space for the per cpu mappings so that they 
can be dynamically grown up to what the reserved virtual space allows.

F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
then there is a limit on the per cpu space available of 16MB.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Fri, 2 Nov 2007, Eric Dumazet wrote:

  Na. Some reasonable upper limit needs to be set. If we set that to say
  32Megabytes and do the virtual mapping then we can just populate the first
  2M and only allocate the remainder if we need it. Then we need to rely on
  Mel's defrag stuff though defrag memory if we need it.
 
 If a 2MB page is not available, could we revert using 4KB pages ? (like
 vmalloc stuff), paying an extra runtime overhead of course.

Sure. Its going to be like vmemmap. There will be limited imposed though 
by the amount of virtual space available. Basically the dynamic per cpu 
area can be at maximum

available_virtual_space / NR_CPUS

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
Hmmm... On x86_64 we could take 8 terabyte virtual space (bit order 43)

With the worst case scenario of 16k of cpus (bit order 16) we are looking 
at 43-16 = 27 ~ 128MB per cpu. Each percpu can at max be mapped by 64 pmd 
entries. 4k support is actually max for projected hw. So we'd get 
to 512M. 

On IA64 we could take half of the vmemmap area which is 45 bits. So 
we could get up to 512MB (with 16k pages, 64k pages can get us even 
further) assuming we can at some point run 16 processors per node (4k is 
the current max which would put the limit on the per cpu area 1GB).

Lets say you have a system with 64 cpus and an area of 128M of per cpu 
storage. Then we are using 8GB of total memory for per cpu storage. The 
128M allows us to store f.e.  16 M of word size counters.

With SLAB and the current allocpercpu you would need the following for 
16M counters:

16M*32*64 (minimum alloc size of SLAB is 32 byte and we alloc via 
kmalloc) for the data.

16M*64*8 for the pointer arrays. 16M allocpercpu areas for 64 processors 
and a pointer size of 8 bytes.

So you would need to use 40G in current systems. The new scheme 
would only need 8GB for the same amount of counters.

So I think its unreasonable to assume that currently systems exist that 
can use more than 128m of allocpercpu space (assuming 64 cpus).

---
 include/asm-x86/pgtable_64.h |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/asm-x86/pgtable_64.h
===
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-01 18:15:52.282577904 
-0700
+++ linux-2.6/include/asm-x86/pgtable_64.h  2007-11-01 18:18:02.886979040 
-0700
@@ -138,10 +138,14 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START_AC(0xc200, UL)
 #define VMALLOC_END  _AC(0xe1ff, UL)
 #define VMEMMAP_START   _AC(0xe200, UL)
+#define PERCPU_START_AC(0xf200, UL)
+#define PERCPU_END  _AC(0xfa00, UL)
 #define MODULES_VADDR_AC(0x8800, UL)
 #define MODULES_END  _AC(0xfff0, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
+#define PERCPU_MIN_SHIFT   PMD_SHIFT
+#define PERCPU_BITS43
+
 #define _PAGE_BIT_PRESENT  0
 #define _PAGE_BIT_RW   1
 #define _PAGE_BIT_USER 2

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Thu, 1 Nov 2007 18:06:17 -0700 (PDT)

 A reasonable implementation for 64 bit is likely going to depend on 
 reserving some virtual memory space for the per cpu mappings so that they 
 can be dynamically grown up to what the reserved virtual space allows.
 
 F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
 then there is a limit on the per cpu space available of 16MB.

Now that I understand your implementation better, yes this
sounds just fine.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 21:16:59 -0700 (PDT)

>  /*
>   * Maximum allowed per cpu data per cpu
>   */
> +#ifdef CONFIG_NUMA
> +#define PER_CPU_ALLOC_SIZE (32768 + MAX_NUMNODES * 512)
> +#else
>  #define PER_CPU_ALLOC_SIZE 32768
> +#endif
> +

Christoph, as Rusty found out years ago when he first wrote this code,
you cannot put hard limits on the alloc_percpu() allocations.

They can be done by anyone, any module, and since there was no limit
before you cannot reasonably add one now.

As just one of many examples, several networking devices use
alloc_percpu() for each instance they bring up.  This alone can
request arbitrary amounts of per-cpu data.

Therefore, you'll need to do your optimization without imposing any
size limits.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 18:21:02 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > From: Christoph Lameter <[EMAIL PROTECTED]>
> > Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)
> > 
> > > On Wed, 31 Oct 2007, David Miller wrote:
> > > 
> > > > All I can do now is bisect and then try to figure out what about the
> > > > guilty change might cause the problem.
> > > 
> > > Reverting the 7th patch should avoid using the sparc register that caches 
> > > the per cpu area offset? (I though so, does it?)
> > 
> > Yes, that's right, %g5 holds the local cpu's per-cpu offset.
> 
> And if I add the address of a percpu variable then I get to the variable 
> for this cpu right?

Right.

I bisected the crash down to:

[PATCH] newallocpercpu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
H... Got this to run on an ia64 big iron. One problem is the sizing of 
the pool. Somehow this needs to be dynamic.

Apply this fix on top of the others.

---
 include/asm-ia64/page.h   |2 +-
 include/asm-ia64/percpu.h |9 ++---
 mm/allocpercpu.c  |   12 ++--
 3 files changed, 17 insertions(+), 6 deletions(-)

Index: linux-2.6/mm/allocpercpu.c
===
--- linux-2.6.orig/mm/allocpercpu.c 2007-10-31 20:53:16.565486654 -0700
+++ linux-2.6/mm/allocpercpu.c  2007-10-31 21:00:27.553486484 -0700
@@ -28,7 +28,12 @@
 /*
  * Maximum allowed per cpu data per cpu
  */
+#ifdef CONFIG_NUMA
+#define PER_CPU_ALLOC_SIZE (32768 + MAX_NUMNODES * 512)
+#else
 #define PER_CPU_ALLOC_SIZE 32768
+#endif
+
 
 #define UNIT_SIZE sizeof(unsigned long long)
 #define UNITS_PER_CPU (PER_CPU_ALLOC_SIZE / UNIT_SIZE)
@@ -37,7 +42,7 @@ enum unit_type { FREE, END, USED };
 
 static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };
 static DEFINE_SPINLOCK(cpu_alloc_map_lock);
-static DEFINE_PER_CPU(int, cpu_area)[UNITS_PER_CPU];
+static DEFINE_PER_CPU(unsigned long long, cpu_area)[UNITS_PER_CPU];
 
 #define CPU_DATA_OFFSET ((unsigned long)_cpu__cpu_area)
 
@@ -97,8 +102,11 @@ static void *cpu_alloc(unsigned long siz
while (start < UNITS_PER_CPU &&
cpu_alloc_map[start] != FREE)
start++;
-   if (start == UNITS_PER_CPU)
+   if (start == UNITS_PER_CPU) {
+   spin_unlock(_alloc_map_lock);
+   printk(KERN_CRIT "Dynamic per cpu memory exhausted\n");
return NULL;
+   }
 
end = start + 1;
while (end < UNITS_PER_CPU && end - start < units &&
Index: linux-2.6/include/asm-ia64/page.h
===
--- linux-2.6.orig/include/asm-ia64/page.h  2007-10-31 20:53:16.573486483 
-0700
+++ linux-2.6/include/asm-ia64/page.h   2007-10-31 20:56:19.372870091 -0700
@@ -44,7 +44,7 @@
 #define PAGE_MASK  (~(PAGE_SIZE - 1))
 #define PAGE_ALIGN(addr)   (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
 
-#define PERCPU_PAGE_SHIFT  16  /* log2() of max. size of per-CPU area 
*/
+#define PERCPU_PAGE_SHIFT  20  /* log2() of max. size of per-CPU area 
*/
 #define PERCPU_PAGE_SIZE   (__IA64_UL_CONST(1) << PERCPU_PAGE_SHIFT)
 
 
Index: linux-2.6/include/asm-ia64/percpu.h
===
--- linux-2.6.orig/include/asm-ia64/percpu.h2007-10-31 20:53:30.424553062 
-0700
+++ linux-2.6/include/asm-ia64/percpu.h 2007-10-31 20:53:36.248486656 -0700
@@ -40,6 +40,12 @@
 #endif
 
 /*
+ * This will make per cpu access to the local area use the virtually mapped
+ * areas.
+ */
+#define this_cpu_offset()  0
+
+/*
  * Pretty much a literal copy of asm-generic/percpu.h, except that 
percpu_modcopy() is an
  * external routine, to avoid include-hell.
  */
@@ -51,8 +57,6 @@ extern unsigned long __per_cpu_offset[NR
 /* Equal to __per_cpu_offset[smp_processor_id()], but faster to access: */
 DECLARE_PER_CPU(unsigned long, local_per_cpu_offset);
 
-#define this_cpu_offset() __ia64_per_cpu_var(local_per_cpu_offset)
-
 #define per_cpu(var, cpu)  (*RELOC_HIDE(_cpu__##var, 
__per_cpu_offset[cpu]))
 #define __get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
 #define __raw_get_cpu_var(var) (*RELOC_HIDE(_cpu__##var, 
__ia64_per_cpu_var(local_per_cpu_offset)))
@@ -67,7 +71,6 @@ extern void *per_cpu_init(void);
 #define __get_cpu_var(var) per_cpu__##var
 #define __raw_get_cpu_var(var) per_cpu__##var
 #define per_cpu_init() (__phys_per_cpu_start)
-#define this_cpu_offset()  0
 
 #endif /* SMP */
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

> From: Christoph Lameter <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)
> 
> > On Wed, 31 Oct 2007, David Miller wrote:
> > 
> > > All I can do now is bisect and then try to figure out what about the
> > > guilty change might cause the problem.
> > 
> > Reverting the 7th patch should avoid using the sparc register that caches 
> > the per cpu area offset? (I though so, does it?)
> 
> Yes, that's right, %g5 holds the local cpu's per-cpu offset.

And if I add the address of a percpu variable then I get to the variable 
for this cpu right?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > All I can do now is bisect and then try to figure out what about the
> > guilty change might cause the problem.
> 
> Reverting the 7th patch should avoid using the sparc register that caches 
> the per cpu area offset? (I though so, does it?)

Yes, that's right, %g5 holds the local cpu's per-cpu offset.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

> It crashes when SSHD starts, the serial console GETTY hasn't
> started up yet so I can't even log in to run those commands
> Christoph.

Hmmm... Bad.

> All I can do now is bisect and then try to figure out what about the
> guilty change might cause the problem.

Reverting the 7th patch should avoid using the sparc register that caches 
the per cpu area offset? (I though so, does it?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 18:01:34 -0700 (PDT)

> On Wed, 31 Oct 2007, David Miller wrote:
> 
> > Without DEBUG_VM I get a loop of crashes shortly after SSHD
> > is started, I'll try to track it down.
> 
> Check how much per cpu memory is in use by
> 
> cat /proc/vmstat
> 
> currently we have a 32k limit there.

It crashes when SSHD starts, the serial console GETTY hasn't
started up yet so I can't even log in to run those commands
Christoph.

All I can do now is bisect and then try to figure out what about the
guilty change might cause the problem.

This is on a 64-cpu sparc64 box, and fast cmpxchg local is not set, so
maybe it's one of the locking changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

> Without DEBUG_VM I get a loop of crashes shortly after SSHD
> is started, I'll try to track it down.

Check how much per cpu memory is in use by

cat /proc/vmstat

currently we have a 32k limit there.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 17:53:23 -0700 (PDT)

> > This patch fixes build failures with DEBUG_VM disabled.
> 
> Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
> need this patch.

Without DEBUG_VM I get a loop of crashes shortly after SSHD
is started, I'll try to track it down.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
> This patch fixes build failures with DEBUG_VM disabled.

Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
need this patch.


---
 include/linux/percpu.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===
--- linux-2.6.orig/include/linux/percpu.h   2007-10-31 17:48:38.020499686 
-0700
+++ linux-2.6/include/linux/percpu.h2007-10-31 17:51:01.423372247 -0700
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 
@@ -53,7 +53,7 @@
 
 #define this_cpu_ptr(ptr)  \
 ({ \
-   void *p = ptr;  \
+   void *p = __percpu_disguise(ptr);   \
(__typeof__(ptr))(p + this_cpu_offset());   \
 })
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 17:31:12 -0700 (PDT)

> Others may have the same issue.
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
> allocpercpu
> 
> should get you the whole thing.

This patch fixes build failures with DEBUG_VM disabled.

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 4b167c0..d414703 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

> > git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
> > performance
> > 
> > and then you should be able to apply these patches.
> 
> Thanks a lot Chrisoph.

Others may have the same issue.

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
allocpercpu

should get you the whole thing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 17:26:16 -0700 (PDT)

> Do 
> 
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
> performance
> 
> and then you should be able to apply these patches.

Thanks a lot Chrisoph.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

> 
> Are these patches against -mm or mainline?
> 
> I get a lot of rejects starting with patch 6 against
> mainline and I really wanted to test them out on sparc64.

Hmmm... They are against the current slab performance head (which is in mm 
but it has not been released yet ;-).

Do 

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
performance

and then you should be able to apply these patches.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller

Are these patches against -mm or mainline?

I get a lot of rejects starting with patch 6 against
mainline and I really wanted to test them out on sparc64.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):


   SLABSLUBSLUB+   SLUB-o   SLUB-a
   896  86  45  44  38  3 *
  1684  92  49  48  43  2 *
  3284  106 61  59  53  +++
  64102 129 82  88  75  ++
 128147 226 188 181 176 -
 256200 248 207 285 204 =
 512300 301 260 209 250 +
1024416 440 398 264 391 ++
2048720 542 530 390 511 +++
40961254342 342 336 376 3 *

alloc/free test
  SLABSLUBSLUB+   SLUB-oSLUB-a
  137-146 151 68-72   68-74 56-58   3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):


   SLABSLUBSLUB+   SLUB-o   SLUB-a
   896  86  45  44  38  3 *
  1684  92  49  48  43  2 *
  3284  106 61  59  53  +++
  64102 129 82  88  75  ++
 128147 226 188 181 176 -
 256200 248 207 285 204 =
 512300 301 260 209 250 +
1024416 440 398 264 391 ++
2048720 542 530 390 511 +++
40961254342 342 336 376 3 *

alloc/free test
  SLABSLUBSLUB+   SLUB-oSLUB-a
  137-146 151 68-72   68-74 56-58   3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller

Are these patches against -mm or mainline?

I get a lot of rejects starting with patch 6 against
mainline and I really wanted to test them out on sparc64.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

 
 Are these patches against -mm or mainline?
 
 I get a lot of rejects starting with patch 6 against
 mainline and I really wanted to test them out on sparc64.

Hmmm... They are against the current slab performance head (which is in mm 
but it has not been released yet ;-).

Do 

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
performance

and then you should be able to apply these patches.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 17:26:16 -0700 (PDT)

 Do 
 
 git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
 performance
 
 and then you should be able to apply these patches.

Thanks a lot Chrisoph.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

  git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
  performance
  
  and then you should be able to apply these patches.
 
 Thanks a lot Chrisoph.

Others may have the same issue.

git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
allocpercpu

should get you the whole thing.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 17:31:12 -0700 (PDT)

 Others may have the same issue.
 
 git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/slab.git 
 allocpercpu
 
 should get you the whole thing.

This patch fixes build failures with DEBUG_VM disabled.

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 4b167c0..d414703 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
 This patch fixes build failures with DEBUG_VM disabled.

Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
need this patch.


---
 include/linux/percpu.h |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===
--- linux-2.6.orig/include/linux/percpu.h   2007-10-31 17:48:38.020499686 
-0700
+++ linux-2.6/include/linux/percpu.h2007-10-31 17:51:01.423372247 -0700
@@ -36,7 +36,7 @@
 #ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
 #else
-#define __percpu_disguide(pdata) ((void *)(pdata))
+#define __percpu_disguise(pdata) ((void *)(pdata))
 #endif
 
 /* 
@@ -53,7 +53,7 @@
 
 #define this_cpu_ptr(ptr)  \
 ({ \
-   void *p = ptr;  \
+   void *p = __percpu_disguise(ptr);   \
(__typeof__(ptr))(p + this_cpu_offset());   \
 })
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 17:53:23 -0700 (PDT)

  This patch fixes build failures with DEBUG_VM disabled.
 
 Well there is more there. Last minute mods sigh. With DEBUG_VM you likely 
 need this patch.

Without DEBUG_VM I get a loop of crashes shortly after SSHD
is started, I'll try to track it down.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 18:01:34 -0700 (PDT)

 On Wed, 31 Oct 2007, David Miller wrote:
 
  Without DEBUG_VM I get a loop of crashes shortly after SSHD
  is started, I'll try to track it down.
 
 Check how much per cpu memory is in use by
 
 cat /proc/vmstat
 
 currently we have a 32k limit there.

It crashes when SSHD starts, the serial console GETTY hasn't
started up yet so I can't even log in to run those commands
Christoph.

All I can do now is bisect and then try to figure out what about the
guilty change might cause the problem.

This is on a 64-cpu sparc64 box, and fast cmpxchg local is not set, so
maybe it's one of the locking changes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread David Miller
From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 18:12:11 -0700 (PDT)

 On Wed, 31 Oct 2007, David Miller wrote:
 
  All I can do now is bisect and then try to figure out what about the
  guilty change might cause the problem.
 
 Reverting the 7th patch should avoid using the sparc register that caches 
 the per cpu area offset? (I though so, does it?)

Yes, that's right, %g5 holds the local cpu's per-cpu offset.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

 It crashes when SSHD starts, the serial console GETTY hasn't
 started up yet so I can't even log in to run those commands
 Christoph.

Hmmm... Bad.

 All I can do now is bisect and then try to figure out what about the
 guilty change might cause the problem.

Reverting the 7th patch should avoid using the sparc register that caches 
the per cpu area offset? (I though so, does it?)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Christoph Lameter
On Wed, 31 Oct 2007, David Miller wrote:

 Without DEBUG_VM I get a loop of crashes shortly after SSHD
 is started, I'll try to track it down.

Check how much per cpu memory is in use by

cat /proc/vmstat

currently we have a 32k limit there.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >