subject:"SLUB\: The unqueued Slab allocator"

[SLUB 0/2] SLUB: The unqueued slab allocator V6

2007-03-31 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v6

Note that the definition of the return type of ksize() is currently
different between mm and Linus' tree. Patch is conforming to mm.
This patch also needs sprint_symbol() support from mm.

V5->V6:

- Straighten out various coding issues u.a. to make the hot path clearer
  in slab_alloc and slab_free. This adds more gotos. sigh.

- Detailed alloc / free tracking including pid, cpu, time of alloc / free
  if SLAB_STORE_USER is enabled or slub_debug=U specified on boot.

- sysfs support via /sys/slab. Drop /proc/slubinfo support.
  Include slabinfo tool that produces an output similar to what
  /proc/slabinfo does. Tool needs to be made more sophisticated
  to allow control of various slub options at runtime. Currently
  reports total slab sizes, slab fragmentation and slab effectiveness
  (actual object use vs. slab space use).

- Runtime debug option changes per slab via /sys/slab/.
  All slab debug options can be configured via sysfs provided that
  no objects have been allocated yet.

- Deal with i386 use of slab page structs. Main patch disables
  slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special
  patch removes the page sized slabs and removes that setting.
  See the caveats in that patch for further details.

V4->V5:

- Single object slabs only for slabs > slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  various platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block.

[SLUB 0/2] SLUB: The unqueued slab allocator V6

2007-03-31 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v6

Note that the definition of the return type of ksize() is currently
different between mm and Linus' tree. Patch is conforming to mm.
This patch also needs sprint_symbol() support from mm.

V5-V6:

- Straighten out various coding issues u.a. to make the hot path clearer
  in slab_alloc and slab_free. This adds more gotos. sigh.

- Detailed alloc / free tracking including pid, cpu, time of alloc / free
  if SLAB_STORE_USER is enabled or slub_debug=U specified on boot.

- sysfs support via /sys/slab. Drop /proc/slubinfo support.
  Include slabinfo tool that produces an output similar to what
  /proc/slabinfo does. Tool needs to be made more sophisticated
  to allow control of various slub options at runtime. Currently
  reports total slab sizes, slab fragmentation and slab effectiveness
  (actual object use vs. slab space use).

- Runtime debug option changes per slab via /sys/slab/slabcache.
  All slab debug options can be configured via sysfs provided that
  no objects have been allocated yet.

- Deal with i386 use of slab page structs. Main patch disables
  slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special
  patch removes the page sized slabs and removes that setting.
  See the caveats in that patch for further details.

V4-V5:

- Single object slabs only for slabs  slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  various platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

V3-V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2-V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1-V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter

On Sat, 10 Mar 2007, Andrew Morton wrote:

> Is this safe to think about applying yet?

Its safe. By default kernels will be build with SLAB. SLUB becomes only a 
selectable alternative. It should not become the primary slab until we 
know that its really superior overall and have thoroughly tested it in
a variety of workloads.

> We lost the leak detector feature.

There will be numerous small things that will have to be addressed. There
is also some minor work to be done for tracking callers better.

> It might be nice to create synonyms for PageActive, PageReferenced and
> PageError, to make things clearer in the slub core.   At the expense of
> making things less clear globally.  Am unsure.

I have been back and forth on doing that. There are somewhat similar 
in what they mean for SLUB. But creating synonyms may be confusing to 
those checking how page flags are being used.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Andrew Morton

Is this safe to think about applying yet?

We lost the leak detector feature.

It might be nice to create synonyms for PageActive, PageReferenced and
PageError, to make things clearer in the slub core.   At the expense of
making things less clear globally.  Am unsure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v4

V4->V5:

- Single object slabs only for slabs > slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  varios platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.

Note that the definition of the return type of ksize() is currently
different between mm and Linus tree. Patch is conforming to mm.

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a perf

[SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v4

V4-V5:

- Single object slabs only for slabs  slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  varios platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.

Note that the definition of the return type of ksize() is currently
different between mm and Linus tree. Patch is conforming to mm.

V3-V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2-V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1-V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Andrew Morton

Is this safe to think about applying yet?

We lost the leak detector feature.

It might be nice to create synonyms for PageActive, PageReferenced and
PageError, to make things clearer in the slub core.   At the expense of
making things less clear globally.  Am unsure.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5

2007-03-10 Thread Christoph Lameter

On Sat, 10 Mar 2007, Andrew Morton wrote:

 Is this safe to think about applying yet?

Its safe. By default kernels will be build with SLAB. SLUB becomes only a 
selectable alternative. It should not become the primary slab until we 
know that its really superior overall and have thoroughly tested it in
a variety of workloads.

 We lost the leak detector feature.

There will be numerous small things that will have to be addressed. There
is also some minor work to be done for tracking callers better.
 
 It might be nice to create synonyms for PageActive, PageReferenced and
 PageError, to make things clearer in the slub core.   At the expense of
 making things less clear globally.  Am unsure.

I have been back and forth on doing that. There are somewhat similar 
in what they mean for SLUB. But creating synonyms may be confusing to 
those checking how page flags are being used.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter

On Fri, 9 Mar 2007, Mel Gorman wrote:

> The results without slub_debug were not good except for IA64. x86_64 and ppc64
> both blew up for a variety of reasons. The IA64 results were

Yuck that is the dst issue that Adrian is also looking at. Likely an issue 
with slab merging and RCU frees.

> KernBench Comparison
> 
>   2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub
> %diff
> User   CPU time1084.64   1032.93 4.77%
> System CPU time  73.38 63.14 
> 13.95%
> Total  CPU time1158.02   1096.07 5.35%
> Elapsedtime 307.00285.62 6.96%

Wow! The first indication that we are on the right track with this.

> AIM9 Comparison
>  2 page_test   2097119.26 3398259.27 1301140.01 
> 62.04% System Allocations & Pages/second

Wow! Must have all stayed within slab boundaries.

>  8 link_test 64776.047488.13  -57287.91 
> -88.44% Link/Unlink Pairs/second

Crap. Maybe we straddled a slab boundary here?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter

On Fri, 9 Mar 2007, Mel Gorman wrote:

> I'm not sure what you mean by per-order queues. The buddy allocator already
> has per-order lists.

Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid 
(or defer) fragmentation by keeping its own queues.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman




Note that I am amazed that the kernbench even worked.


The results without slub_debug were not good except for IA64. x86_64 and 
ppc64 both blew up for a variety of reasons. The IA64 results were


KernBench Comparison

  2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub  
%diff
User   CPU time1084.64   1032.93  
4.77%
System CPU time  73.38 63.14 
13.95%
Total  CPU time1158.02   1096.07  
5.35%
Elapsedtime 307.00285.62  
6.96%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub
 1 creat-clo425460.75  438809.64   13348.89  
3.14% File Creations and Closes/second
 2 page_test   2097119.26 3398259.27 1301140.01 62.04% 
System Allocations & Pages/second
 3 brk_test7008395.33 6728755.72 -279639.61 
-3.99% System Memory Allocations/second
 4 jmp_test   12226295.3112254966.21   28670.90  
0.23% Non-local gotos/second
 5 signal_test 1271126.28 1235510.96  -35615.32 
-2.80% Signal Traps/second
 6 exec_test   395.54 381.18 -14.36 
-3.63% Program Loads/second
 7 fork_test 13218.23   13211.41  -6.82 
-0.05% Task Creations/second
 8 link_test 64776.047488.13  -57287.91 
-88.44% Link/Unlink Pairs/second

An example console log from x86_64 is below. It's not particular clear why 
it went blamo and I haven't had a chance all day to kick it around for a 
bit due to a variety of other hilarity floating around.


Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 
20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 
console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 
loglevel=8
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009d400 (usable)
 BIOS-e820: 0009d400 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - 3ffcddc0 (usable)
 BIOS-e820: 3ffcddc0 - 3ffd (ACPI data)
 BIOS-e820: 3ffd - 4000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP 000FDFC0, 0014 (r0 IBM   )
ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM  45444F43)
ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL  2002025)
ACPI: FACS 3FFCFCC0, 0040
ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM  45444F43)
SRAT: PXM 0 -> APIC 0 -> Node 0
SRAT: PXM 0 -> APIC 1 -> Node 0
SRAT: PXM 1 -> APIC 2 -> Node 1
SRAT: PXM 1 -> APIC 3 -> Node 1
SRAT: Node 0 PXM 0 0-4000
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
NUMA: Using 63 for the hash shift.
Bootmem setup node 0 -3ffcd000
Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000
sizeof(struct page) = 64
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 ->  157
0:  256 ->   262093
On node 0 totalpages: 261994
  DMA zone: 64 pages used for memmap
  DMA zone: 2017 pages reserved
  DMA zone: 1916 pages, LIFO batch:0
  DMA32 zone: 4031 pages used for memmap
  DMA32 zone: 253966 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x2208
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Processor #3
ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24])
IOAPIC[1]: apic_id 13, address

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman


On Thu, 8 Mar 2007, Christoph Lameter wrote:


Note that I am amazed that the kernbench even worked. On small machine


How small? The machines I am testing on aren't "big" but they aren't 
misterable either.



I
seem to be getting into trouble with order 1 allocations.


That in itself is pretty incredible. From what I see, allocations up to 3 
generally work unless they are atomic even with the vanilla kernel. That 
said, it could be because slab is holding onto the high order pages for 
itself.



SLAB seems to be
able to avoid the situation by keeping higher order pages on a freelist
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator?



I'm not sure what you mean by per-order queues. The buddy allocator 
already has per-order lists.



There must be something fundamentally wrong in the page allocator if the
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc.
Yuck! Hopefully your patches fix that fundamental problem.



One way to find out for sure.

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman


On Thu, 8 Mar 2007, Christoph Lameter wrote:


On Thu, 8 Mar 2007, Mel Gorman wrote:


Note that the 16kb page size has a major
impact on SLUB performance. On IA64 slub will use only 1/4th the locking
overhead as on 4kb platforms.

It'll be interesting to see the kernbench tests then with debugging
disabled.


You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order
allocs to work reliably though.


It should work out because of the way buddy always selects the minimum 
page size will tend to cluster the slab allocations together whether they 
are reclaimable or not. It's something I can investigate when slub has 
stabilised a bit.


However, in general, high order kernel allocations remain a bad idea. 
Depending on high order allocations that do not group could potentially 
lead to a situation where the movable areas are used more and more by 
kernel allocations. I cannot think of a workload that would actually break 
everything, but it's a possibility.



The higher the order of slub the less
locking overhead. So the better your patches deal with fragmentation the
more we can reduce locking overhead in slub.



I can certainly kick it around a lot and see what happen. It's best that 
slub_min_order=2 remain an optional performance enhancing switch though.


--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman


On Thu, 8 Mar 2007, Christoph Lameter wrote:


On Thu, 8 Mar 2007, Mel Gorman wrote:


Note that the 16kb page size has a major
impact on SLUB performance. On IA64 slub will use only 1/4th the locking
overhead as on 4kb platforms.

It'll be interesting to see the kernbench tests then with debugging
disabled.


You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order
allocs to work reliably though.


It should work out because of the way buddy always selects the minimum 
page size will tend to cluster the slab allocations together whether they 
are reclaimable or not. It's something I can investigate when slub has 
stabilised a bit.


However, in general, high order kernel allocations remain a bad idea. 
Depending on high order allocations that do not group could potentially 
lead to a situation where the movable areas are used more and more by 
kernel allocations. I cannot think of a workload that would actually break 
everything, but it's a possibility.



The higher the order of slub the less
locking overhead. So the better your patches deal with fragmentation the
more we can reduce locking overhead in slub.



I can certainly kick it around a lot and see what happen. It's best that 
slub_min_order=2 remain an optional performance enhancing switch though.


--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman


On Thu, 8 Mar 2007, Christoph Lameter wrote:


Note that I am amazed that the kernbench even worked. On small machine


How small? The machines I am testing on aren't big but they aren't 
misterable either.



I
seem to be getting into trouble with order 1 allocations.


That in itself is pretty incredible. From what I see, allocations up to 3 
generally work unless they are atomic even with the vanilla kernel. That 
said, it could be because slab is holding onto the high order pages for 
itself.



SLAB seems to be
able to avoid the situation by keeping higher order pages on a freelist
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator?



I'm not sure what you mean by per-order queues. The buddy allocator 
already has per-order lists.



There must be something fundamentally wrong in the page allocator if the
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc.
Yuck! Hopefully your patches fix that fundamental problem.



One way to find out for sure.

--
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Mel Gorman




Note that I am amazed that the kernbench even worked.


The results without slub_debug were not good except for IA64. x86_64 and 
ppc64 both blew up for a variety of reasons. The IA64 results were


KernBench Comparison

  2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub  
%diff
User   CPU time1084.64   1032.93  
4.77%
System CPU time  73.38 63.14 
13.95%
Total  CPU time1158.02   1096.07  
5.35%
Elapsedtime 307.00285.62  
6.96%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub
 1 creat-clo425460.75  438809.64   13348.89  
3.14% File Creations and Closes/second
 2 page_test   2097119.26 3398259.27 1301140.01 62.04% 
System Allocations  Pages/second
 3 brk_test7008395.33 6728755.72 -279639.61 
-3.99% System Memory Allocations/second
 4 jmp_test   12226295.3112254966.21   28670.90  
0.23% Non-local gotos/second
 5 signal_test 1271126.28 1235510.96  -35615.32 
-2.80% Signal Traps/second
 6 exec_test   395.54 381.18 -14.36 
-3.63% Program Loads/second
 7 fork_test 13218.23   13211.41  -6.82 
-0.05% Task Creations/second
 8 link_test 64776.047488.13  -57287.91 
-88.44% Link/Unlink Pairs/second

An example console log from x86_64 is below. It's not particular clear why 
it went blamo and I haven't had a chance all day to kick it around for a 
bit due to a variety of other hilarity floating around.


Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 
20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 
console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 
loglevel=8
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009d400 (usable)
 BIOS-e820: 0009d400 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - 3ffcddc0 (usable)
 BIOS-e820: 3ffcddc0 - 3ffd (ACPI data)
 BIOS-e820: 3ffd - 4000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP 000FDFC0, 0014 (r0 IBM   )
ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM  45444F43)
ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL  2002025)
ACPI: FACS 3FFCFCC0, 0040
ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM  45444F43)
ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM  45444F43)
SRAT: PXM 0 - APIC 0 - Node 0
SRAT: PXM 0 - APIC 1 - Node 0
SRAT: PXM 1 - APIC 2 - Node 1
SRAT: PXM 1 - APIC 3 - Node 1
SRAT: Node 0 PXM 0 0-4000
Entering add_active_range(0, 0, 157) 0 entries of 3200 used
Entering add_active_range(0, 256, 262093) 1 entries of 3200 used
NUMA: Using 63 for the hash shift.
Bootmem setup node 0 -3ffcd000
Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000
sizeof(struct page) = 64
Zone PFN ranges:
  DMA 0 - 4096
  DMA324096 -  1048576
  Normal1048576 -  1048576
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0:0 -  157
0:  256 -   262093
On node 0 totalpages: 261994
  DMA zone: 64 pages used for memmap
  DMA zone: 2017 pages reserved
  DMA zone: 1916 pages, LIFO batch:0
  DMA32 zone: 4031 pages used for memmap
  DMA32 zone: 253966 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap
ACPI: PM-Timer IO Port: 0x2208
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
Processor #2
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
Processor #3
ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23
ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24])
IOAPIC[1]: apic_id 13, address 0xfec1, GSI

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter

On Fri, 9 Mar 2007, Mel Gorman wrote:

 I'm not sure what you mean by per-order queues. The buddy allocator already
 has per-order lists.

Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid 
(or defer) fragmentation by keeping its own queues.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-09 Thread Christoph Lameter

On Fri, 9 Mar 2007, Mel Gorman wrote:

 The results without slub_debug were not good except for IA64. x86_64 and ppc64
 both blew up for a variety of reasons. The IA64 results were

Yuck that is the dst issue that Adrian is also looking at. Likely an issue 
with slab merging and RCU frees.
 
 KernBench Comparison
 
   2.6.21-rc2-mm2-clean   2.6.21-rc2-mm2-slub
 %diff
 User   CPU time1084.64   1032.93 4.77%
 System CPU time  73.38 63.14 
 13.95%
 Total  CPU time1158.02   1096.07 5.35%
 Elapsedtime 307.00285.62 6.96%

Wow! The first indication that we are on the right track with this.

 AIM9 Comparison
  2 page_test   2097119.26 3398259.27 1301140.01 
 62.04% System Allocations  Pages/second

Wow! Must have all stayed within slab boundaries.

  8 link_test 64776.047488.13  -57287.91 
 -88.44% Link/Unlink Pairs/second

Crap. Maybe we straddled a slab boundary here?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

Note that I am amazed that the kernbench even worked. On small machine I 
seem to be getting into trouble with order 1 allocations. SLAB seems to be 
able to avoid the situation by keeping higher order pages on a freelist 
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator? 

There must be something fundamentally wrong in the page allocator if the 
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing 
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator 
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc. 
Yuck! Hopefully your patches fix that fundamental problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

> > Note that the 16kb page size has a major 
> > impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
> > overhead as on 4kb platforms.
> It'll be interesting to see the kernbench tests then with debugging
> disabled.

You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order 
allocs to work reliably though. The higher the order of slub the less 
locking overhead. So the better your patches deal with fragmentation the 
more we can reduce locking overhead in slub.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

> Brought up 4 CPUs
> Node 0 CPUs: 0-3
> mm/memory.c:111: bad pud c50e4480.

Lower bits must be clear right? Looks like the pud was released
and then reused for a 64 byte cache or so. This is likely a freelist 
pointer that slub put there after allocating the page for the 64 byte 
cache. Then we tried to use the pud.

> migration_cost=0,1000
> *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
> c0756090
> offset=240 flags=50c7 inuse=3 freelist=c50de0f0
>   Bytes b4 c50de0e0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de0f0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de100:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de110:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Object c50de120:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
>Redzone c50de130:  00 00 00 00 00 00 00 00
>  FreePointer c50de138: 

Data overwritten after free or after slab was allocated. So this may be 
the same issue. pud was zapped after it was freed destroying the poison 
of another object in the 64 byte cache.

Hmmm.. Maybe I should put the pad checks before the object checks. 
That way we detect that the whole slab was corrupted and do not flag just 
a single object.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman

On (08/03/07 08:48), Christoph Lameter didst pronounce:
> On Thu, 8 Mar 2007, Mel Gorman wrote:
> 
> > On x86_64, it completed successfully and looked reliable. There was a 5%
> > performance loss on kernbench and aim9 figures were way down. However, with
> > slub_debug enabled, I would expect that so it's not a fair comparison
> > performance wise. I'll rerun the tests without debug and see what it looks
> > like if you're interested and do not think it's too early to worry about
> > performance instead of clarity. This is what I have for bl6-13 (machine
> > appears on test.kernel.org so additional details are there).
> 
> No its good to start worrying about performance now. There are still some 
> performance issues to be ironed out in particular on NUMA. I am not sure
> f.e. how the reduction of partial lists affect performance.
> 

Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and
x86_64). It'll take them a long time to complete though as there is a
lot of testing going on right now.

> > IA64 (machine not visible on TKO) curiously did not exhibit the same 
> > problems
> > on kernbench for Total CPU time which is very unexpected but you can see the
> > System CPU times. The AIM9 figures were a bit of an upset but again, I blame
> > slub_debug being enabled
> 
> This was a single node box?

Yes, memory looks like this;

Zone PFN ranges:
  DMA  1024 ->   262144
  Normal 262144 ->   262144
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
0: 1024 ->30719
0:32768 ->65413
0:65440 ->65505
On node 0 totalpages: 62405
Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000
  DMA zone: 220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 62185 pages, LIFO batch:7
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap

> Note that the 16kb page size has a major 
> impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
> overhead as on 4kb platforms.
> 

It'll be interesting to see the kernbench tests then with debugging
disabled.

> > (as an aside, the succes rates for high-order allocations are lower with 
> > SLUB.
> > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar 
> > effects
> > because of red-zoning and the like)
> 
> We have some additional patches here that reduce the max order for some 
> allocs. I believe the task_struct gets to be an order 2 alloc with V4,
> 

Should make a difference for slab fragmentation

> > Now, the bad news. This exploded on ppc64. It started going wrong early in 
> > the
> > boot process and got worse. I haven't looked closely as to why yet as there 
> > is
> > other stuff on my plate but I've included a console log that might be some 
> > use
> > to you. If you think you have a fix for it, feel free to send it on and I'll
> > give it a test.
> 
> Hmmm... Looks like something is zapping an object. Try to rerun with 
> a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.
> 

I've queued up a few tests. One completed as I wrote this and it didn't
explode with SLAB_DEBUG set. Maybe the others will be different. I'll
kick it around for a bit.

It could be a real bug that slab is just not catuching.

> > Brought up 4 CPUs
> > Node 0 CPUs: 0-3
> > mm/memory.c:111: bad pud c50e4480.
> > could not vmalloc 20971520 bytes for cache!
> 
> Hmmm... a bad pud? I need to look at how the puds are managed on power.
> 
> > migration_cost=0,1000
> > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
> 
> An object was overwritten with zeros after it was freed.

> > RTAS daemon started
> > RTAS: event: 1, Type: Platform Error, Severity: 2
> > audit: initializing netlink socket (disabled)
> > audit(1173335571.256:1): initialized
> > Total HugeTLB memory allocated, 0
> > VFS: Disk quotas dquot_6.5.1
> > Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> > JFS: nTxBlock = 8192, nTxLock = 65536
> > SELinux:  Registering netfilter hooks
> > io scheduler noop registered
> > io scheduler anticipatory registered (default)
> > io scheduler deadline registered
> > io scheduler cfq registered
> > pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
> > vio_register_driver: driver hvc_console registering
> > [ cut here ]
> > Badness at mm/slub.c:1701
> 
> Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
> by SLAB but SLUB does.
> 

I'll chase up what's happening here. It will be "reproducable" independent
of SLUB by adding a similar check.

> > Call Trace:
> > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
> > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
> > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
> >

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

> On x86_64, it completed successfully and looked reliable. There was a 5%
> performance loss on kernbench and aim9 figures were way down. However, with
> slub_debug enabled, I would expect that so it's not a fair comparison
> performance wise. I'll rerun the tests without debug and see what it looks
> like if you're interested and do not think it's too early to worry about
> performance instead of clarity. This is what I have for bl6-13 (machine
> appears on test.kernel.org so additional details are there).

No its good to start worrying about performance now. There are still some 
performance issues to be ironed out in particular on NUMA. I am not sure
f.e. how the reduction of partial lists affect performance.

> IA64 (machine not visible on TKO) curiously did not exhibit the same problems
> on kernbench for Total CPU time which is very unexpected but you can see the
> System CPU times. The AIM9 figures were a bit of an upset but again, I blame
> slub_debug being enabled

This was a single node box? Note that the 16kb page size has a major 
impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
overhead as on 4kb platforms.

> (as an aside, the succes rates for high-order allocations are lower with SLUB.
> Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects
> because of red-zoning and the like)

We have some additional patches here that reduce the max order for some 
allocs. I believe the task_struct gets to be an order 2 alloc with V4,

> Now, the bad news. This exploded on ppc64. It started going wrong early in the
> boot process and got worse. I haven't looked closely as to why yet as there is
> other stuff on my plate but I've included a console log that might be some use
> to you. If you think you have a fix for it, feel free to send it on and I'll
> give it a test.

Hmmm... Looks like something is zapping an object. Try to rerun with 
a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.

> Brought up 4 CPUs
> Node 0 CPUs: 0-3
> mm/memory.c:111: bad pud c50e4480.
> could not vmalloc 20971520 bytes for cache!

Hmmm... a bad pud? I need to look at how the puds are managed on power.

> migration_cost=0,1000
> *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab

An object was overwritten with zeros after it was freed.

> RTAS daemon started
> RTAS: event: 1, Type: Platform Error, Severity: 2
> audit: initializing netlink socket (disabled)
> audit(1173335571.256:1): initialized
> Total HugeTLB memory allocated, 0
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> JFS: nTxBlock = 8192, nTxLock = 65536
> SELinux:  Registering netfilter hooks
> io scheduler noop registered
> io scheduler anticipatory registered (default)
> io scheduler deadline registered
> io scheduler cfq registered
> pci_hotplug: PCI Hot Plug PCI Core version: 0.5
> rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
> vio_register_driver: driver hvc_console registering
> [ cut here ]
> Badness at mm/slub.c:1701

Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
by SLAB but SLUB does.

> Call Trace:
> [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
> [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
> [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
> [C506B930] [C00046F4] program_check_common+0xf4/0x100
> --- Exception: 700 at .get_slab+0xbc/0x18c
> LR = .__kmalloc+0x28/0x104
> [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable)
> [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104
> [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c
> [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4
> [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8
> [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68
> mm/memory.c:111: bad pud c5762900.
> mm/memory.c:111: bad pud c5762480.
> [ cut here ]
> kernel BUG at mm/mmap.c:1999!

More page table trouble.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman


On Tue, 6 Mar 2007, Christoph Lameter wrote:


[PATCH] SLUB The unqueued slab allocator v4



Hi Christoph,

I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 
last night to see how they got on. I enabled slub_debug to catch any 
suprises that may be creeping about.


The results are mixed.

On x86_64, it completed successfully and looked reliable. There was a 5% 
performance loss on kernbench and aim9 figures were way down. However, 
with slub_debug enabled, I would expect that so it's not a fair comparison 
performance wise. I'll rerun the tests without debug and see what it looks 
like if you're interested and do not think it's too early to worry about 
performance instead of clarity. This is what I have for bl6-13 (machine 
appears on test.kernel.org so additional details are there).


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 
%diff

User   CPU time  84.32 86.03 
-2.03%
System CPU time  32.97 38.21
-15.89%
Total  CPU time 117.29124.24 
-5.93%
Elapsedtime  34.95 37.31 
-6.75%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo160706.55   62918.54  -97788.01 
-60.85% File Creations and Closes/second
 2 page_test190371.67  204050.99   13679.32  7.19% 
System Allocations & Pages/second
 3 brk_test2320679.89 1923512.75 -397167.14 
-17.11% System Memory Allocations/second
 4 jmp_test   16391869.3816380353.27  -11516.11 
-0.07% Non-local gotos/second
 5 signal_test  492234.63  235710.71 -256523.92 
-52.11% Signal Traps/second
 6 exec_test   232.26 220.88 -11.38 
-4.90% Program Loads/second
 7 fork_test  4514.253609.40-904.85 
-20.04% Task Creations/second
 8 link_test 53639.76   26925.91 -26713.85  
-49.80% Link/Unlink Pairs/second


IA64 (machine not visible on TKO) curiously did not exhibit the same 
problems on kernbench for Total CPU time which is very unexpected but you 
can see the System CPU times. The AIM9 figures were a bit of an upset but 
again, I blame slub_debug being enabled


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based  
%diff
User   CPU time1084.64   1033.46  
4.72%
System CPU time  73.38 84.14
-14.66%
Total  CPU time1158.021117.6  
3.49%
Elapsedtime 307.00291.29  
5.12%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo425460.75  137709.84 -287750.91 
-67.63% File Creations and Closes/second
 2 page_test   2097119.26 2373083.49  275964.23 13.16% 
System Allocations & Pages/second
 3 brk_test7008395.33 3787961.51 -3220433.82 
-45.95% System Memory Allocations/second
 4 jmp_test   12226295.3112254744.03   28448.72  
0.23% Non-local gotos/second
 5 signal_test 1271126.28  334357.29 -936768.99 
-73.70% Signal Traps/second
 6 exec_test   395.54 349.00 -46.54 
-11.77% Program Loads/second
 7 fork_test 13218.238822.93   -4395.30 
-33.25% Task Creations/second
 8 link_test 64776.047410.75  -57365.29 
-88.56% Link/Unlink Pairs/second

(as an aside, the succes rates for high-order allocations are lower with 
SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has 
similar effects because of red-zoning and the like)


Now, the bad news. This exploded on ppc64. It started going wrong early in 
the boot process and got worse. I haven't looked closely as to why yet as 
there is other stuff on my plate but I've included a console log that 
might be some use to you. If you think you have a fix for it, feel free to 
send it on and I'll give it a test.



Config file read, 1024 bytes
Welcome
Welcome to yaboot version 1.3.12
Enter "help" to get some basic usage information
boot: autobench
Please wait, loading kernel...
   Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded at 0240, size: 1648 Kbytes
OF stdout device is: /vdevice/[EMAIL PROTECTED]
Hypertas detected, assuming LPAR !
command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug 
mem

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

 On x86_64, it completed successfully and looked reliable. There was a 5%
 performance loss on kernbench and aim9 figures were way down. However, with
 slub_debug enabled, I would expect that so it's not a fair comparison
 performance wise. I'll rerun the tests without debug and see what it looks
 like if you're interested and do not think it's too early to worry about
 performance instead of clarity. This is what I have for bl6-13 (machine
 appears on test.kernel.org so additional details are there).

No its good to start worrying about performance now. There are still some 
performance issues to be ironed out in particular on NUMA. I am not sure
f.e. how the reduction of partial lists affect performance.

 IA64 (machine not visible on TKO) curiously did not exhibit the same problems
 on kernbench for Total CPU time which is very unexpected but you can see the
 System CPU times. The AIM9 figures were a bit of an upset but again, I blame
 slub_debug being enabled

This was a single node box? Note that the 16kb page size has a major 
impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
overhead as on 4kb platforms.

 (as an aside, the succes rates for high-order allocations are lower with SLUB.
 Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects
 because of red-zoning and the like)

We have some additional patches here that reduce the max order for some 
allocs. I believe the task_struct gets to be an order 2 alloc with V4,

 Now, the bad news. This exploded on ppc64. It started going wrong early in the
 boot process and got worse. I haven't looked closely as to why yet as there is
 other stuff on my plate but I've included a console log that might be some use
 to you. If you think you have a fix for it, feel free to send it on and I'll
 give it a test.

Hmmm... Looks like something is zapping an object. Try to rerun with 
a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.

 Brought up 4 CPUs
 Node 0 CPUs: 0-3
 mm/memory.c:111: bad pud c50e4480.
 could not vmalloc 20971520 bytes for cache!

Hmmm... a bad pud? I need to look at how the puds are managed on power.

 migration_cost=0,1000
 *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab

An object was overwritten with zeros after it was freed.

 RTAS daemon started
 RTAS: event: 1, Type: Platform Error, Severity: 2
 audit: initializing netlink socket (disabled)
 audit(1173335571.256:1): initialized
 Total HugeTLB memory allocated, 0
 VFS: Disk quotas dquot_6.5.1
 Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
 JFS: nTxBlock = 8192, nTxLock = 65536
 SELinux:  Registering netfilter hooks
 io scheduler noop registered
 io scheduler anticipatory registered (default)
 io scheduler deadline registered
 io scheduler cfq registered
 pci_hotplug: PCI Hot Plug PCI Core version: 0.5
 rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
 rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
 vio_register_driver: driver hvc_console registering
 [ cut here ]
 Badness at mm/slub.c:1701

Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
by SLAB but SLUB does.

 Call Trace:
 [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
 [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
 [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
 [C506B930] [C00046F4] program_check_common+0xf4/0x100
 --- Exception: 700 at .get_slab+0xbc/0x18c
 LR = .__kmalloc+0x28/0x104
 [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable)
 [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104
 [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c
 [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4
 [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8
 [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68
 mm/memory.c:111: bad pud c5762900.
 mm/memory.c:111: bad pud c5762480.
 [ cut here ]
 kernel BUG at mm/mmap.c:1999!

More page table trouble.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman

On (08/03/07 08:48), Christoph Lameter didst pronounce:
 On Thu, 8 Mar 2007, Mel Gorman wrote:
 
  On x86_64, it completed successfully and looked reliable. There was a 5%
  performance loss on kernbench and aim9 figures were way down. However, with
  slub_debug enabled, I would expect that so it's not a fair comparison
  performance wise. I'll rerun the tests without debug and see what it looks
  like if you're interested and do not think it's too early to worry about
  performance instead of clarity. This is what I have for bl6-13 (machine
  appears on test.kernel.org so additional details are there).
 
 No its good to start worrying about performance now. There are still some 
 performance issues to be ironed out in particular on NUMA. I am not sure
 f.e. how the reduction of partial lists affect performance.
 

Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and
x86_64). It'll take them a long time to complete though as there is a
lot of testing going on right now.

  IA64 (machine not visible on TKO) curiously did not exhibit the same 
  problems
  on kernbench for Total CPU time which is very unexpected but you can see the
  System CPU times. The AIM9 figures were a bit of an upset but again, I blame
  slub_debug being enabled
 
 This was a single node box?

Yes, memory looks like this;

Zone PFN ranges:
  DMA  1024 -   262144
  Normal 262144 -   262144
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
0: 1024 -30719
0:32768 -65413
0:65440 -65505
On node 0 totalpages: 62405
Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000
  DMA zone: 220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 62185 pages, LIFO batch:7
  Normal zone: 0 pages used for memmap
  Movable zone: 0 pages used for memmap

 Note that the 16kb page size has a major 
 impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
 overhead as on 4kb platforms.
 

It'll be interesting to see the kernbench tests then with debugging
disabled.

  (as an aside, the succes rates for high-order allocations are lower with 
  SLUB.
  Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar 
  effects
  because of red-zoning and the like)
 
 We have some additional patches here that reduce the max order for some 
 allocs. I believe the task_struct gets to be an order 2 alloc with V4,
 

Should make a difference for slab fragmentation

  Now, the bad news. This exploded on ppc64. It started going wrong early in 
  the
  boot process and got worse. I haven't looked closely as to why yet as there 
  is
  other stuff on my plate but I've included a console log that might be some 
  use
  to you. If you think you have a fix for it, feel free to send it on and I'll
  give it a test.
 
 Hmmm... Looks like something is zapping an object. Try to rerun with 
 a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results.
 

I've queued up a few tests. One completed as I wrote this and it didn't
explode with SLAB_DEBUG set. Maybe the others will be different. I'll
kick it around for a bit.

It could be a real bug that slab is just not catuching.

  Brought up 4 CPUs
  Node 0 CPUs: 0-3
  mm/memory.c:111: bad pud c50e4480.
  could not vmalloc 20971520 bytes for cache!
 
 Hmmm... a bad pud? I need to look at how the puds are managed on power.
 
  migration_cost=0,1000
  *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
 
 An object was overwritten with zeros after it was freed.

  RTAS daemon started
  RTAS: event: 1, Type: Platform Error, Severity: 2
  audit: initializing netlink socket (disabled)
  audit(1173335571.256:1): initialized
  Total HugeTLB memory allocated, 0
  VFS: Disk quotas dquot_6.5.1
  Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
  JFS: nTxBlock = 8192, nTxLock = 65536
  SELinux:  Registering netfilter hooks
  io scheduler noop registered
  io scheduler anticipatory registered (default)
  io scheduler deadline registered
  io scheduler cfq registered
  pci_hotplug: PCI Hot Plug PCI Core version: 0.5
  rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
  rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered
  vio_register_driver: driver hvc_console registering
  [ cut here ]
  Badness at mm/slub.c:1701
 
 Someone did a kmalloc(0, ...). Zero sized allocation are not flagged
 by SLAB but SLUB does.
 

I'll chase up what's happening here. It will be reproducable independent
of SLUB by adding a similar check.

  Call Trace:
  [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable)
  [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8
  [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4
  [C506B930] [C00046F4] program_check_common+0xf4/0x100
  --- Exception: 700 at .get_slab+0xbc/0x18c
  LR = .__kmalloc+0x28/0x104

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

 Brought up 4 CPUs
 Node 0 CPUs: 0-3
 mm/memory.c:111: bad pud c50e4480.

Lower bits must be clear right? Looks like the pud was released
and then reused for a 64 byte cache or so. This is likely a freelist 
pointer that slub put there after allocating the page for the 64 byte 
cache. Then we tried to use the pud.

 migration_cost=0,1000
 *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab
 c0756090
 offset=240 flags=50c7 inuse=3 freelist=c50de0f0
   Bytes b4 c50de0e0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
 Object c50de0f0:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
 Object c50de100:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
 Object c50de110:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
 Object c50de120:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 
Redzone c50de130:  00 00 00 00 00 00 00 00
  FreePointer c50de138: 

Data overwritten after free or after slab was allocated. So this may be 
the same issue. pud was zapped after it was freed destroying the poison 
of another object in the 64 byte cache.

Hmmm.. Maybe I should put the pad checks before the object checks. 
That way we detect that the whole slab was corrupted and do not flag just 
a single object.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

On Thu, 8 Mar 2007, Mel Gorman wrote:

  Note that the 16kb page size has a major 
  impact on SLUB performance. On IA64 slub will use only 1/4th the locking 
  overhead as on 4kb platforms.
 It'll be interesting to see the kernbench tests then with debugging
 disabled.

You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on 
bootup.
This means that we have to rely on your patches to allow higher order 
allocs to work reliably though. The higher the order of slub the less 
locking overhead. So the better your patches deal with fragmentation the 
more we can reduce locking overhead in slub.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Christoph Lameter

Note that I am amazed that the kernbench even worked. On small machine I 
seem to be getting into trouble with order 1 allocations. SLAB seems to be 
able to avoid the situation by keeping higher order pages on a freelist 
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator? 

There must be something fundamentally wrong in the page allocator if the 
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing 
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator 
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc. 
Yuck! Hopefully your patches fix that fundamental problem.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-08 Thread Mel Gorman


On Tue, 6 Mar 2007, Christoph Lameter wrote:


[PATCH] SLUB The unqueued slab allocator v4



Hi Christoph,

I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 
last night to see how they got on. I enabled slub_debug to catch any 
suprises that may be creeping about.


The results are mixed.

On x86_64, it completed successfully and looked reliable. There was a 5% 
performance loss on kernbench and aim9 figures were way down. However, 
with slub_debug enabled, I would expect that so it's not a fair comparison 
performance wise. I'll rerun the tests without debug and see what it looks 
like if you're interested and do not think it's too early to worry about 
performance instead of clarity. This is what I have for bl6-13 (machine 
appears on test.kernel.org so additional details are there).


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 
%diff

User   CPU time  84.32 86.03 
-2.03%
System CPU time  32.97 38.21
-15.89%
Total  CPU time 117.29124.24 
-5.93%
Elapsedtime  34.95 37.31 
-6.75%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo160706.55   62918.54  -97788.01 
-60.85% File Creations and Closes/second
 2 page_test190371.67  204050.99   13679.32  7.19% 
System Allocations  Pages/second
 3 brk_test2320679.89 1923512.75 -397167.14 
-17.11% System Memory Allocations/second
 4 jmp_test   16391869.3816380353.27  -11516.11 
-0.07% Non-local gotos/second
 5 signal_test  492234.63  235710.71 -256523.92 
-52.11% Signal Traps/second
 6 exec_test   232.26 220.88 -11.38 
-4.90% Program Loads/second
 7 fork_test  4514.253609.40-904.85 
-20.04% Task Creations/second
 8 link_test 53639.76   26925.91 -26713.85  
-49.80% Link/Unlink Pairs/second


IA64 (machine not visible on TKO) curiously did not exhibit the same 
problems on kernbench for Total CPU time which is very unexpected but you 
can see the System CPU times. The AIM9 figures were a bit of an upset but 
again, I blame slub_debug being enabled


KernBench Comparison

  2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based  
%diff
User   CPU time1084.64   1033.46  
4.72%
System CPU time  73.38 84.14
-14.66%
Total  CPU time1158.021117.6  
3.49%
Elapsedtime 307.00291.29  
5.12%

AIM9 Comparison
---
 2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
 1 creat-clo425460.75  137709.84 -287750.91 
-67.63% File Creations and Closes/second
 2 page_test   2097119.26 2373083.49  275964.23 13.16% 
System Allocations  Pages/second
 3 brk_test7008395.33 3787961.51 -3220433.82 
-45.95% System Memory Allocations/second
 4 jmp_test   12226295.3112254744.03   28448.72  
0.23% Non-local gotos/second
 5 signal_test 1271126.28  334357.29 -936768.99 
-73.70% Signal Traps/second
 6 exec_test   395.54 349.00 -46.54 
-11.77% Program Loads/second
 7 fork_test 13218.238822.93   -4395.30 
-33.25% Task Creations/second
 8 link_test 64776.047410.75  -57365.29 
-88.56% Link/Unlink Pairs/second

(as an aside, the succes rates for high-order allocations are lower with 
SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has 
similar effects because of red-zoning and the like)


Now, the bad news. This exploded on ppc64. It started going wrong early in 
the boot process and got worse. I haven't looked closely as to why yet as 
there is other stuff on my plate but I've included a console log that 
might be some use to you. If you think you have a fix for it, feel free to 
send it on and I'll give it a test.



Config file read, 1024 bytes
Welcome
Welcome to yaboot version 1.3.12
Enter help to get some basic usage information
boot: autobench
Please wait, loading kernel...
   Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded at 0240, size: 1648 Kbytes
OF stdout device is: /vdevice/[EMAIL PROTECTED]
Hypertas detected, assuming LPAR !
command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug 
memory layout at init

[SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-06 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v4

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.
There are very likely also issues remaining in SLUB.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have sl

[SLUB 0/3] SLUB: The unqueued slab allocator V4

2007-03-06 Thread Christoph Lameter

[PATCH] SLUB The unqueued slab allocator v4

V3-V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.
There are very likely also issues remaining in SLUB.

V2-V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1-V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 28 Feb 2007 17:06:19 -0800 (PST)

> On Wed, 28 Feb 2007, David Miller wrote:
> 
> > Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
> > not be set here, but SLUBs change in semantics in this area
> > could cause similar grief in other areas, an audit is probably
> > in order.
> > 
> > The above example was from sparc64, but x86 does the same thing
> > as probably do other platforms which use SLAB for pagetables.
> 
> Maybe this will address these concerns?
> 
> Index: linux-2.6.21-rc2/mm/slub.c
> ===
> --- linux-2.6.21-rc2.orig/mm/slub.c   2007-02-28 16:54:23.0 -0800
> +++ linux-2.6.21-rc2/mm/slub.c2007-02-28 17:03:54.0 -0800
> @@ -1229,8 +1229,10 @@ static int calculate_order(int size)
>  static unsigned long calculate_alignment(unsigned long flags,
>   unsigned long align)
>  {
> - if (flags & (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN))
> + if (flags & SLAB_HWCACHE_ALIGN)
>   return L1_CACHE_BYTES;
> + if (flags & SLAB_MUST_HWCACHE_ALIGN)
> + return max(align, (unsigned long)L1_CACHE_BYTES);
>  
>   if (align < ARCH_SLAB_MINALIGN)
>   return ARCH_SLAB_MINALIGN;

It would achiever parity with existing SLAB behavior, sure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread Christoph Lameter

On Wed, 28 Feb 2007, David Miller wrote:

> Maybe if you managed your individual changes in GIT or similar
> this could be debugged very quickly. :-)

I think once things calm down and the changes become smaller its going 
to be easier. Likely the case with after V4.

> Meanwhile I noticed that your alignment algorithm is different
> than SLAB's.  And I think this is important for the page table
> SLABs that some platforms use.

Ok.
 
> No matter what flags are specified, SLAB gives at least the
> passed in alignment specified in kmem_cache_create().  That
> logic in slab is here:
> 
>   /* 3) caller mandated alignment */
>   if (ralign < align) {
>   ralign = align;
>   }

Hmmm... Right.
 
> Whereas SLUB uses the CPU cacheline size when the MUSTALIGN
> flag is set.  Architectures do things like:
> 
>   pgtable_cache = kmem_cache_create("pgtable_cache",
> PAGE_SIZE, PAGE_SIZE,
> SLAB_HWCACHE_ALIGN |
> SLAB_MUST_HWCACHE_ALIGN,
> zero_ctor,
> NULL);
> 
> to get a PAGE_SIZE aligned slab, SLUB doesn't give the same
> behavior SLAB does in this case.

SLUB only supports this by passing through allocations to the page 
allocator since it does not maintain queues. So the above will cause the 
pgtable_cache to use the caches of the page allocator. The queueing effect 
that you get from SLAB is not present in SLUB since it does not provide 
them. If SLUB is to be used this way then we need to have higher order 
page sizes and allocate chunks from the higher order page for the 
pgtable_cache.

There are other ways of doing it. IA64 f.e. uses a linked list to 
accomplish the same avoiding SLAB overhead.

> Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
> not be set here, but SLUBs change in semantics in this area
> could cause similar grief in other areas, an audit is probably
> in order.
> 
> The above example was from sparc64, but x86 does the same thing
> as probably do other platforms which use SLAB for pagetables.

Maybe this will address these concerns?

Index: linux-2.6.21-rc2/mm/slub.c
===
--- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800
+++ linux-2.6.21-rc2/mm/slub.c  2007-02-28 17:03:54.0 -0800
@@ -1229,8 +1229,10 @@ static int calculate_order(int size)
 static unsigned long calculate_alignment(unsigned long flags,
unsigned long align)
 {
-   if (flags & (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN))
+   if (flags & SLAB_HWCACHE_ALIGN)
return L1_CACHE_BYTES;
+   if (flags & SLAB_MUST_HWCACHE_ALIGN)
+   return max(align, (unsigned long)L1_CACHE_BYTES);
 
if (align < ARCH_SLAB_MINALIGN)
return ARCH_SLAB_MINALIGN;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: David Miller <[EMAIL PROTECTED]>
Date: Wed, 28 Feb 2007 14:00:22 -0800 (PST)

> V3 doesn't boot successfully on sparc64

False alarm!

This crash was actually due to an unrelated problem in the parport_pc
driver on my machine.

Slub v3 boots up and seems to work fine so far on sparc64.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 28 Feb 2007 11:20:44 -0800 (PST)

> V2->V3
> - Debugging and diagnostic support. This is runtime enabled and not compile
>   time enabled. Runtime debugging can be controlled via kernel boot options
>   on an individual slab cache basis or globally.
> - Slab Trace support (For individual slab caches).
> - Resiliency support: If basic sanity checks are enabled (via F f.e.)
>   (boot option) then SLUB will do the best to perform diagnostics and
>   then continue (i.e. mark corrupted objects as used).
> - Fix up numerous issues including clash of SLUBs use of page
>   flags with i386 arch use for pmd and pgds (which are managed
>   as slab caches, sigh).
> - Dynamic per CPU array sizing.
> - Explain SLUB slabcache flags

V3 doesn't boot successfully on sparc64, sorry I don't have the
ability to track this down at the moment since it resets the
machine right as the video device is initialized and after diffing
V2 to V3 there is way too much stuff changing for me to try and
"bisect" between V2 to V3 to find the guilty sub-change.

Maybe if you managed your individual changes in GIT or similar
this could be debugged very quickly. :-)

Meanwhile I noticed that your alignment algorithm is different
than SLAB's.  And I think this is important for the page table
SLABs that some platforms use.

No matter what flags are specified, SLAB gives at least the
passed in alignment specified in kmem_cache_create().  That
logic in slab is here:

/* 3) caller mandated alignment */
if (ralign < align) {
ralign = align;
}

Whereas SLUB uses the CPU cacheline size when the MUSTALIGN
flag is set.  Architectures do things like:

pgtable_cache = kmem_cache_create("pgtable_cache",
  PAGE_SIZE, PAGE_SIZE,
  SLAB_HWCACHE_ALIGN |
  SLAB_MUST_HWCACHE_ALIGN,
  zero_ctor,
  NULL);

to get a PAGE_SIZE aligned slab, SLUB doesn't give the same
behavior SLAB does in this case.

Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
not be set here, but SLUBs change in semantics in this area
could cause similar grief in other areas, an audit is probably
in order.

The above example was from sparc64, but x86 does the same thing
as probably do other platforms which use SLAB for pagetables.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 28 Feb 2007 11:20:44 -0800 (PST)

 V2-V3
 - Debugging and diagnostic support. This is runtime enabled and not compile
   time enabled. Runtime debugging can be controlled via kernel boot options
   on an individual slab cache basis or globally.
 - Slab Trace support (For individual slab caches).
 - Resiliency support: If basic sanity checks are enabled (via F f.e.)
   (boot option) then SLUB will do the best to perform diagnostics and
   then continue (i.e. mark corrupted objects as used).
 - Fix up numerous issues including clash of SLUBs use of page
   flags with i386 arch use for pmd and pgds (which are managed
   as slab caches, sigh).
 - Dynamic per CPU array sizing.
 - Explain SLUB slabcache flags

V3 doesn't boot successfully on sparc64, sorry I don't have the
ability to track this down at the moment since it resets the
machine right as the video device is initialized and after diffing
V2 to V3 there is way too much stuff changing for me to try and
bisect between V2 to V3 to find the guilty sub-change.

Maybe if you managed your individual changes in GIT or similar
this could be debugged very quickly. :-)

Meanwhile I noticed that your alignment algorithm is different
than SLAB's.  And I think this is important for the page table
SLABs that some platforms use.

No matter what flags are specified, SLAB gives at least the
passed in alignment specified in kmem_cache_create().  That
logic in slab is here:

/* 3) caller mandated alignment */
if (ralign  align) {
ralign = align;
}

Whereas SLUB uses the CPU cacheline size when the MUSTALIGN
flag is set.  Architectures do things like:

pgtable_cache = kmem_cache_create(pgtable_cache,
  PAGE_SIZE, PAGE_SIZE,
  SLAB_HWCACHE_ALIGN |
  SLAB_MUST_HWCACHE_ALIGN,
  zero_ctor,
  NULL);

to get a PAGE_SIZE aligned slab, SLUB doesn't give the same
behavior SLAB does in this case.

Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
not be set here, but SLUBs change in semantics in this area
could cause similar grief in other areas, an audit is probably
in order.

The above example was from sparc64, but x86 does the same thing
as probably do other platforms which use SLAB for pagetables.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: David Miller [EMAIL PROTECTED]
Date: Wed, 28 Feb 2007 14:00:22 -0800 (PST)

 V3 doesn't boot successfully on sparc64

False alarm!

This crash was actually due to an unrelated problem in the parport_pc
driver on my machine.

Slub v3 boots up and seems to work fine so far on sparc64.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread Christoph Lameter

On Wed, 28 Feb 2007, David Miller wrote:

 Maybe if you managed your individual changes in GIT or similar
 this could be debugged very quickly. :-)

I think once things calm down and the changes become smaller its going 
to be easier. Likely the case with after V4.

 Meanwhile I noticed that your alignment algorithm is different
 than SLAB's.  And I think this is important for the page table
 SLABs that some platforms use.

Ok.
 
 No matter what flags are specified, SLAB gives at least the
 passed in alignment specified in kmem_cache_create().  That
 logic in slab is here:
 
   /* 3) caller mandated alignment */
   if (ralign  align) {
   ralign = align;
   }

Hmmm... Right.
 
 Whereas SLUB uses the CPU cacheline size when the MUSTALIGN
 flag is set.  Architectures do things like:
 
   pgtable_cache = kmem_cache_create(pgtable_cache,
 PAGE_SIZE, PAGE_SIZE,
 SLAB_HWCACHE_ALIGN |
 SLAB_MUST_HWCACHE_ALIGN,
 zero_ctor,
 NULL);
 
 to get a PAGE_SIZE aligned slab, SLUB doesn't give the same
 behavior SLAB does in this case.

SLUB only supports this by passing through allocations to the page 
allocator since it does not maintain queues. So the above will cause the 
pgtable_cache to use the caches of the page allocator. The queueing effect 
that you get from SLAB is not present in SLUB since it does not provide 
them. If SLUB is to be used this way then we need to have higher order 
page sizes and allocate chunks from the higher order page for the 
pgtable_cache.

There are other ways of doing it. IA64 f.e. uses a linked list to 
accomplish the same avoiding SLAB overhead.

 Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
 not be set here, but SLUBs change in semantics in this area
 could cause similar grief in other areas, an audit is probably
 in order.
 
 The above example was from sparc64, but x86 does the same thing
 as probably do other platforms which use SLAB for pagetables.

Maybe this will address these concerns?

Index: linux-2.6.21-rc2/mm/slub.c
===
--- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800
+++ linux-2.6.21-rc2/mm/slub.c  2007-02-28 17:03:54.0 -0800
@@ -1229,8 +1229,10 @@ static int calculate_order(int size)
 static unsigned long calculate_alignment(unsigned long flags,
unsigned long align)
 {
-   if (flags  (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN))
+   if (flags  SLAB_HWCACHE_ALIGN)
return L1_CACHE_BYTES;
+   if (flags  SLAB_MUST_HWCACHE_ALIGN)
+   return max(align, (unsigned long)L1_CACHE_BYTES);
 
if (align  ARCH_SLAB_MINALIGN)
return ARCH_SLAB_MINALIGN;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB The unqueued slab allocator V3

2007-02-28 Thread David Miller

From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 28 Feb 2007 17:06:19 -0800 (PST)

 On Wed, 28 Feb 2007, David Miller wrote:

  Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should
  not be set here, but SLUBs change in semantics in this area
  could cause similar grief in other areas, an audit is probably
  in order.

  The above example was from sparc64, but x86 does the same thing
  as probably do other platforms which use SLAB for pagetables.

 Maybe this will address these concerns?

 Index: linux-2.6.21-rc2/mm/slub.c
 ===
 --- linux-2.6.21-rc2.orig/mm/slub.c   2007-02-28 16:54:23.0 -0800
 +++ linux-2.6.21-rc2/mm/slub.c2007-02-28 17:03:54.0 -0800
 @@ -1229,8 +1229,10 @@ static int calculate_order(int size)
  static unsigned long calculate_alignment(unsigned long flags,
   unsigned long align)
  {
 - if (flags  (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN))
 + if (flags  SLAB_HWCACHE_ALIGN)
   return L1_CACHE_BYTES;
 + if (flags  SLAB_MUST_HWCACHE_ALIGN)
 + return max(align, (unsigned long)L1_CACHE_BYTES);

   if (align  ARCH_SLAB_MINALIGN)
   return ARCH_SLAB_MINALIGN;

It would achiever parity with existing SLAB behavior, sure.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-25 Thread Jörn Engel

On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote:
> 
> It eliminates 50% of the slab caches. Thus it reduces the management 
> overhead by half.

How much management overhead is there left with SLUB?  Is it just the
one per-node slab?  Is there runtime overhead as well?

In a slightly different approach, can we possibly get rid of some slab
caches, instead of merging them at boot time?  On my system I have 97
slab caches right now, ignoring the generic kmalloc() ones.  Of those,
28 are completely empty, 23 contain <=10 objects, 23 <=100 and 23
contain >100 objects.

It is fairly obvious to me that the highly populated slab caches are a
big win.  But is it worth it to have slab caches with a single object
inside?  Maybe some of these caches are populated for some systems.
But there could also be candidates for removal among them.

#   name
0 0 dm-crypt_io
0 0 dm_io
0 0 dm_tio
0 0 ext3_xattr
0 0 fat_cache
0 0 fat_inode_cache
0 0 flow_cache
0 0 inet_peer_cache
0 0 ip_conntrack_expect
0 0 ip_mrt_cache
0 0 isofs_inode_cache
0 0 jbd_1k
0 0 jbd_4k
0 0 kiocb
0 0 kioctx
0 0 nfs_inode_cache
0 0 nfs_page
0 0 posix_timers_cache
0 0 request_sock_TCP
0 0 revoke_record
0 0 rpc_inode_cache
0 0 scsi_io_context
0 0 secpath_cache
0 0 skbuff_fclone_cache
0 0 tw_sock_TCP
0 0 udf_inode_cache
0 0 uhci_urb_priv
0 0 xfrm_dst_cache
1 169 dnotify_cache
1 30 arp_cache
1 7 mqueue_inode_cache
2 101 eventpoll_pwq
2 203 fasync_cache
2 254 revoke_table
2 30 eventpoll_epi
2 9 RAW
4 17 ip_conntrack
7 10 biovec-128
7 10 biovec-64
7 20 biovec-16
7 42 file_lock_cache
7 59 biovec-4
7 59 uid_cache
7 8 biovec-256
7 9 bdev_cache
8 127 inotify_event_cache
8 20 rpc_tasks
8 8 rpc_buffers
10 113 ip_fib_alias
10 113 ip_fib_hash
10 12 blkdev_queue
11 203 biovec-1
11 22 blkdev_requests
13 92 inotify_watch_cache
16 169 journal_handle
16 203 tcp_bind_bucket
16 72 journal_head
18 18 UDP
19 19 names_cache
19 28 TCP
22 30 mnt_cache
27 27 sigqueue
27 60 ip_dst_cache
32 32 sgpool-128
32 32 sgpool-32
32 32 sgpool-64
32 36 nfs_read_data
32 45 sgpool-16
32 60 sgpool-8
36 42 nfs_write_data
72 80 cfq_pool
74 127 blkdev_ioc
78 92 cfq_ioc_pool
94 94 pgd
107 113 fs_cache
108 108 mm_struct
108 140 files_cache
123 123 sighand_cache
125 140 UNIX
130 130 signal_cache
147 147 task_struct
154 174 idr_layer_cache
158 404 pid
190 190 sock_inode_cache
260 295 bio
273 273 proc_inode_cache
840 920 skbuff_head_cache
1234 1326 inode_cache
1507 1510 shmem_inode_cache
2871 3051 anon_vma
2910 3360 filp
5161 5292 sysfs_dir_cache
5762 6164 vm_area_struct
12056 19446 radix_tree_node
65776 151272 buffer_head
578304 578304 ext3_inode_cache
677490 677490 dentry_cache

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-25 Thread Jörn Engel

On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote:
 
 It eliminates 50% of the slab caches. Thus it reduces the management 
 overhead by half.

How much management overhead is there left with SLUB?  Is it just the
one per-node slab?  Is there runtime overhead as well?

In a slightly different approach, can we possibly get rid of some slab
caches, instead of merging them at boot time?  On my system I have 97
slab caches right now, ignoring the generic kmalloc() ones.  Of those,
28 are completely empty, 23 contain =10 objects, 23 =100 and 23
contain 100 objects.

It is fairly obvious to me that the highly populated slab caches are a
big win.  But is it worth it to have slab caches with a single object
inside?  Maybe some of these caches are populated for some systems.
But there could also be candidates for removal among them.

# active_objs num_objs name
0 0 dm-crypt_io
0 0 dm_io
0 0 dm_tio
0 0 ext3_xattr
0 0 fat_cache
0 0 fat_inode_cache
0 0 flow_cache
0 0 inet_peer_cache
0 0 ip_conntrack_expect
0 0 ip_mrt_cache
0 0 isofs_inode_cache
0 0 jbd_1k
0 0 jbd_4k
0 0 kiocb
0 0 kioctx
0 0 nfs_inode_cache
0 0 nfs_page
0 0 posix_timers_cache
0 0 request_sock_TCP
0 0 revoke_record
0 0 rpc_inode_cache
0 0 scsi_io_context
0 0 secpath_cache
0 0 skbuff_fclone_cache
0 0 tw_sock_TCP
0 0 udf_inode_cache
0 0 uhci_urb_priv
0 0 xfrm_dst_cache
1 169 dnotify_cache
1 30 arp_cache
1 7 mqueue_inode_cache
2 101 eventpoll_pwq
2 203 fasync_cache
2 254 revoke_table
2 30 eventpoll_epi
2 9 RAW
4 17 ip_conntrack
7 10 biovec-128
7 10 biovec-64
7 20 biovec-16
7 42 file_lock_cache
7 59 biovec-4
7 59 uid_cache
7 8 biovec-256
7 9 bdev_cache
8 127 inotify_event_cache
8 20 rpc_tasks
8 8 rpc_buffers
10 113 ip_fib_alias
10 113 ip_fib_hash
10 12 blkdev_queue
11 203 biovec-1
11 22 blkdev_requests
13 92 inotify_watch_cache
16 169 journal_handle
16 203 tcp_bind_bucket
16 72 journal_head
18 18 UDP
19 19 names_cache
19 28 TCP
22 30 mnt_cache
27 27 sigqueue
27 60 ip_dst_cache
32 32 sgpool-128
32 32 sgpool-32
32 32 sgpool-64
32 36 nfs_read_data
32 45 sgpool-16
32 60 sgpool-8
36 42 nfs_write_data
72 80 cfq_pool
74 127 blkdev_ioc
78 92 cfq_ioc_pool
94 94 pgd
107 113 fs_cache
108 108 mm_struct
108 140 files_cache
123 123 sighand_cache
125 140 UNIX
130 130 signal_cache
147 147 task_struct
154 174 idr_layer_cache
158 404 pid
190 190 sock_inode_cache
260 295 bio
273 273 proc_inode_cache
840 920 skbuff_head_cache
1234 1326 inode_cache
1507 1510 shmem_inode_cache
2871 3051 anon_vma
2910 3360 filp
5161 5292 sysfs_dir_cache
5762 6164 vm_area_struct
12056 19446 radix_tree_node
65776 151272 buffer_head
578304 578304 ext3_inode_cache
677490 677490 dentry_cache

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread David Miller

From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Sat, 24 Feb 2007 09:32:49 -0800 (PST)

> On Fri, 23 Feb 2007, David Miller wrote:
> 
> > I also agree with Andi in that merging could mess up how object type
> > local lifetimes help reduce fragmentation in object pools.
> 
> If that is a problem for particular object pools then we may be able to 
> except those from the merging.

If it is a problem, it's going to be a problem "in general"
and not for specific SLAB caches.

I think this is really a very unwise idea.  We have enough
fragmentation problems as it is.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Christoph Lameter

On Sat, 24 Feb 2007, Jörn Engel wrote:

> How much of a gain is the merging anyway?  Once you start having
> explicit whitelists or blacklists of pools that can be merged, one can
> start to wonder if the result is worth the effort.

It eliminates 50% of the slab caches. Thus it reduces the management 
overhead by half.

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Jörn Engel

On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote:
> 
> If that is a problem for particular object pools then we may be able to 
> except those from the merging.

How much of a gain is the merging anyway?  Once you start having
explicit whitelists or blacklists of pools that can be merged, one can
start to wonder if the result is worth the effort.

Jörn

-- 
Joern's library part 6:
http://www.gzip.org/zlib/feldspar.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Christoph Lameter

On Fri, 23 Feb 2007, David Miller wrote:

> > The general caches already merge lots of users depending on their sizes. 
> > So we already have the situation and we have tools to deal with it.
> 
> But this doesn't happen for things like biovecs, and that will
> make debugging painful.
> 
> If a crash happens because of a corrupted biovec-256 I want to know
> it was a biovec not some anonymous clone of kmalloc256.
> 
> Please provide at a minimum a way to turn the merging off.

Ok. Its currently a compile time option. Will make it possible to specify 
a boot option.
 
> I also agree with Andi in that merging could mess up how object type
> local lifetimes help reduce fragmentation in object pools.

If that is a problem for particular object pools then we may be able to 
except those from the merging.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Christoph Lameter

On Fri, 23 Feb 2007, David Miller wrote:

  The general caches already merge lots of users depending on their sizes. 
  So we already have the situation and we have tools to deal with it.
 
 But this doesn't happen for things like biovecs, and that will
 make debugging painful.
 
 If a crash happens because of a corrupted biovec-256 I want to know
 it was a biovec not some anonymous clone of kmalloc256.
 
 Please provide at a minimum a way to turn the merging off.

Ok. Its currently a compile time option. Will make it possible to specify 
a boot option.
 
 I also agree with Andi in that merging could mess up how object type
 local lifetimes help reduce fragmentation in object pools.

If that is a problem for particular object pools then we may be able to 
except those from the merging.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Jörn Engel

On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote:
 
 If that is a problem for particular object pools then we may be able to 
 except those from the merging.

How much of a gain is the merging anyway?  Once you start having
explicit whitelists or blacklists of pools that can be merged, one can
start to wonder if the result is worth the effort.

Jörn

-- 
Joern's library part 6:
http://www.gzip.org/zlib/feldspar.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread Christoph Lameter

On Sat, 24 Feb 2007, Jörn Engel wrote:

 How much of a gain is the merging anyway?  Once you start having
 explicit whitelists or blacklists of pools that can be merged, one can
 start to wonder if the result is worth the effort.

It eliminates 50% of the slab caches. Thus it reduces the management 
overhead by half.

Re: SLUB: The unqueued Slab allocator

2007-02-24 Thread David Miller

From: Christoph Lameter [EMAIL PROTECTED]
Date: Sat, 24 Feb 2007 09:32:49 -0800 (PST)

 On Fri, 23 Feb 2007, David Miller wrote:

  I also agree with Andi in that merging could mess up how object type
  local lifetimes help reduce fragmentation in object pools.

 If that is a problem for particular object pools then we may be able to 
 except those from the merging.

If it is a problem, it's going to be a problem in general
and not for specific SLAB caches.

I think this is really a very unwise idea.  We have enough
fragmentation problems as it is.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread David Miller

From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Fri, 23 Feb 2007 21:47:36 -0800 (PST)

> On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote:
> 
> > >From a viewpoint of a crash dump user, this merging will make crash dump
> > investigation very very very difficult.
> 
> The general caches already merge lots of users depending on their sizes. 
> So we already have the situation and we have tools to deal with it.

But this doesn't happen for things like biovecs, and that will
make debugging painful.

If a crash happens because of a corrupted biovec-256 I want to know
it was a biovec not some anonymous clone of kmalloc256.

Please provide at a minimum a way to turn the merging off.

I also agree with Andi in that merging could mess up how object type
local lifetimes help reduce fragmentation in object pools.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread Christoph Lameter

On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote:

> >From a viewpoint of a crash dump user, this merging will make crash dump
> investigation very very very difficult.

The general caches already merge lots of users depending on their sizes. 
So we already have the situation and we have tools to deal with it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread KAMEZAWA Hiroyuki

On Thu, 22 Feb 2007 10:42:23 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> > > G. Slab merging
> > > 
> > >We often have slab caches with similar parameters. SLUB detects those
> > >on bootup and merges them into the corresponding general caches. This
> > >leads to more effective memory use.
> > 
> > Did you do any tests on what that does to long term memory fragmentation?
> > It is against the "object of same type have similar livetime and should
> > be clustered together" theory at least.
> 
> I have done no tests in that regard and we would have to assess the impact 
> that the merging has to overall system behavior.
> 
>From a viewpoint of a crash dump user, this merging will make crash dump
investigation very very very difficult.
So please avoid this merging if the benefit is nog big.

-Kame

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread KAMEZAWA Hiroyuki

On Thu, 22 Feb 2007 10:42:23 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

   G. Slab merging
   
  We often have slab caches with similar parameters. SLUB detects those
  on bootup and merges them into the corresponding general caches. This
  leads to more effective memory use.
  
  Did you do any tests on what that does to long term memory fragmentation?
  It is against the object of same type have similar livetime and should
  be clustered together theory at least.
 
 I have done no tests in that regard and we would have to assess the impact 
 that the merging has to overall system behavior.
 
From a viewpoint of a crash dump user, this merging will make crash dump
investigation very very very difficult.
So please avoid this merging if the benefit is nog big.

-Kame

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread Christoph Lameter

On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote:

 From a viewpoint of a crash dump user, this merging will make crash dump
 investigation very very very difficult.

The general caches already merge lots of users depending on their sizes. 
So we already have the situation and we have tools to deal with it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-23 Thread David Miller

From: Christoph Lameter [EMAIL PROTECTED]
Date: Fri, 23 Feb 2007 21:47:36 -0800 (PST)

 On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote:

  From a viewpoint of a crash dump user, this merging will make crash dump
  investigation very very very difficult.

 The general caches already merge lots of users depending on their sizes. 
 So we already have the situation and we have tools to deal with it.

But this doesn't happen for things like biovecs, and that will
make debugging painful.

If a crash happens because of a corrupted biovec-256 I want to know
it was a biovec not some anonymous clone of kmalloc256.

Please provide at a minimum a way to turn the merging off.

I also agree with Andi in that merging could mess up how object type
local lifetimes help reduce fragmentation in object pools.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Fri, 23 Feb 2007, Andi Kleen wrote:

> If you don't cache constructed but free objects then there is no cache
> advantage of constructors/destructors and they would be useless.

SLUB caches those objects as long as they are part of a partially 
allocated slab. If all objects in the slab are freed then the whole slab 
will be freed. SLUB does not keep queues of freed slabs.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Andi Kleen

On Thu, Feb 22, 2007 at 10:42:23AM -0800, Christoph Lameter wrote:
> On Thu, 22 Feb 2007, Andi Kleen wrote:
> 
> > >SLUB does not need a cache reaper for UP systems.
> > 
> > This means constructors/destructors are becomming worthless? 
> > Can you describe your rationale why you think they don't make
> > sense on UP?
> 
> Cache reaping has nothing to do with constructors and destructors. SLUB 
> fully supports constructors and destructors.

If you don't cache constructed but free objects then there is no cache
advantage of constructors/destructors and they would be useless.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Andi Kleen wrote:

> >SLUB does not need a cache reaper for UP systems.
> 
> This means constructors/destructors are becomming worthless? 
> Can you describe your rationale why you think they don't make
> sense on UP?

Cache reaping has nothing to do with constructors and destructors. SLUB 
fully supports constructors and destructors.

> > G. Slab merging
> > 
> >We often have slab caches with similar parameters. SLUB detects those
> >on bootup and merges them into the corresponding general caches. This
> >leads to more effective memory use.
> 
> Did you do any tests on what that does to long term memory fragmentation?
> It is against the "object of same type have similar livetime and should
> be clustered together" theory at least.

I have done no tests in that regard and we would have to assess the impact 
that the merging has to overall system behavior.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Andi Kleen

Christoph Lameter <[EMAIL PROTECTED]> writes:

> This is a new slab allocator which was motivated by the complexity of the
> with the existing implementation.

Thanks for doing that work. It certainly was long overdue.

> D. SLAB has a complex cache reaper
> 
>SLUB does not need a cache reaper for UP systems.

This means constructors/destructors are becomming worthless? 
Can you describe your rationale why you think they don't make
sense on UP?

> G. Slab merging
> 
>We often have slab caches with similar parameters. SLUB detects those
>on bootup and merges them into the corresponding general caches. This
>leads to more effective memory use.

Did you do any tests on what that does to long term memory fragmentation?
It is against the "object of same type have similar livetime and should
be clustered together" theory at least.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

n Thu, 22 Feb 2007, David Miller wrote:

> All of that logic needs to be protected by CONFIG_ZONE_DMA too.

Right. Will fix that in the next release.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Peter Zijlstra wrote:

> On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote:
> 
> > +/*
> > + * Lock order:
> > + *   1. slab_lock(page)
> > + *   2. slab->list_lock
> > + *
> 
> That seems to contradict this:

This is a trylock. If it fails then we can compensate by allocating
a new slab.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Pekka Enberg wrote:

> On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > This is a new slab allocator which was motivated by the complexity of the
> > existing code in mm/slab.c. It attempts to address a variety of concerns
> > with the existing implementation.
> 
> So do you want to add a new allocator or replace slab?

Add. The performance and quality is not comparable to SLAB at this point.

> On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > B. Storage overhead of object queues
> 
> Does this make sense for non-NUMA too? If not, can we disable the
> queues for NUMA in current slab?

Given the locking scheme in the current slab you cannot do that. Otherwise
there will be a single lock taken for every operation limiting performace

> On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:
> > C. SLAB metadata overhead
> 
> Can be done for the current slab code too, no?

The per slab metadata of the SLAB does not fit into the page_struct. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Pekka Enberg


Hi Christoph,

On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.


So do you want to add a new allocator or replace slab?

On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

B. Storage overhead of object queues


Does this make sense for non-NUMA too? If not, can we disable the
queues for NUMA in current slab?

On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote:

C. SLAB metadata overhead


Can be done for the current slab code too, no?

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread David Miller

From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Wed, 21 Feb 2007 23:00:30 -0800 (PST)

> +#ifdef CONFIG_ZONE_DMA
> +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_NR_CACHES];
> +#endif

Therefore.

> +static struct kmem_cache *get_slab(size_t size, gfp_t flags)
> +{
 ...
> + s = kmalloc_caches_dma[index];
> + if (s)
> + return s;
> +
> + /* Dynamically create dma cache */
> + x = kmalloc(sizeof(struct kmem_cache), flags & ~(__GFP_DMA));
> +
> + if (!x)
> + panic("Unable to allocate memory for dma cache\n");
> +
> +#ifdef KMALLOC_EXTRA
> + if (index <= KMALLOC_SHIFT_HIGH - KMALLOC_SHIFT_LOW)
> +#endif
> + realsize = 1 << index;
> +#ifdef KMALLOC_EXTRA
> + else if (index == KMALLOC_EXTRAS)
> + realsize = 96;
> + else
> + realsize = 192;
> +#endif
> +
> + s = create_kmalloc_cache(x, "kmalloc_dma", realsize);
> + kmalloc_caches_dma[index] = s;
> + return s;
> +}

All of that logic needs to be protected by CONFIG_ZONE_DMA too.

I noticed this due to a build failure on sparc64 with this patch.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Peter Zijlstra

On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote:

> +/*
> + * Lock order:
> + *   1. slab_lock(page)
> + *   2. slab->list_lock
> + *

That seems to contradict this:

> +/*
> + * Lock page and remove it from the partial list
> + *
> + * Must hold list_lock
> + */
> +static __always_inline int lock_and_del_slab(struct kmem_cache *s,
> + struct page *page)
> +{
> + if (slab_trylock(page)) {
> + list_del(>lru);
> + s->nr_partial--;
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> + * Get a partial page, lock it and return it.
> + */
> +#ifdef CONFIG_NUMA
> +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
> +{
> + struct page *page;
> + int searchnode = (node == -1) ? numa_node_id() : node;
> +
> + if (!s->nr_partial)
> + return NULL;
> +
> + spin_lock(>list_lock);
> + /*
> +  * Search for slab on the right node
> +  */
> + list_for_each_entry(page, >partial, lru)
> + if (likely(page_to_nid(page) == searchnode) &&
> + lock_and_del_slab(s, page))
> + goto out;
> +
> + if (likely(!(flags & __GFP_THISNODE))) {
> + /*
> +  * We can fall back to any other node in order to
> +  * reduce the size of the partial list.
> +  */
> + list_for_each_entry(page, >partial, lru)
> + if (likely(lock_and_del_slab(s, page)))
> + goto out;
> + }
> +
> + /* Nothing found */
> + page = NULL;
> +out:
> + spin_unlock(>list_lock);
> + return page;
> +}
> +#else
> +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
> +{
> + struct page *page;
> +
> + /*
> +  * Racy check. If we mistakenly see no partial slabs then we
> +  * just allocate an empty slab.
> +  */
> + if (!s->nr_partial)
> + return NULL;
> +
> + spin_lock(>list_lock);
> + list_for_each_entry(page, >partial, lru)
> + if (likely(lock_and_del_slab(s, page)))
> + goto out;
> +
> + /* No slab or all slabs busy */
> + page = NULL;
> +out:
> + spin_unlock(>list_lock);
> + return page;
> +}
> +#endif

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Peter Zijlstra

On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote:

 +/*
 + * Lock order:
 + *   1. slab_lock(page)
 + *   2. slab-list_lock
 + *

That seems to contradict this:

 +/*
 + * Lock page and remove it from the partial list
 + *
 + * Must hold list_lock
 + */
 +static __always_inline int lock_and_del_slab(struct kmem_cache *s,
 + struct page *page)
 +{
 + if (slab_trylock(page)) {
 + list_del(page-lru);
 + s-nr_partial--;
 + return 1;
 + }
 + return 0;
 +}
 +
 +/*
 + * Get a partial page, lock it and return it.
 + */
 +#ifdef CONFIG_NUMA
 +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 +{
 + struct page *page;
 + int searchnode = (node == -1) ? numa_node_id() : node;
 +
 + if (!s-nr_partial)
 + return NULL;
 +
 + spin_lock(s-list_lock);
 + /*
 +  * Search for slab on the right node
 +  */
 + list_for_each_entry(page, s-partial, lru)
 + if (likely(page_to_nid(page) == searchnode) 
 + lock_and_del_slab(s, page))
 + goto out;
 +
 + if (likely(!(flags  __GFP_THISNODE))) {
 + /*
 +  * We can fall back to any other node in order to
 +  * reduce the size of the partial list.
 +  */
 + list_for_each_entry(page, s-partial, lru)
 + if (likely(lock_and_del_slab(s, page)))
 + goto out;
 + }
 +
 + /* Nothing found */
 + page = NULL;
 +out:
 + spin_unlock(s-list_lock);
 + return page;
 +}
 +#else
 +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 +{
 + struct page *page;
 +
 + /*
 +  * Racy check. If we mistakenly see no partial slabs then we
 +  * just allocate an empty slab.
 +  */
 + if (!s-nr_partial)
 + return NULL;
 +
 + spin_lock(s-list_lock);
 + list_for_each_entry(page, s-partial, lru)
 + if (likely(lock_and_del_slab(s, page)))
 + goto out;
 +
 + /* No slab or all slabs busy */
 + page = NULL;
 +out:
 + spin_unlock(s-list_lock);
 + return page;
 +}
 +#endif

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread David Miller

From: Christoph Lameter [EMAIL PROTECTED]
Date: Wed, 21 Feb 2007 23:00:30 -0800 (PST)

 +#ifdef CONFIG_ZONE_DMA
 +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_NR_CACHES];
 +#endif

Therefore.

 +static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 +{
 ...
 + s = kmalloc_caches_dma[index];
 + if (s)
 + return s;
 +
 + /* Dynamically create dma cache */
 + x = kmalloc(sizeof(struct kmem_cache), flags  ~(__GFP_DMA));
 +
 + if (!x)
 + panic(Unable to allocate memory for dma cache\n);
 +
 +#ifdef KMALLOC_EXTRA
 + if (index = KMALLOC_SHIFT_HIGH - KMALLOC_SHIFT_LOW)
 +#endif
 + realsize = 1  index;
 +#ifdef KMALLOC_EXTRA
 + else if (index == KMALLOC_EXTRAS)
 + realsize = 96;
 + else
 + realsize = 192;
 +#endif
 +
 + s = create_kmalloc_cache(x, kmalloc_dma, realsize);
 + kmalloc_caches_dma[index] = s;
 + return s;
 +}

All of that logic needs to be protected by CONFIG_ZONE_DMA too.

I noticed this due to a build failure on sparc64 with this patch.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Pekka Enberg


Hi Christoph,

On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.


So do you want to add a new allocator or replace slab?

On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:

B. Storage overhead of object queues


Does this make sense for non-NUMA too? If not, can we disable the
queues for NUMA in current slab?

On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:

C. SLAB metadata overhead


Can be done for the current slab code too, no?

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Pekka Enberg wrote:

 On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:
  This is a new slab allocator which was motivated by the complexity of the
  existing code in mm/slab.c. It attempts to address a variety of concerns
  with the existing implementation.
 
 So do you want to add a new allocator or replace slab?

Add. The performance and quality is not comparable to SLAB at this point.

 On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:
  B. Storage overhead of object queues
 
 Does this make sense for non-NUMA too? If not, can we disable the
 queues for NUMA in current slab?

Given the locking scheme in the current slab you cannot do that. Otherwise
there will be a single lock taken for every operation limiting performace

 On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote:
  C. SLAB metadata overhead
 
 Can be done for the current slab code too, no?

The per slab metadata of the SLAB does not fit into the page_struct. 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Peter Zijlstra wrote:

 On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote:
 
  +/*
  + * Lock order:
  + *   1. slab_lock(page)
  + *   2. slab-list_lock
  + *
 
 That seems to contradict this:

This is a trylock. If it fails then we can compensate by allocating
a new slab.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

n Thu, 22 Feb 2007, David Miller wrote:

 All of that logic needs to be protected by CONFIG_ZONE_DMA too.

Right. Will fix that in the next release.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Andi Kleen

Christoph Lameter [EMAIL PROTECTED] writes:

 This is a new slab allocator which was motivated by the complexity of the
 with the existing implementation.

Thanks for doing that work. It certainly was long overdue.

 D. SLAB has a complex cache reaper
 
SLUB does not need a cache reaper for UP systems.

This means constructors/destructors are becomming worthless? 
Can you describe your rationale why you think they don't make
sense on UP?

 G. Slab merging
 
We often have slab caches with similar parameters. SLUB detects those
on bootup and merges them into the corresponding general caches. This
leads to more effective memory use.

Did you do any tests on what that does to long term memory fragmentation?
It is against the object of same type have similar livetime and should
be clustered together theory at least.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Thu, 22 Feb 2007, Andi Kleen wrote:

 SLUB does not need a cache reaper for UP systems.
 
 This means constructors/destructors are becomming worthless? 
 Can you describe your rationale why you think they don't make
 sense on UP?

Cache reaping has nothing to do with constructors and destructors. SLUB 
fully supports constructors and destructors.

  G. Slab merging
  
 We often have slab caches with similar parameters. SLUB detects those
 on bootup and merges them into the corresponding general caches. This
 leads to more effective memory use.
 
 Did you do any tests on what that does to long term memory fragmentation?
 It is against the object of same type have similar livetime and should
 be clustered together theory at least.

I have done no tests in that regard and we would have to assess the impact 
that the merging has to overall system behavior.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Andi Kleen

On Thu, Feb 22, 2007 at 10:42:23AM -0800, Christoph Lameter wrote:
 On Thu, 22 Feb 2007, Andi Kleen wrote:
 
  SLUB does not need a cache reaper for UP systems.
  
  This means constructors/destructors are becomming worthless? 
  Can you describe your rationale why you think they don't make
  sense on UP?
 
 Cache reaping has nothing to do with constructors and destructors. SLUB 
 fully supports constructors and destructors.

If you don't cache constructed but free objects then there is no cache
advantage of constructors/destructors and they would be useless.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB: The unqueued Slab allocator

2007-02-22 Thread Christoph Lameter

On Fri, 23 Feb 2007, Andi Kleen wrote:

 If you don't cache constructed but free objects then there is no cache
 advantage of constructors/destructors and they would be useless.

SLUB caches those objects as long as they are part of a partially 
allocated slab. If all objects in the slab are freed then the whole slab 
will be freed. SLUB does not keep queues of freed slabs.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SLUB: The unqueued Slab allocator

2007-02-21 Thread Christoph Lameter

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns 
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each cpus allocating and use a slab directly instead of objects
   from queues.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per cpu. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of 
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB metadata overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all metadata in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over. 
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per cpu slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per cpu, shared and alien object queues 
   during cache reaping which may cause strange holdoffs.

E. SLAB's has a complex NUMA policy layer support.

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of 
   policies to individual slab objects allocated in SLAB is 
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spinlock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per cpu and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on bootup and merges them into the corresponding general caches. This
   leads to more effective memory use.

The patch here is only the core portion. There are various add-ons
that may become ready later when this one has matured a bit. SLUB should
be fine for UP and SMP. No NUMA optimizations have been done so far so
it works but it does not scale to the high processor and node numbers yet.

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator. The output of /proc/slabinfo will then change. Here is a
sample (this is an UP/SMP format. The NUMA display will show on which nodes
the slabs were allocated):

slubinfo - version: 1.0
# name   // 
radix_tree_node 5574 0 560  797/0/1CP
bdev_cache 5 0 7682/1/1 CSrPa
sysfs_dir_cache 5946 0  80  117/0/1
inode_cache 2690 0 536  386/3/1  CSrP
dentry_cache7735 0 192  369/1/1   SrP
idr_layer_cache   79 0 536   12/0/1 C
buffer_head 5427 0 112  151/0/1  CSrP
mm_struct 37 1 8326/5/1Pa
vm_area_struct  1734 0 168   73/3/1 P
files_cache   37 0 6408/6/1Pa
signal_cache  63 0 640   12/4/1Pa
sighand_cache 63 22112   11/4/1  CRPa
task_struct   75 21728   11/6/1 P
anon_vma 590 0  244/3/1   CRP
kmalloc-192  424 0 192   21/0/1
kmalloc-96  1150 0  96   28/3/1
kmalloc-262144

SLUB: The unqueued Slab allocator

2007-02-21 Thread Christoph Lameter

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns 
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each cpus allocating and use a slab directly instead of objects
   from queues.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per cpu. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of 
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB metadata overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all metadata in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over. 
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per cpu slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per cpu, shared and alien object queues 
   during cache reaping which may cause strange holdoffs.

E. SLAB's has a complex NUMA policy layer support.

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of 
   policies to individual slab objects allocated in SLAB is 
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spinlock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per cpu and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on bootup and merges them into the corresponding general caches. This
   leads to more effective memory use.

The patch here is only the core portion. There are various add-ons
that may become ready later when this one has matured a bit. SLUB should
be fine for UP and SMP. No NUMA optimizations have been done so far so
it works but it does not scale to the high processor and node numbers yet.

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator. The output of /proc/slabinfo will then change. Here is a
sample (this is an UP/SMP format. The NUMA display will show on which nodes
the slabs were allocated):

slubinfo - version: 1.0
# nameobjects order objsize slabs/partial/cpu flags
radix_tree_node 5574 0 560  797/0/1CP
bdev_cache 5 0 7682/1/1 CSrPa
sysfs_dir_cache 5946 0  80  117/0/1
inode_cache 2690 0 536  386/3/1  CSrP
dentry_cache7735 0 192  369/1/1   SrP
idr_layer_cache   79 0 536   12/0/1 C
buffer_head 5427 0 112  151/0/1  CSrP
mm_struct 37 1 8326/5/1Pa
vm_area_struct  1734 0 168   73/3/1 P
files_cache   37 0 6408/6/1Pa
signal_cache  63 0 640   12/4/1Pa
sighand_cache 63 22112   11/4/1  CRPa
task_struct   75 21728   11/6/1 P
anon_vma 590 0  244/3/1   CRP
kmalloc-192  424 0 192   21/0/1
kmalloc-96  1150 0

78 matches

Mail list logo