[SLUB 0/2] SLUB: The unqueued slab allocator V6

2007-03-31 Thread Christoph Lameter
[PATCH] SLUB The unqueued slab allocator v6

Note that the definition of the return type of ksize() is currently
different between mm and Linus' tree. Patch is conforming to mm.
This patch also needs sprint_symbol() support from mm.

V5->V6:

- Straighten out various coding issues u.a. to make the hot path clearer
  in slab_alloc and slab_free. This adds more gotos. sigh.

- Detailed alloc / free tracking including pid, cpu, time of alloc / free
  if SLAB_STORE_USER is enabled or slub_debug=U specified on boot.

- sysfs support via /sys/slab. Drop /proc/slubinfo support.
  Include slabinfo tool that produces an output similar to what
  /proc/slabinfo does. Tool needs to be made more sophisticated
  to allow control of various slub options at runtime. Currently
  reports total slab sizes, slab fragmentation and slab effectiveness
  (actual object use vs. slab space use).

- Runtime debug option changes per slab via /sys/slab/.
  All slab debug options can be configured via sysfs provided that
  no objects have been allocated yet.

- Deal with i386 use of slab page structs. Main patch disables
  slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special
  patch removes the page sized slabs and removes that setting.
  See the caveats in that patch for further details.

V4->V5:

- Single object slabs only for slabs > slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  various platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all 

[SLUB 0/2] SLUB: The unqueued slab allocator V6

2007-03-31 Thread Christoph Lameter
[PATCH] SLUB The unqueued slab allocator v6

Note that the definition of the return type of ksize() is currently
different between mm and Linus' tree. Patch is conforming to mm.
This patch also needs sprint_symbol() support from mm.

V5-V6:

- Straighten out various coding issues u.a. to make the hot path clearer
  in slab_alloc and slab_free. This adds more gotos. sigh.

- Detailed alloc / free tracking including pid, cpu, time of alloc / free
  if SLAB_STORE_USER is enabled or slub_debug=U specified on boot.

- sysfs support via /sys/slab. Drop /proc/slubinfo support.
  Include slabinfo tool that produces an output similar to what
  /proc/slabinfo does. Tool needs to be made more sophisticated
  to allow control of various slub options at runtime. Currently
  reports total slab sizes, slab fragmentation and slab effectiveness
  (actual object use vs. slab space use).

- Runtime debug option changes per slab via /sys/slab/slabcache.
  All slab debug options can be configured via sysfs provided that
  no objects have been allocated yet.

- Deal with i386 use of slab page structs. Main patch disables
  slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special
  patch removes the page sized slabs and removes that setting.
  See the caveats in that patch for further details.

V4-V5:

- Single object slabs only for slabs  slub_max_order otherwise generate
  sufficient objects to avoid frequent use of the page allocator. This is
  necessary to compensate for fragmentation caused by frequent uses of
  the page allocator. We expect slabs of PAGE_SIZE from this rule since
  multi object slabs require uses of fields that are in use on i386 and
  x86_64. See the quicklist patchset for a way to fix that issue
  and a patch to get rid of the PAGE_SIZE special casing.

- Drop pass through to page allocator due to page allocator fragmenting
  memory. The buffering through large order allocations is done in SLUB.
  Infrequent larger order allocations cause less fragmentation
  than frequent small order allocations.

- We need to update object sizes when merging slabs otherwise kzalloc
  will not initialize the full object (this caused the failure on
  various platforms).

- Padding checks before redzone checks so that we get messages about
  the corruption of whole slab and not about a single object.

V3-V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

V2-V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1-V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps