[SLUB 0/2] SLUB: The unqueued slab allocator V6
[PATCH] SLUB The unqueued slab allocator v6 Note that the definition of the return type of ksize() is currently different between mm and Linus' tree. Patch is conforming to mm. This patch also needs sprint_symbol() support from mm. V5->V6: - Straighten out various coding issues u.a. to make the hot path clearer in slab_alloc and slab_free. This adds more gotos. sigh. - Detailed alloc / free tracking including pid, cpu, time of alloc / free if SLAB_STORE_USER is enabled or slub_debug=U specified on boot. - sysfs support via /sys/slab. Drop /proc/slubinfo support. Include slabinfo tool that produces an output similar to what /proc/slabinfo does. Tool needs to be made more sophisticated to allow control of various slub options at runtime. Currently reports total slab sizes, slab fragmentation and slab effectiveness (actual object use vs. slab space use). - Runtime debug option changes per slab via /sys/slab/. All slab debug options can be configured via sysfs provided that no objects have been allocated yet. - Deal with i386 use of slab page structs. Main patch disables slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special patch removes the page sized slabs and removes that setting. See the caveats in that patch for further details. V4->V5: - Single object slabs only for slabs > slub_max_order otherwise generate sufficient objects to avoid frequent use of the page allocator. This is necessary to compensate for fragmentation caused by frequent uses of the page allocator. We expect slabs of PAGE_SIZE from this rule since multi object slabs require uses of fields that are in use on i386 and x86_64. See the quicklist patchset for a way to fix that issue and a patch to get rid of the PAGE_SIZE special casing. - Drop pass through to page allocator due to page allocator fragmenting memory. The buffering through large order allocations is done in SLUB. Infrequent larger order allocations cause less fragmentation than frequent small order allocations. - We need to update object sizes when merging slabs otherwise kzalloc will not initialize the full object (this caused the failure on various platforms). - Padding checks before redzone checks so that we get messages about the corruption of whole slab and not about a single object. V3->V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. V2->V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1->V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block.
[SLUB 0/2] SLUB: The unqueued slab allocator V6
[PATCH] SLUB The unqueued slab allocator v6 Note that the definition of the return type of ksize() is currently different between mm and Linus' tree. Patch is conforming to mm. This patch also needs sprint_symbol() support from mm. V5-V6: - Straighten out various coding issues u.a. to make the hot path clearer in slab_alloc and slab_free. This adds more gotos. sigh. - Detailed alloc / free tracking including pid, cpu, time of alloc / free if SLAB_STORE_USER is enabled or slub_debug=U specified on boot. - sysfs support via /sys/slab. Drop /proc/slubinfo support. Include slabinfo tool that produces an output similar to what /proc/slabinfo does. Tool needs to be made more sophisticated to allow control of various slub options at runtime. Currently reports total slab sizes, slab fragmentation and slab effectiveness (actual object use vs. slab space use). - Runtime debug option changes per slab via /sys/slab/slabcache. All slab debug options can be configured via sysfs provided that no objects have been allocated yet. - Deal with i386 use of slab page structs. Main patch disables slub for i386 (CONFIG_ARCH_USES_SLAB_PAGE_STRUCT). Then a special patch removes the page sized slabs and removes that setting. See the caveats in that patch for further details. V4-V5: - Single object slabs only for slabs slub_max_order otherwise generate sufficient objects to avoid frequent use of the page allocator. This is necessary to compensate for fragmentation caused by frequent uses of the page allocator. We expect slabs of PAGE_SIZE from this rule since multi object slabs require uses of fields that are in use on i386 and x86_64. See the quicklist patchset for a way to fix that issue and a patch to get rid of the PAGE_SIZE special casing. - Drop pass through to page allocator due to page allocator fragmenting memory. The buffering through large order allocations is done in SLUB. Infrequent larger order allocations cause less fragmentation than frequent small order allocations. - We need to update object sizes when merging slabs otherwise kzalloc will not initialize the full object (this caused the failure on various platforms). - Padding checks before redzone checks so that we get messages about the corruption of whole slab and not about a single object. V3-V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. V2-V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1-V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5
On Sat, 10 Mar 2007, Andrew Morton wrote: > Is this safe to think about applying yet? Its safe. By default kernels will be build with SLAB. SLUB becomes only a selectable alternative. It should not become the primary slab until we know that its really superior overall and have thoroughly tested it in a variety of workloads. > We lost the leak detector feature. There will be numerous small things that will have to be addressed. There is also some minor work to be done for tracking callers better. > It might be nice to create synonyms for PageActive, PageReferenced and > PageError, to make things clearer in the slub core. At the expense of > making things less clear globally. Am unsure. I have been back and forth on doing that. There are somewhat similar in what they mean for SLUB. But creating synonyms may be confusing to those checking how page flags are being used. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5
Is this safe to think about applying yet? We lost the leak detector feature. It might be nice to create synonyms for PageActive, PageReferenced and PageError, to make things clearer in the slub core. At the expense of making things less clear globally. Am unsure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[SLUB 0/3] SLUB: The unqueued slab allocator V5
[PATCH] SLUB The unqueued slab allocator v4 V4->V5: - Single object slabs only for slabs > slub_max_order otherwise generate sufficient objects to avoid frequent use of the page allocator. This is necessary to compensate for fragmentation caused by frequent uses of the page allocator. We expect slabs of PAGE_SIZE from this rule since multi object slabs require uses of fields that are in use on i386 and x86_64. See the quicklist patchset for a way to fix that issue and a patch to get rid of the PAGE_SIZE special casing. - Drop pass through to page allocator due to page allocator fragmenting memory. The buffering through large order allocations is done in SLUB. Infrequent larger order allocations cause less fragmentation than frequent small order allocations. - We need to update object sizes when merging slabs otherwise kzalloc will not initialize the full object (this caused the failure on varios platforms). - Padding checks before redzone checks so that we get messages about the corruption of whole slab and not about a single object. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. Note that the definition of the return type of ksize() is currently different between mm and Linus tree. Patch is conforming to mm. V3->V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. V2->V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1->V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a perf
[SLUB 0/3] SLUB: The unqueued slab allocator V5
[PATCH] SLUB The unqueued slab allocator v4 V4-V5: - Single object slabs only for slabs slub_max_order otherwise generate sufficient objects to avoid frequent use of the page allocator. This is necessary to compensate for fragmentation caused by frequent uses of the page allocator. We expect slabs of PAGE_SIZE from this rule since multi object slabs require uses of fields that are in use on i386 and x86_64. See the quicklist patchset for a way to fix that issue and a patch to get rid of the PAGE_SIZE special casing. - Drop pass through to page allocator due to page allocator fragmenting memory. The buffering through large order allocations is done in SLUB. Infrequent larger order allocations cause less fragmentation than frequent small order allocations. - We need to update object sizes when merging slabs otherwise kzalloc will not initialize the full object (this caused the failure on varios platforms). - Padding checks before redzone checks so that we get messages about the corruption of whole slab and not about a single object. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. Note that the definition of the return type of ksize() is currently different between mm and Linus tree. Patch is conforming to mm. V3-V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. V2-V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1-V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5
Is this safe to think about applying yet? We lost the leak detector feature. It might be nice to create synonyms for PageActive, PageReferenced and PageError, to make things clearer in the slub core. At the expense of making things less clear globally. Am unsure. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V5
On Sat, 10 Mar 2007, Andrew Morton wrote: Is this safe to think about applying yet? Its safe. By default kernels will be build with SLAB. SLUB becomes only a selectable alternative. It should not become the primary slab until we know that its really superior overall and have thoroughly tested it in a variety of workloads. We lost the leak detector feature. There will be numerous small things that will have to be addressed. There is also some minor work to be done for tracking callers better. It might be nice to create synonyms for PageActive, PageReferenced and PageError, to make things clearer in the slub core. At the expense of making things less clear globally. Am unsure. I have been back and forth on doing that. There are somewhat similar in what they mean for SLUB. But creating synonyms may be confusing to those checking how page flags are being used. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: > The results without slub_debug were not good except for IA64. x86_64 and ppc64 > both blew up for a variety of reasons. The IA64 results were Yuck that is the dst issue that Adrian is also looking at. Likely an issue with slab merging and RCU frees. > KernBench Comparison > > 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub > %diff > User CPU time1084.64 1032.93 4.77% > System CPU time 73.38 63.14 > 13.95% > Total CPU time1158.02 1096.07 5.35% > Elapsedtime 307.00285.62 6.96% Wow! The first indication that we are on the right track with this. > AIM9 Comparison > 2 page_test 2097119.26 3398259.27 1301140.01 > 62.04% System Allocations & Pages/second Wow! Must have all stayed within slab boundaries. > 8 link_test 64776.047488.13 -57287.91 > -88.44% Link/Unlink Pairs/second Crap. Maybe we straddled a slab boundary here? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: > I'm not sure what you mean by per-order queues. The buddy allocator already > has per-order lists. Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid (or defer) fragmentation by keeping its own queues. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. The results without slub_debug were not good except for IA64. x86_64 and ppc64 both blew up for a variety of reasons. The IA64 results were KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub %diff User CPU time1084.64 1032.93 4.77% System CPU time 73.38 63.14 13.95% Total CPU time1158.02 1096.07 5.35% Elapsedtime 307.00285.62 6.96% AIM9 Comparison --- 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub 1 creat-clo425460.75 438809.64 13348.89 3.14% File Creations and Closes/second 2 page_test 2097119.26 3398259.27 1301140.01 62.04% System Allocations & Pages/second 3 brk_test7008395.33 6728755.72 -279639.61 -3.99% System Memory Allocations/second 4 jmp_test 12226295.3112254966.21 28670.90 0.23% Non-local gotos/second 5 signal_test 1271126.28 1235510.96 -35615.32 -2.80% Signal Traps/second 6 exec_test 395.54 381.18 -14.36 -3.63% Program Loads/second 7 fork_test 13218.23 13211.41 -6.82 -0.05% Task Creations/second 8 link_test 64776.047488.13 -57287.91 -88.44% Link/Unlink Pairs/second An example console log from x86_64 is below. It's not particular clear why it went blamo and I haven't had a chance all day to kick it around for a bit due to a variety of other hilarity floating around. Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007 Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 loglevel=8 BIOS-provided physical RAM map: BIOS-e820: - 0009d400 (usable) BIOS-e820: 0009d400 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - 3ffcddc0 (usable) BIOS-e820: 3ffcddc0 - 3ffd (ACPI data) BIOS-e820: 3ffd - 4000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used end_pfn_map = 1048576 DMI 2.3 present. ACPI: RSDP 000FDFC0, 0014 (r0 IBM ) ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM 45444F43) ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL 2002025) ACPI: FACS 3FFCFCC0, 0040 ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM 45444F43) SRAT: PXM 0 -> APIC 0 -> Node 0 SRAT: PXM 0 -> APIC 1 -> Node 0 SRAT: PXM 1 -> APIC 2 -> Node 1 SRAT: PXM 1 -> APIC 3 -> Node 1 SRAT: Node 0 PXM 0 0-4000 Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used NUMA: Using 63 for the hash shift. Bootmem setup node 0 -3ffcd000 Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000 sizeof(struct page) = 64 Zone PFN ranges: DMA 0 -> 4096 DMA324096 -> 1048576 Normal1048576 -> 1048576 Movable zone start PFN for each node early_node_map[2] active PFN ranges 0:0 -> 157 0: 256 -> 262093 On node 0 totalpages: 261994 DMA zone: 64 pages used for memmap DMA zone: 2017 pages reserved DMA zone: 1916 pages, LIFO batch:0 DMA32 zone: 4031 pages used for memmap DMA32 zone: 253966 pages, LIFO batch:31 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap ACPI: PM-Timer IO Port: 0x2208 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Processor #2 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) Processor #3 ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24]) IOAPIC[1]: apic_id 13, address
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: Note that I am amazed that the kernbench even worked. On small machine How small? The machines I am testing on aren't "big" but they aren't misterable either. I seem to be getting into trouble with order 1 allocations. That in itself is pretty incredible. From what I see, allocations up to 3 generally work unless they are atomic even with the vanilla kernel. That said, it could be because slab is holding onto the high order pages for itself. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? I'm not sure what you mean by per-order queues. The buddy allocator already has per-order lists. There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. One way to find out for sure. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: On Thu, 8 Mar 2007, Mel Gorman wrote: Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. It'll be interesting to see the kernbench tests then with debugging disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. It should work out because of the way buddy always selects the minimum page size will tend to cluster the slab allocations together whether they are reclaimable or not. It's something I can investigate when slub has stabilised a bit. However, in general, high order kernel allocations remain a bad idea. Depending on high order allocations that do not group could potentially lead to a situation where the movable areas are used more and more by kernel allocations. I cannot think of a workload that would actually break everything, but it's a possibility. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. I can certainly kick it around a lot and see what happen. It's best that slub_min_order=2 remain an optional performance enhancing switch though. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: On Thu, 8 Mar 2007, Mel Gorman wrote: Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. It'll be interesting to see the kernbench tests then with debugging disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. It should work out because of the way buddy always selects the minimum page size will tend to cluster the slab allocations together whether they are reclaimable or not. It's something I can investigate when slub has stabilised a bit. However, in general, high order kernel allocations remain a bad idea. Depending on high order allocations that do not group could potentially lead to a situation where the movable areas are used more and more by kernel allocations. I cannot think of a workload that would actually break everything, but it's a possibility. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. I can certainly kick it around a lot and see what happen. It's best that slub_min_order=2 remain an optional performance enhancing switch though. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Christoph Lameter wrote: Note that I am amazed that the kernbench even worked. On small machine How small? The machines I am testing on aren't big but they aren't misterable either. I seem to be getting into trouble with order 1 allocations. That in itself is pretty incredible. From what I see, allocations up to 3 generally work unless they are atomic even with the vanilla kernel. That said, it could be because slab is holding onto the high order pages for itself. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? I'm not sure what you mean by per-order queues. The buddy allocator already has per-order lists. There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. One way to find out for sure. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. The results without slub_debug were not good except for IA64. x86_64 and ppc64 both blew up for a variety of reasons. The IA64 results were KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub %diff User CPU time1084.64 1032.93 4.77% System CPU time 73.38 63.14 13.95% Total CPU time1158.02 1096.07 5.35% Elapsedtime 307.00285.62 6.96% AIM9 Comparison --- 2.6.21-rc2-mm2-clean2.6.21-rc2-mm2-slub 1 creat-clo425460.75 438809.64 13348.89 3.14% File Creations and Closes/second 2 page_test 2097119.26 3398259.27 1301140.01 62.04% System Allocations Pages/second 3 brk_test7008395.33 6728755.72 -279639.61 -3.99% System Memory Allocations/second 4 jmp_test 12226295.3112254966.21 28670.90 0.23% Non-local gotos/second 5 signal_test 1271126.28 1235510.96 -35615.32 -2.80% Signal Traps/second 6 exec_test 395.54 381.18 -14.36 -3.63% Program Loads/second 7 fork_test 13218.23 13211.41 -6.82 -0.05% Task Creations/second 8 link_test 64776.047488.13 -57287.91 -88.44% Link/Unlink Pairs/second An example console log from x86_64 is below. It's not particular clear why it went blamo and I haven't had a chance all day to kick it around for a bit due to a variety of other hilarity floating around. Linux version 2.6.21-rc2-mm2-autokern1 ([EMAIL PROTECTED]) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007 Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 console=ttyS1,19200 selinux=no autobench_args: root=30726124 ABAT:1173378546 loglevel=8 BIOS-provided physical RAM map: BIOS-e820: - 0009d400 (usable) BIOS-e820: 0009d400 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - 3ffcddc0 (usable) BIOS-e820: 3ffcddc0 - 3ffd (ACPI data) BIOS-e820: 3ffd - 4000 (reserved) BIOS-e820: fec0 - 0001 (reserved) Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used end_pfn_map = 1048576 DMI 2.3 present. ACPI: RSDP 000FDFC0, 0014 (r0 IBM ) ACPI: RSDT 3FFCFF80, 0034 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: FACP 3FFCFEC0, 0084 (r2 IBMSERBLADE 1000 IBM 45444F43) ACPI: DSDT 3FFCDDC0, 1EA6 (r1 IBMSERBLADE 1000 INTL 2002025) ACPI: FACS 3FFCFCC0, 0040 ACPI: APIC 3FFCFE00, 009C (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: SRAT 3FFCFD40, 0098 (r1 IBMSERBLADE 1000 IBM 45444F43) ACPI: HPET 3FFCFD00, 0038 (r1 IBMSERBLADE 1000 IBM 45444F43) SRAT: PXM 0 - APIC 0 - Node 0 SRAT: PXM 0 - APIC 1 - Node 0 SRAT: PXM 1 - APIC 2 - Node 1 SRAT: PXM 1 - APIC 3 - Node 1 SRAT: Node 0 PXM 0 0-4000 Entering add_active_range(0, 0, 157) 0 entries of 3200 used Entering add_active_range(0, 256, 262093) 1 entries of 3200 used NUMA: Using 63 for the hash shift. Bootmem setup node 0 -3ffcd000 Node 0 memmap at 0x81003efcd000 size 16773952 first pfn 0x81003efcd000 sizeof(struct page) = 64 Zone PFN ranges: DMA 0 - 4096 DMA324096 - 1048576 Normal1048576 - 1048576 Movable zone start PFN for each node early_node_map[2] active PFN ranges 0:0 - 157 0: 256 - 262093 On node 0 totalpages: 261994 DMA zone: 64 pages used for memmap DMA zone: 2017 pages reserved DMA zone: 1916 pages, LIFO batch:0 DMA32 zone: 4031 pages used for memmap DMA32 zone: 253966 pages, LIFO batch:31 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap ACPI: PM-Timer IO Port: 0x2208 ACPI: Local APIC address 0xfee0 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled) Processor #2 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled) Processor #3 ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1]) ACPI: IOAPIC (id[0x0e] address[0xfec0] gsi_base[0]) IOAPIC[0]: apic_id 14, address 0xfec0, GSI 0-23 ACPI: IOAPIC (id[0x0d] address[0xfec1] gsi_base[24]) IOAPIC[1]: apic_id 13, address 0xfec1, GSI
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: I'm not sure what you mean by per-order queues. The buddy allocator already has per-order lists. Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid (or defer) fragmentation by keeping its own queues. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Fri, 9 Mar 2007, Mel Gorman wrote: The results without slub_debug were not good except for IA64. x86_64 and ppc64 both blew up for a variety of reasons. The IA64 results were Yuck that is the dst issue that Adrian is also looking at. Likely an issue with slab merging and RCU frees. KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub %diff User CPU time1084.64 1032.93 4.77% System CPU time 73.38 63.14 13.95% Total CPU time1158.02 1096.07 5.35% Elapsedtime 307.00285.62 6.96% Wow! The first indication that we are on the right track with this. AIM9 Comparison 2 page_test 2097119.26 3398259.27 1301140.01 62.04% System Allocations Pages/second Wow! Must have all stayed within slab boundaries. 8 link_test 64776.047488.13 -57287.91 -88.44% Link/Unlink Pairs/second Crap. Maybe we straddled a slab boundary here? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. On small machine I seem to be getting into trouble with order 1 allocations. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > > Note that the 16kb page size has a major > > impact on SLUB performance. On IA64 slub will use only 1/4th the locking > > overhead as on 4kb platforms. > It'll be interesting to see the kernbench tests then with debugging > disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > Brought up 4 CPUs > Node 0 CPUs: 0-3 > mm/memory.c:111: bad pud c50e4480. Lower bits must be clear right? Looks like the pud was released and then reused for a 64 byte cache or so. This is likely a freelist pointer that slub put there after allocating the page for the 64 byte cache. Then we tried to use the pud. > migration_cost=0,1000 > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab > c0756090 > offset=240 flags=50c7 inuse=3 freelist=c50de0f0 > Bytes b4 c50de0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Object c50de120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >Redzone c50de130: 00 00 00 00 00 00 00 00 > FreePointer c50de138: Data overwritten after free or after slab was allocated. So this may be the same issue. pud was zapped after it was freed destroying the poison of another object in the 64 byte cache. Hmmm.. Maybe I should put the pad checks before the object checks. That way we detect that the whole slab was corrupted and do not flag just a single object. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On (08/03/07 08:48), Christoph Lameter didst pronounce: > On Thu, 8 Mar 2007, Mel Gorman wrote: > > > On x86_64, it completed successfully and looked reliable. There was a 5% > > performance loss on kernbench and aim9 figures were way down. However, with > > slub_debug enabled, I would expect that so it's not a fair comparison > > performance wise. I'll rerun the tests without debug and see what it looks > > like if you're interested and do not think it's too early to worry about > > performance instead of clarity. This is what I have for bl6-13 (machine > > appears on test.kernel.org so additional details are there). > > No its good to start worrying about performance now. There are still some > performance issues to be ironed out in particular on NUMA. I am not sure > f.e. how the reduction of partial lists affect performance. > Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and x86_64). It'll take them a long time to complete though as there is a lot of testing going on right now. > > IA64 (machine not visible on TKO) curiously did not exhibit the same > > problems > > on kernbench for Total CPU time which is very unexpected but you can see the > > System CPU times. The AIM9 figures were a bit of an upset but again, I blame > > slub_debug being enabled > > This was a single node box? Yes, memory looks like this; Zone PFN ranges: DMA 1024 -> 262144 Normal 262144 -> 262144 Movable zone start PFN for each node early_node_map[3] active PFN ranges 0: 1024 ->30719 0:32768 ->65413 0:65440 ->65505 On node 0 totalpages: 62405 Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000 DMA zone: 220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 62185 pages, LIFO batch:7 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap > Note that the 16kb page size has a major > impact on SLUB performance. On IA64 slub will use only 1/4th the locking > overhead as on 4kb platforms. > It'll be interesting to see the kernbench tests then with debugging disabled. > > (as an aside, the succes rates for high-order allocations are lower with > > SLUB. > > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar > > effects > > because of red-zoning and the like) > > We have some additional patches here that reduce the max order for some > allocs. I believe the task_struct gets to be an order 2 alloc with V4, > Should make a difference for slab fragmentation > > Now, the bad news. This exploded on ppc64. It started going wrong early in > > the > > boot process and got worse. I haven't looked closely as to why yet as there > > is > > other stuff on my plate but I've included a console log that might be some > > use > > to you. If you think you have a fix for it, feel free to send it on and I'll > > give it a test. > > Hmmm... Looks like something is zapping an object. Try to rerun with > a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. > I've queued up a few tests. One completed as I wrote this and it didn't explode with SLAB_DEBUG set. Maybe the others will be different. I'll kick it around for a bit. It could be a real bug that slab is just not catuching. > > Brought up 4 CPUs > > Node 0 CPUs: 0-3 > > mm/memory.c:111: bad pud c50e4480. > > could not vmalloc 20971520 bytes for cache! > > Hmmm... a bad pud? I need to look at how the puds are managed on power. > > > migration_cost=0,1000 > > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab > > An object was overwritten with zeros after it was freed. > > RTAS daemon started > > RTAS: event: 1, Type: Platform Error, Severity: 2 > > audit: initializing netlink socket (disabled) > > audit(1173335571.256:1): initialized > > Total HugeTLB memory allocated, 0 > > VFS: Disk quotas dquot_6.5.1 > > Dquot-cache hash table entries: 512 (order 0, 4096 bytes) > > JFS: nTxBlock = 8192, nTxLock = 65536 > > SELinux: Registering netfilter hooks > > io scheduler noop registered > > io scheduler anticipatory registered (default) > > io scheduler deadline registered > > io scheduler cfq registered > > pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered > > vio_register_driver: driver hvc_console registering > > [ cut here ] > > Badness at mm/slub.c:1701 > > Someone did a kmalloc(0, ...). Zero sized allocation are not flagged > by SLAB but SLUB does. > I'll chase up what's happening here. It will be "reproducable" independent of SLUB by adding a similar check. > > Call Trace: > > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) > > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 > > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 > >
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: > On x86_64, it completed successfully and looked reliable. There was a 5% > performance loss on kernbench and aim9 figures were way down. However, with > slub_debug enabled, I would expect that so it's not a fair comparison > performance wise. I'll rerun the tests without debug and see what it looks > like if you're interested and do not think it's too early to worry about > performance instead of clarity. This is what I have for bl6-13 (machine > appears on test.kernel.org so additional details are there). No its good to start worrying about performance now. There are still some performance issues to be ironed out in particular on NUMA. I am not sure f.e. how the reduction of partial lists affect performance. > IA64 (machine not visible on TKO) curiously did not exhibit the same problems > on kernbench for Total CPU time which is very unexpected but you can see the > System CPU times. The AIM9 figures were a bit of an upset but again, I blame > slub_debug being enabled This was a single node box? Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. > (as an aside, the succes rates for high-order allocations are lower with SLUB. > Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects > because of red-zoning and the like) We have some additional patches here that reduce the max order for some allocs. I believe the task_struct gets to be an order 2 alloc with V4, > Now, the bad news. This exploded on ppc64. It started going wrong early in the > boot process and got worse. I haven't looked closely as to why yet as there is > other stuff on my plate but I've included a console log that might be some use > to you. If you think you have a fix for it, feel free to send it on and I'll > give it a test. Hmmm... Looks like something is zapping an object. Try to rerun with a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. > Brought up 4 CPUs > Node 0 CPUs: 0-3 > mm/memory.c:111: bad pud c50e4480. > could not vmalloc 20971520 bytes for cache! Hmmm... a bad pud? I need to look at how the puds are managed on power. > migration_cost=0,1000 > *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab An object was overwritten with zeros after it was freed. > RTAS daemon started > RTAS: event: 1, Type: Platform Error, Severity: 2 > audit: initializing netlink socket (disabled) > audit(1173335571.256:1): initialized > Total HugeTLB memory allocated, 0 > VFS: Disk quotas dquot_6.5.1 > Dquot-cache hash table entries: 512 (order 0, 4096 bytes) > JFS: nTxBlock = 8192, nTxLock = 65536 > SELinux: Registering netfilter hooks > io scheduler noop registered > io scheduler anticipatory registered (default) > io scheduler deadline registered > io scheduler cfq registered > pci_hotplug: PCI Hot Plug PCI Core version: 0.5 > rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 > rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered > vio_register_driver: driver hvc_console registering > [ cut here ] > Badness at mm/slub.c:1701 Someone did a kmalloc(0, ...). Zero sized allocation are not flagged by SLAB but SLUB does. > Call Trace: > [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) > [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 > [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 > [C506B930] [C00046F4] program_check_common+0xf4/0x100 > --- Exception: 700 at .get_slab+0xbc/0x18c > LR = .__kmalloc+0x28/0x104 > [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable) > [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104 > [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c > [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4 > [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8 > [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68 > mm/memory.c:111: bad pud c5762900. > mm/memory.c:111: bad pud c5762480. > [ cut here ] > kernel BUG at mm/mmap.c:1999! More page table trouble. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Tue, 6 Mar 2007, Christoph Lameter wrote: [PATCH] SLUB The unqueued slab allocator v4 Hi Christoph, I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 last night to see how they got on. I enabled slub_debug to catch any suprises that may be creeping about. The results are mixed. On x86_64, it completed successfully and looked reliable. There was a 5% performance loss on kernbench and aim9 figures were way down. However, with slub_debug enabled, I would expect that so it's not a fair comparison performance wise. I'll rerun the tests without debug and see what it looks like if you're interested and do not think it's too early to worry about performance instead of clarity. This is what I have for bl6-13 (machine appears on test.kernel.org so additional details are there). KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time 84.32 86.03 -2.03% System CPU time 32.97 38.21 -15.89% Total CPU time 117.29124.24 -5.93% Elapsedtime 34.95 37.31 -6.75% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo160706.55 62918.54 -97788.01 -60.85% File Creations and Closes/second 2 page_test190371.67 204050.99 13679.32 7.19% System Allocations & Pages/second 3 brk_test2320679.89 1923512.75 -397167.14 -17.11% System Memory Allocations/second 4 jmp_test 16391869.3816380353.27 -11516.11 -0.07% Non-local gotos/second 5 signal_test 492234.63 235710.71 -256523.92 -52.11% Signal Traps/second 6 exec_test 232.26 220.88 -11.38 -4.90% Program Loads/second 7 fork_test 4514.253609.40-904.85 -20.04% Task Creations/second 8 link_test 53639.76 26925.91 -26713.85 -49.80% Link/Unlink Pairs/second IA64 (machine not visible on TKO) curiously did not exhibit the same problems on kernbench for Total CPU time which is very unexpected but you can see the System CPU times. The AIM9 figures were a bit of an upset but again, I blame slub_debug being enabled KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time1084.64 1033.46 4.72% System CPU time 73.38 84.14 -14.66% Total CPU time1158.021117.6 3.49% Elapsedtime 307.00291.29 5.12% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo425460.75 137709.84 -287750.91 -67.63% File Creations and Closes/second 2 page_test 2097119.26 2373083.49 275964.23 13.16% System Allocations & Pages/second 3 brk_test7008395.33 3787961.51 -3220433.82 -45.95% System Memory Allocations/second 4 jmp_test 12226295.3112254744.03 28448.72 0.23% Non-local gotos/second 5 signal_test 1271126.28 334357.29 -936768.99 -73.70% Signal Traps/second 6 exec_test 395.54 349.00 -46.54 -11.77% Program Loads/second 7 fork_test 13218.238822.93 -4395.30 -33.25% Task Creations/second 8 link_test 64776.047410.75 -57365.29 -88.56% Link/Unlink Pairs/second (as an aside, the succes rates for high-order allocations are lower with SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects because of red-zoning and the like) Now, the bad news. This exploded on ppc64. It started going wrong early in the boot process and got worse. I haven't looked closely as to why yet as there is other stuff on my plate but I've included a console log that might be some use to you. If you think you have a fix for it, feel free to send it on and I'll give it a test. Config file read, 1024 bytes Welcome Welcome to yaboot version 1.3.12 Enter "help" to get some basic usage information boot: autobench Please wait, loading kernel... Elf64 kernel loaded... Loading ramdisk... ramdisk loaded at 0240, size: 1648 Kbytes OF stdout device is: /vdevice/[EMAIL PROTECTED] Hypertas detected, assuming LPAR ! command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug mem
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: On x86_64, it completed successfully and looked reliable. There was a 5% performance loss on kernbench and aim9 figures were way down. However, with slub_debug enabled, I would expect that so it's not a fair comparison performance wise. I'll rerun the tests without debug and see what it looks like if you're interested and do not think it's too early to worry about performance instead of clarity. This is what I have for bl6-13 (machine appears on test.kernel.org so additional details are there). No its good to start worrying about performance now. There are still some performance issues to be ironed out in particular on NUMA. I am not sure f.e. how the reduction of partial lists affect performance. IA64 (machine not visible on TKO) curiously did not exhibit the same problems on kernbench for Total CPU time which is very unexpected but you can see the System CPU times. The AIM9 figures were a bit of an upset but again, I blame slub_debug being enabled This was a single node box? Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. (as an aside, the succes rates for high-order allocations are lower with SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects because of red-zoning and the like) We have some additional patches here that reduce the max order for some allocs. I believe the task_struct gets to be an order 2 alloc with V4, Now, the bad news. This exploded on ppc64. It started going wrong early in the boot process and got worse. I haven't looked closely as to why yet as there is other stuff on my plate but I've included a console log that might be some use to you. If you think you have a fix for it, feel free to send it on and I'll give it a test. Hmmm... Looks like something is zapping an object. Try to rerun with a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. Brought up 4 CPUs Node 0 CPUs: 0-3 mm/memory.c:111: bad pud c50e4480. could not vmalloc 20971520 bytes for cache! Hmmm... a bad pud? I need to look at how the puds are managed on power. migration_cost=0,1000 *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab An object was overwritten with zeros after it was freed. RTAS daemon started RTAS: event: 1, Type: Platform Error, Severity: 2 audit: initializing netlink socket (disabled) audit(1173335571.256:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) JFS: nTxBlock = 8192, nTxLock = 65536 SELinux: Registering netfilter hooks io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered pci_hotplug: PCI Hot Plug PCI Core version: 0.5 rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered vio_register_driver: driver hvc_console registering [ cut here ] Badness at mm/slub.c:1701 Someone did a kmalloc(0, ...). Zero sized allocation are not flagged by SLAB but SLUB does. Call Trace: [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 [C506B930] [C00046F4] program_check_common+0xf4/0x100 --- Exception: 700 at .get_slab+0xbc/0x18c LR = .__kmalloc+0x28/0x104 [C506BC20] [C506BCC0] 0xc506bcc0 (unreliable) [C506BCD0] [C00CE2EC] .__kmalloc+0x28/0x104 [C506BD60] [C022E724] .tty_register_driver+0x5c/0x23c [C506BE10] [C0477910] .hvsi_init+0x154/0x1b4 [C506BEC0] [C0451B7C] .init+0x1c4/0x2f8 [C506BF90] [C00275D0] .kernel_thread+0x4c/0x68 mm/memory.c:111: bad pud c5762900. mm/memory.c:111: bad pud c5762480. [ cut here ] kernel BUG at mm/mmap.c:1999! More page table trouble. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On (08/03/07 08:48), Christoph Lameter didst pronounce: On Thu, 8 Mar 2007, Mel Gorman wrote: On x86_64, it completed successfully and looked reliable. There was a 5% performance loss on kernbench and aim9 figures were way down. However, with slub_debug enabled, I would expect that so it's not a fair comparison performance wise. I'll rerun the tests without debug and see what it looks like if you're interested and do not think it's too early to worry about performance instead of clarity. This is what I have for bl6-13 (machine appears on test.kernel.org so additional details are there). No its good to start worrying about performance now. There are still some performance issues to be ironed out in particular on NUMA. I am not sure f.e. how the reduction of partial lists affect performance. Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and x86_64). It'll take them a long time to complete though as there is a lot of testing going on right now. IA64 (machine not visible on TKO) curiously did not exhibit the same problems on kernbench for Total CPU time which is very unexpected but you can see the System CPU times. The AIM9 figures were a bit of an upset but again, I blame slub_debug being enabled This was a single node box? Yes, memory looks like this; Zone PFN ranges: DMA 1024 - 262144 Normal 262144 - 262144 Movable zone start PFN for each node early_node_map[3] active PFN ranges 0: 1024 -30719 0:32768 -65413 0:65440 -65505 On node 0 totalpages: 62405 Node 0 memmap at 0xe1126000 size 3670016 first pfn 0xe1134000 DMA zone: 220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 62185 pages, LIFO batch:7 Normal zone: 0 pages used for memmap Movable zone: 0 pages used for memmap Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. It'll be interesting to see the kernbench tests then with debugging disabled. (as an aside, the succes rates for high-order allocations are lower with SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects because of red-zoning and the like) We have some additional patches here that reduce the max order for some allocs. I believe the task_struct gets to be an order 2 alloc with V4, Should make a difference for slab fragmentation Now, the bad news. This exploded on ppc64. It started going wrong early in the boot process and got worse. I haven't looked closely as to why yet as there is other stuff on my plate but I've included a console log that might be some use to you. If you think you have a fix for it, feel free to send it on and I'll give it a test. Hmmm... Looks like something is zapping an object. Try to rerun with a kernel compiled with CONFIG_SLAB_DEBUG. I would expect similar results. I've queued up a few tests. One completed as I wrote this and it didn't explode with SLAB_DEBUG set. Maybe the others will be different. I'll kick it around for a bit. It could be a real bug that slab is just not catuching. Brought up 4 CPUs Node 0 CPUs: 0-3 mm/memory.c:111: bad pud c50e4480. could not vmalloc 20971520 bytes for cache! Hmmm... a bad pud? I need to look at how the puds are managed on power. migration_cost=0,1000 *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab An object was overwritten with zeros after it was freed. RTAS daemon started RTAS: event: 1, Type: Platform Error, Severity: 2 audit: initializing netlink socket (disabled) audit(1173335571.256:1): initialized Total HugeTLB memory allocated, 0 VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 512 (order 0, 4096 bytes) JFS: nTxBlock = 8192, nTxLock = 65536 SELinux: Registering netfilter hooks io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered pci_hotplug: PCI Hot Plug PCI Core version: 0.5 rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1 rpaphp: Slot [:00:02.2](PCI location=U7879.001.DQD0T7T-P1-C4) registered vio_register_driver: driver hvc_console registering [ cut here ] Badness at mm/slub.c:1701 Someone did a kmalloc(0, ...). Zero sized allocation are not flagged by SLAB but SLUB does. I'll chase up what's happening here. It will be reproducable independent of SLUB by adding a similar check. Call Trace: [C506B730] [C0011188] .show_stack+0x6c/0x1a0 (unreliable) [C506B7D0] [C01EE9F4] .report_bug+0x94/0xe8 [C506B860] [C038C85C] .program_check_exception+0x16c/0x5f4 [C506B930] [C00046F4] program_check_common+0xf4/0x100 --- Exception: 700 at .get_slab+0xbc/0x18c LR = .__kmalloc+0x28/0x104
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: Brought up 4 CPUs Node 0 CPUs: 0-3 mm/memory.c:111: bad pud c50e4480. Lower bits must be clear right? Looks like the pud was released and then reused for a 64 byte cache or so. This is likely a freelist pointer that slub put there after allocating the page for the 64 byte cache. Then we tried to use the pud. migration_cost=0,1000 *** SLUB: Redzone Inactive check fails in [EMAIL PROTECTED] Slab c0756090 offset=240 flags=50c7 inuse=3 freelist=c50de0f0 Bytes b4 c50de0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object c50de0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object c50de100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object c50de110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Object c50de120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Redzone c50de130: 00 00 00 00 00 00 00 00 FreePointer c50de138: Data overwritten after free or after slab was allocated. So this may be the same issue. pud was zapped after it was freed destroying the poison of another object in the 64 byte cache. Hmmm.. Maybe I should put the pad checks before the object checks. That way we detect that the whole slab was corrupted and do not flag just a single object. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Thu, 8 Mar 2007, Mel Gorman wrote: Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking overhead as on 4kb platforms. It'll be interesting to see the kernbench tests then with debugging disabled. You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
Note that I am amazed that the kernbench even worked. On small machine I seem to be getting into trouble with order 1 allocations. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4
On Tue, 6 Mar 2007, Christoph Lameter wrote: [PATCH] SLUB The unqueued slab allocator v4 Hi Christoph, I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 last night to see how they got on. I enabled slub_debug to catch any suprises that may be creeping about. The results are mixed. On x86_64, it completed successfully and looked reliable. There was a 5% performance loss on kernbench and aim9 figures were way down. However, with slub_debug enabled, I would expect that so it's not a fair comparison performance wise. I'll rerun the tests without debug and see what it looks like if you're interested and do not think it's too early to worry about performance instead of clarity. This is what I have for bl6-13 (machine appears on test.kernel.org so additional details are there). KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time 84.32 86.03 -2.03% System CPU time 32.97 38.21 -15.89% Total CPU time 117.29124.24 -5.93% Elapsedtime 34.95 37.31 -6.75% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo160706.55 62918.54 -97788.01 -60.85% File Creations and Closes/second 2 page_test190371.67 204050.99 13679.32 7.19% System Allocations Pages/second 3 brk_test2320679.89 1923512.75 -397167.14 -17.11% System Memory Allocations/second 4 jmp_test 16391869.3816380353.27 -11516.11 -0.07% Non-local gotos/second 5 signal_test 492234.63 235710.71 -256523.92 -52.11% Signal Traps/second 6 exec_test 232.26 220.88 -11.38 -4.90% Program Loads/second 7 fork_test 4514.253609.40-904.85 -20.04% Task Creations/second 8 link_test 53639.76 26925.91 -26713.85 -49.80% Link/Unlink Pairs/second IA64 (machine not visible on TKO) curiously did not exhibit the same problems on kernbench for Total CPU time which is very unexpected but you can see the System CPU times. The AIM9 figures were a bit of an upset but again, I blame slub_debug being enabled KernBench Comparison 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based %diff User CPU time1084.64 1033.46 4.72% System CPU time 73.38 84.14 -14.66% Total CPU time1158.021117.6 3.49% Elapsedtime 307.00291.29 5.12% AIM9 Comparison --- 2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 1 creat-clo425460.75 137709.84 -287750.91 -67.63% File Creations and Closes/second 2 page_test 2097119.26 2373083.49 275964.23 13.16% System Allocations Pages/second 3 brk_test7008395.33 3787961.51 -3220433.82 -45.95% System Memory Allocations/second 4 jmp_test 12226295.3112254744.03 28448.72 0.23% Non-local gotos/second 5 signal_test 1271126.28 334357.29 -936768.99 -73.70% Signal Traps/second 6 exec_test 395.54 349.00 -46.54 -11.77% Program Loads/second 7 fork_test 13218.238822.93 -4395.30 -33.25% Task Creations/second 8 link_test 64776.047410.75 -57365.29 -88.56% Link/Unlink Pairs/second (as an aside, the succes rates for high-order allocations are lower with SLUB. Again, I blame slub_debug. I know that enabling SLAB_DEBUG has similar effects because of red-zoning and the like) Now, the bad news. This exploded on ppc64. It started going wrong early in the boot process and got worse. I haven't looked closely as to why yet as there is other stuff on my plate but I've included a console log that might be some use to you. If you think you have a fix for it, feel free to send it on and I'll give it a test. Config file read, 1024 bytes Welcome Welcome to yaboot version 1.3.12 Enter help to get some basic usage information boot: autobench Please wait, loading kernel... Elf64 kernel loaded... Loading ramdisk... ramdisk loaded at 0240, size: 1648 Kbytes OF stdout device is: /vdevice/[EMAIL PROTECTED] Hypertas detected, assuming LPAR ! command line: ro console=hvc0 autobench_args: root=/dev/sda6 ABAT:1173335344 loglevel=8 slub_debug memory layout at init
[SLUB 0/3] SLUB: The unqueued slab allocator V4
[PATCH] SLUB The unqueued slab allocator v4 V3->V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. There are very likely also issues remaining in SLUB. V2->V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1->V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have sl
[SLUB 0/3] SLUB: The unqueued slab allocator V4
[PATCH] SLUB The unqueued slab allocator v4 V3-V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. There are very likely also issues remaining in SLUB. V2-V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1-V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each allocating CPU and use objects from a slab directly instead of queueing them up. B. Storage overhead of object queues SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB meta data overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all meta data in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per CPU slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per CPU, shared and alien object queues during cache reaping which may cause strange hold offs. E. SLAB has complex NUMA policy layer support SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spin lock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per CPU and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar
Re: [PATCH] SLUB The unqueued slab allocator V3
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Wed, 28 Feb 2007 17:06:19 -0800 (PST) > On Wed, 28 Feb 2007, David Miller wrote: > > > Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should > > not be set here, but SLUBs change in semantics in this area > > could cause similar grief in other areas, an audit is probably > > in order. > > > > The above example was from sparc64, but x86 does the same thing > > as probably do other platforms which use SLAB for pagetables. > > Maybe this will address these concerns? > > Index: linux-2.6.21-rc2/mm/slub.c > === > --- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800 > +++ linux-2.6.21-rc2/mm/slub.c2007-02-28 17:03:54.0 -0800 > @@ -1229,8 +1229,10 @@ static int calculate_order(int size) > static unsigned long calculate_alignment(unsigned long flags, > unsigned long align) > { > - if (flags & (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN)) > + if (flags & SLAB_HWCACHE_ALIGN) > return L1_CACHE_BYTES; > + if (flags & SLAB_MUST_HWCACHE_ALIGN) > + return max(align, (unsigned long)L1_CACHE_BYTES); > > if (align < ARCH_SLAB_MINALIGN) > return ARCH_SLAB_MINALIGN; It would achiever parity with existing SLAB behavior, sure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
On Wed, 28 Feb 2007, David Miller wrote: > Maybe if you managed your individual changes in GIT or similar > this could be debugged very quickly. :-) I think once things calm down and the changes become smaller its going to be easier. Likely the case with after V4. > Meanwhile I noticed that your alignment algorithm is different > than SLAB's. And I think this is important for the page table > SLABs that some platforms use. Ok. > No matter what flags are specified, SLAB gives at least the > passed in alignment specified in kmem_cache_create(). That > logic in slab is here: > > /* 3) caller mandated alignment */ > if (ralign < align) { > ralign = align; > } Hmmm... Right. > Whereas SLUB uses the CPU cacheline size when the MUSTALIGN > flag is set. Architectures do things like: > > pgtable_cache = kmem_cache_create("pgtable_cache", > PAGE_SIZE, PAGE_SIZE, > SLAB_HWCACHE_ALIGN | > SLAB_MUST_HWCACHE_ALIGN, > zero_ctor, > NULL); > > to get a PAGE_SIZE aligned slab, SLUB doesn't give the same > behavior SLAB does in this case. SLUB only supports this by passing through allocations to the page allocator since it does not maintain queues. So the above will cause the pgtable_cache to use the caches of the page allocator. The queueing effect that you get from SLAB is not present in SLUB since it does not provide them. If SLUB is to be used this way then we need to have higher order page sizes and allocate chunks from the higher order page for the pgtable_cache. There are other ways of doing it. IA64 f.e. uses a linked list to accomplish the same avoiding SLAB overhead. > Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should > not be set here, but SLUBs change in semantics in this area > could cause similar grief in other areas, an audit is probably > in order. > > The above example was from sparc64, but x86 does the same thing > as probably do other platforms which use SLAB for pagetables. Maybe this will address these concerns? Index: linux-2.6.21-rc2/mm/slub.c === --- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800 +++ linux-2.6.21-rc2/mm/slub.c 2007-02-28 17:03:54.0 -0800 @@ -1229,8 +1229,10 @@ static int calculate_order(int size) static unsigned long calculate_alignment(unsigned long flags, unsigned long align) { - if (flags & (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN)) + if (flags & SLAB_HWCACHE_ALIGN) return L1_CACHE_BYTES; + if (flags & SLAB_MUST_HWCACHE_ALIGN) + return max(align, (unsigned long)L1_CACHE_BYTES); if (align < ARCH_SLAB_MINALIGN) return ARCH_SLAB_MINALIGN; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
From: David Miller <[EMAIL PROTECTED]> Date: Wed, 28 Feb 2007 14:00:22 -0800 (PST) > V3 doesn't boot successfully on sparc64 False alarm! This crash was actually due to an unrelated problem in the parport_pc driver on my machine. Slub v3 boots up and seems to work fine so far on sparc64. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Wed, 28 Feb 2007 11:20:44 -0800 (PST) > V2->V3 > - Debugging and diagnostic support. This is runtime enabled and not compile > time enabled. Runtime debugging can be controlled via kernel boot options > on an individual slab cache basis or globally. > - Slab Trace support (For individual slab caches). > - Resiliency support: If basic sanity checks are enabled (via F f.e.) > (boot option) then SLUB will do the best to perform diagnostics and > then continue (i.e. mark corrupted objects as used). > - Fix up numerous issues including clash of SLUBs use of page > flags with i386 arch use for pmd and pgds (which are managed > as slab caches, sigh). > - Dynamic per CPU array sizing. > - Explain SLUB slabcache flags V3 doesn't boot successfully on sparc64, sorry I don't have the ability to track this down at the moment since it resets the machine right as the video device is initialized and after diffing V2 to V3 there is way too much stuff changing for me to try and "bisect" between V2 to V3 to find the guilty sub-change. Maybe if you managed your individual changes in GIT or similar this could be debugged very quickly. :-) Meanwhile I noticed that your alignment algorithm is different than SLAB's. And I think this is important for the page table SLABs that some platforms use. No matter what flags are specified, SLAB gives at least the passed in alignment specified in kmem_cache_create(). That logic in slab is here: /* 3) caller mandated alignment */ if (ralign < align) { ralign = align; } Whereas SLUB uses the CPU cacheline size when the MUSTALIGN flag is set. Architectures do things like: pgtable_cache = kmem_cache_create("pgtable_cache", PAGE_SIZE, PAGE_SIZE, SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, zero_ctor, NULL); to get a PAGE_SIZE aligned slab, SLUB doesn't give the same behavior SLAB does in this case. Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should not be set here, but SLUBs change in semantics in this area could cause similar grief in other areas, an audit is probably in order. The above example was from sparc64, but x86 does the same thing as probably do other platforms which use SLAB for pagetables. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
From: Christoph Lameter [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 11:20:44 -0800 (PST) V2-V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V3 doesn't boot successfully on sparc64, sorry I don't have the ability to track this down at the moment since it resets the machine right as the video device is initialized and after diffing V2 to V3 there is way too much stuff changing for me to try and bisect between V2 to V3 to find the guilty sub-change. Maybe if you managed your individual changes in GIT or similar this could be debugged very quickly. :-) Meanwhile I noticed that your alignment algorithm is different than SLAB's. And I think this is important for the page table SLABs that some platforms use. No matter what flags are specified, SLAB gives at least the passed in alignment specified in kmem_cache_create(). That logic in slab is here: /* 3) caller mandated alignment */ if (ralign align) { ralign = align; } Whereas SLUB uses the CPU cacheline size when the MUSTALIGN flag is set. Architectures do things like: pgtable_cache = kmem_cache_create(pgtable_cache, PAGE_SIZE, PAGE_SIZE, SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, zero_ctor, NULL); to get a PAGE_SIZE aligned slab, SLUB doesn't give the same behavior SLAB does in this case. Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should not be set here, but SLUBs change in semantics in this area could cause similar grief in other areas, an audit is probably in order. The above example was from sparc64, but x86 does the same thing as probably do other platforms which use SLAB for pagetables. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
From: David Miller [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 14:00:22 -0800 (PST) V3 doesn't boot successfully on sparc64 False alarm! This crash was actually due to an unrelated problem in the parport_pc driver on my machine. Slub v3 boots up and seems to work fine so far on sparc64. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
On Wed, 28 Feb 2007, David Miller wrote: Maybe if you managed your individual changes in GIT or similar this could be debugged very quickly. :-) I think once things calm down and the changes become smaller its going to be easier. Likely the case with after V4. Meanwhile I noticed that your alignment algorithm is different than SLAB's. And I think this is important for the page table SLABs that some platforms use. Ok. No matter what flags are specified, SLAB gives at least the passed in alignment specified in kmem_cache_create(). That logic in slab is here: /* 3) caller mandated alignment */ if (ralign align) { ralign = align; } Hmmm... Right. Whereas SLUB uses the CPU cacheline size when the MUSTALIGN flag is set. Architectures do things like: pgtable_cache = kmem_cache_create(pgtable_cache, PAGE_SIZE, PAGE_SIZE, SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, zero_ctor, NULL); to get a PAGE_SIZE aligned slab, SLUB doesn't give the same behavior SLAB does in this case. SLUB only supports this by passing through allocations to the page allocator since it does not maintain queues. So the above will cause the pgtable_cache to use the caches of the page allocator. The queueing effect that you get from SLAB is not present in SLUB since it does not provide them. If SLUB is to be used this way then we need to have higher order page sizes and allocate chunks from the higher order page for the pgtable_cache. There are other ways of doing it. IA64 f.e. uses a linked list to accomplish the same avoiding SLAB overhead. Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should not be set here, but SLUBs change in semantics in this area could cause similar grief in other areas, an audit is probably in order. The above example was from sparc64, but x86 does the same thing as probably do other platforms which use SLAB for pagetables. Maybe this will address these concerns? Index: linux-2.6.21-rc2/mm/slub.c === --- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800 +++ linux-2.6.21-rc2/mm/slub.c 2007-02-28 17:03:54.0 -0800 @@ -1229,8 +1229,10 @@ static int calculate_order(int size) static unsigned long calculate_alignment(unsigned long flags, unsigned long align) { - if (flags (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN)) + if (flags SLAB_HWCACHE_ALIGN) return L1_CACHE_BYTES; + if (flags SLAB_MUST_HWCACHE_ALIGN) + return max(align, (unsigned long)L1_CACHE_BYTES); if (align ARCH_SLAB_MINALIGN) return ARCH_SLAB_MINALIGN; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SLUB The unqueued slab allocator V3
From: Christoph Lameter [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 17:06:19 -0800 (PST) On Wed, 28 Feb 2007, David Miller wrote: Arguably SLAB_HWCACHE_ALIGN and SLAB_MUST_HWCACHE_ALIGN should not be set here, but SLUBs change in semantics in this area could cause similar grief in other areas, an audit is probably in order. The above example was from sparc64, but x86 does the same thing as probably do other platforms which use SLAB for pagetables. Maybe this will address these concerns? Index: linux-2.6.21-rc2/mm/slub.c === --- linux-2.6.21-rc2.orig/mm/slub.c 2007-02-28 16:54:23.0 -0800 +++ linux-2.6.21-rc2/mm/slub.c2007-02-28 17:03:54.0 -0800 @@ -1229,8 +1229,10 @@ static int calculate_order(int size) static unsigned long calculate_alignment(unsigned long flags, unsigned long align) { - if (flags (SLAB_MUST_HWCACHE_ALIGN|SLAB_HWCACHE_ALIGN)) + if (flags SLAB_HWCACHE_ALIGN) return L1_CACHE_BYTES; + if (flags SLAB_MUST_HWCACHE_ALIGN) + return max(align, (unsigned long)L1_CACHE_BYTES); if (align ARCH_SLAB_MINALIGN) return ARCH_SLAB_MINALIGN; It would achiever parity with existing SLAB behavior, sure. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote: > > It eliminates 50% of the slab caches. Thus it reduces the management > overhead by half. How much management overhead is there left with SLUB? Is it just the one per-node slab? Is there runtime overhead as well? In a slightly different approach, can we possibly get rid of some slab caches, instead of merging them at boot time? On my system I have 97 slab caches right now, ignoring the generic kmalloc() ones. Of those, 28 are completely empty, 23 contain <=10 objects, 23 <=100 and 23 contain >100 objects. It is fairly obvious to me that the highly populated slab caches are a big win. But is it worth it to have slab caches with a single object inside? Maybe some of these caches are populated for some systems. But there could also be candidates for removal among them. # name 0 0 dm-crypt_io 0 0 dm_io 0 0 dm_tio 0 0 ext3_xattr 0 0 fat_cache 0 0 fat_inode_cache 0 0 flow_cache 0 0 inet_peer_cache 0 0 ip_conntrack_expect 0 0 ip_mrt_cache 0 0 isofs_inode_cache 0 0 jbd_1k 0 0 jbd_4k 0 0 kiocb 0 0 kioctx 0 0 nfs_inode_cache 0 0 nfs_page 0 0 posix_timers_cache 0 0 request_sock_TCP 0 0 revoke_record 0 0 rpc_inode_cache 0 0 scsi_io_context 0 0 secpath_cache 0 0 skbuff_fclone_cache 0 0 tw_sock_TCP 0 0 udf_inode_cache 0 0 uhci_urb_priv 0 0 xfrm_dst_cache 1 169 dnotify_cache 1 30 arp_cache 1 7 mqueue_inode_cache 2 101 eventpoll_pwq 2 203 fasync_cache 2 254 revoke_table 2 30 eventpoll_epi 2 9 RAW 4 17 ip_conntrack 7 10 biovec-128 7 10 biovec-64 7 20 biovec-16 7 42 file_lock_cache 7 59 biovec-4 7 59 uid_cache 7 8 biovec-256 7 9 bdev_cache 8 127 inotify_event_cache 8 20 rpc_tasks 8 8 rpc_buffers 10 113 ip_fib_alias 10 113 ip_fib_hash 10 12 blkdev_queue 11 203 biovec-1 11 22 blkdev_requests 13 92 inotify_watch_cache 16 169 journal_handle 16 203 tcp_bind_bucket 16 72 journal_head 18 18 UDP 19 19 names_cache 19 28 TCP 22 30 mnt_cache 27 27 sigqueue 27 60 ip_dst_cache 32 32 sgpool-128 32 32 sgpool-32 32 32 sgpool-64 32 36 nfs_read_data 32 45 sgpool-16 32 60 sgpool-8 36 42 nfs_write_data 72 80 cfq_pool 74 127 blkdev_ioc 78 92 cfq_ioc_pool 94 94 pgd 107 113 fs_cache 108 108 mm_struct 108 140 files_cache 123 123 sighand_cache 125 140 UNIX 130 130 signal_cache 147 147 task_struct 154 174 idr_layer_cache 158 404 pid 190 190 sock_inode_cache 260 295 bio 273 273 proc_inode_cache 840 920 skbuff_head_cache 1234 1326 inode_cache 1507 1510 shmem_inode_cache 2871 3051 anon_vma 2910 3360 filp 5161 5292 sysfs_dir_cache 5762 6164 vm_area_struct 12056 19446 radix_tree_node 65776 151272 buffer_head 578304 578304 ext3_inode_cache 677490 677490 dentry_cache Jörn -- And spam is a useful source of entropy for /dev/random too! -- Jasmine Strong - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 16:14:48 -0800, Christoph Lameter wrote: It eliminates 50% of the slab caches. Thus it reduces the management overhead by half. How much management overhead is there left with SLUB? Is it just the one per-node slab? Is there runtime overhead as well? In a slightly different approach, can we possibly get rid of some slab caches, instead of merging them at boot time? On my system I have 97 slab caches right now, ignoring the generic kmalloc() ones. Of those, 28 are completely empty, 23 contain =10 objects, 23 =100 and 23 contain 100 objects. It is fairly obvious to me that the highly populated slab caches are a big win. But is it worth it to have slab caches with a single object inside? Maybe some of these caches are populated for some systems. But there could also be candidates for removal among them. # active_objs num_objs name 0 0 dm-crypt_io 0 0 dm_io 0 0 dm_tio 0 0 ext3_xattr 0 0 fat_cache 0 0 fat_inode_cache 0 0 flow_cache 0 0 inet_peer_cache 0 0 ip_conntrack_expect 0 0 ip_mrt_cache 0 0 isofs_inode_cache 0 0 jbd_1k 0 0 jbd_4k 0 0 kiocb 0 0 kioctx 0 0 nfs_inode_cache 0 0 nfs_page 0 0 posix_timers_cache 0 0 request_sock_TCP 0 0 revoke_record 0 0 rpc_inode_cache 0 0 scsi_io_context 0 0 secpath_cache 0 0 skbuff_fclone_cache 0 0 tw_sock_TCP 0 0 udf_inode_cache 0 0 uhci_urb_priv 0 0 xfrm_dst_cache 1 169 dnotify_cache 1 30 arp_cache 1 7 mqueue_inode_cache 2 101 eventpoll_pwq 2 203 fasync_cache 2 254 revoke_table 2 30 eventpoll_epi 2 9 RAW 4 17 ip_conntrack 7 10 biovec-128 7 10 biovec-64 7 20 biovec-16 7 42 file_lock_cache 7 59 biovec-4 7 59 uid_cache 7 8 biovec-256 7 9 bdev_cache 8 127 inotify_event_cache 8 20 rpc_tasks 8 8 rpc_buffers 10 113 ip_fib_alias 10 113 ip_fib_hash 10 12 blkdev_queue 11 203 biovec-1 11 22 blkdev_requests 13 92 inotify_watch_cache 16 169 journal_handle 16 203 tcp_bind_bucket 16 72 journal_head 18 18 UDP 19 19 names_cache 19 28 TCP 22 30 mnt_cache 27 27 sigqueue 27 60 ip_dst_cache 32 32 sgpool-128 32 32 sgpool-32 32 32 sgpool-64 32 36 nfs_read_data 32 45 sgpool-16 32 60 sgpool-8 36 42 nfs_write_data 72 80 cfq_pool 74 127 blkdev_ioc 78 92 cfq_ioc_pool 94 94 pgd 107 113 fs_cache 108 108 mm_struct 108 140 files_cache 123 123 sighand_cache 125 140 UNIX 130 130 signal_cache 147 147 task_struct 154 174 idr_layer_cache 158 404 pid 190 190 sock_inode_cache 260 295 bio 273 273 proc_inode_cache 840 920 skbuff_head_cache 1234 1326 inode_cache 1507 1510 shmem_inode_cache 2871 3051 anon_vma 2910 3360 filp 5161 5292 sysfs_dir_cache 5762 6164 vm_area_struct 12056 19446 radix_tree_node 65776 151272 buffer_head 578304 578304 ext3_inode_cache 677490 677490 dentry_cache Jörn -- And spam is a useful source of entropy for /dev/random too! -- Jasmine Strong - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Sat, 24 Feb 2007 09:32:49 -0800 (PST) > On Fri, 23 Feb 2007, David Miller wrote: > > > I also agree with Andi in that merging could mess up how object type > > local lifetimes help reduce fragmentation in object pools. > > If that is a problem for particular object pools then we may be able to > except those from the merging. If it is a problem, it's going to be a problem "in general" and not for specific SLAB caches. I think this is really a very unwise idea. We have enough fragmentation problems as it is. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 Feb 2007, Jörn Engel wrote: > How much of a gain is the merging anyway? Once you start having > explicit whitelists or blacklists of pools that can be merged, one can > start to wonder if the result is worth the effort. It eliminates 50% of the slab caches. Thus it reduces the management overhead by half.
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote: > > If that is a problem for particular object pools then we may be able to > except those from the merging. How much of a gain is the merging anyway? Once you start having explicit whitelists or blacklists of pools that can be merged, one can start to wonder if the result is worth the effort. Jörn -- Joern's library part 6: http://www.gzip.org/zlib/feldspar.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Fri, 23 Feb 2007, David Miller wrote: > > The general caches already merge lots of users depending on their sizes. > > So we already have the situation and we have tools to deal with it. > > But this doesn't happen for things like biovecs, and that will > make debugging painful. > > If a crash happens because of a corrupted biovec-256 I want to know > it was a biovec not some anonymous clone of kmalloc256. > > Please provide at a minimum a way to turn the merging off. Ok. Its currently a compile time option. Will make it possible to specify a boot option. > I also agree with Andi in that merging could mess up how object type > local lifetimes help reduce fragmentation in object pools. If that is a problem for particular object pools then we may be able to except those from the merging. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Fri, 23 Feb 2007, David Miller wrote: The general caches already merge lots of users depending on their sizes. So we already have the situation and we have tools to deal with it. But this doesn't happen for things like biovecs, and that will make debugging painful. If a crash happens because of a corrupted biovec-256 I want to know it was a biovec not some anonymous clone of kmalloc256. Please provide at a minimum a way to turn the merging off. Ok. Its currently a compile time option. Will make it possible to specify a boot option. I also agree with Andi in that merging could mess up how object type local lifetimes help reduce fragmentation in object pools. If that is a problem for particular object pools then we may be able to except those from the merging. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 February 2007 09:32:49 -0800, Christoph Lameter wrote: If that is a problem for particular object pools then we may be able to except those from the merging. How much of a gain is the merging anyway? Once you start having explicit whitelists or blacklists of pools that can be merged, one can start to wonder if the result is worth the effort. Jörn -- Joern's library part 6: http://www.gzip.org/zlib/feldspar.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 Feb 2007, Jörn Engel wrote: How much of a gain is the merging anyway? Once you start having explicit whitelists or blacklists of pools that can be merged, one can start to wonder if the result is worth the effort. It eliminates 50% of the slab caches. Thus it reduces the management overhead by half.
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter [EMAIL PROTECTED] Date: Sat, 24 Feb 2007 09:32:49 -0800 (PST) On Fri, 23 Feb 2007, David Miller wrote: I also agree with Andi in that merging could mess up how object type local lifetimes help reduce fragmentation in object pools. If that is a problem for particular object pools then we may be able to except those from the merging. If it is a problem, it's going to be a problem in general and not for specific SLAB caches. I think this is really a very unwise idea. We have enough fragmentation problems as it is. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Fri, 23 Feb 2007 21:47:36 -0800 (PST) > On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote: > > > >From a viewpoint of a crash dump user, this merging will make crash dump > > investigation very very very difficult. > > The general caches already merge lots of users depending on their sizes. > So we already have the situation and we have tools to deal with it. But this doesn't happen for things like biovecs, and that will make debugging painful. If a crash happens because of a corrupted biovec-256 I want to know it was a biovec not some anonymous clone of kmalloc256. Please provide at a minimum a way to turn the merging off. I also agree with Andi in that merging could mess up how object type local lifetimes help reduce fragmentation in object pools. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote: > >From a viewpoint of a crash dump user, this merging will make crash dump > investigation very very very difficult. The general caches already merge lots of users depending on their sizes. So we already have the situation and we have tools to deal with it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007 10:42:23 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > > > G. Slab merging > > > > > >We often have slab caches with similar parameters. SLUB detects those > > >on bootup and merges them into the corresponding general caches. This > > >leads to more effective memory use. > > > > Did you do any tests on what that does to long term memory fragmentation? > > It is against the "object of same type have similar livetime and should > > be clustered together" theory at least. > > I have done no tests in that regard and we would have to assess the impact > that the merging has to overall system behavior. > >From a viewpoint of a crash dump user, this merging will make crash dump investigation very very very difficult. So please avoid this merging if the benefit is nog big. -Kame - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007 10:42:23 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: G. Slab merging We often have slab caches with similar parameters. SLUB detects those on bootup and merges them into the corresponding general caches. This leads to more effective memory use. Did you do any tests on what that does to long term memory fragmentation? It is against the object of same type have similar livetime and should be clustered together theory at least. I have done no tests in that regard and we would have to assess the impact that the merging has to overall system behavior. From a viewpoint of a crash dump user, this merging will make crash dump investigation very very very difficult. So please avoid this merging if the benefit is nog big. -Kame - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote: From a viewpoint of a crash dump user, this merging will make crash dump investigation very very very difficult. The general caches already merge lots of users depending on their sizes. So we already have the situation and we have tools to deal with it. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter [EMAIL PROTECTED] Date: Fri, 23 Feb 2007 21:47:36 -0800 (PST) On Sat, 24 Feb 2007, KAMEZAWA Hiroyuki wrote: From a viewpoint of a crash dump user, this merging will make crash dump investigation very very very difficult. The general caches already merge lots of users depending on their sizes. So we already have the situation and we have tools to deal with it. But this doesn't happen for things like biovecs, and that will make debugging painful. If a crash happens because of a corrupted biovec-256 I want to know it was a biovec not some anonymous clone of kmalloc256. Please provide at a minimum a way to turn the merging off. I also agree with Andi in that merging could mess up how object type local lifetimes help reduce fragmentation in object pools. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Fri, 23 Feb 2007, Andi Kleen wrote: > If you don't cache constructed but free objects then there is no cache > advantage of constructors/destructors and they would be useless. SLUB caches those objects as long as they are part of a partially allocated slab. If all objects in the slab are freed then the whole slab will be freed. SLUB does not keep queues of freed slabs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, Feb 22, 2007 at 10:42:23AM -0800, Christoph Lameter wrote: > On Thu, 22 Feb 2007, Andi Kleen wrote: > > > >SLUB does not need a cache reaper for UP systems. > > > > This means constructors/destructors are becomming worthless? > > Can you describe your rationale why you think they don't make > > sense on UP? > > Cache reaping has nothing to do with constructors and destructors. SLUB > fully supports constructors and destructors. If you don't cache constructed but free objects then there is no cache advantage of constructors/destructors and they would be useless. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Andi Kleen wrote: > >SLUB does not need a cache reaper for UP systems. > > This means constructors/destructors are becomming worthless? > Can you describe your rationale why you think they don't make > sense on UP? Cache reaping has nothing to do with constructors and destructors. SLUB fully supports constructors and destructors. > > G. Slab merging > > > >We often have slab caches with similar parameters. SLUB detects those > >on bootup and merges them into the corresponding general caches. This > >leads to more effective memory use. > > Did you do any tests on what that does to long term memory fragmentation? > It is against the "object of same type have similar livetime and should > be clustered together" theory at least. I have done no tests in that regard and we would have to assess the impact that the merging has to overall system behavior. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
Christoph Lameter <[EMAIL PROTECTED]> writes: > This is a new slab allocator which was motivated by the complexity of the > with the existing implementation. Thanks for doing that work. It certainly was long overdue. > D. SLAB has a complex cache reaper > >SLUB does not need a cache reaper for UP systems. This means constructors/destructors are becomming worthless? Can you describe your rationale why you think they don't make sense on UP? > G. Slab merging > >We often have slab caches with similar parameters. SLUB detects those >on bootup and merges them into the corresponding general caches. This >leads to more effective memory use. Did you do any tests on what that does to long term memory fragmentation? It is against the "object of same type have similar livetime and should be clustered together" theory at least. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
n Thu, 22 Feb 2007, David Miller wrote: > All of that logic needs to be protected by CONFIG_ZONE_DMA too. Right. Will fix that in the next release. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Peter Zijlstra wrote: > On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote: > > > +/* > > + * Lock order: > > + * 1. slab_lock(page) > > + * 2. slab->list_lock > > + * > > That seems to contradict this: This is a trylock. If it fails then we can compensate by allocating a new slab. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Pekka Enberg wrote: > On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: > > This is a new slab allocator which was motivated by the complexity of the > > existing code in mm/slab.c. It attempts to address a variety of concerns > > with the existing implementation. > > So do you want to add a new allocator or replace slab? Add. The performance and quality is not comparable to SLAB at this point. > On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: > > B. Storage overhead of object queues > > Does this make sense for non-NUMA too? If not, can we disable the > queues for NUMA in current slab? Given the locking scheme in the current slab you cannot do that. Otherwise there will be a single lock taken for every operation limiting performace > On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: > > C. SLAB metadata overhead > > Can be done for the current slab code too, no? The per slab metadata of the SLAB does not fit into the page_struct. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
Hi Christoph, On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. So do you want to add a new allocator or replace slab? On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: B. Storage overhead of object queues Does this make sense for non-NUMA too? If not, can we disable the queues for NUMA in current slab? On 2/22/07, Christoph Lameter <[EMAIL PROTECTED]> wrote: C. SLAB metadata overhead Can be done for the current slab code too, no? Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter <[EMAIL PROTECTED]> Date: Wed, 21 Feb 2007 23:00:30 -0800 (PST) > +#ifdef CONFIG_ZONE_DMA > +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_NR_CACHES]; > +#endif Therefore. > +static struct kmem_cache *get_slab(size_t size, gfp_t flags) > +{ ... > + s = kmalloc_caches_dma[index]; > + if (s) > + return s; > + > + /* Dynamically create dma cache */ > + x = kmalloc(sizeof(struct kmem_cache), flags & ~(__GFP_DMA)); > + > + if (!x) > + panic("Unable to allocate memory for dma cache\n"); > + > +#ifdef KMALLOC_EXTRA > + if (index <= KMALLOC_SHIFT_HIGH - KMALLOC_SHIFT_LOW) > +#endif > + realsize = 1 << index; > +#ifdef KMALLOC_EXTRA > + else if (index == KMALLOC_EXTRAS) > + realsize = 96; > + else > + realsize = 192; > +#endif > + > + s = create_kmalloc_cache(x, "kmalloc_dma", realsize); > + kmalloc_caches_dma[index] = s; > + return s; > +} All of that logic needs to be protected by CONFIG_ZONE_DMA too. I noticed this due to a build failure on sparc64 with this patch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote: > +/* > + * Lock order: > + * 1. slab_lock(page) > + * 2. slab->list_lock > + * That seems to contradict this: > +/* > + * Lock page and remove it from the partial list > + * > + * Must hold list_lock > + */ > +static __always_inline int lock_and_del_slab(struct kmem_cache *s, > + struct page *page) > +{ > + if (slab_trylock(page)) { > + list_del(>lru); > + s->nr_partial--; > + return 1; > + } > + return 0; > +} > + > +/* > + * Get a partial page, lock it and return it. > + */ > +#ifdef CONFIG_NUMA > +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) > +{ > + struct page *page; > + int searchnode = (node == -1) ? numa_node_id() : node; > + > + if (!s->nr_partial) > + return NULL; > + > + spin_lock(>list_lock); > + /* > + * Search for slab on the right node > + */ > + list_for_each_entry(page, >partial, lru) > + if (likely(page_to_nid(page) == searchnode) && > + lock_and_del_slab(s, page)) > + goto out; > + > + if (likely(!(flags & __GFP_THISNODE))) { > + /* > + * We can fall back to any other node in order to > + * reduce the size of the partial list. > + */ > + list_for_each_entry(page, >partial, lru) > + if (likely(lock_and_del_slab(s, page))) > + goto out; > + } > + > + /* Nothing found */ > + page = NULL; > +out: > + spin_unlock(>list_lock); > + return page; > +} > +#else > +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) > +{ > + struct page *page; > + > + /* > + * Racy check. If we mistakenly see no partial slabs then we > + * just allocate an empty slab. > + */ > + if (!s->nr_partial) > + return NULL; > + > + spin_lock(>list_lock); > + list_for_each_entry(page, >partial, lru) > + if (likely(lock_and_del_slab(s, page))) > + goto out; > + > + /* No slab or all slabs busy */ > + page = NULL; > +out: > + spin_unlock(>list_lock); > + return page; > +} > +#endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote: +/* + * Lock order: + * 1. slab_lock(page) + * 2. slab-list_lock + * That seems to contradict this: +/* + * Lock page and remove it from the partial list + * + * Must hold list_lock + */ +static __always_inline int lock_and_del_slab(struct kmem_cache *s, + struct page *page) +{ + if (slab_trylock(page)) { + list_del(page-lru); + s-nr_partial--; + return 1; + } + return 0; +} + +/* + * Get a partial page, lock it and return it. + */ +#ifdef CONFIG_NUMA +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +{ + struct page *page; + int searchnode = (node == -1) ? numa_node_id() : node; + + if (!s-nr_partial) + return NULL; + + spin_lock(s-list_lock); + /* + * Search for slab on the right node + */ + list_for_each_entry(page, s-partial, lru) + if (likely(page_to_nid(page) == searchnode) + lock_and_del_slab(s, page)) + goto out; + + if (likely(!(flags __GFP_THISNODE))) { + /* + * We can fall back to any other node in order to + * reduce the size of the partial list. + */ + list_for_each_entry(page, s-partial, lru) + if (likely(lock_and_del_slab(s, page))) + goto out; + } + + /* Nothing found */ + page = NULL; +out: + spin_unlock(s-list_lock); + return page; +} +#else +static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node) +{ + struct page *page; + + /* + * Racy check. If we mistakenly see no partial slabs then we + * just allocate an empty slab. + */ + if (!s-nr_partial) + return NULL; + + spin_lock(s-list_lock); + list_for_each_entry(page, s-partial, lru) + if (likely(lock_and_del_slab(s, page))) + goto out; + + /* No slab or all slabs busy */ + page = NULL; +out: + spin_unlock(s-list_lock); + return page; +} +#endif - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
From: Christoph Lameter [EMAIL PROTECTED] Date: Wed, 21 Feb 2007 23:00:30 -0800 (PST) +#ifdef CONFIG_ZONE_DMA +static struct kmem_cache *kmalloc_caches_dma[KMALLOC_NR_CACHES]; +#endif Therefore. +static struct kmem_cache *get_slab(size_t size, gfp_t flags) +{ ... + s = kmalloc_caches_dma[index]; + if (s) + return s; + + /* Dynamically create dma cache */ + x = kmalloc(sizeof(struct kmem_cache), flags ~(__GFP_DMA)); + + if (!x) + panic(Unable to allocate memory for dma cache\n); + +#ifdef KMALLOC_EXTRA + if (index = KMALLOC_SHIFT_HIGH - KMALLOC_SHIFT_LOW) +#endif + realsize = 1 index; +#ifdef KMALLOC_EXTRA + else if (index == KMALLOC_EXTRAS) + realsize = 96; + else + realsize = 192; +#endif + + s = create_kmalloc_cache(x, kmalloc_dma, realsize); + kmalloc_caches_dma[index] = s; + return s; +} All of that logic needs to be protected by CONFIG_ZONE_DMA too. I noticed this due to a build failure on sparc64 with this patch. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
Hi Christoph, On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. So do you want to add a new allocator or replace slab? On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: B. Storage overhead of object queues Does this make sense for non-NUMA too? If not, can we disable the queues for NUMA in current slab? On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: C. SLAB metadata overhead Can be done for the current slab code too, no? Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Pekka Enberg wrote: On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. So do you want to add a new allocator or replace slab? Add. The performance and quality is not comparable to SLAB at this point. On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: B. Storage overhead of object queues Does this make sense for non-NUMA too? If not, can we disable the queues for NUMA in current slab? Given the locking scheme in the current slab you cannot do that. Otherwise there will be a single lock taken for every operation limiting performace On 2/22/07, Christoph Lameter [EMAIL PROTECTED] wrote: C. SLAB metadata overhead Can be done for the current slab code too, no? The per slab metadata of the SLAB does not fit into the page_struct. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Peter Zijlstra wrote: On Wed, 2007-02-21 at 23:00 -0800, Christoph Lameter wrote: +/* + * Lock order: + * 1. slab_lock(page) + * 2. slab-list_lock + * That seems to contradict this: This is a trylock. If it fails then we can compensate by allocating a new slab. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
n Thu, 22 Feb 2007, David Miller wrote: All of that logic needs to be protected by CONFIG_ZONE_DMA too. Right. Will fix that in the next release. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
Christoph Lameter [EMAIL PROTECTED] writes: This is a new slab allocator which was motivated by the complexity of the with the existing implementation. Thanks for doing that work. It certainly was long overdue. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. This means constructors/destructors are becomming worthless? Can you describe your rationale why you think they don't make sense on UP? G. Slab merging We often have slab caches with similar parameters. SLUB detects those on bootup and merges them into the corresponding general caches. This leads to more effective memory use. Did you do any tests on what that does to long term memory fragmentation? It is against the object of same type have similar livetime and should be clustered together theory at least. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, 22 Feb 2007, Andi Kleen wrote: SLUB does not need a cache reaper for UP systems. This means constructors/destructors are becomming worthless? Can you describe your rationale why you think they don't make sense on UP? Cache reaping has nothing to do with constructors and destructors. SLUB fully supports constructors and destructors. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on bootup and merges them into the corresponding general caches. This leads to more effective memory use. Did you do any tests on what that does to long term memory fragmentation? It is against the object of same type have similar livetime and should be clustered together theory at least. I have done no tests in that regard and we would have to assess the impact that the merging has to overall system behavior. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Thu, Feb 22, 2007 at 10:42:23AM -0800, Christoph Lameter wrote: On Thu, 22 Feb 2007, Andi Kleen wrote: SLUB does not need a cache reaper for UP systems. This means constructors/destructors are becomming worthless? Can you describe your rationale why you think they don't make sense on UP? Cache reaping has nothing to do with constructors and destructors. SLUB fully supports constructors and destructors. If you don't cache constructed but free objects then there is no cache advantage of constructors/destructors and they would be useless. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SLUB: The unqueued Slab allocator
On Fri, 23 Feb 2007, Andi Kleen wrote: If you don't cache constructed but free objects then there is no cache advantage of constructors/destructors and they would be useless. SLUB caches those objects as long as they are part of a partially allocated slab. If all objects in the slab are freed then the whole slab will be freed. SLUB does not keep queues of freed slabs. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SLUB: The unqueued Slab allocator
This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each cpus allocating and use a slab directly instead of objects from queues. B. Storage overhead of object queues SLAB Object queues exist per node, per cpu. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB metadata overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all metadata in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per cpu slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per cpu, shared and alien object queues during cache reaping which may cause strange holdoffs. E. SLAB's has a complex NUMA policy layer support. SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spinlock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per cpu and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on bootup and merges them into the corresponding general caches. This leads to more effective memory use. The patch here is only the core portion. There are various add-ons that may become ready later when this one has matured a bit. SLUB should be fine for UP and SMP. No NUMA optimizations have been done so far so it works but it does not scale to the high processor and node numbers yet. To use SLUB: Apply this patch and then select SLUB as the default slab allocator. The output of /proc/slabinfo will then change. Here is a sample (this is an UP/SMP format. The NUMA display will show on which nodes the slabs were allocated): slubinfo - version: 1.0 # name // radix_tree_node 5574 0 560 797/0/1CP bdev_cache 5 0 7682/1/1 CSrPa sysfs_dir_cache 5946 0 80 117/0/1 inode_cache 2690 0 536 386/3/1 CSrP dentry_cache7735 0 192 369/1/1 SrP idr_layer_cache 79 0 536 12/0/1 C buffer_head 5427 0 112 151/0/1 CSrP mm_struct 37 1 8326/5/1Pa vm_area_struct 1734 0 168 73/3/1 P files_cache 37 0 6408/6/1Pa signal_cache 63 0 640 12/4/1Pa sighand_cache 63 22112 11/4/1 CRPa task_struct 75 21728 11/6/1 P anon_vma 590 0 244/3/1 CRP kmalloc-192 424 0 192 21/0/1 kmalloc-96 1150 0 96 28/3/1 kmalloc-262144
SLUB: The unqueued Slab allocator
This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for each cpus allocating and use a slab directly instead of objects from queues. B. Storage overhead of object queues SLAB Object queues exist per node, per cpu. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues. One fears that the whole memory of the machine could one day be consumed by those queues. C. SLAB metadata overhead SLAB has overhead at the beginning of each slab. This means that data cannot be naturally aligned at the beginning of a slab block. SLUB keeps all metadata in the corresponding page_struct. Objects can be naturally aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte boundaries and can fit tightly into a 4k page with no bytes left over. SLAB cannot do this. D. SLAB has a complex cache reaper SLUB does not need a cache reaper for UP systems. On SMP systems the per cpu slab may be pushed back into partial list but that operation is simple and does not require an iteration over a list of objects. SLAB expires per cpu, shared and alien object queues during cache reaping which may cause strange holdoffs. E. SLAB's has a complex NUMA policy layer support. SLUB pushes NUMA policy handling into the page allocator. This means that allocation is coarser (SLUB does interleave on a page level) but that situation was also present before 2.6.13. SLABs application of policies to individual slab objects allocated in SLAB is certainly a performance concern due to the frequent references to memory policies which may lead a sequence of objects to come from one node after another. SLUB will get a slab full of objects from one node and then will switch to the next. F. Reduction of the size of partial slab lists SLAB has per node partial lists. This means that over time a large number of partial slabs may accumulate on those lists. These can only be reused if allocator occur on specific nodes. SLUB has a global pool of partial slabs and will consume slabs from that pool to decrease fragmentation. G. Tunables SLAB has sophisticated tuning abilities for each slab cache. One can manipulate the queue sizes in detail. However, filling the queues still requires the uses of the spinlock to check out slabs. SLUB has a global parameter (min_slab_order) for tuning. Increasing the minimum slab order can decrease the locking overhead. The bigger the slab order the less motions of pages between per cpu and partial lists occur and the better SLUB will be scaling. G. Slab merging We often have slab caches with similar parameters. SLUB detects those on bootup and merges them into the corresponding general caches. This leads to more effective memory use. The patch here is only the core portion. There are various add-ons that may become ready later when this one has matured a bit. SLUB should be fine for UP and SMP. No NUMA optimizations have been done so far so it works but it does not scale to the high processor and node numbers yet. To use SLUB: Apply this patch and then select SLUB as the default slab allocator. The output of /proc/slabinfo will then change. Here is a sample (this is an UP/SMP format. The NUMA display will show on which nodes the slabs were allocated): slubinfo - version: 1.0 # nameobjects order objsize slabs/partial/cpu flags radix_tree_node 5574 0 560 797/0/1CP bdev_cache 5 0 7682/1/1 CSrPa sysfs_dir_cache 5946 0 80 117/0/1 inode_cache 2690 0 536 386/3/1 CSrP dentry_cache7735 0 192 369/1/1 SrP idr_layer_cache 79 0 536 12/0/1 C buffer_head 5427 0 112 151/0/1 CSrP mm_struct 37 1 8326/5/1Pa vm_area_struct 1734 0 168 73/3/1 P files_cache 37 0 6408/6/1Pa signal_cache 63 0 640 12/4/1Pa sighand_cache 63 22112 11/4/1 CRPa task_struct 75 21728 11/6/1 P anon_vma 590 0 244/3/1 CRP kmalloc-192 424 0 192 21/0/1 kmalloc-96 1150 0