Re: [PATCH 2/3] slub: Support for array operations
On Fri, Feb 13, 2015 at 09:49:24AM -0600, Christoph Lameter wrote: > On Fri, 13 Feb 2015, Joonsoo Kim wrote: > > > > + *p++ = freelist; > > > + freelist = get_freepointer(s, freelist); > > > + allocated++; > > > + } > > > > Fetching all objects with holding node lock could result in enomourous > > lock contention. How about getting free ojbect pointer without holding > > the node lock? We can temporarilly store all head of freelists in > > array p and can fetch each object pointer without holding node lock. > > > Could do that but lets first see if there is really an issue. The other > cpu sharing the same partial lists presumaly have cpu local objects to get > through first before they hit this lock. Okay. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Fri, Feb 13, 2015 at 09:49:24AM -0600, Christoph Lameter wrote: On Fri, 13 Feb 2015, Joonsoo Kim wrote: + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } Fetching all objects with holding node lock could result in enomourous lock contention. How about getting free ojbect pointer without holding the node lock? We can temporarilly store all head of freelists in array p and can fetch each object pointer without holding node lock. Could do that but lets first see if there is really an issue. The other cpu sharing the same partial lists presumaly have cpu local objects to get through first before they hit this lock. Okay. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Fri, 13 Feb 2015, Joonsoo Kim wrote: > > + *p++ = freelist; > > + freelist = get_freepointer(s, freelist); > > + allocated++; > > + } > > Fetching all objects with holding node lock could result in enomourous > lock contention. How about getting free ojbect pointer without holding > the node lock? We can temporarilly store all head of freelists in > array p and can fetch each object pointer without holding node lock. Could do that but lets first see if there is really an issue. The other cpu sharing the same partial lists presumaly have cpu local objects to get through first before they hit this lock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Fri, 13 Feb 2015, Joonsoo Kim wrote: + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } Fetching all objects with holding node lock could result in enomourous lock contention. How about getting free ojbect pointer without holding the node lock? We can temporarilly store all head of freelists in array p and can fetch each object pointer without holding node lock. Could do that but lets first see if there is really an issue. The other cpu sharing the same partial lists presumaly have cpu local objects to get through first before they hit this lock. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Tue, Feb 10, 2015 at 01:48:06PM -0600, Christoph Lameter wrote: > The major portions are there but there is no support yet for > directly allocating per cpu objects. There could also be more > sophisticated code to exploit the batch freeing. > > Signed-off-by: Christoph Lameter > > Index: linux/include/linux/slub_def.h > === > --- linux.orig/include/linux/slub_def.h > +++ linux/include/linux/slub_def.h > @@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str > } > #endif > > +#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS > #endif /* _LINUX_SLUB_DEF_H */ > Index: linux/mm/slub.c > === > --- linux.orig/mm/slub.c > +++ linux/mm/slub.c > @@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac > s->ctor(object); > } > > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) > +static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node) > { > struct page *page; > - void *start; > - void *p; > - int order; > - int idx; > > if (unlikely(flags & GFP_SLAB_BUG_MASK)) { > pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); > @@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem > > page = allocate_slab(s, > flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); > - if (!page) > - goto out; > + if (page) { > + inc_slabs_node(s, page_to_nid(page), page->objects); > + page->slab_cache = s; > + __SetPageSlab(page); > + if (page->pfmemalloc) > + SetPageSlabPfmemalloc(page); > + } > > - order = compound_order(page); > - inc_slabs_node(s, page_to_nid(page), page->objects); > - page->slab_cache = s; > - __SetPageSlab(page); > - if (page->pfmemalloc) > - SetPageSlabPfmemalloc(page); > - > - start = page_address(page); > - > - if (unlikely(s->flags & SLAB_POISON)) > - memset(start, POISON_INUSE, PAGE_SIZE << order); > - > - for_each_object_idx(p, idx, s, start, page->objects) { > - setup_object(s, page, p); > - if (likely(idx < page->objects)) > - set_freepointer(s, p, p + s->size); > - else > - set_freepointer(s, p, NULL); > - } > - > - page->freelist = start; > - page->inuse = page->objects; > - page->frozen = 1; > -out: > + return page; > +} > + > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) > +{ > + struct page *page = __new_slab(s, flags, node); > + > + if (page) { > + void *p; > + int idx; > + void *start = page_address(page); > + > + if (unlikely(s->flags & SLAB_POISON)) > + memset(start, POISON_INUSE, > + PAGE_SIZE << compound_order(page)); I'm not sure, but, this poisoning is also needed for slab_array_alloc_from_page_allocator()? > + > + for_each_object_idx(p, idx, s, start, page->objects) { > + setup_object(s, page, p); > + if (likely(idx < page->objects)) > + set_freepointer(s, p, p + s->size); > + else > + set_freepointer(s, p, NULL); > + } > + > + page->freelist = start; > + page->inuse = page->objects; > + page->frozen = 1; > + } > return page; > } > > @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac > #endif > #endif > > +int slab_array_alloc_from_partial(struct kmem_cache *s, > + size_t nr, void **p) > +{ > + void **end = p + nr; > + struct kmem_cache_node *n = get_node(s, numa_mem_id()); > + int allocated = 0; > + unsigned long flags; > + struct page *page, *page2; > + > + if (!n->nr_partial) > + return 0; > + > + > + spin_lock_irqsave(>list_lock, flags); > + list_for_each_entry_safe(page, page2, >partial, lru) { > + void *freelist; > + > + if (page->objects - page->inuse > end - p) > + /* More objects free in page than we want */ > + break; > + list_del(>lru); > + slab_lock(page); slab_lock() doesn't protect freelist if CONFIG_HAVE_CMPXCHG_DOUBLE is enabled. You should use cmpxchg_double_slab() things. And, better solution is to use acquire_slab() rather than re-implementation of detaching freelist. > + freelist = page->freelist; > + page->inuse = page->objects; > + page->freelist = NULL; > + slab_unlock(page); > + /* Grab all available objects */ > + while (freelist) { > + *p++ = freelist; > +
Re: [PATCH 2/3] slub: Support for array operations
On Tue, Feb 10, 2015 at 01:48:06PM -0600, Christoph Lameter wrote: The major portions are there but there is no support yet for directly allocating per cpu objects. There could also be more sophisticated code to exploit the batch freeing. Signed-off-by: Christoph Lameter c...@linux.com Index: linux/include/linux/slub_def.h === --- linux.orig/include/linux/slub_def.h +++ linux/include/linux/slub_def.h @@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str } #endif +#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS #endif /* _LINUX_SLUB_DEF_H */ Index: linux/mm/slub.c === --- linux.orig/mm/slub.c +++ linux/mm/slub.c @@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac s-ctor(object); } -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; - void *start; - void *p; - int order; - int idx; if (unlikely(flags GFP_SLAB_BUG_MASK)) { pr_emerg(gfp: %u\n, flags GFP_SLAB_BUG_MASK); @@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem page = allocate_slab(s, flags (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); - if (!page) - goto out; + if (page) { + inc_slabs_node(s, page_to_nid(page), page-objects); + page-slab_cache = s; + __SetPageSlab(page); + if (page-pfmemalloc) + SetPageSlabPfmemalloc(page); + } - order = compound_order(page); - inc_slabs_node(s, page_to_nid(page), page-objects); - page-slab_cache = s; - __SetPageSlab(page); - if (page-pfmemalloc) - SetPageSlabPfmemalloc(page); - - start = page_address(page); - - if (unlikely(s-flags SLAB_POISON)) - memset(start, POISON_INUSE, PAGE_SIZE order); - - for_each_object_idx(p, idx, s, start, page-objects) { - setup_object(s, page, p); - if (likely(idx page-objects)) - set_freepointer(s, p, p + s-size); - else - set_freepointer(s, p, NULL); - } - - page-freelist = start; - page-inuse = page-objects; - page-frozen = 1; -out: + return page; +} + +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +{ + struct page *page = __new_slab(s, flags, node); + + if (page) { + void *p; + int idx; + void *start = page_address(page); + + if (unlikely(s-flags SLAB_POISON)) + memset(start, POISON_INUSE, + PAGE_SIZE compound_order(page)); I'm not sure, but, this poisoning is also needed for slab_array_alloc_from_page_allocator()? + + for_each_object_idx(p, idx, s, start, page-objects) { + setup_object(s, page, p); + if (likely(idx page-objects)) + set_freepointer(s, p, p + s-size); + else + set_freepointer(s, p, NULL); + } + + page-freelist = start; + page-inuse = page-objects; + page-frozen = 1; + } return page; } @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac #endif #endif +int slab_array_alloc_from_partial(struct kmem_cache *s, + size_t nr, void **p) +{ + void **end = p + nr; + struct kmem_cache_node *n = get_node(s, numa_mem_id()); + int allocated = 0; + unsigned long flags; + struct page *page, *page2; + + if (!n-nr_partial) + return 0; + + + spin_lock_irqsave(n-list_lock, flags); + list_for_each_entry_safe(page, page2, n-partial, lru) { + void *freelist; + + if (page-objects - page-inuse end - p) + /* More objects free in page than we want */ + break; + list_del(page-lru); + slab_lock(page); slab_lock() doesn't protect freelist if CONFIG_HAVE_CMPXCHG_DOUBLE is enabled. You should use cmpxchg_double_slab() things. And, better solution is to use acquire_slab() rather than re-implementation of detaching freelist. + freelist = page-freelist; + page-inuse = page-objects; + page-freelist = NULL; + slab_unlock(page); + /* Grab all available objects */ + while (freelist) { + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } Fetching all objects with holding node lock could result in
Re: [PATCH 2/3] slub: Support for array operations
On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: > Measured on my laptop CPU i7-2620M CPU @ 2.70GHz: > > * 12.775 ns - "clean" spin_lock_unlock > * 21.099 ns - irqsave variant spinlock > * 22.808 ns - "manual" irqsave before spin_lock > * 14.618 ns - "manual" local_irq_disable + spin_lock > > Reproducible via my github repo: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > The clean spin_lock_unlock is 8.324 ns faster than irqsave variant. > The irqsave variant is actually faster than expected, as the measurement > of an isolated local_irq_save_restore were 13.256 ns. I am using spin_lock_irq() in the current version on my system. If the performance of that is a problem then please optimize that function. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015 16:06:50 -0600 (CST) Christoph Lameter wrote: > On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: > > > > > This is quite an expensive lock with irqsave. [...] > > > We can require that interrupt are off when the functions are called. Then > > > we can avoid the "save" part? > > > > Yes, we could also do so with an "_irqoff" variant of the func call, > > but given we are defining the API we can just require this from the > > start. > > Allright. Lets do that then. Okay. Some measurements to guide this choice. Measured on my laptop CPU i7-2620M CPU @ 2.70GHz: * 12.775 ns - "clean" spin_lock_unlock * 21.099 ns - irqsave variant spinlock * 22.808 ns - "manual" irqsave before spin_lock * 14.618 ns - "manual" local_irq_disable + spin_lock Reproducible via my github repo: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c The clean spin_lock_unlock is 8.324 ns faster than irqsave variant. The irqsave variant is actually faster than expected, as the measurement of an isolated local_irq_save_restore were 13.256 ns. The difference to the "manual" irqsave is only 1.709 ns, which is approx the cost of an extra function call. If one can use the non-flags-save version of local_irq_disable, then one can save 6.481 ns (on this specific CPU and kernel config 3.17.8-200.fc20.x86_64). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer https://github.com/netoptimizer/prototype-kernel/commit/1471ac60 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: > > > This is quite an expensive lock with irqsave. > > > > Yes but we take it for all partial pages. > > Sure, that is good, but this might be a contention point. In a micro > benchmark, this contention should be visible, but in real use-cases the > given subsystem also need to spend time to use these elements before > requesting a new batch (as long as NIC cleanup cycles don't get too > synchronized) Yes definitely it will be a contention point. There is no way around it. > > Yup the page access is shared but there is one per page. Contention is > > unlikely. > > Yes, contention is unlikely, but every atomic operation is expensive. > On my system the measured cost is 8ns, and a lock/unlock does two, thus > 16ns. Which we then do per page freelist. Not sure what we can do about this. > > We can require that interrupt are off when the functions are called. Then > > we can avoid the "save" part? > > Yes, we could also do so with an "_irqoff" variant of the func call, > but given we are defining the API we can just require this from the > start. Allright. Lets do that then. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015 13:07:24 -0600 (CST) Christoph Lameter wrote: > On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote: > > > > + > > > + > > > + spin_lock_irqsave(>list_lock, flags); > > > > This is quite an expensive lock with irqsave. > > Yes but we take it for all partial pages. Sure, that is good, but this might be a contention point. In a micro benchmark, this contention should be visible, but in real use-cases the given subsystem also need to spend time to use these elements before requesting a new batch (as long as NIC cleanup cycles don't get too synchronized) > > Yet another lock cost. > > Yup the page access is shared but there is one per page. Contention is > unlikely. Yes, contention is unlikely, but every atomic operation is expensive. On my system the measured cost is 8ns, and a lock/unlock does two, thus 16ns. Which we then do per page freelist. > > > + spin_unlock_irqrestore(>list_lock, flags); > > > + return allocated; > > > > I estimate (on my CPU) the locking cost itself is more than 32ns, plus > > the irqsave (which I've also found quite expensive, alone 14ns). Thus, > > estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 > > elem bulking should be enough to amortized the cost, guess we are still > > good :-) > > We can require that interrupt are off when the functions are called. Then > we can avoid the "save" part? Yes, we could also do so with an "_irqoff" variant of the func call, but given we are defining the API we can just require this from the start. I plan to use this in softirq, where I know interrupts are on, but I can use the less-expensive "non-save" variant local_irq_{disable,enable}. Measurements show (x86_64 E5-2695): * 2.860 ns cost for local_irq_{disable,enable} * 14.840 ns cost for local_irq_save()+local_irq_restore() -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote: > > + > > + > > + spin_lock_irqsave(>list_lock, flags); > > This is quite an expensive lock with irqsave. Yes but we take it for all partial pages. > Yet another lock cost. Yup the page access is shared but there is one per page. Contention is unlikely. > > + spin_unlock_irqrestore(>list_lock, flags); > > + return allocated; > > I estimate (on my CPU) the locking cost itself is more than 32ns, plus > the irqsave (which I've also found quite expensive, alone 14ns). Thus, > estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 > elem bulking should be enough to amortized the cost, guess we are still > good :-) We can require that interrupt are off when the functions are called. Then we can avoid the "save" part? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote: + + + spin_lock_irqsave(n-list_lock, flags); This is quite an expensive lock with irqsave. Yes but we take it for all partial pages. Yet another lock cost. Yup the page access is shared but there is one per page. Contention is unlikely. + spin_unlock_irqrestore(n-list_lock, flags); + return allocated; I estimate (on my CPU) the locking cost itself is more than 32ns, plus the irqsave (which I've also found quite expensive, alone 14ns). Thus, estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 elem bulking should be enough to amortized the cost, guess we are still good :-) We can require that interrupt are off when the functions are called. Then we can avoid the save part? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015 13:07:24 -0600 (CST) Christoph Lameter c...@linux.com wrote: On Wed, 11 Feb 2015, Jesper Dangaard Brouer wrote: + + + spin_lock_irqsave(n-list_lock, flags); This is quite an expensive lock with irqsave. Yes but we take it for all partial pages. Sure, that is good, but this might be a contention point. In a micro benchmark, this contention should be visible, but in real use-cases the given subsystem also need to spend time to use these elements before requesting a new batch (as long as NIC cleanup cycles don't get too synchronized) Yet another lock cost. Yup the page access is shared but there is one per page. Contention is unlikely. Yes, contention is unlikely, but every atomic operation is expensive. On my system the measured cost is 8ns, and a lock/unlock does two, thus 16ns. Which we then do per page freelist. + spin_unlock_irqrestore(n-list_lock, flags); + return allocated; I estimate (on my CPU) the locking cost itself is more than 32ns, plus the irqsave (which I've also found quite expensive, alone 14ns). Thus, estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 elem bulking should be enough to amortized the cost, guess we are still good :-) We can require that interrupt are off when the functions are called. Then we can avoid the save part? Yes, we could also do so with an _irqoff variant of the func call, but given we are defining the API we can just require this from the start. I plan to use this in softirq, where I know interrupts are on, but I can use the less-expensive non-save variant local_irq_{disable,enable}. Measurements show (x86_64 E5-2695): * 2.860 ns cost for local_irq_{disable,enable} * 14.840 ns cost for local_irq_save()+local_irq_restore() -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: This is quite an expensive lock with irqsave. Yes but we take it for all partial pages. Sure, that is good, but this might be a contention point. In a micro benchmark, this contention should be visible, but in real use-cases the given subsystem also need to spend time to use these elements before requesting a new batch (as long as NIC cleanup cycles don't get too synchronized) Yes definitely it will be a contention point. There is no way around it. Yup the page access is shared but there is one per page. Contention is unlikely. Yes, contention is unlikely, but every atomic operation is expensive. On my system the measured cost is 8ns, and a lock/unlock does two, thus 16ns. Which we then do per page freelist. Not sure what we can do about this. We can require that interrupt are off when the functions are called. Then we can avoid the save part? Yes, we could also do so with an _irqoff variant of the func call, but given we are defining the API we can just require this from the start. Allright. Lets do that then. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Wed, 11 Feb 2015 16:06:50 -0600 (CST) Christoph Lameter c...@linux.com wrote: On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: This is quite an expensive lock with irqsave. [...] We can require that interrupt are off when the functions are called. Then we can avoid the save part? Yes, we could also do so with an _irqoff variant of the func call, but given we are defining the API we can just require this from the start. Allright. Lets do that then. Okay. Some measurements to guide this choice. Measured on my laptop CPU i7-2620M CPU @ 2.70GHz: * 12.775 ns - clean spin_lock_unlock * 21.099 ns - irqsave variant spinlock * 22.808 ns - manual irqsave before spin_lock * 14.618 ns - manual local_irq_disable + spin_lock Reproducible via my github repo: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c The clean spin_lock_unlock is 8.324 ns faster than irqsave variant. The irqsave variant is actually faster than expected, as the measurement of an isolated local_irq_save_restore were 13.256 ns. The difference to the manual irqsave is only 1.709 ns, which is approx the cost of an extra function call. If one can use the non-flags-save version of local_irq_disable, then one can save 6.481 ns (on this specific CPU and kernel config 3.17.8-200.fc20.x86_64). -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer https://github.com/netoptimizer/prototype-kernel/commit/1471ac60 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Thu, 12 Feb 2015, Jesper Dangaard Brouer wrote: Measured on my laptop CPU i7-2620M CPU @ 2.70GHz: * 12.775 ns - clean spin_lock_unlock * 21.099 ns - irqsave variant spinlock * 22.808 ns - manual irqsave before spin_lock * 14.618 ns - manual local_irq_disable + spin_lock Reproducible via my github repo: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c The clean spin_lock_unlock is 8.324 ns faster than irqsave variant. The irqsave variant is actually faster than expected, as the measurement of an isolated local_irq_save_restore were 13.256 ns. I am using spin_lock_irq() in the current version on my system. If the performance of that is a problem then please optimize that function. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] slub: Support for array operations
On Tue, 10 Feb 2015 13:48:06 -0600 Christoph Lameter wrote: > The major portions are there but there is no support yet for > directly allocating per cpu objects. There could also be more > sophisticated code to exploit the batch freeing. > > Signed-off-by: Christoph Lameter > [...] > Index: linux/mm/slub.c > === > --- linux.orig/mm/slub.c > +++ linux/mm/slub.c [...] > @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac > #endif > #endif > > +int slab_array_alloc_from_partial(struct kmem_cache *s, > + size_t nr, void **p) > +{ > + void **end = p + nr; > + struct kmem_cache_node *n = get_node(s, numa_mem_id()); > + int allocated = 0; > + unsigned long flags; > + struct page *page, *page2; > + > + if (!n->nr_partial) > + return 0; > + > + > + spin_lock_irqsave(>list_lock, flags); This is quite an expensive lock with irqsave. > + list_for_each_entry_safe(page, page2, >partial, lru) { > + void *freelist; > + > + if (page->objects - page->inuse > end - p) > + /* More objects free in page than we want */ > + break; > + list_del(>lru); > + slab_lock(page); Yet another lock cost. > + freelist = page->freelist; > + page->inuse = page->objects; > + page->freelist = NULL; > + slab_unlock(page); > + /* Grab all available objects */ > + while (freelist) { > + *p++ = freelist; > + freelist = get_freepointer(s, freelist); > + allocated++; > + } > + } > + spin_unlock_irqrestore(>list_lock, flags); > + return allocated; I estimate (on my CPU) the locking cost itself is more than 32ns, plus the irqsave (which I've also found quite expensive, alone 14ns). Thus, estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 elem bulking should be enough to amortized the cost, guess we are still good :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] slub: Support for array operations
The major portions are there but there is no support yet for directly allocating per cpu objects. There could also be more sophisticated code to exploit the batch freeing. Signed-off-by: Christoph Lameter Index: linux/include/linux/slub_def.h === --- linux.orig/include/linux/slub_def.h +++ linux/include/linux/slub_def.h @@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str } #endif +#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS #endif /* _LINUX_SLUB_DEF_H */ Index: linux/mm/slub.c === --- linux.orig/mm/slub.c +++ linux/mm/slub.c @@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac s->ctor(object); } -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; - void *start; - void *p; - int order; - int idx; if (unlikely(flags & GFP_SLAB_BUG_MASK)) { pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK); @@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem page = allocate_slab(s, flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); - if (!page) - goto out; + if (page) { + inc_slabs_node(s, page_to_nid(page), page->objects); + page->slab_cache = s; + __SetPageSlab(page); + if (page->pfmemalloc) + SetPageSlabPfmemalloc(page); + } - order = compound_order(page); - inc_slabs_node(s, page_to_nid(page), page->objects); - page->slab_cache = s; - __SetPageSlab(page); - if (page->pfmemalloc) - SetPageSlabPfmemalloc(page); - - start = page_address(page); - - if (unlikely(s->flags & SLAB_POISON)) - memset(start, POISON_INUSE, PAGE_SIZE << order); - - for_each_object_idx(p, idx, s, start, page->objects) { - setup_object(s, page, p); - if (likely(idx < page->objects)) - set_freepointer(s, p, p + s->size); - else - set_freepointer(s, p, NULL); - } - - page->freelist = start; - page->inuse = page->objects; - page->frozen = 1; -out: + return page; +} + +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +{ + struct page *page = __new_slab(s, flags, node); + + if (page) { + void *p; + int idx; + void *start = page_address(page); + + if (unlikely(s->flags & SLAB_POISON)) + memset(start, POISON_INUSE, + PAGE_SIZE << compound_order(page)); + + for_each_object_idx(p, idx, s, start, page->objects) { + setup_object(s, page, p); + if (likely(idx < page->objects)) + set_freepointer(s, p, p + s->size); + else + set_freepointer(s, p, NULL); + } + + page->freelist = start; + page->inuse = page->objects; + page->frozen = 1; + } return page; } @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac #endif #endif +int slab_array_alloc_from_partial(struct kmem_cache *s, + size_t nr, void **p) +{ + void **end = p + nr; + struct kmem_cache_node *n = get_node(s, numa_mem_id()); + int allocated = 0; + unsigned long flags; + struct page *page, *page2; + + if (!n->nr_partial) + return 0; + + + spin_lock_irqsave(>list_lock, flags); + list_for_each_entry_safe(page, page2, >partial, lru) { + void *freelist; + + if (page->objects - page->inuse > end - p) + /* More objects free in page than we want */ + break; + list_del(>lru); + slab_lock(page); + freelist = page->freelist; + page->inuse = page->objects; + page->freelist = NULL; + slab_unlock(page); + /* Grab all available objects */ + while (freelist) { + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } + } + spin_unlock_irqrestore(>list_lock, flags); + return allocated; +} + +int slab_array_alloc_from_page_allocator(struct kmem_cache *s, + gfp_t flags, size_t nr, void **p) +{ + void **end = p + nr; + int allocated = 0; + + while (end - p >= oo_objects(s->oo)) { + struct page *page = __new_slab(s, flags, NUMA_NO_NODE); +
[PATCH 2/3] slub: Support for array operations
The major portions are there but there is no support yet for directly allocating per cpu objects. There could also be more sophisticated code to exploit the batch freeing. Signed-off-by: Christoph Lameter c...@linux.com Index: linux/include/linux/slub_def.h === --- linux.orig/include/linux/slub_def.h +++ linux/include/linux/slub_def.h @@ -110,4 +110,5 @@ static inline void sysfs_slab_remove(str } #endif +#define _HAVE_SLAB_ALLOCATOR_ARRAY_OPERATIONS #endif /* _LINUX_SLUB_DEF_H */ Index: linux/mm/slub.c === --- linux.orig/mm/slub.c +++ linux/mm/slub.c @@ -1379,13 +1379,9 @@ static void setup_object(struct kmem_cac s-ctor(object); } -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +static struct page *__new_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; - void *start; - void *p; - int order; - int idx; if (unlikely(flags GFP_SLAB_BUG_MASK)) { pr_emerg(gfp: %u\n, flags GFP_SLAB_BUG_MASK); @@ -1394,33 +1390,42 @@ static struct page *new_slab(struct kmem page = allocate_slab(s, flags (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); - if (!page) - goto out; + if (page) { + inc_slabs_node(s, page_to_nid(page), page-objects); + page-slab_cache = s; + __SetPageSlab(page); + if (page-pfmemalloc) + SetPageSlabPfmemalloc(page); + } - order = compound_order(page); - inc_slabs_node(s, page_to_nid(page), page-objects); - page-slab_cache = s; - __SetPageSlab(page); - if (page-pfmemalloc) - SetPageSlabPfmemalloc(page); - - start = page_address(page); - - if (unlikely(s-flags SLAB_POISON)) - memset(start, POISON_INUSE, PAGE_SIZE order); - - for_each_object_idx(p, idx, s, start, page-objects) { - setup_object(s, page, p); - if (likely(idx page-objects)) - set_freepointer(s, p, p + s-size); - else - set_freepointer(s, p, NULL); - } - - page-freelist = start; - page-inuse = page-objects; - page-frozen = 1; -out: + return page; +} + +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) +{ + struct page *page = __new_slab(s, flags, node); + + if (page) { + void *p; + int idx; + void *start = page_address(page); + + if (unlikely(s-flags SLAB_POISON)) + memset(start, POISON_INUSE, + PAGE_SIZE compound_order(page)); + + for_each_object_idx(p, idx, s, start, page-objects) { + setup_object(s, page, p); + if (likely(idx page-objects)) + set_freepointer(s, p, p + s-size); + else + set_freepointer(s, p, NULL); + } + + page-freelist = start; + page-inuse = page-objects; + page-frozen = 1; + } return page; } @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac #endif #endif +int slab_array_alloc_from_partial(struct kmem_cache *s, + size_t nr, void **p) +{ + void **end = p + nr; + struct kmem_cache_node *n = get_node(s, numa_mem_id()); + int allocated = 0; + unsigned long flags; + struct page *page, *page2; + + if (!n-nr_partial) + return 0; + + + spin_lock_irqsave(n-list_lock, flags); + list_for_each_entry_safe(page, page2, n-partial, lru) { + void *freelist; + + if (page-objects - page-inuse end - p) + /* More objects free in page than we want */ + break; + list_del(page-lru); + slab_lock(page); + freelist = page-freelist; + page-inuse = page-objects; + page-freelist = NULL; + slab_unlock(page); + /* Grab all available objects */ + while (freelist) { + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } + } + spin_unlock_irqrestore(n-list_lock, flags); + return allocated; +} + +int slab_array_alloc_from_page_allocator(struct kmem_cache *s, + gfp_t flags, size_t nr, void **p) +{ + void **end = p + nr; + int allocated = 0; + + while (end - p = oo_objects(s-oo)) { + struct page *page = __new_slab(s, flags, NUMA_NO_NODE); + void *q =
Re: [PATCH 2/3] slub: Support for array operations
On Tue, 10 Feb 2015 13:48:06 -0600 Christoph Lameter c...@linux.com wrote: The major portions are there but there is no support yet for directly allocating per cpu objects. There could also be more sophisticated code to exploit the batch freeing. Signed-off-by: Christoph Lameter c...@linux.com [...] Index: linux/mm/slub.c === --- linux.orig/mm/slub.c +++ linux/mm/slub.c [...] @@ -2516,8 +2521,78 @@ EXPORT_SYMBOL(kmem_cache_alloc_node_trac #endif #endif +int slab_array_alloc_from_partial(struct kmem_cache *s, + size_t nr, void **p) +{ + void **end = p + nr; + struct kmem_cache_node *n = get_node(s, numa_mem_id()); + int allocated = 0; + unsigned long flags; + struct page *page, *page2; + + if (!n-nr_partial) + return 0; + + + spin_lock_irqsave(n-list_lock, flags); This is quite an expensive lock with irqsave. + list_for_each_entry_safe(page, page2, n-partial, lru) { + void *freelist; + + if (page-objects - page-inuse end - p) + /* More objects free in page than we want */ + break; + list_del(page-lru); + slab_lock(page); Yet another lock cost. + freelist = page-freelist; + page-inuse = page-objects; + page-freelist = NULL; + slab_unlock(page); + /* Grab all available objects */ + while (freelist) { + *p++ = freelist; + freelist = get_freepointer(s, freelist); + allocated++; + } + } + spin_unlock_irqrestore(n-list_lock, flags); + return allocated; I estimate (on my CPU) the locking cost itself is more than 32ns, plus the irqsave (which I've also found quite expensive, alone 14ns). Thus, estimated 46ns. Single elem slub fast path cost is 18-19ns. Thus 3-4 elem bulking should be enough to amortized the cost, guess we are still good :-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/