Re: [PATCH 2/5] mm, highmem: remove useless pool_lock
Hello, Andrew. 2012/10/31 Andrew Morton a...@linux-foundation.org: On Mon, 29 Oct 2012 04:12:53 +0900 Joonsoo Kim js1...@gmail.com wrote: The pool_lock protects the page_address_pool from concurrent access. But, access to the page_address_pool is already protected by kmap_lock. So remove it. Well, there's a set_page_address() call in mm/page_alloc.c which doesn't have lock_kmap(). it doesn't *need* lock_kmap() because it's init-time code and we're running single-threaded there. I hope! But this exception should be double-checked and mentioned in the changelog, please. And it's a reason why we can't add assert_spin_locked(kmap_lock) to set_page_address(), which is unfortunate. set_page_address() in mm/page_alloc.c is invoked only when WANT_PAGE_VIRTUAL is defined. And in this case, set_page_address()'s definition is not in highmem.c, but in include/linux/mm.h. So, we don't need to worry about set_page_address() call in mm/page_alloc.c The irq-disabling in this code is odd. If ARCH_NEEDS_KMAP_HIGH_GET=n, we didn't need irq-safe locking in set_page_address(). I guess we'll need to retain it in page_address() - I expect some callers have IRQs disabled. As Minchan described, if we don't disable irq when we take a lock for pas-lock, it would be deadlock with page_address(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/5] mm, highmem: remove page_address_pool list
We can find free page_address_map instance without the page_address_pool. So remove it. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Signed-off-by: Joonsoo Kim js1...@gmail.com Reviewed-by: Minchan Kim minc...@kernel.org diff --git a/mm/highmem.c b/mm/highmem.c index 017bad1..d98b0a9 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -324,10 +324,7 @@ struct page_address_map { struct list_head list; }; -/* - * page_address_map freelist, allocated from page_address_maps. - */ -static struct list_head page_address_pool; /* freelist */ +static struct page_address_map page_address_maps[LAST_PKMAP]; /* * Hash table bucket @@ -392,12 +389,7 @@ void set_page_address(struct page *page, void *virtual) pas = page_slot(page); if (virtual) { /* Add */ - BUG_ON(list_empty(page_address_pool)); - - pam = list_entry(page_address_pool.next, - struct page_address_map, list); - list_del(pam-list); - + pam = page_address_maps[PKMAP_NR((unsigned long)virtual)]; pam-page = page; pam-virtual = virtual; @@ -410,7 +402,6 @@ void set_page_address(struct page *page, void *virtual) if (pam-page == page) { list_del(pam-list); spin_unlock_irqrestore(pas-lock, flags); - list_add_tail(pam-list, page_address_pool); goto done; } } @@ -420,15 +411,10 @@ done: return; } -static struct page_address_map page_address_maps[LAST_PKMAP]; - void __init page_address_init(void) { int i; - INIT_LIST_HEAD(page_address_pool); - for (i = 0; i ARRAY_SIZE(page_address_maps); i++) - list_add(page_address_maps[i].list, page_address_pool); for (i = 0; i ARRAY_SIZE(page_address_htable); i++) { INIT_LIST_HEAD(page_address_htable[i].lh); spin_lock_init(page_address_htable[i].lock); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()
In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page. Using this index, we can simply get virtual address of the page. So change it. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Signed-off-by: Joonsoo Kim js1...@gmail.com Reviewed-by: Minchan Kim minc...@kernel.org diff --git a/mm/highmem.c b/mm/highmem.c index b365f7b..675ec97 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -137,8 +137,7 @@ static unsigned int flush_all_zero_pkmaps(void) * So no dangers, even with speculative execution. */ page = pte_page(pkmap_page_table[i]); - pte_clear(init_mm, (unsigned long)page_address(page), - pkmap_page_table[i]); + pte_clear(init_mm, PKMAP_ADDR(i), pkmap_page_table[i]); set_page_address(page, NULL); if (index == PKMAP_INVALID_INDEX) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/5] minor clean-up and optimize highmem related code
This patchset clean-up and optimize highmem related code. Change from v1 Rebase on v3.7-rc3 [4] Instead of returning index of last flushed entry, return first index. And update last_pkmap_nr to this index to optimize more. Summary for v1 [1] is just clean-up and doesn't introduce any functional change. [2-3] are for clean-up and optimization. These eliminate an useless lock opearation and list management. [4-5] is for optimization related to flush_all_zero_pkmaps(). Joonsoo Kim (5): mm, highmem: use PKMAP_NR() to calculate an index of pkmap mm, highmem: remove useless pool_lock mm, highmem: remove page_address_pool list mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry mm, highmem: get virtual address of the page using PKMAP_ADDR() include/linux/highmem.h |1 + mm/highmem.c| 108 ++- 2 files changed, 51 insertions(+), 58 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap
To calculate an index of pkmap, using PKMAP_NR() is more understandable and maintainable, So change it. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Signed-off-by: Joonsoo Kim js1...@gmail.com Reviewed-by: Minchan Kim minc...@kernel.org diff --git a/mm/highmem.c b/mm/highmem.c index d517cd1..b3b3d68 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr) unsigned long addr = (unsigned long)vaddr; if (addr = PKMAP_ADDR(0) addr = PKMAP_ADDR(LAST_PKMAP)) { - int i = (addr - PKMAP_ADDR(0)) PAGE_SHIFT; + int i = PKMAP_NR(addr); return pte_page(pkmap_page_table[i]); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; unlock_kmap(); } static inline unsigned long map_new_virtual(struct page *page) { unsigned long vaddr; + unsigned int index = PKMAP_INVALID_INDEX; int count; start: @@ -168,40 +176,45 @@ start: for (;;) { last_pkmap_nr = (last_pkmap_nr + 1) LAST_PKMAP_MASK; if (!last_pkmap_nr) { - flush_all_zero_pkmaps(); - count = LAST_PKMAP; + index = flush_all_zero_pkmaps(); + break; } - if (!pkmap_count[last_pkmap_nr]) + if (!pkmap_count[last_pkmap_nr]) { + index = last_pkmap_nr; break; /* Found a usable entry */ - if (--count) - continue; - - /* -* Sleep for somebody else to unmap their entries -*/ - { - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(pkmap_map_wait, wait); - unlock_kmap(); - schedule(); - remove_wait_queue(pkmap_map_wait, wait); - lock_kmap(); - - /* Somebody else might have mapped it while we slept */ - if (page_address(page)) - return (unsigned long)page_address(page); - - /* Re-start */ - goto start; } + if (--count == 0) + break; } - vaddr = PKMAP_ADDR(last_pkmap_nr); + + /* +* Sleep for somebody else to unmap their entries +*/ + if (index == PKMAP_INVALID_INDEX) { + DECLARE_WAITQUEUE(wait, current); + + __set_current_state(TASK_UNINTERRUPTIBLE); + add_wait_queue(pkmap_map_wait, wait); + unlock_kmap(); + schedule(); + remove_wait_queue(pkmap_map_wait, wait); + lock_kmap(); + + /* Somebody else might have mapped it while we slept */ + vaddr = (unsigned long)page_address(page); + if (vaddr) + return vaddr; + + /* Re-start */ + goto start; + } + + vaddr = PKMAP_ADDR(index); set_pte_at(init_mm, vaddr, - (pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot)); + (pkmap_page_table[index
[PATCH v2 2/5] mm, highmem: remove useless pool_lock
The pool_lock protects the page_address_pool from concurrent access. But, access to the page_address_pool is already protected by kmap_lock. So remove it. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Signed-off-by: Joonsoo Kim js1...@gmail.com Reviewed-by: Minchan Kim minc...@kernel.org diff --git a/mm/highmem.c b/mm/highmem.c index b3b3d68..017bad1 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -328,7 +328,6 @@ struct page_address_map { * page_address_map freelist, allocated from page_address_maps. */ static struct list_head page_address_pool; /* freelist */ -static spinlock_t pool_lock; /* protects page_address_pool */ /* * Hash table bucket @@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual) if (virtual) { /* Add */ BUG_ON(list_empty(page_address_pool)); - spin_lock_irqsave(pool_lock, flags); pam = list_entry(page_address_pool.next, struct page_address_map, list); list_del(pam-list); - spin_unlock_irqrestore(pool_lock, flags); pam-page = page; pam-virtual = virtual; @@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual) if (pam-page == page) { list_del(pam-list); spin_unlock_irqrestore(pas-lock, flags); - spin_lock_irqsave(pool_lock, flags); list_add_tail(pam-list, page_address_pool); - spin_unlock_irqrestore(pool_lock, flags); goto done; } } @@ -438,7 +433,6 @@ void __init page_address_init(void) INIT_LIST_HEAD(page_address_htable[i].lh); spin_lock_init(page_address_htable[i].lock); } - spin_lock_init(pool_lock); } #endif /* defined(CONFIG_HIGHMEM) !defined(WANT_PAGE_VIRTUAL) */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] minor clean-up and optimize highmem related code
Hello, Andrew. 2012/10/29 JoonSoo Kim js1...@gmail.com: Hi, Minchan. 2012/10/29 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote: This patchset clean-up and optimize highmem related code. [1] is just clean-up and doesn't introduce any functional change. [2-3] are for clean-up and optimization. These eliminate an useless lock opearation and list management. [4-5] is for optimization related to flush_all_zero_pkmaps(). Joonsoo Kim (5): mm, highmem: use PKMAP_NR() to calculate an index of pkmap mm, highmem: remove useless pool_lock mm, highmem: remove page_address_pool list mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry mm, highmem: get virtual address of the page using PKMAP_ADDR() This patchset looks awesome to me. If you have a plan to respin, please CCed Peter. Thanks for review. I will wait more review and respin, the day after tomorrow. Version 2 will include fix about your comment. Could you pick up second version of this patchset? [3] is changed to leave one blank line. [4] is changed in order to further optimize according to Minchan's comment. It return first index of flushed entry, instead of last index. Others doesn't changed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
Hello, Minchan. 2012/11/1 Minchan Kim minc...@kernel.org: On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; I don't know how kmap_flush_unused is really fast path so how my nitpick is effective. Anyway, What problem happens if we do following as? lock() index = flush_all_zero_pkmaps(); if (index != PKMAP_INVALID_INDEX) last_pkmap_nr = index; unlock(); Normally, last_pkmap_nr is increased with searching empty slot in map_new_virtual. So I expect return value of flush_all_zero_pkmaps in kmap_flush_unused normally become either less than last_pkmap_nr or last_pkmap_nr + 1. There is a case that return value of kmap_flush_unused() is larger than last_pkmap_nr. Look at the following example. Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped. do kmap_flush_unused() = flush index 10 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 So, little dirty implementation is needed. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 00/29] kmem controller for memcg.
Hello, Glauber. 2012/11/2 Glauber Costa glom...@parallels.com: On 11/02/2012 04:04 AM, Andrew Morton wrote: On Thu, 1 Nov 2012 16:07:16 +0400 Glauber Costa glom...@parallels.com wrote: Hi, This work introduces the kernel memory controller for memcg. Unlike previous submissions, this includes the whole controller, comprised of slab and stack memory. I'm in the middle of (re)reading all this. Meanwhile I'll push it all out to http://ozlabs.org/~akpm/mmots/ for the crazier testers. One thing: Numbers can be found at https://lkml.org/lkml/2012/9/13/239 You claim in the above that the fork worload is 'slab intensive. Or at least, you seem to - it's a bit fuzzy. But how slab intensive is it, really? What is extremely slab intensive is networking. The networking guys are very sensitive to slab performance. If this hasn't already been done, could you please determine what impact this has upon networking? I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest testing approaches. I can test it, but unfortunately I am unlikely to get to prepare a good environment before Barcelona. I know, however, that Greg Thelen was testing netperf in his setup. Greg, do you have any publishable numbers you could share? Below is my humble opinion. I am worrying about data cache footprint which is possibly caused by this patchset, especially slab implementation. If there are several memcg cgroups, each cgroup has it's own kmem_caches. When each group do slab-intensive job hard, data cache may be overflowed easily, and cache miss rate will be high, therefore this would decrease system performance highly. Is there any result about this? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions
kmalloc() and kmalloc_node() is always inlined into generic code. However, there is a mistake in implemention of the SLUB. In kmalloc() and kmalloc_node() of the SLUB, we try to compare kmalloc_caches[index] with NULL. As it cannot be known at compile time, this comparison is inserted into generic code invoking kmalloc*. This may decrease system performance, so we should fix it. Below is the result of size vmlinux text size is decreased roughly 20KB Before: textdata bss dec hex filename 100441771443168 5722112 172094571069871 vmlinux After: textdata bss dec hex filename 100226271443136 5722112 171878751064423 vmlinux Cc: Christoph Lameter c...@linux.com Signed-off-by: Joonsoo Kim js1...@gmail.com --- With Christoph's patchset(common kmalloc caches: '[15/15] Common Kmalloc cache determination') which is not merged into mainline yet, this issue will be fixed. As it takes some time, I send this patch for v3.7 This patch is based on v3.7-rc1 diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index df448ad..4c75f2b 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -271,9 +271,10 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags) return kmalloc_large(size, flags); if (!(flags SLUB_DMA)) { - struct kmem_cache *s = kmalloc_slab(size); + int index = kmalloc_index(size); + struct kmem_cache *s = kmalloc_caches[index]; - if (!s) + if (!index) return ZERO_SIZE_PTR; return kmem_cache_alloc_trace(s, flags, size); @@ -304,9 +305,10 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { if (__builtin_constant_p(size) size = SLUB_MAX_SIZE !(flags SLUB_DMA)) { - struct kmem_cache *s = kmalloc_slab(size); + int index = kmalloc_index(size); + struct kmem_cache *s = kmalloc_caches[index]; - if (!s) + if (!index) return ZERO_SIZE_PTR; return kmem_cache_alloc_node_trace(s, flags, node, size); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA. This patch optimize this case, so when @flags = __GFP_DMA, it will be inlined into generic code. Cc: Christoph Lameter c...@linux.com Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 4c75f2b..4adf50b 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -147,6 +147,7 @@ struct kmem_cache { * 2^x bytes of allocations. */ extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT]; +extern struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT]; /* * Sorry that the following has to be that ugly but some versions of GCC @@ -266,19 +267,24 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) static __always_inline void *kmalloc(size_t size, gfp_t flags) { + struct kmem_cache *s; + int index; + if (__builtin_constant_p(size)) { if (size SLUB_MAX_SIZE) return kmalloc_large(size, flags); - if (!(flags SLUB_DMA)) { - int index = kmalloc_index(size); - struct kmem_cache *s = kmalloc_caches[index]; - - if (!index) - return ZERO_SIZE_PTR; + index = kmalloc_index(size); + if (!index) + return ZERO_SIZE_PTR; +#ifdef CONFIG_ZONE_DMA + if (unlikely(flags SLUB_DMA)) { + s = kmalloc_dma_caches[index]; + } else +#endif + s = kmalloc_caches[index]; - return kmem_cache_alloc_trace(s, flags, size); - } + return kmem_cache_alloc_trace(s, flags, size); } return __kmalloc(size, flags); } @@ -303,13 +309,19 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s, static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) { - if (__builtin_constant_p(size) - size = SLUB_MAX_SIZE !(flags SLUB_DMA)) { - int index = kmalloc_index(size); - struct kmem_cache *s = kmalloc_caches[index]; + struct kmem_cache *s; + int index; + if (__builtin_constant_p(size) size = SLUB_MAX_SIZE) { + index = kmalloc_index(size); if (!index) return ZERO_SIZE_PTR; +#ifdef CONFIG_ZONE_DMA + if (unlikely(flags SLUB_DMA)) { + s = kmalloc_dma_caches[index]; + } else +#endif + s = kmalloc_caches[index]; return kmem_cache_alloc_node_trace(s, flags, node, size); } diff --git a/mm/slub.c b/mm/slub.c index a0d6984..a94533c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3222,7 +3222,8 @@ struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT]; EXPORT_SYMBOL(kmalloc_caches); #ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT]; +struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT]; +EXPORT_SYMBOL(kmalloc_dma_caches); #endif static int __init setup_slub_min_order(char *str) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] workqueue: minor cleanup
This patchset do minor cleanup for workqueue code. First patch makes minor behavior change, however, it is trivial. Others doesn't makes any functional difference. These are based on v3.7-rc1 Joonsoo Kim (3): workqueue: optimize mod_delayed_work_on() when @delay == 0 workqueue: trivial fix for return statement in work_busy() workqueue: remove unused argument of wq_worker_waking_up() kernel/sched/core.c |2 +- kernel/workqueue.c | 11 +++ kernel/workqueue_sched.h |2 +- 3 files changed, 9 insertions(+), 6 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0
After try_to_grab_pending(), __queue_delayed_work() is invoked in mod_delayed_work_on(). When @delay == 0, we can call __queue_work() directly in order to avoid setting useless timer. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/kernel/workqueue.c b/kernel/workqueue.c index d951daa..c57358e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1477,7 +1477,11 @@ bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq, } while (unlikely(ret == -EAGAIN)); if (likely(ret = 0)) { - __queue_delayed_work(cpu, wq, dwork, delay); + if (!delay) + __queue_work(cpu, wq, dwork-work); + else + __queue_delayed_work(cpu, wq, dwork, delay); + local_irq_restore(flags); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where applicable') changes an approach to access nr_running. Thus, wq_worker_waking_up() doesn't use @cpu anymore. Remove it and remove comment related to it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d8927f..30a23d0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1243,7 +1243,7 @@ static void ttwu_activate(struct rq *rq, struct task_struct *p, int en_flags) /* if a worker is waking up, notify workqueue */ if (p-flags PF_WQ_WORKER) - wq_worker_waking_up(p, cpu_of(rq)); + wq_worker_waking_up(p); } /* diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 27a6dee..daf101c 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -727,7 +727,6 @@ static void wake_up_worker(struct worker_pool *pool) /** * wq_worker_waking_up - a worker is waking up * @task: task waking up - * @cpu: CPU @task is waking up to * * This function is called during try_to_wake_up() when a worker is * being awoken. @@ -735,7 +734,7 @@ static void wake_up_worker(struct worker_pool *pool) * CONTEXT: * spin_lock_irq(rq-lock) */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) +void wq_worker_waking_up(struct task_struct *task) { struct worker *worker = kthread_data(task); diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h index 2d10fc9..c1b45a5 100644 --- a/kernel/workqueue_sched.h +++ b/kernel/workqueue_sched.h @@ -4,6 +4,6 @@ * Scheduler hooks for concurrency managed workqueue. Only to be * included from sched.c and workqueue.c. */ -void wq_worker_waking_up(struct task_struct *task, unsigned int cpu); +void wq_worker_waking_up(struct task_struct *task); struct task_struct *wq_worker_sleeping(struct task_struct *task, unsigned int cpu); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] workqueue: trivial fix for return statement in work_busy()
Return type of work_busy() is unsigned int. There is return statement returning boolean value, 'false' in work_busy(). It is not problem, because 'false' may be treated '0'. However, fixing it would make code robust. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c57358e..27a6dee 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3479,7 +3479,7 @@ unsigned int work_busy(struct work_struct *work) unsigned int ret = 0; if (!gcwq) - return false; + return 0; spin_lock_irqsave(gcwq-lock, flags); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] slab: move kmem_cache_free to common code
Hello, Glauber. 2012/10/23 Glauber Costa glom...@parallels.com: On 10/22/2012 06:45 PM, Christoph Lameter wrote: On Mon, 22 Oct 2012, Glauber Costa wrote: + * kmem_cache_free - Deallocate an object + * @cachep: The cache the allocation was from. + * @objp: The previously allocated object. + * + * Free an object which was previously allocated from this + * cache. + */ +void kmem_cache_free(struct kmem_cache *s, void *x) +{ +__kmem_cache_free(s, x); +trace_kmem_cache_free(_RET_IP_, x); +} +EXPORT_SYMBOL(kmem_cache_free); + This results in an additional indirection if tracing is off. Wonder if there is a performance impact? if tracing is on, you mean? Tracing already incurs overhead, not sure how much a function call would add to the tracing overhead. I would not be concerned with this, but I can measure, if you have any specific workload in mind. With this patch, kmem_cache_free() invokes __kmem_cache_free(), that is, it add one more call instruction than before. I think that Christoph's comment means above fact. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()
2012/10/21 Tejun Heo t...@kernel.org: On Sun, Oct 21, 2012 at 01:30:07AM +0900, Joonsoo Kim wrote: Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where applicable') changes an approach to access nr_running. Thus, wq_worker_waking_up() doesn't use @cpu anymore. Remove it and remove comment related to it. Signed-off-by: Joonsoo Kim js1...@gmail.com I'm not sure whether I wanna remove or add WARN_ON_ONCE() on it. That part has gone through some changes and seen some bugs. Can we please do the following instead at least for now? if (!(worker-flags WORKER_NOT_RUNNING)) { WARN_ON_ONCE(worker-pool-gcwq-cpu != cpu); atomic_inc(get_pool_nr_running(worker-pool)); } I have no objection to do this for now. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
2012/10/22 Christoph Lameter c...@linux.com: On Sun, 21 Oct 2012, Joonsoo Kim wrote: kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA. This patch optimize this case, so when @flags = __GFP_DMA, it will be inlined into generic code. __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was not considered that it is worth to directly support it in the inlining code. Hmm... but, the SLAB already did that optimization for __GFP_DMA. Almost every kmalloc() is invoked with constant flags value, so I think that overhead from this patch may be negligible. With this patch, code size of vmlinux is reduced slightly. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] slab: move kmem_cache_free to common code
2012/10/23 Glauber Costa glom...@parallels.com: On 10/23/2012 12:07 PM, Glauber Costa wrote: On 10/23/2012 04:48 AM, JoonSoo Kim wrote: Hello, Glauber. 2012/10/23 Glauber Costa glom...@parallels.com: On 10/22/2012 06:45 PM, Christoph Lameter wrote: On Mon, 22 Oct 2012, Glauber Costa wrote: + * kmem_cache_free - Deallocate an object + * @cachep: The cache the allocation was from. + * @objp: The previously allocated object. + * + * Free an object which was previously allocated from this + * cache. + */ +void kmem_cache_free(struct kmem_cache *s, void *x) +{ +__kmem_cache_free(s, x); +trace_kmem_cache_free(_RET_IP_, x); +} +EXPORT_SYMBOL(kmem_cache_free); + This results in an additional indirection if tracing is off. Wonder if there is a performance impact? if tracing is on, you mean? Tracing already incurs overhead, not sure how much a function call would add to the tracing overhead. I would not be concerned with this, but I can measure, if you have any specific workload in mind. With this patch, kmem_cache_free() invokes __kmem_cache_free(), that is, it add one more call instruction than before. I think that Christoph's comment means above fact. Ah, this. Ok, I got fooled by his mention to tracing. I do agree, but since freeing is ultimately dependent on the allocator layout, I don't see a clean way of doing this without dropping tears of sorrow around. The calls in slub/slab/slob would have to be somehow inlined. Hum... maybe it is possible to do it from include/linux/sl*b_def.h... Let me give it a try and see what I can come up with. Ok. I am attaching a PoC for this for your appreciation. This gets quite ugly, but it's the way I found without including sl{a,u,o}b.c directly - which would be even worse. Hmm... This is important issue for sl[aou]b common allocators. Because there are similar functions like as kmem_cache_alloc, ksize, kfree, ... So it is good time to resolve this issue. As far as I know, now, we have 3 solutions. 1. include/linux/slab.h __always_inline kmem_cache_free() { __kmem_cache_free(); blablabla... } 2. define macro like as Glauber's solution 3. include sl[aou]b.c directly. Is there other good solution? Among them, I prefer solution 3, because future developing cost may be minimum among them. Solution 2 may be error-prone for future developing. Solution 1 may make compile-time longer and larger code. Is my understanding right? Is Solution 3 really ugly? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA
Hi, Eric. 2012/10/23 Eric Dumazet eric.duma...@gmail.com: On Tue, 2012-10-23 at 11:29 +0900, JoonSoo Kim wrote: 2012/10/22 Christoph Lameter c...@linux.com: On Sun, 21 Oct 2012, Joonsoo Kim wrote: kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA. This patch optimize this case, so when @flags = __GFP_DMA, it will be inlined into generic code. __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was not considered that it is worth to directly support it in the inlining code. Hmm... but, the SLAB already did that optimization for __GFP_DMA. Almost every kmalloc() is invoked with constant flags value, so I think that overhead from this patch may be negligible. With this patch, code size of vmlinux is reduced slightly. Only because you asked a allyesconfig GFP_DMA is used for less than 0.1 % of kmalloc() calls, for legacy hardware (from last century) I'm not doing with allyesconfig, but localmodconfig on my ubuntu desktop system. On my system, 700 bytes of text of vmlinux is reduced which mean there may be more than 100 callsite with GFP_DMA. In fact if you want to reduce even more your vmlinux, you could test if (__builtin_constant_p(flags) (flags SLUB_DMA)) return kmem_cache_alloc_trace(s, flags, size); to force the call to out of line code. The reason why I mention about code size is that I want to say it may be good for performance, although it has a just small impact. I'm not interest of reducing code size :) Thanks for comment. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 05/18] slab/slub: struct memcg_params
Hi, Glauber. 2012/10/19 Glauber Costa glom...@parallels.com: For the kmem slab controller, we need to record some extra information in the kmem_cache structure. Signed-off-by: Glauber Costa glom...@parallels.com Signed-off-by: Suleiman Souhlal sulei...@google.com CC: Christoph Lameter c...@linux.com CC: Pekka Enberg penb...@cs.helsinki.fi CC: Michal Hocko mho...@suse.cz CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com CC: Johannes Weiner han...@cmpxchg.org CC: Tejun Heo t...@kernel.org --- include/linux/slab.h | 25 + include/linux/slab_def.h | 3 +++ include/linux/slub_def.h | 3 +++ mm/slab.h| 13 + 4 files changed, 44 insertions(+) diff --git a/include/linux/slab.h b/include/linux/slab.h index 0dd2dfa..e4ea48a 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -177,6 +177,31 @@ unsigned int kmem_cache_size(struct kmem_cache *); #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long) #endif +#include linux/workqueue.h Why workqueue.h is includede at this time? It may be future use, so is it better to add it later? Adding it at proper time makes git blame works better. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 06/18] consider a memcg parameter in kmem_create_cache
2012/10/19 Glauber Costa glom...@parallels.com: diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6a1e096..59f6d54 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -339,6 +339,12 @@ struct mem_cgroup { #if defined(CONFIG_MEMCG_KMEM) defined(CONFIG_INET) struct tcp_memcontrol tcp_mem; #endif +#if defined(CONFIG_MEMCG_KMEM) + /* analogous to slab_common's slab_caches list. per-memcg */ + struct list_head memcg_slab_caches; + /* Not a spinlock, we can take a lot of time walking the list */ + struct mutex slab_caches_mutex; +#endif }; It is better to change name of slab_caches_mutex to something representing slab cache mutex of memcg. +int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s) +{ + size_t size = sizeof(struct memcg_cache_params); + + if (!memcg_kmem_enabled()) + return 0; + + s-memcg_params = kzalloc(size, GFP_KERNEL); + if (!s-memcg_params) + return -ENOMEM; + + if (memcg) + s-memcg_params-memcg = memcg; + return 0; +} Now, I don't read full-patchset and I have not enough knowledge about cgroup. So I have a question. When memcg_kmem_enable, creation kmem_cache of normal user(not included in cgroup) also allocate memcg_params. Is it right behavior? +void memcg_release_cache(struct kmem_cache *s) +{ + kfree(s-memcg_params); +} + /* * We need to verify if the allocation against current-mm-owner's memcg is * possible for the given order. But the page is not allocated yet, so we'll diff --git a/mm/slab.h b/mm/slab.h index 5ee1851..c35ecce 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -35,12 +35,15 @@ extern struct kmem_cache *kmem_cache; /* Functions provided by the slab allocators */ extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags); +struct mem_cgroup; #ifdef CONFIG_SLUB -struct kmem_cache *__kmem_cache_alias(const char *name, size_t size, - size_t align, unsigned long flags, void (*ctor)(void *)); +struct kmem_cache * +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size, + size_t align, unsigned long flags, void (*ctor)(void *)); #else -static inline struct kmem_cache *__kmem_cache_alias(const char *name, size_t size, - size_t align, unsigned long flags, void (*ctor)(void *)) +static inline struct kmem_cache * +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size, + size_t align, unsigned long flags, void (*ctor)(void *)) { return NULL; } #endif @@ -98,11 +101,23 @@ static inline bool is_root_cache(struct kmem_cache *s) { return !s-memcg_params || s-memcg_params-is_root_cache; } + +static inline bool cache_match_memcg(struct kmem_cache *cachep, +struct mem_cgroup *memcg) +{ + return (is_root_cache(cachep) !memcg) || + (cachep-memcg_params-memcg == memcg); +} When cachep-memcg_params == NULL and memcg is not NULL, NULL pointer deref may be occurred. Is there no situation like above? #else static inline bool is_root_cache(struct kmem_cache *s) { return true; } +static inline bool cache_match_memcg(struct kmem_cache *cachep, +struct mem_cgroup *memcg) +{ + return true; +} #endif #endif diff --git a/mm/slab_common.c b/mm/slab_common.c index bf4b4f1..f97f7b8 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -18,6 +18,7 @@ #include asm/cacheflush.h #include asm/tlbflush.h #include asm/page.h +#include linux/memcontrol.h #include slab.h @@ -27,7 +28,8 @@ DEFINE_MUTEX(slab_mutex); struct kmem_cache *kmem_cache; #ifdef CONFIG_DEBUG_VM -static int kmem_cache_sanity_check(const char *name, size_t size) +static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char *name, + size_t size) { struct kmem_cache *s = NULL; @@ -53,7 +55,7 @@ static int kmem_cache_sanity_check(const char *name, size_t size) continue; } - if (!strcmp(s-name, name)) { + if (cache_match_memcg(s, memcg) !strcmp(s-name, name)) { pr_err(%s (%s): Cache name already exists.\n, __func__, name); dump_stack(); @@ -66,7 +68,8 @@ static int kmem_cache_sanity_check(const char *name, size_t size) return 0; } #else -static inline int kmem_cache_sanity_check(const char *name, size_t size) +static inline int kmem_cache_sanity_check(struct mem_cgroup *memcg, + const char *name, size_t size) { return 0; } @@ -97,8 +100,9 @@ static inline int kmem_cache_sanity_check(const char *name, size_t size) * as davem. */ -struct kmem_cache *kmem_cache_create(const
Re: [PATCH v2 1/2] kmem_cache: include allocators code directly into slab_common
2012/10/24 Glauber Costa glom...@parallels.com: On 10/24/2012 06:29 PM, Christoph Lameter wrote: On Wed, 24 Oct 2012, Glauber Costa wrote: Because of that, we either have to move all the entry points to the mm/slab.h and rely heavily on the pre-processor, or include all .c files in here. Hmm... That is a bit of a radical solution. The global optimizations now possible with the new gcc compiler include the ability to fold functions across different linkable objects. Andi, is that usable for kernel builds? In general, it takes quite a lot of time to take all those optimizations for granted. We still live a lot of time with multiple compiler versions building distros, etc, for quite some time. I would expect the end result for anyone not using such a compiler to be a sudden performance drop when using a new kernel. Not really pleasant. I agree with Glauber's opinion. And patch looks fine to me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 08/18] memcg: infrastructure to match an allocation to the right cache
2012/10/19 Glauber Costa glom...@parallels.com: @@ -2930,9 +2937,188 @@ int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s) void memcg_release_cache(struct kmem_cache *s) { + struct kmem_cache *root; + int id = memcg_css_id(s-memcg_params-memcg); + + if (s-memcg_params-is_root_cache) + goto out; + + root = s-memcg_params-root_cache; + root-memcg_params-memcg_caches[id] = NULL; + mem_cgroup_put(s-memcg_params-memcg); +out: kfree(s-memcg_params); } memcg_css_id should be called after checking s-memcg_params-is_root_cache. Because when is_root_cache == true, memcg_params has no memcg object. +/* + * This lock protects updaters, not readers. We want readers to be as fast as + * they can, and they will either see NULL or a valid cache value. Our model + * allow them to see NULL, in which case the root memcg will be selected. + * + * We need this lock because multiple allocations to the same cache from a non + * GFP_WAIT area will span more than one worker. Only one of them can create + * the cache. + */ +static DEFINE_MUTEX(memcg_cache_mutex); +static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg, + struct kmem_cache *cachep) +{ + struct kmem_cache *new_cachep; + int idx; + + BUG_ON(!memcg_can_account_kmem(memcg)); + + idx = memcg_css_id(memcg); + + mutex_lock(memcg_cache_mutex); + new_cachep = cachep-memcg_params-memcg_caches[idx]; + if (new_cachep) + goto out; + + new_cachep = kmem_cache_dup(memcg, cachep); + + if (new_cachep == NULL) { + new_cachep = cachep; + goto out; + } + + mem_cgroup_get(memcg); + cachep-memcg_params-memcg_caches[idx] = new_cachep; + wmb(); /* the readers won't lock, make sure everybody sees it */ Is there any rmb() pair? As far as I know, without rmb(), wmb() doesn't guarantee anything. + new_cachep-memcg_params-memcg = memcg; + new_cachep-memcg_params-root_cache = cachep; It may be better these assignment before the statement cachep-memcg_params-memcg_caches[idx] = new_cachep. Otherwise, it may produce race situation. And assigning value to memcg_params-memcg and root_cache is redundant, because it is already done in memcg_register_cache(). +/* + * Return the kmem_cache we're supposed to use for a slab allocation. + * We try to use the current memcg's version of the cache. + * + * If the cache does not exist yet, if we are the first user of it, + * we either create it immediately, if possible, or create it asynchronously + * in a workqueue. + * In the latter case, we will let the current allocation go through with + * the original cache. + * + * Can't be called in interrupt context or from kernel threads. + * This function needs to be called with rcu_read_lock() held. + */ +struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, + gfp_t gfp) +{ + struct mem_cgroup *memcg; + int idx; + + if (cachep-memcg_params cachep-memcg_params-memcg) + return cachep; In __memcg_kmem_get_cache, cachep may be always root cache. So checking cachep-memcg_params-memcg is somewhat strange. Is it right? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] workqueue: insert warn_on_once() into wq_worker_waking_up()
Recently, workqueue code has gone through some changes and we found some bugs related to this. To prevent futher bugs in advance, add WARN_ON_ONCE() in wq_worker_waking_up(). When worker is not WORKER_NOT_RUNNIG state, it should be binded to it's associated cpu and executed in that cpu. Add code for checking this. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/kernel/workqueue.c b/kernel/workqueue.c index a1135c6..1a65132 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -739,8 +739,10 @@ void wq_worker_waking_up(struct task_struct *task, unsigned int cpu) { struct worker *worker = kthread_data(task); - if (!(worker-flags WORKER_NOT_RUNNING)) + if (!(worker-flags WORKER_NOT_RUNNING)) { + WARN_ON_ONCE(worker-pool-gcwq-cpu != cpu); atomic_inc(get_pool_nr_running(worker-pool)); + } } /** -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] clean-up initialization of deferrable timer
2012/10/19 Joonsoo Kim js1...@gmail.com: This patchset introduces setup_timer_deferrable() macro. Using it makes code simple and understandable. This patchset doesn't make any functional difference. It is just for clean-up. It is based on v3.7-rc1 Joonsoo Kim (2): timer: add setup_timer_deferrable() macro timer: use new setup_timer_deferrable() macro drivers/acpi/apei/ghes.c |5 ++--- drivers/ata/libata-core.c|5 ++--- drivers/net/ethernet/nvidia/forcedeth.c | 13 - drivers/net/ethernet/qlogic/qlge/qlge_main.c |4 +--- drivers/net/vxlan.c |5 ++--- include/linux/timer.h|2 ++ kernel/workqueue.c |6 ++ net/mac80211/agg-rx.c| 12 ++-- net/mac80211/agg-tx.c| 12 ++-- net/sched/cls_flow.c |5 ++--- net/sched/sch_sfq.c |5 ++--- 11 files changed, 31 insertions(+), 43 deletions(-) -- 1.7.9.5 Hello, Thomas. Will you pick this for your tree? Or is there anything wrong with it? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] kmem_cache: include allocators code directly into slab_common
2012/10/25 Christoph Lameter c...@linux.com: On Wed, 24 Oct 2012, Pekka Enberg wrote: So I hate this patch with a passion. We don't have any fastpaths in mm/slab_common.c nor should we. Those should be allocator specific. I have similar thoughts on the issue. Lets keep the fast paths allocator specific until we find a better way to handle this issue. Okay. I see. How about applying LTO not to the whole kernel code, but just to slab_common.o + sl[aou]b.o? I think that it may be possible, isn't it? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap
To calculate an index of pkmap, using PKMAP_NR() is more understandable and maintainable, So change it. Cc: Mel Gorman mgor...@suse.de Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/highmem.c b/mm/highmem.c index d517cd1..b3b3d68 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr) unsigned long addr = (unsigned long)vaddr; if (addr = PKMAP_ADDR(0) addr = PKMAP_ADDR(LAST_PKMAP)) { - int i = (addr - PKMAP_ADDR(0)) PAGE_SHIFT; + int i = PKMAP_NR(addr); return pte_page(pkmap_page_table[i]); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] mm, highmem: remove page_address_pool list
We can find free page_address_map instance without the page_address_pool. So remove it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/highmem.c b/mm/highmem.c index 017bad1..731cf9a 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -323,11 +323,7 @@ struct page_address_map { void *virtual; struct list_head list; }; - -/* - * page_address_map freelist, allocated from page_address_maps. - */ -static struct list_head page_address_pool; /* freelist */ +static struct page_address_map page_address_maps[LAST_PKMAP]; /* * Hash table bucket @@ -392,12 +388,7 @@ void set_page_address(struct page *page, void *virtual) pas = page_slot(page); if (virtual) { /* Add */ - BUG_ON(list_empty(page_address_pool)); - - pam = list_entry(page_address_pool.next, - struct page_address_map, list); - list_del(pam-list); - + pam = page_address_maps[PKMAP_NR((unsigned long)virtual)]; pam-page = page; pam-virtual = virtual; @@ -410,7 +401,6 @@ void set_page_address(struct page *page, void *virtual) if (pam-page == page) { list_del(pam-list); spin_unlock_irqrestore(pas-lock, flags); - list_add_tail(pam-list, page_address_pool); goto done; } } @@ -420,15 +410,10 @@ done: return; } -static struct page_address_map page_address_maps[LAST_PKMAP]; - void __init page_address_init(void) { int i; - INIT_LIST_HEAD(page_address_pool); - for (i = 0; i ARRAY_SIZE(page_address_maps); i++) - list_add(page_address_maps[i].list, page_address_pool); for (i = 0; i ARRAY_SIZE(page_address_htable); i++) { INIT_LIST_HEAD(page_address_htable[i].lh); spin_lock_init(page_address_htable[i].lock); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of last flushed entry. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..0683869 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INDEX_INVAL (-1) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index 731cf9a..65beb9a 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + int index = PKMAP_INDEX_INVAL; flush_cache_kmaps(); @@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + index = i; } - if (need_flush) + if (index != PKMAP_INDEX_INVAL) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -160,6 +162,7 @@ void kmap_flush_unused(void) static inline unsigned long map_new_virtual(struct page *page) { unsigned long vaddr; + int index = PKMAP_INDEX_INVAL; int count; start: @@ -168,40 +171,45 @@ start: for (;;) { last_pkmap_nr = (last_pkmap_nr + 1) LAST_PKMAP_MASK; if (!last_pkmap_nr) { - flush_all_zero_pkmaps(); - count = LAST_PKMAP; + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INDEX_INVAL) + break; /* Found a usable entry */ } - if (!pkmap_count[last_pkmap_nr]) + if (!pkmap_count[last_pkmap_nr]) { + index = last_pkmap_nr; break; /* Found a usable entry */ - if (--count) - continue; - - /* -* Sleep for somebody else to unmap their entries -*/ - { - DECLARE_WAITQUEUE(wait, current); - - __set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(pkmap_map_wait, wait); - unlock_kmap(); - schedule(); - remove_wait_queue(pkmap_map_wait, wait); - lock_kmap(); - - /* Somebody else might have mapped it while we slept */ - if (page_address(page)) - return (unsigned long)page_address(page); - - /* Re-start */ - goto start; } + if (--count == 0) + break; } - vaddr = PKMAP_ADDR(last_pkmap_nr); + + /* +* Sleep for somebody else to unmap their entries +*/ + if (index == PKMAP_INDEX_INVAL) { + DECLARE_WAITQUEUE(wait, current); + + __set_current_state(TASK_UNINTERRUPTIBLE); + add_wait_queue(pkmap_map_wait, wait); + unlock_kmap(); + schedule(); + remove_wait_queue(pkmap_map_wait, wait); + lock_kmap(); + + /* Somebody else might have mapped it while we slept */ + vaddr = (unsigned long)page_address(page); + if (vaddr) + return vaddr; + + /* Re-start */ + goto start; + } + + vaddr = PKMAP_ADDR(index); set_pte_at(init_mm, vaddr, - (pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot)); + (pkmap_page_table[index]), mk_pte(page, kmap_prot)); - pkmap_count[last_pkmap_nr] = 1; + pkmap_count[index] = 1; set_page_address(page, (void *)vaddr); return vaddr; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] mm, highmem: remove useless pool_lock
The pool_lock protects the page_address_pool from concurrent access. But, access to the page_address_pool is already protected by kmap_lock. So remove it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/highmem.c b/mm/highmem.c index b3b3d68..017bad1 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -328,7 +328,6 @@ struct page_address_map { * page_address_map freelist, allocated from page_address_maps. */ static struct list_head page_address_pool; /* freelist */ -static spinlock_t pool_lock; /* protects page_address_pool */ /* * Hash table bucket @@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual) if (virtual) { /* Add */ BUG_ON(list_empty(page_address_pool)); - spin_lock_irqsave(pool_lock, flags); pam = list_entry(page_address_pool.next, struct page_address_map, list); list_del(pam-list); - spin_unlock_irqrestore(pool_lock, flags); pam-page = page; pam-virtual = virtual; @@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual) if (pam-page == page) { list_del(pam-list); spin_unlock_irqrestore(pas-lock, flags); - spin_lock_irqsave(pool_lock, flags); list_add_tail(pam-list, page_address_pool); - spin_unlock_irqrestore(pool_lock, flags); goto done; } } @@ -438,7 +433,6 @@ void __init page_address_init(void) INIT_LIST_HEAD(page_address_htable[i].lh); spin_lock_init(page_address_htable[i].lock); } - spin_lock_init(pool_lock); } #endif /* defined(CONFIG_HIGHMEM) !defined(WANT_PAGE_VIRTUAL) */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] minor clean-up and optimize highmem related code
This patchset clean-up and optimize highmem related code. [1] is just clean-up and doesn't introduce any functional change. [2-3] are for clean-up and optimization. These eliminate an useless lock opearation and list management. [4-5] is for optimization related to flush_all_zero_pkmaps(). Joonsoo Kim (5): mm, highmem: use PKMAP_NR() to calculate an index of pkmap mm, highmem: remove useless pool_lock mm, highmem: remove page_address_pool list mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry mm, highmem: get virtual address of the page using PKMAP_ADDR() include/linux/highmem.h |1 + mm/highmem.c| 102 --- 2 files changed, 45 insertions(+), 58 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()
In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page. Using this index, we can simply get virtual address of the page. So change it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/highmem.c b/mm/highmem.c index 65beb9a..1417f4f 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -137,8 +137,7 @@ static int flush_all_zero_pkmaps(void) * So no dangers, even with speculative execution. */ page = pte_page(pkmap_page_table[i]); - pte_clear(init_mm, (unsigned long)page_address(page), - pkmap_page_table[i]); + pte_clear(init_mm, PKMAP_ADDR(i), pkmap_page_table[i]); set_page_address(page, NULL); index = i; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry
2012/10/29 Minchan Kim minc...@kernel.org: On Mon, Oct 29, 2012 at 04:12:55AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of last flushed entry. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..0683869 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INDEX_INVAL (-1) How about this? #define PKMAP_INVALID_INDEX (-1) Okay. /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index 731cf9a..65beb9a 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + int index = PKMAP_INDEX_INVAL; flush_cache_kmaps(); @@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + index = i; How about returning first free index instead of last one? and update last_pkmap_nr to it. Okay. It will be more good. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] minor clean-up and optimize highmem related code
Hi, Minchan. 2012/10/29 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote: This patchset clean-up and optimize highmem related code. [1] is just clean-up and doesn't introduce any functional change. [2-3] are for clean-up and optimization. These eliminate an useless lock opearation and list management. [4-5] is for optimization related to flush_all_zero_pkmaps(). Joonsoo Kim (5): mm, highmem: use PKMAP_NR() to calculate an index of pkmap mm, highmem: remove useless pool_lock mm, highmem: remove page_address_pool list mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry mm, highmem: get virtual address of the page using PKMAP_ADDR() This patchset looks awesome to me. If you have a plan to respin, please CCed Peter. Thanks for review. I will wait more review and respin, the day after tomorrow. Version 2 will include fix about your comment. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Documentation: insert an additional information to android.txt
Without defining ARCH=arm, building a perf for Android ARM will be failed, because it needs architecture specific files. So add related information to a document. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Irina Tirdea irina.tir...@intel.com Cc: David Ahern dsah...@gmail.com Cc: Ingo Molnar mi...@kernel.org Cc: Jiri Olsa jo...@redhat.com Cc: Namhyung Kim namhy...@kernel.org Cc: Paul Mackerras pau...@samba.org Cc: Pekka Enberg penb...@kernel.org Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Steven Rostedt rost...@goodmis.org --- It is based on git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git, perf-core-for-mingo branch. diff --git a/tools/perf/Documentation/android.txt b/tools/perf/Documentation/android.txt index a39dbbb..8484c3a 100644 --- a/tools/perf/Documentation/android.txt +++ b/tools/perf/Documentation/android.txt @@ -48,7 +48,10 @@ For x86: II. Compile perf for Android You need to run make with the NDK toolchain and sysroot defined above: - make CROSS_COMPILE=${NDK_TOOLCHAIN} CFLAGS=--sysroot=${NDK_SYSROOT} +For arm: + make ARCH=arm CROSS_COMPILE=${NDK_TOOLCHAIN} CFLAGS=--sysroot=${NDK_SYSROOT} +For x86: + make ARCH=x86 CROSS_COMPILE=${NDK_TOOLCHAIN} CFLAGS=--sysroot=${NDK_SYSROOT} III. Install perf --- -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] percpu: change a method freeing a chunk for consistency.
commit 099a19d9('allow limited allocation before slab is online') changes a method allocating a chunk from kzalloc to pcpu_mem_alloc. But, it missed changing matched free operation. It may not be a problem for now, but fix it for consistency. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Christoph Lameter c...@linux.com diff --git a/mm/percpu.c b/mm/percpu.c index ddc5efb..ec25896 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -631,7 +631,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk) if (!chunk) return; pcpu_mem_free(chunk-map, chunk-map_alloc * sizeof(chunk-map[0])); - kfree(chunk); + pcpu_mem_free(chunk, pcpu_chunk_struct_size); } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 11/18] sl[au]b: Allocate objects from memcg cache
Hi, Glauber. 2012/10/19 Glauber Costa glom...@parallels.com: We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: Glauber Costa glom...@parallels.com CC: Christoph Lameter c...@linux.com CC: Pekka Enberg penb...@cs.helsinki.fi CC: Michal Hocko mho...@suse.cz CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com CC: Johannes Weiner han...@cmpxchg.org CC: Suleiman Souhlal sulei...@google.com CC: Tejun Heo t...@kernel.org --- include/linux/slub_def.h | 15 ++- mm/memcontrol.c | 3 +++ mm/slab.c| 6 +- mm/slub.c| 5 +++-- 4 files changed, 21 insertions(+), 8 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 961e72e..ed330df 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -13,6 +13,8 @@ #include linux/kobject.h #include linux/kmemleak.h +#include linux/memcontrol.h +#include linux/mm.h enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size) * This ought to end up with a global pointer to the right cache * in kmalloc_caches. */ -static __always_inline struct kmem_cache *kmalloc_slab(size_t size) +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t size) { int index = kmalloc_index(size); if (index == 0) return NULL; - return kmalloc_caches[index]; + return memcg_kmem_get_cache(kmalloc_caches[index], flags); } You don't need this, because memcg_kmem_get_cache() is invoked in both slab_alloc() and __cache_alloc_node(). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 13/18] memcg/sl[au]b Track all the memcg children of a kmem_cache.
2012/10/19 Glauber Costa glom...@parallels.com: +void kmem_cache_destroy_memcg_children(struct kmem_cache *s) +{ + struct kmem_cache *c; + int i; + + if (!s-memcg_params) + return; + if (!s-memcg_params-is_root_cache) + return; + + /* +* If the cache is being destroyed, we trust that there is no one else +* requesting objects from it. Even if there are, the sanity checks in +* kmem_cache_destroy should caught this ill-case. +* +* Still, we don't want anyone else freeing memcg_caches under our +* noses, which can happen if a new memcg comes to life. As usual, +* we'll take the set_limit_mutex to protect ourselves against this. +*/ + mutex_lock(set_limit_mutex); + for (i = 0; i memcg_limited_groups_array_size; i++) { + c = s-memcg_params-memcg_caches[i]; + if (c) + kmem_cache_destroy(c); + } + mutex_unlock(set_limit_mutex); +} It may cause NULL deref. Look at the following scenario. 1. some memcg slab caches has remained object. 2. start to destroy memcg. 3. schedule_delayed_work(kmem_cache_destroy_work_func, @delay 60hz) 4. all remained object is freed. 5. start to destroy root cache. 6. kmem_cache_destroy makes 's-memcg_params-memcg_caches[i] NULL!! 7. Start delayed work function. 8. cachep in kmem_cache_destroy_work_func() may be NULL Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] bootmem: remove not implemented function call, bootmem_arch_preferred_node()
There is no implementation of bootmeme_arch_preferred_node() and call for this function will makes compile-error. So, remove it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/bootmem.c b/mm/bootmem.c index 434be4a..6f62c03e 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -589,19 +589,6 @@ static void * __init alloc_arch_preferred_bootmem(bootmem_data_t *bdata, { if (WARN_ON_ONCE(slab_is_available())) return kzalloc(size, GFP_NOWAIT); - -#ifdef CONFIG_HAVE_ARCH_BOOTMEM - { - bootmem_data_t *p_bdata; - - p_bdata = bootmem_arch_preferred_node(bdata, size, align, - goal, limit); - if (p_bdata) - return alloc_bootmem_bdata(p_bdata, size, align, - goal, limit); - } -#endif - return NULL; } static void * __init alloc_bootmem_core(unsigned long size, -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] bootmem: remove alloc_arch_preferred_bootmem()
The name of function is not suitable for now. And removing function and inlining it's code to each call sites makes code more understandable. Additionally, we shouldn't do allocation from bootmem when slab_is_available(), so directly return kmalloc*'s return value. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/mm/bootmem.c b/mm/bootmem.c index 6f62c03e..cd5c5a2 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -583,14 +583,6 @@ find_block: return NULL; } -static void * __init alloc_arch_preferred_bootmem(bootmem_data_t *bdata, - unsigned long size, unsigned long align, - unsigned long goal, unsigned long limit) -{ - if (WARN_ON_ONCE(slab_is_available())) - return kzalloc(size, GFP_NOWAIT); -} - static void * __init alloc_bootmem_core(unsigned long size, unsigned long align, unsigned long goal, @@ -599,9 +591,8 @@ static void * __init alloc_bootmem_core(unsigned long size, bootmem_data_t *bdata; void *region; - region = alloc_arch_preferred_bootmem(NULL, size, align, goal, limit); - if (region) - return region; + if (WARN_ON_ONCE(slab_is_available())) + return kzalloc(size, GFP_NOWAIT); list_for_each_entry(bdata, bdata_list, list) { if (goal bdata-node_low_pfn = PFN_DOWN(goal)) @@ -699,11 +690,9 @@ void * __init ___alloc_bootmem_node_nopanic(pg_data_t *pgdat, { void *ptr; + if (WARN_ON_ONCE(slab_is_available())) + return kzalloc(size, GFP_NOWAIT); again: - ptr = alloc_arch_preferred_bootmem(pgdat-bdata, size, - align, goal, limit); - if (ptr) - return ptr; /* do not panic in alloc_bootmem_bdata() */ if (limit goal + size limit) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] avr32, kconfig: remove HAVE_ARCH_BOOTMEM
Now, there is no code for CONFIG_HAVE_ARCH_BOOTMEM. So remove it. Cc: Haavard Skinnemoen hskinnem...@gmail.com Cc: Hans-Christian Egtvedt egtv...@samfundet.no Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/avr32/Kconfig b/arch/avr32/Kconfig index 06e73bf..c2bbc9a 100644 --- a/arch/avr32/Kconfig +++ b/arch/avr32/Kconfig @@ -193,9 +193,6 @@ source kernel/Kconfig.preempt config QUICKLIST def_bool y -config HAVE_ARCH_BOOTMEM - def_bool n - config ARCH_HAVE_MEMORY_PRESENT def_bool n -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] bootmem: fix wrong call parameter for free_bootmem()
It is somehow strange that alloc_bootmem return virtual address and free_bootmem require physical address. Anyway, free_bootmem()'s first parameter should be physical address. There are some call sites for free_bootmem() with virtual address. So fix them. Cc: Andrew Morton a...@linux-foundation.org Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/powerpc/platforms/cell/celleb_pci.c b/arch/powerpc/platforms/cell/celleb_pci.c index abc8af4..1735681 100644 --- a/arch/powerpc/platforms/cell/celleb_pci.c +++ b/arch/powerpc/platforms/cell/celleb_pci.c @@ -401,11 +401,11 @@ error: } else { if (config *config) { size = 256; - free_bootmem((unsigned long)(*config), size); + free_bootmem(__pa(*config), size); } if (res *res) { size = sizeof(struct celleb_pci_resource); - free_bootmem((unsigned long)(*res), size); + free_bootmem(__pa(*res), size); } } diff --git a/drivers/macintosh/smu.c b/drivers/macintosh/smu.c index 7d5a6b4..1963680 100644 --- a/drivers/macintosh/smu.c +++ b/drivers/macintosh/smu.c @@ -565,7 +565,7 @@ fail_msg_node: fail_db_node: of_node_put(smu-db_node); fail_bootmem: - free_bootmem((unsigned long)smu, sizeof(struct smu_device)); + free_bootmem(__pa(smu), sizeof(struct smu_device)); smu = NULL; fail_np: of_node_put(np); diff --git a/lib/cpumask.c b/lib/cpumask.c index 402a54a..d327b87 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -161,6 +161,6 @@ EXPORT_SYMBOL(free_cpumask_var); */ void __init free_bootmem_cpumask_var(cpumask_var_t mask) { - free_bootmem((unsigned long)mask, cpumask_size()); + free_bootmem(__pa(mask), cpumask_size()); } #endif -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] powerpc: change free_bootmem() to kfree()
commit ea96025a('Don't use alloc_bootmem() in init_IRQ() path') changed alloc_bootmem() to kzalloc(), but missed to change free_bootmem() to kfree(). So correct it. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c b/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c index 328d221..74861a7 100644 --- a/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c +++ b/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c @@ -16,7 +16,6 @@ #include linux/spinlock.h #include linux/irq.h #include linux/types.h -#include linux/bootmem.h #include linux/slab.h #include asm/io.h @@ -149,7 +148,7 @@ int __init pq2ads_pci_init_irq(void) priv-regs = of_iomap(np, 0); if (!priv-regs) { printk(KERN_ERR Cannot map PCI PIC registers.\n); - goto out_free_bootmem; + goto out_free_kmalloc; } /* mask all PCI interrupts */ @@ -171,9 +170,8 @@ int __init pq2ads_pci_init_irq(void) out_unmap_regs: iounmap(priv-regs); -out_free_bootmem: - free_bootmem((unsigned long)priv, -sizeof(struct pq2ads_pci_pic)); +out_free_kmalloc: + kfree(priv); of_node_put(np); out_unmap_irq: irq_dispose_mapping(irq); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86/tlb: correct vmflag test for checking VM_HUGETLB
commit 611ae8e3f5204f7480b3b405993b3352cfa16662('enable tlb flush range support for x86') change flush_tlb_mm_range() considerably. After this, we test whether vmflag equal to VM_HUGETLB and it may be always failed, because vmflag usually has other flags simultaneously. Our intention is to check whether this vma is for hughtlb, so correct it according to this purpose. Cc: Alex Shi alex@intel.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 0777f04..60f926c 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -197,7 +197,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, } if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1 - || vmflag == VM_HUGETLB) { + || vmflag VM_HUGETLB) { local_flush_tlb(); goto flush_all; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
2012/11/3 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote: Hello, Minchan. 2012/11/1 Minchan Kim minc...@kernel.org: On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; I don't know how kmap_flush_unused is really fast path so how my nitpick is effective. Anyway, What problem happens if we do following as? lock() index = flush_all_zero_pkmaps(); if (index != PKMAP_INVALID_INDEX) last_pkmap_nr = index; unlock(); Normally, last_pkmap_nr is increased with searching empty slot in map_new_virtual. So I expect return value of flush_all_zero_pkmaps in kmap_flush_unused normally become either less than last_pkmap_nr or last_pkmap_nr + 1. There is a case that return value of kmap_flush_unused() is larger than last_pkmap_nr. I see but why it's problem? kmap_flush_unused returns larger value than last_pkmap_nr means that there is no free slot at below the value. So unconditional last_pkmap_nr update is vaild. I think that this is not true. Look at the slightly different example. Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped. do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 = last_pkmap = 17? In this case, unconditional last_pkmap_nr update skip one kunmapped index. So, conditional update is needed. Look at the following example. Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped. do kmap_flush_unused() = flush index 10 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 So, little dirty implementation is needed. Thanks. -- Kind Regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] bootmem: fix wrong call parameter for free_bootmem()
Hi, Andrew. 2012/11/13 Andrew Morton a...@linux-foundation.org: On Tue, 13 Nov 2012 01:31:55 +0900 Joonsoo Kim js1...@gmail.com wrote: It is somehow strange that alloc_bootmem return virtual address and free_bootmem require physical address. Anyway, free_bootmem()'s first parameter should be physical address. There are some call sites for free_bootmem() with virtual address. So fix them. Well gee, I wonder how that happened :( How does this look? From: Andrew Morton a...@linux-foundation.org Subject: bootmem-fix-wrong-call-parameter-for-free_bootmem-fix improve free_bootmem() and free_bootmem_pate() documentation Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: FUJITA Tomonori fujita.tomon...@lab.ntt.co.jp Cc: Haavard Skinnemoen hskinnem...@gmail.com Cc: Hans-Christian Egtvedt egtv...@samfundet.no Cc: Johannes Weiner han...@cmpxchg.org Cc: Joonsoo Kim js1...@gmail.com Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/bootmem.h |4 ++-- mm/bootmem.c| 20 ++-- 2 files changed, 12 insertions(+), 12 deletions(-) --- a/mm/bootmem.c~bootmem-fix-wrong-call-parameter-for-free_bootmem-fix +++ a/mm/bootmem.c @@ -147,21 +147,21 @@ unsigned long __init init_bootmem(unsign /* * free_bootmem_late - free bootmem pages directly to page allocator - * @addr: starting address of the range + * @addr: starting physical address of the range * @size: size of the range in bytes * * This is only useful when the bootmem allocator has already been torn * down, but we are still initializing the system. Pages are given directly * to the page allocator, no bootmem metadata is updated because it is gone. */ -void __init free_bootmem_late(unsigned long addr, unsigned long size) +void __init free_bootmem_late(unsigned long physaddr, unsigned long size) { unsigned long cursor, end; - kmemleak_free_part(__va(addr), size); + kmemleak_free_part(__va(physaddr), size); - cursor = PFN_UP(addr); - end = PFN_DOWN(addr + size); + cursor = PFN_UP(physaddr); + end = PFN_DOWN(physaddr + size); for (; cursor end; cursor++) { __free_pages_bootmem(pfn_to_page(cursor), 0); @@ -385,21 +385,21 @@ void __init free_bootmem_node(pg_data_t /** * free_bootmem - mark a page range as usable - * @addr: starting address of the range + * @addr: starting physical address of the range * @size: size of the range in bytes * * Partial pages will be considered reserved and left as they are. * * The range must be contiguous but may span node boundaries. */ -void __init free_bootmem(unsigned long addr, unsigned long size) +void __init free_bootmem(unsigned long physaddr, unsigned long size) { unsigned long start, end; - kmemleak_free_part(__va(addr), size); + kmemleak_free_part(__va(physaddr), size); - start = PFN_UP(addr); - end = PFN_DOWN(addr + size); + start = PFN_UP(physaddr); + end = PFN_DOWN(physaddr + size); mark_bootmem(start, end, 0, 0); } --- a/include/linux/bootmem.h~bootmem-fix-wrong-call-parameter-for-free_bootmem-fix +++ a/include/linux/bootmem.h @@ -51,8 +51,8 @@ extern unsigned long free_all_bootmem(vo extern void free_bootmem_node(pg_data_t *pgdat, unsigned long addr, unsigned long size); -extern void free_bootmem(unsigned long addr, unsigned long size); -extern void free_bootmem_late(unsigned long addr, unsigned long size); +extern void free_bootmem(unsigned long physaddr, unsigned long size); +extern void free_bootmem_late(unsigned long physaddr, unsigned long size); /* * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE, _ Looks good to me :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry
2012/11/13 Minchan Kim minc...@kernel.org: On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote: 2012/11/3 Minchan Kim minc...@kernel.org: Hi Joonsoo, On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote: Hello, Minchan. 2012/11/1 Minchan Kim minc...@kernel.org: On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote: In current code, after flush_all_zero_pkmaps() is invoked, then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps() return index of first flushed entry. With this index, we can immediately map highmem page to virtual address represented by index. So change return type of flush_all_zero_pkmaps() and return index of first flushed entry. Additionally, update last_pkmap_nr to this index. It is certain that entry which is below this index is occupied by other mapping, therefore updating last_pkmap_nr to this index is reasonable optimization. Cc: Mel Gorman m...@csn.ul.ie Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Minchan Kim minc...@kernel.org Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/highmem.h b/include/linux/highmem.h index ef788b5..97ad208 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size) #ifdef CONFIG_HIGHMEM #include asm/highmem.h +#define PKMAP_INVALID_INDEX (LAST_PKMAP) /* declarations for linux/mm/highmem.c */ unsigned int nr_free_highpages(void); diff --git a/mm/highmem.c b/mm/highmem.c index d98b0a9..b365f7b 100644 --- a/mm/highmem.c +++ b/mm/highmem.c @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr) return virt_to_page(addr); } -static void flush_all_zero_pkmaps(void) +static unsigned int flush_all_zero_pkmaps(void) { int i; - int need_flush = 0; + unsigned int index = PKMAP_INVALID_INDEX; flush_cache_kmaps(); @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void) pkmap_page_table[i]); set_page_address(page, NULL); - need_flush = 1; + if (index == PKMAP_INVALID_INDEX) + index = i; } - if (need_flush) + if (index != PKMAP_INVALID_INDEX) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); + + return index; } /** @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void) */ void kmap_flush_unused(void) { + unsigned int index; + lock_kmap(); - flush_all_zero_pkmaps(); + index = flush_all_zero_pkmaps(); + if (index != PKMAP_INVALID_INDEX (index last_pkmap_nr)) + last_pkmap_nr = index; I don't know how kmap_flush_unused is really fast path so how my nitpick is effective. Anyway, What problem happens if we do following as? lock() index = flush_all_zero_pkmaps(); if (index != PKMAP_INVALID_INDEX) last_pkmap_nr = index; unlock(); Normally, last_pkmap_nr is increased with searching empty slot in map_new_virtual. So I expect return value of flush_all_zero_pkmaps in kmap_flush_unused normally become either less than last_pkmap_nr or last_pkmap_nr + 1. There is a case that return value of kmap_flush_unused() is larger than last_pkmap_nr. I see but why it's problem? kmap_flush_unused returns larger value than last_pkmap_nr means that there is no free slot at below the value. So unconditional last_pkmap_nr update is vaild. I think that this is not true. Look at the slightly different example. Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped. do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10; do kunmap() with index 17 do kmap_flush_unused() = flush index 17 = last_pkmap = 17? In this case, unconditional last_pkmap_nr update skip one kunmapped index. So, conditional update is needed. Thanks for pouinting out, Joonsoo. You're right. I misunderstood your flush_all_zero_pkmaps change. As your change, flush_all_zero_pkmaps returns first *flushed* free slot index. What's the benefit returning flushed flushed free slot index rather than free slot index? If flush_all_zero_pkmaps() return free slot index rather than first flushed free slot, we need another comparison like as 'if pkmap_count[i] == 0' and need another local variable for determining whether flush is occurred or not. I want to minimize these overhead and churning of the code, although they are negligible. I think flush_all_zero_pkmaps should return first free slot because customer of flush_all_zero_pkmaps doesn't care whether it's just flushed or not. What he want is just free or not. In such case, we can remove above check and it makes flusha_all_zero_pkmaps more intuitive
Re: [Q] Default SLAB allocator
Hello, Eric. 2012/10/14 Eric Dumazet eric.duma...@gmail.com: SLUB was really bad in the common workload you describe (allocations done by one cpu, freeing done by other cpus), because all kfree() hit the slow path and cpus contend in __slab_free() in the loop guarded by cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly hit the main struct page to add the freed object to freelist. Could you elaborate more on how 'netperf RR' makes kernel allocations done by one cpu, freeling done by other cpus, please? I don't have enough background network subsystem, so I'm just curious. I played some months ago adding a percpu associative cache to SLUB, then just moved on other strategy. (Idea for this per cpu cache was to build a temporary free list of objects to batch accesses to struct page) Is this implemented and submitted? If it is, could you tell me the link for the patches? Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] timer: use new setup_timer_deferrable() macro
Now, we have a handy macro for initializing deferrable timer. Using it makes code clean and easy to understand. Additionally, in some driver codes, use setup_timer() instead of init_timer(). This patch doesn't make any functional difference. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Len Brown l...@kernel.org Cc: Jeff Garzik jgar...@pobox.com Cc: Jitendra Kalsaria jitendra.kalsa...@qlogic.com Cc: Ron Mercer ron.mer...@qlogic.com Cc: Tejun Heo t...@kernel.org Cc: Johannes Berg johan...@sipsolutions.net Cc: John W. Linville linvi...@tuxdriver.com Cc: David S. Miller da...@davemloft.net Cc: Jamal Hadi Salim j...@mojatatu.com diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 1599566..1707f9a 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -944,9 +944,8 @@ static int __devinit ghes_probe(struct platform_device *ghes_dev) } switch (generic-notify.type) { case ACPI_HEST_NOTIFY_POLLED: - ghes-timer.function = ghes_poll_func; - ghes-timer.data = (unsigned long)ghes; - init_timer_deferrable(ghes-timer); + setup_timer_deferrable(ghes-timer, ghes_poll_func, + (unsigned long)ghes); ghes_add_timer(ghes); break; case ACPI_HEST_NOTIFY_EXTERNAL: diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 3cc7096..a811e6f 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5616,9 +5616,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host) INIT_LIST_HEAD(ap-eh_done_q); init_waitqueue_head(ap-eh_wait_q); init_completion(ap-park_req_pending); - init_timer_deferrable(ap-fastdrain_timer); - ap-fastdrain_timer.function = ata_eh_fastdrain_timerfn; - ap-fastdrain_timer.data = (unsigned long)ap; + setup_timer_deferrable(ap-fastdrain_timer, ata_eh_fastdrain_timerfn, + (unsigned long)ap); ap-cbl = ATA_CBL_NONE; diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c index 876bece..987e2cf 100644 --- a/drivers/net/ethernet/nvidia/forcedeth.c +++ b/drivers/net/ethernet/nvidia/forcedeth.c @@ -5548,15 +5548,10 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, const struct pci_device_i spin_lock_init(np-hwstats_lock); SET_NETDEV_DEV(dev, pci_dev-dev); - init_timer(np-oom_kick); - np-oom_kick.data = (unsigned long) dev; - np-oom_kick.function = nv_do_rx_refill;/* timer handler */ - init_timer(np-nic_poll); - np-nic_poll.data = (unsigned long) dev; - np-nic_poll.function = nv_do_nic_poll; /* timer handler */ - init_timer_deferrable(np-stats_poll); - np-stats_poll.data = (unsigned long) dev; - np-stats_poll.function = nv_do_stats_poll; /* timer handler */ + setup_timer(np-oom_kick, nv_do_rx_refill, (unsigned long)dev); + setup_timer(np-nic_poll, nv_do_nic_poll, (unsigned long)dev); + setup_timer_deferrable(np-stats_poll, nv_do_stats_poll, + (unsigned long)dev); err = pci_enable_device(pci_dev); if (err) diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c index b262d61..1687883 100644 --- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c +++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c @@ -4707,9 +4707,7 @@ static int __devinit qlge_probe(struct pci_dev *pdev, /* Start up the timer to trigger EEH if * the bus goes dead */ - init_timer_deferrable(qdev-timer); - qdev-timer.data = (unsigned long)qdev; - qdev-timer.function = ql_timer; + setup_timer_deferrable(qdev-timer, ql_timer, (unsigned long)qdev); qdev-timer.expires = jiffies + (5*HZ); add_timer(qdev-timer); ql_link_off(qdev); diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 607976c..743c7d7 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -995,9 +995,8 @@ static void vxlan_setup(struct net_device *dev) spin_lock_init(vxlan-hash_lock); - init_timer_deferrable(vxlan-age_timer); - vxlan-age_timer.function = vxlan_cleanup; - vxlan-age_timer.data = (unsigned long) vxlan; + setup_timer_deferrable(vxlan-age_timer, vxlan_cleanup, + (unsigned long)vxlan); inet_get_local_port_range(low, high); vxlan-port_min = low; diff --git a/kernel/workqueue.c b/kernel/workqueue.c index d951daa..77d6f62 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3845,10 +3845,8 @@ static int __init init_workqueues(void) INIT_LIST_HEAD(pool-worklist); INIT_LIST_HEAD(pool-idle_list); - init_timer_deferrable
[PATCH 1/2] timer: add setup_timer_deferrable() macro
There are some users of deferrable timer. Because of lacking of handy initializer, they should initialize deferrable timer fumblingly. We might do better with new setup_timer_deferrable() macro. So add it. Following patch will makes some users of init_timer_deferrable() use this handy macro. Signed-off-by: Joonsoo Kim js1...@gmail.com diff --git a/include/linux/timer.h b/include/linux/timer.h index 8c5a197..5950276 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -151,6 +151,8 @@ static inline void init_timer_on_stack_key(struct timer_list *timer, #define setup_timer(timer, fn, data) \ __setup_timer((timer), (fn), (data), 0) +#define setup_timer_deferrable(timer, fn, data) \ + __setup_timer((timer), (fn), (data), TIMER_DEFERRABLE) #define setup_timer_on_stack(timer, fn, data) \ __setup_timer_on_stack((timer), (fn), (data), 0) #define setup_deferrable_timer_on_stack(timer, fn, data) \ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] clean-up initialization of deferrable timer
This patchset introduces setup_timer_deferrable() macro. Using it makes code simple and understandable. This patchset doesn't make any functional difference. It is just for clean-up. It is based on v3.7-rc1 Joonsoo Kim (2): timer: add setup_timer_deferrable() macro timer: use new setup_timer_deferrable() macro drivers/acpi/apei/ghes.c |5 ++--- drivers/ata/libata-core.c|5 ++--- drivers/net/ethernet/nvidia/forcedeth.c | 13 - drivers/net/ethernet/qlogic/qlge/qlge_main.c |4 +--- drivers/net/vxlan.c |5 ++--- include/linux/timer.h|2 ++ kernel/workqueue.c |6 ++ net/mac80211/agg-rx.c| 12 ++-- net/mac80211/agg-tx.c| 12 ++-- net/sched/cls_flow.c |5 ++--- net/sched/sch_sfq.c |5 ++--- 11 files changed, 31 insertions(+), 43 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Q] Default SLAB allocator
Hello, Eric. Thank you very much for a kind comment about my question. I have one more question related to network subsystem. Please let me know what I misunderstand. 2012/10/14 Eric Dumazet eric.duma...@gmail.com: In latest kernels, skb-head no longer use kmalloc()/kfree(), so SLAB vs SLUB is less a concern for network loads. In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to populate skb-head. You mentioned that in latest kernel skb-head no longer use kmalloc()/kfree(). But, why result of David's netperf RR test on v3.6 is differentiated by choosing the allocator? As far as I know, __netdev_alloc_frag may be introduced in v3.5, so I'm just confused. Does this test use __netdev_alloc_skb with __GFP_WAIT | GFP_DMA? Does normal workload for network use __netdev_alloc_skb with __GFP_WAIT | GFP_DMA? Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] mm: correct return value of migrate_pages()
migrate_pages() should return number of pages not migrated or error code. When unmap_and_move return -EAGAIN, outer loop is re-execution without initialising nr_failed. This makes nr_failed over-counted. So this patch correct it by initialising nr_failed in outer loop. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Christoph Lameter c...@linux.com diff --git a/mm/migrate.c b/mm/migrate.c index be26d5c..294d52a 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from, for(pass = 0; pass 10 retry; pass++) { retry = 0; + nr_failed = 0; list_for_each_entry_safe(page, page2, from, lru) { cond_resched(); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
do_migrate_pages() can return the number of pages not migrated. Because migrate_pages() syscall return this value directly, migrate_pages() syscall may return the number of pages not migrated. In fail case in migrate_pages() syscall, we should return error value. So change err to -EIO Additionally, Correct comment above do_migrate_pages() Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Sasha Levin levinsasha...@gmail.com Cc: Christoph Lameter c...@linux.com diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1d771e4..f7df271 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, * Move pages between the two nodesets so as to preserve the physical * layout as much as possible. * - * Returns the number of page that could not be moved. + * Returns error or the number of pages not migrated. */ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to, int flags) @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, err = do_migrate_pages(mm, old, new, capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE); + if (err 0) + err = -EIO; mmput(mm); out: -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
migrate_pages() would return positive value in some failure case, so 'ret 0 ? 0 : ret' may be wrong. This fix it and remove one dead statement. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Michal Nazarewicz min...@mina86.com Cc: Marek Szyprowski m.szyprow...@samsung.com Cc: Minchan Kim minc...@kernel.org Cc: Christoph Lameter c...@linux.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4403009..02d4519 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } tries = 0; } else if (++tries == 5) { - ret = ret 0 ? ret : -EBUSY; break; } @@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } putback_lru_pages(cc.migratepages); - return ret 0 ? 0 : ret; + return ret = 0 ? ret : -EBUSY; } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4] mm: fix possible incorrect return value of move_pages() syscall
move_pages() syscall may return success in case that do_move_page_to_node_array return positive value which means migration failed. This patch changes return value of do_move_page_to_node_array for not returning positive value. It can fix the problem. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Brice Goglin br...@myri.com Cc: Christoph Lameter c...@linux.com Cc: Minchan Kim minc...@kernel.org diff --git a/mm/migrate.c b/mm/migrate.c index 294d52a..adabaf4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1171,7 +1171,7 @@ set_status: } up_read(mm-mmap_sem); - return err; + return err 0 ? -EIO : err; } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] mm: correct return value of migrate_pages()
2012/7/17 Christoph Lameter c...@linux.com: On Tue, 17 Jul 2012, Joonsoo Kim wrote: migrate_pages() should return number of pages not migrated or error code. When unmap_and_move return -EAGAIN, outer loop is re-execution without initialising nr_failed. This makes nr_failed over-counted. The itention of the nr_failed was only to give an indication as to how many attempts where made. The failed pages where on a separate queue that seems to have vanished. So this patch correct it by initialising nr_failed in outer loop. Well yea it makes sense since retry is initialized there as well. Acked-by: Christoph Lameter c...@linux.com Thanks for comment. Additinally, I find that migrate_huge_pages() is needed identical fix as migrate_pages(). @@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from, for (pass = 0; pass 10 retry; pass++) { retry = 0; + nr_failed = 0; list_for_each_entry_safe(page, page2, from, lru) { cond_resched(); When I resend with this, could I include Acked-by: Christoph Lameter c...@linux.com? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] mm: correct return value of migrate_pages()
2012/7/17 Michal Nazarewicz min...@tlen.pl: Acked-by: Michal Nazarewicz min...@mina86.com Thanks. Actually, it makes me wonder if there is any code that uses this information. If not, it would be best in my opinion to make it return zero or negative error code, but that would have to be checked. I think that, too. I looked at every callsites for migrate_pages() and there is no place which really need fail count. This function sometimes makes caller error-prone, so I think changing return value is preferable. How do you think, Christoph? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall
2012/7/17 Michal Nazarewicz min...@tlen.pl: Joonsoo Kim js1...@gmail.com writes: do_migrate_pages() can return the number of pages not migrated. Because migrate_pages() syscall return this value directly, migrate_pages() syscall may return the number of pages not migrated. In fail case in migrate_pages() syscall, we should return error value. So change err to -EIO Additionally, Correct comment above do_migrate_pages() Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Sasha Levin levinsasha...@gmail.com Cc: Christoph Lameter c...@linux.com Acked-by: Michal Nazarewicz min...@mina86.com Thanks. When I resend with changing -EIO to -EBUSY, could I include Acked-by: Michal Nazarewicz min...@mina86.com? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()
2012/7/17 Michal Nazarewicz min...@mina86.com: Joonsoo Kim js1...@gmail.com writes: migrate_pages() would return positive value in some failure case, so 'ret 0 ? 0 : ret' may be wrong. This fix it and remove one dead statement. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Michal Nazarewicz min...@mina86.com Cc: Marek Szyprowski m.szyprow...@samsung.com Cc: Minchan Kim minc...@kernel.org Cc: Christoph Lameter c...@linux.com Have you actually encountered this problem? If migrate_pages() fails with a positive value, the code that you are removing kicks in and -EBUSY is assigned to ret (now that I look at it, I think that in the current code the return ret 0 ? 0 : ret; statement could be reduced to return ret;). Your code seems to be cleaner, but the commit message does not look accurate to me. I don't encounter this problem yet. If migrate_pages() with offlining false meets KSM page, then migration failed. In this case, failed page is removed from cc.migratepage list and return failed count. So it can be possible exiting loop without testing ++tries == 5 and ret is over the zero. Is there any point which I missing? Is there any possible scenario migrate_pages return 0 and cc.migratepages is empty? I'm not expert for MM, so please comment my humble opinion. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages()
migrate_pages() should return number of pages not migrated or error code. When unmap_and_move return -EAGAIN, outer loop is re-execution without initialising nr_failed. This makes nr_failed over-counted. So this patch correct it by initialising nr_failed in outer loop. migrate_huge_pages() is identical case as migrate_pages() Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Christoph Lameter c...@linux.com Acked-by: Christoph Lameter c...@linux.com Acked-by: Michal Nazarewicz min...@mina86.com diff --git a/mm/migrate.c b/mm/migrate.c index be26d5c..f495c58 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from, for(pass = 0; pass 10 retry; pass++) { retry = 0; + nr_failed = 0; list_for_each_entry_safe(page, page2, from, lru) { cond_resched(); @@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from, for (pass = 0; pass 10 retry; pass++) { retry = 0; + nr_failed = 0; list_for_each_entry_safe(page, page2, from, lru) { cond_resched(); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall
do_migrate_pages() can return the number of pages not migrated. Because migrate_pages() syscall return this value directly, migrate_pages() syscall may return the number of pages not migrated. In fail case in migrate_pages() syscall, we should return error value. So change err to -EBUSY Additionally, Correct comment above do_migrate_pages() Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Sasha Levin levinsasha...@gmail.com Cc: Christoph Lameter c...@linux.com diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1d771e4..0732729 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, * Move pages between the two nodesets so as to preserve the physical * layout as much as possible. * - * Returns the number of page that could not be moved. + * Returns error or the number of pages not migrated. */ int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from, const nodemask_t *to, int flags) @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, err = do_migrate_pages(mm, old, new, capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE); + if (err 0) + err = -EBUSY; mmput(mm); out: -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall
move_pages() syscall may return success in case that do_move_page_to_node_array return positive value which means migration failed. This patch changes return value of do_move_page_to_node_array for not returning positive value. It can fix the problem. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Brice Goglin br...@myri.com Cc: Christoph Lameter c...@linux.com Cc: Minchan Kim minc...@kernel.org diff --git a/mm/migrate.c b/mm/migrate.c index f495c58..eeaf409 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1172,7 +1172,7 @@ set_status: } up_read(mm-mmap_sem); - return err; + return err 0 ? -EBUSY : err; } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()
migrate_pages() would return positive value in some failure case, so 'ret 0 ? 0 : ret' may be wrong. This fix it and remove one dead statement. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Michal Nazarewicz min...@mina86.com Cc: Marek Szyprowski m.szyprow...@samsung.com Cc: Minchan Kim minc...@kernel.org Cc: Christoph Lameter c...@linux.com Acked-by: Christoph Lameter c...@linux.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4403009..02d4519 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } tries = 0; } else if (++tries == 5) { - ret = ret 0 ? ret : -EBUSY; break; } @@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } putback_lru_pages(cc.migratepages); - return ret 0 ? 0 : ret; + return ret = 0 ? ret : -EBUSY; } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall
2012/7/17 Christoph Lameter c...@linux.com: On Tue, 17 Jul 2012, Joonsoo Kim wrote: @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode, err = do_migrate_pages(mm, old, new, capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE); + if (err 0) + err = -EBUSY; mmput(mm); out: Why not have do_migrate_pages() return EBUSY if we do not need the number of failed/retried pages? There is no serious reason. do_migrate_pages() have two callsites, although another one doesn't use return value. do_migrate_pages() is commented Return the number of page And my focus is fixing possible error in migrate_pages() syscall. So, I keep to return the number of failed/retired pages. If we really think the number of failed/retired pages is useless, in that time, instead that do_migrate_pages() return EBUSY, we can make migrate_pages() return EBUSY. I think it is better to fix all the related codes at one go. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()
2012/7/17 Michal Nazarewicz min...@mina86.com: On Tue, 17 Jul 2012 14:33:34 +0200, Joonsoo Kim js1...@gmail.com wrote: migrate_pages() would return positive value in some failure case, so 'ret 0 ? 0 : ret' may be wrong. This fix it and remove one dead statement. How about the following message: --- 8 --- migrate_pages() can return positive value while at the same time emptying the list of pages it was called with. Such situation means that it went through all the pages on the list some of which failed to be migrated. If that happens, __alloc_contig_migrate_range()'s loop may finish without ++tries == 5 never being checked. This in turn means that at the end of the function, ret may have a positive value, which should be treated as an error. This patch changes __alloc_contig_migrate_range() so that the return statement converts positive ret value into -EBUSY error. --- 8 --- It's good. I will resend patch replacing my comment with yours. Thanks for help. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range()
migrate_pages() can return positive value while at the same time emptying the list of pages it was called with. Such situation means that it went through all the pages on the list some of which failed to be migrated. If that happens, __alloc_contig_migrate_range()'s loop may finish without ++tries == 5 never being checked. This in turn means that at the end of the function, ret may have a positive value, which should be treated as an error. This patch changes __alloc_contig_migrate_range() so that the return statement converts positive ret value into -EBUSY error. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Michal Nazarewicz min...@mina86.com Cc: Marek Szyprowski m.szyprow...@samsung.com Cc: Minchan Kim minc...@kernel.org Cc: Christoph Lameter c...@linux.com Acked-by: Christoph Lameter c...@linux.com Acked-by: Michal Nazarewicz min...@mina86.com diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4403009..02d4519 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } tries = 0; } else if (++tries == 5) { - ret = ret 0 ? ret : -EBUSY; break; } @@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long start, unsigned long end) } putback_lru_pages(cc.migratepages); - return ret 0 ? 0 : ret; + return ret = 0 ? ret : -EBUSY; } /* -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()
Commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 ('mm: compaction: introduce sync-light migration for use by compaction') change declaration of migrate_pages() and migrate_huge_pages(). But, it miss changing argument of migrate_huge_pages() in soft_offline_huge_page(). In this case, we should call with MIGRATE_SYNC. So change it. Additionally, there is mismatch between type of argument and function declaration for migrate_pages(). So fix this simple case, too. Signed-off-by: Joonsoo Kim js1...@gmail.com Cc: Christoph Lameter c...@linux.com Cc: Mel Gorman mgor...@suse.de diff --git a/mm/memory-failure.c b/mm/memory-failure.c index ab1e714..afde561 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1431,8 +1431,8 @@ static int soft_offline_huge_page(struct page *page, int flags) /* Keep page count to indicate a given hugepage is isolated. */ list_add(hpage-lru, pagelist); - ret = migrate_huge_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, 0, - true); + ret = migrate_huge_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, false, + MIGRATE_SYNC); if (ret) { struct page *page1, *page2; list_for_each_entry_safe(page1, page2, pagelist, lru) @@ -1561,7 +1561,7 @@ int soft_offline_page(struct page *page, int flags) page_is_file_cache(page)); list_add(page-lru, pagelist); ret = migrate_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, - 0, MIGRATE_SYNC); + false, MIGRATE_SYNC); if (ret) { putback_lru_pages(pagelist); pr_info(soft offline: %#lx: migration failed %d, type %lx\n, -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/6] correct load_balance()
On Tue, Mar 26, 2013 at 03:01:34PM +0900, Joonsoo Kim wrote: Commit 88b8dac0 makes load_balance() consider other cpus in its group. But, there are some missing parts for this feature to work properly. This patchset correct these things and make load_balance() robust. Others are related to LBF_ALL_PINNED. This is fallback functionality when all tasks can't be moved as cpu affinity. But, currently, if imbalance is not large enough to task's load, we leave LBF_ALL_PINNED flag and 'redo' is triggered. This is not our intention, so correct it. These are based on v3.9-rc4. Changelog v1-v2: Changes from Peter's suggestion [4/6]: don't include a code to evaluate load value in can_migrate_task() [5/6]: rename load_balance_tmpmask to load_balance_mask [6/6]: not use one more cpumasks, use env's cpus for prevent to re-select Joonsoo Kim (6): sched: change position of resched_cpu() in load_balance() sched: explicitly cpu_idle_type checking in rebalance_domains() sched: don't consider other cpus in our group in case of NEWLY_IDLE sched: move up affinity check to mitigate useless redoing overhead sched: rename load_balance_tmpmask to load_balance_mask sched: prevent to re-select dst-cpu in load_balance() kernel/sched/core.c |4 +-- kernel/sched/fair.c | 67 +++ 2 files changed, 38 insertions(+), 33 deletions(-) Hello, Ingo and Peter. Just ping for this patchset. Please let me know what I have to do for merging this patchset. Thanks. -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/4] kprobes: delay blacklist symbol lookup until we actually need it
Hello, Oskar. On Thu, Apr 04, 2013 at 02:51:26PM +0200, Oskar Andero wrote: From: Toby Collett toby.coll...@sonymobile.com The symbol lookup can take a long time and kprobes is initialised very early in boot, so delay symbol lookup until the blacklist is first used. Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Cc: David S. Miller da...@davemloft.net Reviewed-by: Radovan Lekanovic radovan.lekano...@sonymobile.com Signed-off-by: Toby Collett toby.coll...@sonymobile.com Signed-off-by: Oskar Andero oskar.and...@sonymobile.com --- kernel/kprobes.c | 98 ++-- 1 file changed, 60 insertions(+), 38 deletions(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index e35be53..0a270e5 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -68,6 +68,7 @@ #endif static int kprobes_initialized; +static int kprobe_blacklist_initialized; static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE]; static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE]; @@ -102,6 +103,60 @@ static struct kprobe_blackpoint kprobe_blacklist[] = { {NULL}/* Terminator */ }; +/* it can take some time ( 100ms ) to initialise the + * blacklist so we delay this until we actually need it + */ +static void init_kprobe_blacklist(void) +{ + int i; + unsigned long offset = 0, size = 0; + char *modname, namebuf[128]; + const char *symbol_name; + void *addr; + struct kprobe_blackpoint *kb; + + mutex_lock(kprobe_mutex); + if (kprobe_blacklist_initialized) + goto out; + + /* + * Lookup and populate the kprobe_blacklist. + * + * Unlike the kretprobe blacklist, we'll need to determine + * the range of addresses that belong to the said functions, + * since a kprobe need not necessarily be at the beginning + * of a function. + */ + for (kb = kprobe_blacklist; kb-name != NULL; kb++) { + kprobe_lookup_name(kb-name, addr); + if (!addr) + continue; + + kb-start_addr = (unsigned long)addr; + symbol_name = kallsyms_lookup(kb-start_addr, + size, offset, modname, namebuf); + if (!symbol_name) + kb-range = 0; + else + kb-range = size; + } + + if (kretprobe_blacklist_size) { + /* lookup the function address from its name */ + for (i = 0; kretprobe_blacklist[i].name != NULL; i++) { + kprobe_lookup_name(kretprobe_blacklist[i].name, +kretprobe_blacklist[i].addr); + if (!kretprobe_blacklist[i].addr) + printk(kretprobe: lookup failed: %s\n, +kretprobe_blacklist[i].name); + } + } + kprobe_blacklist_initialized = 1; You need smp_wmb() before assigning 'kprobe_blacklist_initialized = 1'. This guarantee that who see kprobe_blacklist_initialized = 1 will get updated data of kprobe_blacklist. Please refer my previous patch once more :) And How about define kprobe_blacklist_initialized as boolean? Thanks. + +out: + mutex_unlock(kprobe_mutex); +} + #ifdef __ARCH_WANT_KPROBES_INSN_SLOT /* * kprobe-ainsn.insn points to the copy of the instruction to be @@ -1331,6 +1386,9 @@ static int __kprobes in_kprobes_functions(unsigned long addr) if (addr = (unsigned long)__kprobes_text_start addr (unsigned long)__kprobes_text_end) return -EINVAL; + + if (unlikely(!kprobe_blacklist_initialized)) + init_kprobe_blacklist(); /* * If there exists a kprobe_blacklist, verify and * fail any probe registration in the prohibited area @@ -1816,6 +1874,8 @@ int __kprobes register_kretprobe(struct kretprobe *rp) void *addr; if (kretprobe_blacklist_size) { + if (unlikely(!kprobe_blacklist_initialized)) + init_kprobe_blacklist(); addr = kprobe_addr(rp-kp); if (IS_ERR(addr)) return PTR_ERR(addr); @@ -2065,11 +2125,6 @@ static struct notifier_block kprobe_module_nb = { static int __init init_kprobes(void) { int i, err = 0; - unsigned long offset = 0, size = 0; - char *modname, namebuf[128]; - const char *symbol_name; - void *addr; - struct kprobe_blackpoint *kb; /* FIXME allocate the probe table, currently defined statically */ /* initialize all list heads */ @@ -2079,39 +2134,6 @@ static int __init init_kprobes(void) raw_spin_lock_init((kretprobe_table_locks[i].lock)); } - /* - * Lookup and populate the kprobe_blacklist. - * - * Unlike the kretprobe blacklist, we'll need to determine - * the
Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.
On Thu, Apr 04, 2013 at 01:53:25PM +, Christoph Lameter wrote: On Thu, 4 Apr 2013, Joonsoo Kim wrote: Pekka alreay applied it. Do we need update? Well I thought the passing of the count via lru.next would be something worthwhile to pick up. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hello, Pekka. Here goes a patch implementing Christoph's idea. Instead of updating my previous patch, I re-write this patch on top of your slab/next tree. Thanks. 8--- From e1c18793dd2a9d9cef87b07faf975364b71276d7 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim iamjoonsoo@lge.com Date: Fri, 5 Apr 2013 10:49:36 +0900 Subject: [PATCH] slub: use page-lru.next to calculate nr of acquired object We can pass inuse count via page-lru.next in order to calculate number of acquired objects and it is more beautiful way. This reduces one function argument and makes clean code. Cc: Christoph Lameter c...@linux.com Suggested-by: Christoph Lameter c...@linux.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slub.c b/mm/slub.c index 21b3f00..8a35464 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1493,11 +1493,12 @@ static inline void remove_partial(struct kmem_cache_node *n, */ static inline void *acquire_slab(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page, - int mode, int *objects) + int mode) { void *freelist; unsigned long counters; struct page new; + unsigned long inuse; /* * Zap the freelist and set the frozen bit. @@ -1507,7 +1508,7 @@ static inline void *acquire_slab(struct kmem_cache *s, freelist = page-freelist; counters = page-counters; new.counters = counters; - *objects = new.objects - new.inuse; + inuse = page-inuse; if (mode) { new.inuse = page-objects; new.freelist = NULL; @@ -1525,6 +1526,7 @@ static inline void *acquire_slab(struct kmem_cache *s, return NULL; remove_partial(n, page); + page-lru.next = (void *)inuse; WARN_ON(!freelist); return freelist; } @@ -1541,7 +1543,6 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page, *page2; void *object = NULL; int available = 0; - int objects; /* * Racy check. If we mistakenly see no partial slabs then we @@ -1559,11 +1560,11 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, if (!pfmemalloc_match(page, flags)) continue; - t = acquire_slab(s, n, page, object == NULL, objects); + t = acquire_slab(s, n, page, object == NULL); if (!t) break; - available += objects; + available += (page-objects - (unsigned long)page-lru.next); if (!object) { c-page = page; stat(s, ALLOC_FROM_PARTIAL); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Preeti. On Thu, Apr 04, 2013 at 12:18:32PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/04/2013 06:12 AM, Joonsoo Kim wrote: Hello, Preeti. So, how about extending a sched_period with rq-nr_running, instead of cfs_rq-nr_running? It is my quick thought and I think that we can ensure to run atleast once in this extending sched_period. Yeah this seems to be correct.This would ensure sched_min_granularity also. So then in the scenarion where there are 2 tgs in a runqueue with 10 tasks each,when we calculate the sched_slice of any task,the __sched_period() would return 4*20 = 80ms. The sched_slice of each of the task would be 80/20 = 4ms. But what about the sched_slice of each task group? How would that be calculated then? Ah... Okay. I will think more deeply about this issue. Let us take the above example and walk through this problem.This would probably help us spot the issues involved with this. And, do we leave a problem if we cannot guaranteed atleast once property? This would depend on the results of the benchmarks with the changes.I am unable to comment on this off the top of my head. Okay. :) Thanks for your kind review!! Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab
Currently, freed pages via rcu is not counted for reclaimed_slab, because it is freed in rcu context, not current task context. But, this free is initiated by this task, so counting this into this task's reclaimed_slab is meaningful to decide whether we continue reclaim, or not. So change code to count these pages for this task's reclaimed_slab. Cc: Christoph Lameter c...@linux-foundation.org Cc: Pekka Enberg penb...@kernel.org Cc: Matt Mackall m...@selenic.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slub.c b/mm/slub.c index 4aec537..16fd2d5 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1409,8 +1409,6 @@ static void __free_slab(struct kmem_cache *s, struct page *page) memcg_release_pages(s, order); page_mapcount_reset(page); - if (current-reclaim_state) - current-reclaim_state-reclaimed_slab += pages; __free_memcg_kmem_pages(page, order); } @@ -1431,6 +1429,8 @@ static void rcu_free_slab(struct rcu_head *h) static void free_slab(struct kmem_cache *s, struct page *page) { + int pages = 1 compound_order(page); + if (unlikely(s-flags SLAB_DESTROY_BY_RCU)) { struct rcu_head *head; @@ -1450,6 +1450,9 @@ static void free_slab(struct kmem_cache *s, struct page *page) call_rcu(head, rcu_free_slab); } else __free_slab(s, page); + + if (current-reclaim_state) + current-reclaim_state-reclaimed_slab += pages; } static void discard_slab(struct kmem_cache *s, struct page *page) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] mm, slab: count freed pages via rcu as this task's reclaimed_slab
Currently, freed pages via rcu is not counted for reclaimed_slab, because it is freed in rcu context, not current task context. But, this free is initiated by this task, so counting this into this task's reclaimed_slab is meaningful to decide whether we continue reclaim, or not. So change code to count these pages for this task's reclaimed_slab. Cc: Christoph Lameter c...@linux-foundation.org Cc: Pekka Enberg penb...@kernel.org Cc: Matt Mackall m...@selenic.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/mm/slab.c b/mm/slab.c index 856e4a1..4d94bcb 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1934,8 +1934,6 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr) } memcg_release_pages(cachep, cachep-gfporder); - if (current-reclaim_state) - current-reclaim_state-reclaimed_slab += nr_freed; free_memcg_kmem_pages((unsigned long)addr, cachep-gfporder); } @@ -2165,6 +2163,7 @@ static void slab_destroy_debugcheck(struct kmem_cache *cachep, struct slab *slab */ static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp) { + unsigned long nr_freed = (1 cachep-gfporder); void *addr = slabp-s_mem - slabp-colouroff; slab_destroy_debugcheck(cachep, slabp); @@ -2180,6 +2179,9 @@ static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp) if (OFF_SLAB(cachep)) kmem_cache_free(cachep-slabp_cache, slabp); } + + if (current-reclaim_state) + current-reclaim_state-reclaimed_slab += nr_freed; } /** -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] mm, vmscan: count accidental reclaimed pages failed to put into lru
In shrink_(in)active_list(), we can fail to put into lru, and these pages are reclaimed accidentally. Currently, these pages are not counted for sc-nr_reclaimed, but with this information, we can stop to reclaim earlier, so can reduce overhead of reclaim. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0f615eb..5d60ae0 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -365,7 +365,7 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); extern void free_hot_cold_page(struct page *page, int cold); -extern void free_hot_cold_page_list(struct list_head *list, int cold); +extern unsigned long free_hot_cold_page_list(struct list_head *list, int cold); extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..a5f3952 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1360,14 +1360,18 @@ out: /* * Free a list of 0-order pages */ -void free_hot_cold_page_list(struct list_head *list, int cold) +unsigned long free_hot_cold_page_list(struct list_head *list, int cold) { + unsigned long nr_reclaimed = 0; struct page *page, *next; list_for_each_entry_safe(page, next, list, lru) { trace_mm_page_free_batched(page, cold); free_hot_cold_page(page, cold); + nr_reclaimed++; } + + return nr_reclaimed; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 88c5fed..eff2927 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -915,7 +915,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ __clear_page_locked(page); free_it: - nr_reclaimed++; /* * Is there need to periodically free_page_list? It would @@ -954,7 +953,7 @@ keep: if (nr_dirty nr_dirty == nr_congested global_reclaim(sc)) zone_set_flag(zone, ZONE_CONGESTED); - free_hot_cold_page_list(free_pages, 1); + nr_reclaimed += free_hot_cold_page_list(free_pages, 1); list_splice(ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); @@ -1321,7 +1320,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_page_list(page_list, zone, sc, TTU_UNMAP, + nr_reclaimed += shrink_page_list(page_list, zone, sc, TTU_UNMAP, nr_dirty, nr_writeback, false); spin_lock_irq(zone-lru_lock); @@ -1343,7 +1342,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, spin_unlock_irq(zone-lru_lock); - free_hot_cold_page_list(page_list, 1); + nr_reclaimed += free_hot_cold_page_list(page_list, 1); /* * If reclaim is isolating dirty pages under writeback, it implies @@ -1438,7 +1437,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec, __count_vm_events(PGDEACTIVATE, pgmoved); } -static void shrink_active_list(unsigned long nr_to_scan, +static unsigned long shrink_active_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, enum lru_list lru) @@ -1534,7 +1533,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(zone-lru_lock); - free_hot_cold_page_list(l_hold, 1); + return free_hot_cold_page_list(l_hold, 1); } #ifdef CONFIG_SWAP @@ -1617,7 +1616,8 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, { if (is_active_lru(lru)) { if (inactive_list_is_low(lruvec, lru)) - shrink_active_list(nr_to_scan, lruvec, sc, lru); + return shrink_active_list(nr_to_scan, lruvec, sc, lru); + return 0; } @@ -1861,8 +1861,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc) * rebalance the anon lru active/inactive ratio. */ if (inactive_anon_is_low(lruvec)) - shrink_active_list(SWAP_CLUSTER_MAX, lruvec, - sc, LRU_ACTIVE_ANON); + sc-nr_reclaimed += shrink_active_list(SWAP_CLUSTER_MAX, + lruvec, sc, LRU_ACTIVE_ANON); throttle_vm_writeout(sc-gfp_mask); } @@ -2470,23 +2470,27 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
Hello, Mel. Sorry for too late question. On Sun, Mar 17, 2013 at 01:04:14PM +, Mel Gorman wrote: If kswaps fails to make progress but continues to shrink slab then it'll either discard all of slab or consume CPU uselessly scanning shrinkers. This patch causes kswapd to only call the shrinkers once per priority. Signed-off-by: Mel Gorman mgor...@suse.de --- mm/vmscan.c | 28 +--- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7d5a932..84375b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, */ static bool kswapd_shrink_zone(struct zone *zone, struct scan_control *sc, -unsigned long lru_pages) +unsigned long lru_pages, +bool shrinking_slab) { - unsigned long nr_slab; + unsigned long nr_slab = 0; struct reclaim_state *reclaim_state = current-reclaim_state; struct shrink_control shrink = { .gfp_mask = sc-gfp_mask, @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone, sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); shrink_zone(zone, sc); - reclaim_state-reclaimed_slab = 0; - nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); - sc-nr_reclaimed += reclaim_state-reclaimed_slab; + /* + * Slabs are shrunk for each zone once per priority or if the zone + * being balanced is otherwise unreclaimable + */ + if (shrinking_slab || !zone_reclaimable(zone)) { + reclaim_state-reclaimed_slab = 0; + nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); + sc-nr_reclaimed += reclaim_state-reclaimed_slab; + } if (nr_slab == 0 !zone_reclaimable(zone)) zone-all_unreclaimable = 1; Why shrink_slab() is called here? I think that outside of zone loop is better place to run shrink_slab(), because shrink_slab() is not directly related to a specific zone. And this is a question not related to this patch. Why nr_slab is used here to decide zone-all_unreclaimable? nr_slab is not directly related whether a specific zone is reclaimable or not, and, moreover, nr_slab is not directly related to number of reclaimed pages. It just say some objects in the system are freed. This question comes from my ignorance, so please enlighten me. Thanks. @@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long nr_soft_reclaimed; unsigned long nr_soft_scanned; + bool shrinking_slab = true; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .priority = DEF_PRIORITY, @@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, * already being scanned that that high * watermark would be met at 100% efficiency. */ - if (kswapd_shrink_zone(zone, sc, lru_pages)) + if (kswapd_shrink_zone(zone, sc, + lru_pages, shrinking_slab)) raise_priority = false; nr_to_reclaim += sc.nr_to_reclaim; @@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, pfmemalloc_watermark_ok(pgdat)) wake_up(pgdat-pfmemalloc_wait); + /* Only shrink slab once per priority */ + shrinking_slab = false; + /* * Fragmentation may mean that the system cannot be rebalanced * for high-order allocations in all zones. If twice the @@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ - if (raise_priority || !this_reclaimed) + if (raise_priority || !this_reclaimed) { sc.priority--; + shrinking_slab = true; + } } while (sc.priority = 1 !pgdat_balanced(pgdat, order, *classzone_idx)); -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
Hello, Mel. On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote: On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote: Hello, Mel. Sorry for too late question. No need to apologise at all. On Sun, Mar 17, 2013 at 01:04:14PM +, Mel Gorman wrote: If kswaps fails to make progress but continues to shrink slab then it'll either discard all of slab or consume CPU uselessly scanning shrinkers. This patch causes kswapd to only call the shrinkers once per priority. Signed-off-by: Mel Gorman mgor...@suse.de --- mm/vmscan.c | 28 +--- 1 file changed, 21 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 7d5a932..84375b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, */ static bool kswapd_shrink_zone(struct zone *zone, struct scan_control *sc, -unsigned long lru_pages) +unsigned long lru_pages, +bool shrinking_slab) { - unsigned long nr_slab; + unsigned long nr_slab = 0; struct reclaim_state *reclaim_state = current-reclaim_state; struct shrink_control shrink = { .gfp_mask = sc-gfp_mask, @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone, sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); shrink_zone(zone, sc); - reclaim_state-reclaimed_slab = 0; - nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); - sc-nr_reclaimed += reclaim_state-reclaimed_slab; + /* + * Slabs are shrunk for each zone once per priority or if the zone + * being balanced is otherwise unreclaimable + */ + if (shrinking_slab || !zone_reclaimable(zone)) { + reclaim_state-reclaimed_slab = 0; + nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages); + sc-nr_reclaimed += reclaim_state-reclaimed_slab; + } if (nr_slab == 0 !zone_reclaimable(zone)) zone-all_unreclaimable = 1; Why shrink_slab() is called here? Preserves existing behaviour. Yes, but, with this patch, existing behaviour is changed, that is, we call shrink_slab() once per priority. For now, there is no reason this function is called here. How about separating it and executing it outside of zone loop? We can do it with another zone loop in order to decide a zone-all_unreclaimble. Below is pseudo code from my quick thought. for each zone shrink_zone() end nr_slab = shrink_slab() if (nr_slab == 0) { for each zone if (!zone_reclaimable) zone-all_unreclaimble = 1 end end } I think that outside of zone loop is better place to run shrink_slab(), because shrink_slab() is not directly related to a specific zone. This is true and has been the case for a long time. The slab shrinkers are not zone aware and it is complicated by the fact that slab usage can indirectly pin memory on other zones. Consider for example a slab object that is an inode entry that is allocated from the Normal zone on a 32-bit machine. Reclaiming may free memory from the Highmem zone. It's less obvious a problem on 64-bit machines but freeing slab objects from a zone like DMA32 can indirectly free memory from the Normal zone or even another node entirely. And this is a question not related to this patch. Why nr_slab is used here to decide zone-all_unreclaimable? Slab is not directly associated with a slab but as reclaiming slab can free memory from unpredictable zones we do not consider a zone to be fully unreclaimable until we cannot shrink slab any more. You may be thinking that this is extremely heavy handed and you're right, it is. nr_slab is not directly related whether a specific zone is reclaimable or not, and, moreover, nr_slab is not directly related to number of reclaimed pages. It just say some objects in the system are freed. All true, it's the indirect relation between slab objects and the memory that is freed when slab objects are reclaimed that has to be taken into account. This question comes from my ignorance, so please enlighten me. I hope this clarifies matters. Very helpful :) Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority
Hello, Dave. On Wed, Apr 10, 2013 at 11:07:34AM +1000, Dave Chinner wrote: On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote: On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote: I think that outside of zone loop is better place to run shrink_slab(), because shrink_slab() is not directly related to a specific zone. This is true and has been the case for a long time. The slab shrinkers are not zone aware and it is complicated by the fact that slab usage can indirectly pin memory on other zones. .. And this is a question not related to this patch. Why nr_slab is used here to decide zone-all_unreclaimable? Slab is not directly associated with a slab but as reclaiming slab can free memory from unpredictable zones we do not consider a zone to be fully unreclaimable until we cannot shrink slab any more. This is something the numa aware shrinkers will greatly help with - instead of being a global shrink it becomes a node-the-zone-belongs-to shrink, and so You may be thinking that this is extremely heavy handed and you're right, it is. ... it is much less heavy handed than the current code... nr_slab is not directly related whether a specific zone is reclaimable or not, and, moreover, nr_slab is not directly related to number of reclaimed pages. It just say some objects in the system are freed. All true, it's the indirect relation between slab objects and the memory that is freed when slab objects are reclaimed that has to be taken into account. Node awareness within the shrinker infrastructure and LRUs make the relationship much more direct ;) Yes, I think so ;) Thanks. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab
Hello, Christoph. On Tue, Apr 09, 2013 at 02:28:06PM +, Christoph Lameter wrote: On Tue, 9 Apr 2013, Joonsoo Kim wrote: Currently, freed pages via rcu is not counted for reclaimed_slab, because it is freed in rcu context, not current task context. But, this free is initiated by this task, so counting this into this task's reclaimed_slab is meaningful to decide whether we continue reclaim, or not. So change code to count these pages for this task's reclaimed_slab. slab-reclaim_state guides the reclaim actions in vmscan.c. With this patch slab-reclaim_state could get quite a high value without new pages being available for allocation. slab-reclaim_state will only be updated when the RCU period ends. Okay. In addition, there is a little place who use SLAB_DESTROY_BY_RCU. I will drop this patch[2/3] and [3/3] for next spin. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] mm, vmscan: count accidental reclaimed pages failed to put into lru
Hello, Minchan. On Tue, Apr 09, 2013 at 02:55:14PM +0900, Minchan Kim wrote: Hello Joonsoo, On Tue, Apr 09, 2013 at 10:21:16AM +0900, Joonsoo Kim wrote: In shrink_(in)active_list(), we can fail to put into lru, and these pages are reclaimed accidentally. Currently, these pages are not counted for sc-nr_reclaimed, but with this information, we can stop to reclaim earlier, so can reduce overhead of reclaim. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com Nice catch! But this patch handles very corner case and makes reclaim function's name rather stupid so I'd like to see text size change after we apply this patch. Other nipicks below. Ah... Yes. I can re-work it to add number to sc-nr_reclaimed directly for both cases, shrink_active_list() and age_active_anon(). diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0f615eb..5d60ae0 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -365,7 +365,7 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); extern void __free_pages(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); extern void free_hot_cold_page(struct page *page, int cold); -extern void free_hot_cold_page_list(struct list_head *list, int cold); +extern unsigned long free_hot_cold_page_list(struct list_head *list, int cold); extern void __free_memcg_kmem_pages(struct page *page, unsigned int order); extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8fcced7..a5f3952 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1360,14 +1360,18 @@ out: /* * Free a list of 0-order pages */ -void free_hot_cold_page_list(struct list_head *list, int cold) +unsigned long free_hot_cold_page_list(struct list_head *list, int cold) { + unsigned long nr_reclaimed = 0; How about nr_free or nr_freed for consistent with function title? Okay. struct page *page, *next; list_for_each_entry_safe(page, next, list, lru) { trace_mm_page_free_batched(page, cold); free_hot_cold_page(page, cold); + nr_reclaimed++; } + + return nr_reclaimed; } /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 88c5fed..eff2927 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -915,7 +915,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ __clear_page_locked(page); free_it: - nr_reclaimed++; /* * Is there need to periodically free_page_list? It would @@ -954,7 +953,7 @@ keep: if (nr_dirty nr_dirty == nr_congested global_reclaim(sc)) zone_set_flag(zone, ZONE_CONGESTED); - free_hot_cold_page_list(free_pages, 1); + nr_reclaimed += free_hot_cold_page_list(free_pages, 1); Nice cleanup. list_splice(ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); @@ -1321,7 +1320,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_taken == 0) return 0; - nr_reclaimed = shrink_page_list(page_list, zone, sc, TTU_UNMAP, + nr_reclaimed += shrink_page_list(page_list, zone, sc, TTU_UNMAP, nr_dirty, nr_writeback, false); Do you have any reason to change? To me, '=' is more clear to initialize the variable. When I see above, I have to look through above lines to catch where code used the nr_reclaimed. There is no reason, I will change it. spin_lock_irq(zone-lru_lock); @@ -1343,7 +1342,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, spin_unlock_irq(zone-lru_lock); - free_hot_cold_page_list(page_list, 1); + nr_reclaimed += free_hot_cold_page_list(page_list, 1); How about considering vmstat, too? It could be minor but you are considering freed page as reclaim context. (ie, sc-nr_reclaimed) so it would be more appropriate. I don't understand what you mean. Please explain more what you have in mind :) /* * If reclaim is isolating dirty pages under writeback, it implies @@ -1438,7 +1437,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec, __count_vm_events(PGDEACTIVATE, pgmoved); } -static void shrink_active_list(unsigned long nr_to_scan, +static unsigned long shrink_active_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc, enum lru_list lru) @@ -1534,7 +1533,7 @@ static void shrink_active_list(unsigned long nr_to_scan, __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(zone-lru_lock); - free_hot_cold_page_list(l_hold, 1); + return
Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.
On Wed, Apr 10, 2013 at 09:31:10AM +0300, Pekka Enberg wrote: On Mon, Apr 8, 2013 at 3:32 PM, Steven Rostedt rost...@goodmis.org wrote: Index: linux/mm/slub.c === --- linux.orig/mm/slub.c2013-03-28 12:14:26.958358688 -0500 +++ linux/mm/slub.c 2013-04-01 10:23:24.677584499 -0500 @@ -1498,6 +1498,7 @@ static inline void *acquire_slab(struct void *freelist; unsigned long counters; struct page new; + unsigned long objects; /* * Zap the freelist and set the frozen bit. @@ -1507,6 +1508,7 @@ static inline void *acquire_slab(struct freelist = page-freelist; counters = page-counters; new.counters = counters; + objects = page-inuse; if (mode) { new.inuse = page-objects; new.freelist = NULL; @@ -1524,6 +1526,7 @@ static inline void *acquire_slab(struct return NULL; remove_partial(n, page); + page-lru.next = (void *)objects; WARN_ON(!freelist); return freelist; } Good. I like your method which use lru.next in order to hand over number of objects. I hate it ;-) It just seems to be something that's not very robust and can cause hours of debugging in the future. I mean, there's not even a comment explaining what is happening. The lru is a union with other slub partials structs that is not very obvious. If something is out of order, it can easily break, and there's nothing here that points to why. Just pass the damn objects pointer by reference and use that. It's easy to understand, read and is robust. Christoph, Joonsoo, comments? Steven's comment is reasonable to me. If there is no objection from Christoph, let's drop a patch in which I implement Christoph's idea. Thanks. Pekka -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab
2013/4/10 Christoph Lameter c...@linux.com: On Wed, 10 Apr 2013, Joonsoo Kim wrote: Hello, Christoph. On Tue, Apr 09, 2013 at 02:28:06PM +, Christoph Lameter wrote: On Tue, 9 Apr 2013, Joonsoo Kim wrote: Currently, freed pages via rcu is not counted for reclaimed_slab, because it is freed in rcu context, not current task context. But, this free is initiated by this task, so counting this into this task's reclaimed_slab is meaningful to decide whether we continue reclaim, or not. So change code to count these pages for this task's reclaimed_slab. slab-reclaim_state guides the reclaim actions in vmscan.c. With this patch slab-reclaim_state could get quite a high value without new pages being available for allocation. slab-reclaim_state will only be updated when the RCU period ends. Okay. In addition, there is a little place who use SLAB_DESTROY_BY_RCU. I will drop this patch[2/3] and [3/3] for next spin. What you have discoverd is an issue that we have so far overlooked. Could you add comments to both places explaining the situation? Yes, I can. RCU is used for some inode and the dentry cache. Failing to account for these frees could pose a problem. One solution would be to ensure that we get through an RCU quiescent period in the slabs reclaim. If we can ensure that then your patch may be ok. Hmm... I don't perfectly understand RCU code and it's quiescent period. But, yes, it can be one of possible solutions in my quick thought. Currently, I have no ability to do that, so I skip to think about this. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Preeti. On Fri, Mar 29, 2013 at 12:42:53PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 03/28/2013 01:28 PM, Joonsoo Kim wrote: Following-up upper se in sched_slice() should not be done, because sched_slice() is used for checking that resched is needed whithin *this* cfs_rq and there is one problem related to this in current implementation. The problem is that if we follow-up upper se in sched_slice(), it is possible that we get a ideal slice which is lower than sysctl_sched_min_granularity. For example, we assume that we have 4 tg which is attached to root tg with same share and each one have 20 runnable tasks on cpu0, respectivly. In this case, __sched_period() return sysctl_sched_min_granularity * 20 and then go into loop. At first iteration, we compute a portion of slice for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity. Afterward, we enter second iteration and get a slice which is a quarter of sysctl_sched_min_granularity, because there is 4 tgs with same share in that cfs_rq. Ensuring slice larger than min_granularity is important for performance and there is no lower bound about this, except timer tick, we should fix it not to consider upper se when calculating sched_slice. I am assuming the sysctl_sched_latency = 20ms and sysctl_sched_min_granularity = 4ms. In that case: With your patch, the sum of the sched_slice(runnable_task) on each_tg is 40ms = sched_min_granularity * 10, while the parent tg has a sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. Thanks. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency
Hello Preeti. On Fri, Mar 29, 2013 at 05:05:37PM +0530, Preeti U Murthy wrote: Hi Joonsoo On 03/28/2013 01:28 PM, Joonsoo Kim wrote: sched_slice() compute ideal runtime slice. If there are many tasks in cfs_rq, period for this cfs_rq is extended to guarantee that each task has time slice at least, sched_min_granularity. And then each task get a portion of this period for it. If there is a task which have much larger load weight than others, a portion of period can exceed far more than sysctl_sched_latency. Correct. But that does not matter, the length of the scheduling latency period is determined by the return value of ___sched_period(), not the value of sysctl_sched_latency. You would not extend the period,if you wanted all tasks to have a slice within the sysctl_sched_latency, right? So since the value of the length of the scheduling latency period, is dynamic depending on the number of the processes running, the sysctl_sched_latency which is the default latency period length is not mesed with, but is only used as a base to determine the actual scheduling period. For exampple, you can simply imagine that one task with nice -20 and 9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977), that is, approximately, sysctl_sched_min_granularity * 9. This aspect can be much larger if there is more tasks with nice 0. Yeah so the __sched_period says that within 40ms, all tasks need to be scheduled ateast once, and the highest priority task gets nearly 36ms of it, while the rest is distributed among the others. So we should limit this possible weird situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e232421..6ceffbc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) } slice = calc_delta_mine(slice, se-load.weight, load); + if (unlikely(slice sysctl_sched_latency)) + slice = sysctl_sched_latency; Then in this case the highest priority thread would get 20ms(sysctl_sched_latency), and the rest would get sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms. Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms = 23.7ms, while your scheduling latency period was extended to 40ms,just so that each of these tasks don't have their sched_slices shrunk due to large number of tasks. I don't know I understand your question correctly. I will do my best to answer your comment. :) With this patch, I just limit maximum slice at one time. Scheduling is controlled through the vruntime. So, in this case, the task with nice -20 will be scheduled twice. 20 + (0.4 * 9) + 20 = 43.9 ms And after 43.9 ms, this process is repeated. So I can tell you that scheduling period is preserved as before. If we give a long period to a task at one go, it can cause a latency problem. So IMHO, limiting this is meaningful. Thanks. + return slice; } Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] sched: factor out code to should_we_balance()
Hello, Peter. On Fri, Mar 29, 2013 at 12:45:14PM +0100, Peter Zijlstra wrote: On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote: There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. But with this change, we can save two memset cost and can expect better compiler optimization, so clean-up may be more beneficial to us. It would be good if you'd described what you actually did, there's a lot of code movement and now I have to figure out wth you did and why. Okay. I will do that when respin. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] sched: factor out code to should_we_balance()
On Fri, Mar 29, 2013 at 12:58:26PM +0100, Peter Zijlstra wrote: On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote: +static int should_we_balance(struct lb_env *env) +{ + struct sched_group *sg = env-sd-groups; + int cpu, balance_cpu = -1; + + /* +* In the newly idle case, we will allow all the cpu's +* to do the newly idle load balance. +*/ + if (env-idle == CPU_NEWLY_IDLE) + return 1; + + /* Try to find first idle cpu */ + for_each_cpu_and(cpu, sched_group_cpus(sg), env-cpus) + if (cpumask_test_cpu(cpu, sched_group_mask(sg)) + idle_cpu(cpu)) { + balance_cpu = cpu; + break; + } /me hands you a bucket of curlies too.. use them I'll send you another bucket when this one gets empty! Thanks! (We always put in curlies on multi lines statements -- as opposed to multi statement blocks where they're required) Okay. I will do that when respin. Thanks. + + if (balance_cpu == -1) + balance_cpu = group_balance_cpu(sg); + + /* +* First idle cpu or the first cpu(busiest) in this sched group +* is eligible for doing load balancing at this and above domains. +*/ + if (balance_cpu != env-dst_cpu) + return 0; + + return 1; +} -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kprobe: initialize kprobe_blacklist when it is used firstly
Currently, initializing kprobe_blacklist is done during boot process. It takes 230 ms on our android platform and this is significant amount for our use case. We can disable CONFIG_KPROBES for production kernel, but it is hassle. This kprobe functionality is not commonly used, so we don't need to pay this cost at all times. With this rationale, change code to initialize kprobe_blacklist when it is used firstly. Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Anil S Keshavamurthy anil.s.keshavamur...@intel.com Cc: David S. Miller da...@davemloft.net Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com --- I fotgot to add lkml. Sorry for noise. diff --git a/kernel/kprobes.c b/kernel/kprobes.c index e35be53..5e90092 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -101,6 +101,7 @@ static struct kprobe_blackpoint kprobe_blacklist[] = { {mcount,},/* mcount can be called from everywhere */ {NULL}/* Terminator */ }; +static bool kprobe_blacklist_initialized; #ifdef __ARCH_WANT_KPROBES_INSN_SLOT /* @@ -1324,6 +1325,49 @@ out: return ret; } +static void __kprobes init_kprobe_blacklist(void) +{ + unsigned long offset = 0, size = 0; + char *modname, namebuf[128]; + const char *symbol_name; + void *addr; + struct kprobe_blackpoint *kb; + + mutex_lock(kprobe_mutex); + if (kprobe_blacklist_initialized) + goto out; + + /* +* Lookup and populate the kprobe_blacklist. +* +* Unlike the kretprobe blacklist, we'll need to determine +* the range of addresses that belong to the said functions, +* since a kprobe need not necessarily be at the beginning +* of a function. +*/ + for (kb = kprobe_blacklist; kb-name != NULL; kb++) { + kprobe_lookup_name(kb-name, addr); + if (!addr) + continue; + + kb-start_addr = (unsigned long)addr; + symbol_name = kallsyms_lookup(kb-start_addr, + size, offset, modname, namebuf); + if (!symbol_name) + kb-range = 0; + else + kb-range = size; + } + + /* This guarantee that who see initilized will +* get a updated data of kprobe_blacklist */ + smp_wmb(); + kprobe_blacklist_initialized = true; + +out: + mutex_unlock(kprobe_mutex); +} + static int __kprobes in_kprobes_functions(unsigned long addr) { struct kprobe_blackpoint *kb; @@ -1331,6 +1375,7 @@ static int __kprobes in_kprobes_functions(unsigned long addr) if (addr = (unsigned long)__kprobes_text_start addr (unsigned long)__kprobes_text_end) return -EINVAL; + /* * If there exists a kprobe_blacklist, verify and * fail any probe registration in the prohibited area @@ -1476,6 +1521,9 @@ int __kprobes register_kprobe(struct kprobe *p) struct module *probed_mod; kprobe_opcode_t *addr; + if (unlikely(!kprobe_blacklist_initialized)) + init_kprobe_blacklist(); + /* Adjust probe address from symbol */ addr = kprobe_addr(p); if (IS_ERR(addr)) @@ -2065,11 +2113,6 @@ static struct notifier_block kprobe_module_nb = { static int __init init_kprobes(void) { int i, err = 0; - unsigned long offset = 0, size = 0; - char *modname, namebuf[128]; - const char *symbol_name; - void *addr; - struct kprobe_blackpoint *kb; /* FIXME allocate the probe table, currently defined statically */ /* initialize all list heads */ @@ -2079,28 +2122,6 @@ static int __init init_kprobes(void) raw_spin_lock_init((kretprobe_table_locks[i].lock)); } - /* -* Lookup and populate the kprobe_blacklist. -* -* Unlike the kretprobe blacklist, we'll need to determine -* the range of addresses that belong to the said functions, -* since a kprobe need not necessarily be at the beginning -* of a function. -*/ - for (kb = kprobe_blacklist; kb-name != NULL; kb++) { - kprobe_lookup_name(kb-name, addr); - if (!addr) - continue; - - kb-start_addr = (unsigned long)addr; - symbol_name = kallsyms_lookup(kb-start_addr, - size, offset, modname, namebuf); - if (!symbol_name) - kb-range = 0; - else - kb-range = size; - } - if (kretprobe_blacklist_size) { /* lookup the function address from its name */ for (i = 0; kretprobe_blacklist[i].name != NULL; i++) { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message
Re: [RFC PATCH 5/6] ARM, mm: change meaning of max_low_pfn to maximum pfn for nobootmem
Hello, Russell. On Mon, Mar 25, 2013 at 09:48:16AM +, Russell King - ARM Linux wrote: On Mon, Mar 25, 2013 at 01:11:13PM +0900, Joonsoo Kim wrote: nobootmem use max_low_pfn for computing boundary in free_all_bootmem() So we need proper value to max_low_pfn. But, there is some difficulty related to max_low_pfn. max_low_pfn is used for two meanings in various architectures. One is for number of pages in lowmem and the other is for maximum lowmem pfn. Now, in ARM, it is used as number of pages in lowmem. You can get more information in below link. http://lwn.net/Articles/543408/ http://lwn.net/Articles/543424/ As I investigated, architectures which use max_low_pfn as maximum pfn are more than others, so to change meaning of max_low_pfn to maximum pfn is preferable solution to me. This patch change max_low_pfn as maximum lowmem pfn in ARM. In addition, min_low_pfn, max_pfn is assigned according to this criteria. There is no real user for max_low_pfn except block/blk-setting.c and blk-setting.c assume that max_low_pfn is maximum lowmem pfn, so this patch may not harm anything. This will need some very rigorous testing before it can go into mainline. There is no attention from others to this patchset. :) Do you have any plan to test this? What I can do best is just to run this patchset on my platform. Would you guide me how to go into mainline for this patchset. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.
Hello, Christoph. On Mon, Apr 01, 2013 at 03:33:23PM +, Christoph Lameter wrote: Subject: slub: Fix object counts in acquire_slab V2 It seems that we were overallocating objects from the slab queues since get_partial_node() assumed that page-inuse was undisturbed by acquire_slab(). Save the # of objects in page-lru.next in acquire_slab() and pass it to get_partial_node() that way. I have a vague memory that Joonsoo also ran into this issue awhile back. Yes. I sent a patch for this two month ago. :) Signed-off-by: Christoph Lameter c...@linux.com Index: linux/mm/slub.c === --- linux.orig/mm/slub.c 2013-03-28 12:14:26.958358688 -0500 +++ linux/mm/slub.c 2013-04-01 10:23:24.677584499 -0500 @@ -1498,6 +1498,7 @@ static inline void *acquire_slab(struct void *freelist; unsigned long counters; struct page new; + unsigned long objects; /* * Zap the freelist and set the frozen bit. @@ -1507,6 +1508,7 @@ static inline void *acquire_slab(struct freelist = page-freelist; counters = page-counters; new.counters = counters; + objects = page-inuse; if (mode) { new.inuse = page-objects; new.freelist = NULL; @@ -1524,6 +1526,7 @@ static inline void *acquire_slab(struct return NULL; remove_partial(n, page); + page-lru.next = (void *)objects; WARN_ON(!freelist); return freelist; } Good. I like your method which use lru.next in order to hand over number of objects. @@ -1565,7 +1568,7 @@ static void *get_partial_node(struct kme c-page = page; stat(s, ALLOC_FROM_PARTIAL); object = t; - available = page-objects - page-inuse; + available = page-objects - (unsigned long)page-lru.next; } else { available = put_cpu_partial(s, page, 0); stat(s, CPU_PARTIAL_NODE); We need one more fix for correctness. When available is assigned by put_cpu_partial, it doesn't count cpu slab's objects. Please reference my old patch. https://lkml.org/lkml/2013/1/21/64 Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.
Hello, Christoph. On Mon, Apr 01, 2013 at 03:32:43PM +, Christoph Lameter wrote: On Thu, 28 Mar 2013, Paul Gortmaker wrote: Index: linux/init/Kconfig === --- linux.orig/init/Kconfig 2013-03-28 12:14:26.958358688 -0500 +++ linux/init/Kconfig 2013-03-28 12:19:46.275866639 -0500 @@ -1514,6 +1514,14 @@ config SLOB endchoice +config SLUB_CPU_PARTIAL + depends on SLUB + bool SLUB per cpu partial cache + help + Per cpu partial caches accellerate freeing of objects at the + price of more indeterminism in the latency of the free. + Typically one would choose no for a realtime system. Is batch a better description than accelerate ? Something like Its not a batching but a cache that is going to be mainly used for new allocations on the same processor. Per cpu partial caches allows batch freeing of objects to maximize throughput. However, this can increase the length of time spent holding key locks, which can increase latency spikes with respect to responsiveness. Select yes unless you are tuning for a realtime oriented system. Also, I believe this will cause a behaviour change for people who just run make oldconfig -- since there is no default line. Meaning that it used to be unconditionally on, but now I think it will be off by default, if people just mindlessly hold down Enter key. Ok. For RT, we'll want default N if RT_FULL (RT_BASE?) but for mainline, I expect you'll want default Y in order to be consistent with previous behaviour? I was not sure exactly how to handle that one yet for realtime. So I need two different patches? I've not built/booted yet, but I'll follow up if I see anything else in doing that. Here is an updated patch. I will also send an updated fixup patch. Subject: slub: Make cpu partial slab support configurable V2 cpu partial support can introduce level of indeterminism that is not wanted in certain context (like a realtime kernel). Make it configurable. Signed-off-by: Christoph Lameter c...@linux.com Index: linux/include/linux/slub_def.h === --- linux.orig/include/linux/slub_def.h 2013-04-01 10:27:05.908964674 -0500 +++ linux/include/linux/slub_def.h2013-04-01 10:27:19.905178531 -0500 @@ -47,7 +47,9 @@ struct kmem_cache_cpu { void **freelist;/* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ struct page *page; /* The slab from which we are allocating */ +#ifdef CONFIG_SLUB_CPU_PARTIAL struct page *partial; /* Partially allocated frozen slabs */ +#endif #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif @@ -84,7 +86,9 @@ struct kmem_cache { int size; /* The size of an object including meta data */ int object_size;/* The size of an object without meta data */ int offset; /* Free pointer offset. */ +#ifdef CONFIG_SLUB_CPU_PARTIAL int cpu_partial;/* Number of per cpu partial objects to keep around */ +#endif struct kmem_cache_order_objects oo; /* Allocation and freeing of slabs */ When !CONFIG_SLUB_CPU_PARTIAL, should we remove these variable? Without removing these, we can make code more simpler and maintainable. Index: linux/mm/slub.c === --- linux.orig/mm/slub.c 2013-04-01 10:27:05.908964674 -0500 +++ linux/mm/slub.c 2013-04-01 10:27:19.905178531 -0500 @@ -1531,7 +1531,9 @@ static inline void *acquire_slab(struct return freelist; } +#ifdef CONFIG_SLUB_CPU_PARTIAL static int put_cpu_partial(struct kmem_cache *s, struct page *page, int drain); +#endif static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags); /* @@ -1570,10 +1572,20 @@ static void *get_partial_node(struct kme object = t; available = page-objects - (unsigned long)page-lru.next; } else { +#ifdef CONFIG_SLUB_CPU_PARTIAL available = put_cpu_partial(s, page, 0); stat(s, CPU_PARTIAL_NODE); +#else + BUG(); +#endif } - if (kmem_cache_debug(s) || available s-cpu_partial / 2) + if (kmem_cache_debug(s) || +#ifdef CONFIG_SLUB_CPU_PARTIAL + available s-cpu_partial / 2 +#else + available 0 +#endif + ) break; } How about introducing wrapper function, cpu_partial_enabled() like as kmem_cache_debug()? int cpu_partial_enabled(s) { return kmem_cache_debug(s) || blablabla } As you already know, when
Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency
Hello, Preeti. On Mon, Apr 01, 2013 at 12:15:50PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/01/2013 10:39 AM, Joonsoo Kim wrote: Hello Preeti. So we should limit this possible weird situation. Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e232421..6ceffbc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) } slice = calc_delta_mine(slice, se-load.weight, load); + if (unlikely(slice sysctl_sched_latency)) + slice = sysctl_sched_latency; Then in this case the highest priority thread would get 20ms(sysctl_sched_latency), and the rest would get sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms. Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms = 23.7ms, while your scheduling latency period was extended to 40ms,just so that each of these tasks don't have their sched_slices shrunk due to large number of tasks. I don't know I understand your question correctly. I will do my best to answer your comment. :) With this patch, I just limit maximum slice at one time. Scheduling is controlled through the vruntime. So, in this case, the task with nice -20 will be scheduled twice. 20 + (0.4 * 9) + 20 = 43.9 ms And after 43.9 ms, this process is repeated. So I can tell you that scheduling period is preserved as before. If we give a long period to a task at one go, it can cause a latency problem. So IMHO, limiting this is meaningful. Thank you very much for the explanation. Just one question. What is the reason behind you choosing sysctl_sched_latency as the upper bound here? sysctl_sched_latency is a sched_slice when there is one task. So, I think that this is proper as upper bound. Thanks. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Preeti. On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/01/2013 09:38 AM, Joonsoo Kim wrote: Hello, Preeti. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. In the case where the #running sched_nr_latency, the children's sched_slices add up to the parent's. A rq with two tgs,each with 3 tasks. Each of these tasks have a sched slice of [(sysctl_sched_latency / 3) / 2] as of the present implementation. The sum of the above sched_slice of all tasks of a tg will lead to the sched_slice of its parent: sysctl_sched_latency / 2 This breaks when the nr_running on each tg sched_nr_latency. However I don't know if this is a good thing or a bad thing. Ah.. Now I get your point. Yes, you are right and it may be good thing. With that property, all tasks in the system can be scheduled at least once in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration, so my patch may be wrong. With my patch, all tasks in the system cannot be scheduled at least once in sysctl_sched_latency. Instead, it schedule all tasks in cfs_rq at least once in sysctl_sched_latency if there is no other tgs. I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; Thanks. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Preeti. On Tue, Apr 02, 2013 at 10:25:23AM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/02/2013 07:55 AM, Joonsoo Kim wrote: Hello, Preeti. On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/01/2013 09:38 AM, Joonsoo Kim wrote: Hello, Preeti. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. In the case where the #running sched_nr_latency, the children's sched_slices add up to the parent's. A rq with two tgs,each with 3 tasks. Each of these tasks have a sched slice of [(sysctl_sched_latency / 3) / 2] as of the present implementation. The sum of the above sched_slice of all tasks of a tg will lead to the sched_slice of its parent: sysctl_sched_latency / 2 This breaks when the nr_running on each tg sched_nr_latency. However I don't know if this is a good thing or a bad thing. Ah.. Now I get your point. Yes, you are right and it may be good thing. With that property, all tasks in the system can be scheduled at least once in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration, so my patch may be wrong. With my patch, all tasks in the system cannot be scheduled at least once in sysctl_sched_latency. Instead, it schedule all tasks in cfs_rq at least once in sysctl_sched_latency if there is no other tgs. Exactly. You have got all the above points right. I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; Consider the below scenario. A runqueue has two task groups,each with 10 tasks. With the current implementation,each of these tasks get a sched_slice of 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks of both the task groups) will get the chance to run. But what is the scheduling period in this scenario? Is it 40ms (extended sysctl_sched_latency), which is the scheduling period for each of the runqueues with 10 tasks in it? Or is it 80ms which is the total of the scheduling periods of each of the run queues with 10 tasks.Either way all tasks seem to get scheduled atleast once within the scheduling period.So we appear to be safe. Although the sched_slice sched_min_granularity. With your above lower bound of sysctl_sched_min_granularity, each task of each tg gets 4ms as its sched_slice.So in a matter of (10*4) + (10*4) = 80ms,all tasks get to run. With the above question being put forth here as well, we don't appear to be safe if the scheduling_period is considered to be 40ms, otherwise it appears fine to me, because it ensures the sched_slice is atleast sched_min_granularity no matter what. So, you mean that we should guarantee to schedule each task atleast once in sysctl_sched_latency? But this is not guaranteed in current code, this is why I made this patch. Please refer a patch description. Thanks. Thank you Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Mike. On Tue, Apr 02, 2013 at 04:35:26AM +0200, Mike Galbraith wrote: On Tue, 2013-04-02 at 11:25 +0900, Joonsoo Kim wrote: Hello, Preeti. On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote: Hi Joonsoo, On 04/01/2013 09:38 AM, Joonsoo Kim wrote: Hello, Preeti. Ideally the children's cpu share must add upto the parent's share. I don't think so. We should schedule out the parent tg if 5ms is over. As we do so, we can fairly distribute time slice to every tg within short term. If we add the children's cpu share upto the parent's, the parent tg may have large time slice, so it cannot be preempted easily. There may be a latency problem if there are many tgs. In the case where the #running sched_nr_latency, the children's sched_slices add up to the parent's. A rq with two tgs,each with 3 tasks. Each of these tasks have a sched slice of [(sysctl_sched_latency / 3) / 2] as of the present implementation. The sum of the above sched_slice of all tasks of a tg will lead to the sched_slice of its parent: sysctl_sched_latency / 2 This breaks when the nr_running on each tg sched_nr_latency. However I don't know if this is a good thing or a bad thing. Ah.. Now I get your point. Yes, you are right and it may be good thing. With that property, all tasks in the system can be scheduled at least once in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration, so my patch may be wrong. With my patch, all tasks in the system cannot be scheduled at least once in sysctl_sched_latency. Instead, it schedule all tasks in cfs_rq at least once in sysctl_sched_latency if there is no other tgs. I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; How many SCHED_IDLE or +nice tasks will fit in that? It is more related to how many running tasks in cfs_rq and how many tg is in the system. If we have two tgs which have more than sched_nr_latency tasks, all these tasks fit in this condition in current implementation. Thanks. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] sched: factor out code to should_we_balance()
Hello, Peter. On Tue, Apr 02, 2013 at 10:10:06AM +0200, Peter Zijlstra wrote: On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote: Now checking that this cpu is appropriate to balance is embedded into update_sg_lb_stats() and this checking has no direct relationship to this function. There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. Its only one iteration if there's only 2 groups, but there can be more than 2, take any desktop Intel i7, it will have 4-8 cores, each with HT; thus the CPU domain will have 4-8 groups. And note that local_group is always the first group of a domain, so we'd stop the balance at the first group and avoid touching the other 3-7, avoiding touching cachelines on 6-14 cpus. So this short-cut does make sense.. its not pretty, granted, but killing it doesn't seem right. It seems that there is some misunderstanding about this patch. In this patch, we don't iterate all groups. Instead, we iterate on cpus of local sched_group only. So there is no penalty you mentioned. In summary, net effect is below. * For cpus which are not proper to balance, Reduce function call, Reduce memset * For cpus which should balance Extra one iteration on cpus of local sched_group in should_we_balance() Reduce some branch, so expect better optimization Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()
Hello, Preeti. On Tue, Apr 02, 2013 at 11:02:43PM +0530, Preeti U Murthy wrote: Hi Joonsoo, I think that it is real problem that sysctl_sched_min_granularity is not guaranteed for each task. Instead of this patch, how about considering low bound? if (slice sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; Consider the below scenario. A runqueue has two task groups,each with 10 tasks. With the current implementation,each of these tasks get a sched_slice of 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks of both the task groups) will get the chance to run. But what is the scheduling period in this scenario? Is it 40ms (extended sysctl_sched_latency), which is the scheduling period for each of the runqueues with 10 tasks in it? Or is it 80ms which is the total of the scheduling periods of each of the run queues with 10 tasks.Either way all tasks seem to get scheduled atleast once within the scheduling period.So we appear to be safe. Although the sched_slice sched_min_granularity. With your above lower bound of sysctl_sched_min_granularity, each task of each tg gets 4ms as its sched_slice.So in a matter of (10*4) + (10*4) = 80ms,all tasks get to run. With the above question being put forth here as well, we don't appear to be safe if the scheduling_period is considered to be 40ms, otherwise it appears fine to me, because it ensures the sched_slice is atleast sched_min_granularity no matter what. So, you mean that we should guarantee to schedule each task atleast once in sysctl_sched_latency? I would rather say all tasks get to run atleast once in a sched_period, which could be just sysctl_sched_latency or more depending on the number of tasks in the current implementation. But this is not guaranteed in current code, this is why I made this patch. Please refer a patch description. Ok,take the example of a runqueue with 2 task groups,each with 10 tasks.Same as your previous example. Can you explain how your patch ensures that all 20 tasks get to run atleast once in a sched_period? My patch doesn't ensure that :) I just want to say a problem of current implementation which can't ensure to run atleast once in sched_period through my patch description. So, how about extending a sched_period with rq-nr_running, instead of cfs_rq-nr_running? It is my quick thought and I think that we can ensure to run atleast once in this extending sched_period. And, do we leave a problem if we cannot guaranteed atleast once property? Thanks. Regards Preeti U Murthy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] sched: factor out code to should_we_balance()
Hello, Peter. On Tue, Apr 02, 2013 at 12:29:42PM +0200, Peter Zijlstra wrote: On Tue, 2013-04-02 at 12:00 +0200, Peter Zijlstra wrote: On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote: It seems that there is some misunderstanding about this patch. In this patch, we don't iterate all groups. Instead, we iterate on cpus of local sched_group only. So there is no penalty you mentioned. OK, I'll go stare at it again.. Ah, I see, you're doing should_we_balance() _before_ find_busiest_group() and instead you're doing another for_each_cpu() in there. I'd write the thing like: static bool should_we_balance(struct lb_env *env) { struct sched_group *sg = env-sd-groups; struct cpumask *sg_cpus, *sg_mask; int cpu, balance_cpu = -1; if (env-idle == CPU_NEWLY_IDLE) return true; sg_cpus = sched_group_cpus(sg); sg_mask = sched_group_mask(sg); for_each_cpu_and(cpu, sg_cpus, env-cpus) { if (!cpumask_test_cpu(cpu, sg_mask)) continue; if (!idle_cpu(cpu)) continue; balance_cpu = cpu; break; } if (balance_cpu == -1) balance_cpu = group_balance_cpu(sg); return balance_cpu == env-dst_cpu; } Okay. It looks nice. I also considered doing the group_balance_cpu() first to avoid having to do the idle_cpu() scan, but that's a slight behavioural change afaict. In my quick thought, we can avoid it through below way. balance_cpu = group_balance_cpu(sg); if (idle_cpu(balance_cpu)) return balance_cpu == env-dst_cpu; else do idle_cpus() scan loop Is it your idea? If not, please let me know your idea. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.
Hello, Christoph. On Tue, Apr 02, 2013 at 07:25:20PM +, Christoph Lameter wrote: On Tue, 2 Apr 2013, Joonsoo Kim wrote: We need one more fix for correctness. When available is assigned by put_cpu_partial, it doesn't count cpu slab's objects. Please reference my old patch. https://lkml.org/lkml/2013/1/21/64 Could you update your patch and submit it again? Pekka alreay applied it. Do we need update? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/