from:"Joonsoo Kim"

Re: [PATCH 2/5] mm, highmem: remove useless pool_lock

2012-10-31 Thread JoonSoo Kim

Hello, Andrew.

2012/10/31 Andrew Morton a...@linux-foundation.org:
 On Mon, 29 Oct 2012 04:12:53 +0900
 Joonsoo Kim js1...@gmail.com wrote:

 The pool_lock protects the page_address_pool from concurrent access.
 But, access to the page_address_pool is already protected by kmap_lock.
 So remove it.

 Well, there's a set_page_address() call in mm/page_alloc.c which
 doesn't have lock_kmap().  it doesn't *need* lock_kmap() because it's
 init-time code and we're running single-threaded there.  I hope!

 But this exception should be double-checked and mentioned in the
 changelog, please.  And it's a reason why we can't add
 assert_spin_locked(kmap_lock) to set_page_address(), which is
 unfortunate.

set_page_address() in mm/page_alloc.c is invoked only when
WANT_PAGE_VIRTUAL is defined.
And in this case, set_page_address()'s definition is not in highmem.c,
but in include/linux/mm.h.
So, we don't need to worry about set_page_address() call in mm/page_alloc.c

 The irq-disabling in this code is odd.  If ARCH_NEEDS_KMAP_HIGH_GET=n,
 we didn't need irq-safe locking in set_page_address().  I guess we'll
 need to retain it in page_address() - I expect some callers have IRQs
 disabled.

As Minchan described, if we don't disable irq when we take a lock for pas-lock,
it would be deadlock with page_address().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/5] mm, highmem: remove page_address_pool list

2012-10-31 Thread Joonsoo Kim

We can find free page_address_map instance without the page_address_pool.
So remove it.

Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Joonsoo Kim js1...@gmail.com
Reviewed-by: Minchan Kim minc...@kernel.org

diff --git a/mm/highmem.c b/mm/highmem.c
index 017bad1..d98b0a9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -324,10 +324,7 @@ struct page_address_map {
struct list_head list;
 };
 
-/*
- * page_address_map freelist, allocated from page_address_maps.
- */
-static struct list_head page_address_pool; /* freelist */
+static struct page_address_map page_address_maps[LAST_PKMAP];
 
 /*
  * Hash table bucket
@@ -392,12 +389,7 @@ void set_page_address(struct page *page, void *virtual)
 
pas = page_slot(page);
if (virtual) {  /* Add */
-   BUG_ON(list_empty(page_address_pool));
-
-   pam = list_entry(page_address_pool.next,
-   struct page_address_map, list);
-   list_del(pam-list);
-
+   pam = page_address_maps[PKMAP_NR((unsigned long)virtual)];
pam-page = page;
pam-virtual = virtual;
 
@@ -410,7 +402,6 @@ void set_page_address(struct page *page, void *virtual)
if (pam-page == page) {
list_del(pam-list);
spin_unlock_irqrestore(pas-lock, flags);
-   list_add_tail(pam-list, page_address_pool);
goto done;
}
}
@@ -420,15 +411,10 @@ done:
return;
 }
 
-static struct page_address_map page_address_maps[LAST_PKMAP];
-
 void __init page_address_init(void)
 {
int i;
 
-   INIT_LIST_HEAD(page_address_pool);
-   for (i = 0; i  ARRAY_SIZE(page_address_maps); i++)
-   list_add(page_address_maps[i].list, page_address_pool);
for (i = 0; i  ARRAY_SIZE(page_address_htable); i++) {
INIT_LIST_HEAD(page_address_htable[i].lh);
spin_lock_init(page_address_htable[i].lock);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()

2012-10-31 Thread Joonsoo Kim

In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
Using this index, we can simply get virtual address of the page.
So change it.

Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Joonsoo Kim js1...@gmail.com
Reviewed-by: Minchan Kim minc...@kernel.org

diff --git a/mm/highmem.c b/mm/highmem.c
index b365f7b..675ec97 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -137,8 +137,7 @@ static unsigned int flush_all_zero_pkmaps(void)
 * So no dangers, even with speculative execution.
 */
page = pte_page(pkmap_page_table[i]);
-   pte_clear(init_mm, (unsigned long)page_address(page),
- pkmap_page_table[i]);
+   pte_clear(init_mm, PKMAP_ADDR(i), pkmap_page_table[i]);
 
set_page_address(page, NULL);
if (index == PKMAP_INVALID_INDEX)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/5] minor clean-up and optimize highmem related code

2012-10-31 Thread Joonsoo Kim

This patchset clean-up and optimize highmem related code.

Change from v1
Rebase on v3.7-rc3
[4] Instead of returning index of last flushed entry, return first index.
And update last_pkmap_nr to this index to optimize more.

Summary for v1
[1] is just clean-up and doesn't introduce any functional change.
[2-3] are for clean-up and optimization.
These eliminate an useless lock opearation and list management.
[4-5] is for optimization related to flush_all_zero_pkmaps().

Joonsoo Kim (5):
  mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  mm, highmem: remove useless pool_lock
  mm, highmem: remove page_address_pool list
  mm, highmem: makes flush_all_zero_pkmaps() return index of first
flushed entry
  mm, highmem: get virtual address of the page using PKMAP_ADDR()

 include/linux/highmem.h |1 +
 mm/highmem.c|  108 ++-
 2 files changed, 51 insertions(+), 58 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap

2012-10-31 Thread Joonsoo Kim

To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, So change it.

Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Joonsoo Kim js1...@gmail.com
Reviewed-by: Minchan Kim minc...@kernel.org

diff --git a/mm/highmem.c b/mm/highmem.c
index d517cd1..b3b3d68 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr)
unsigned long addr = (unsigned long)vaddr;
 
if (addr = PKMAP_ADDR(0)  addr = PKMAP_ADDR(LAST_PKMAP)) {
-   int i = (addr - PKMAP_ADDR(0))  PAGE_SHIFT;
+   int i = PKMAP_NR(addr);
return pte_page(pkmap_page_table[i]);
}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-10-31 Thread Joonsoo Kim

In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of first flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of first flushed entry.

Additionally, update last_pkmap_nr to this index.
It is certain that entry which is below this index is occupied by other mapping,
therefore updating last_pkmap_nr to this index is reasonable optimization.

Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Minchan Kim minc...@kernel.org
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..97ad208 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, 
int size)
 
 #ifdef CONFIG_HIGHMEM
 #include asm/highmem.h
+#define PKMAP_INVALID_INDEX (LAST_PKMAP)
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index d98b0a9..b365f7b 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
return virt_to_page(addr);
 }
 
-static void flush_all_zero_pkmaps(void)
+static unsigned int flush_all_zero_pkmaps(void)
 {
int i;
-   int need_flush = 0;
+   unsigned int index = PKMAP_INVALID_INDEX;
 
flush_cache_kmaps();
 
@@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
  pkmap_page_table[i]);
 
set_page_address(page, NULL);
-   need_flush = 1;
+   if (index == PKMAP_INVALID_INDEX)
+   index = i;
}
-   if (need_flush)
+   if (index != PKMAP_INVALID_INDEX)
flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+
+   return index;
 }
 
 /**
@@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
  */
 void kmap_flush_unused(void)
 {
+   unsigned int index;
+
lock_kmap();
-   flush_all_zero_pkmaps();
+   index = flush_all_zero_pkmaps();
+   if (index != PKMAP_INVALID_INDEX  (index  last_pkmap_nr))
+   last_pkmap_nr = index;
unlock_kmap();
 }
 
 static inline unsigned long map_new_virtual(struct page *page)
 {
unsigned long vaddr;
+   unsigned int index = PKMAP_INVALID_INDEX;
int count;
 
 start:
@@ -168,40 +176,45 @@ start:
for (;;) {
last_pkmap_nr = (last_pkmap_nr + 1)  LAST_PKMAP_MASK;
if (!last_pkmap_nr) {
-   flush_all_zero_pkmaps();
-   count = LAST_PKMAP;
+   index = flush_all_zero_pkmaps();
+   break;
}
-   if (!pkmap_count[last_pkmap_nr])
+   if (!pkmap_count[last_pkmap_nr]) {
+   index = last_pkmap_nr;
break;  /* Found a usable entry */
-   if (--count)
-   continue;
-
-   /*
-* Sleep for somebody else to unmap their entries
-*/
-   {
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_UNINTERRUPTIBLE);
-   add_wait_queue(pkmap_map_wait, wait);
-   unlock_kmap();
-   schedule();
-   remove_wait_queue(pkmap_map_wait, wait);
-   lock_kmap();
-
-   /* Somebody else might have mapped it while we slept */
-   if (page_address(page))
-   return (unsigned long)page_address(page);
-
-   /* Re-start */
-   goto start;
}
+   if (--count == 0)
+   break;
}
-   vaddr = PKMAP_ADDR(last_pkmap_nr);
+
+   /*
+* Sleep for somebody else to unmap their entries
+*/
+   if (index == PKMAP_INVALID_INDEX) {
+   DECLARE_WAITQUEUE(wait, current);
+
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   add_wait_queue(pkmap_map_wait, wait);
+   unlock_kmap();
+   schedule();
+   remove_wait_queue(pkmap_map_wait, wait);
+   lock_kmap();
+
+   /* Somebody else might have mapped it while we slept */
+   vaddr = (unsigned long)page_address(page);
+   if (vaddr)
+   return vaddr;
+
+   /* Re-start */
+   goto start;
+   }
+
+   vaddr = PKMAP_ADDR(index);
set_pte_at(init_mm, vaddr,
-  (pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
+  (pkmap_page_table[index

[PATCH v2 2/5] mm, highmem: remove useless pool_lock

2012-10-31 Thread Joonsoo Kim

The pool_lock protects the page_address_pool from concurrent access.
But, access to the page_address_pool is already protected by kmap_lock.
So remove it.

Cc: Mel Gorman m...@csn.ul.ie
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Joonsoo Kim js1...@gmail.com
Reviewed-by: Minchan Kim minc...@kernel.org

diff --git a/mm/highmem.c b/mm/highmem.c
index b3b3d68..017bad1 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -328,7 +328,6 @@ struct page_address_map {
  * page_address_map freelist, allocated from page_address_maps.
  */
 static struct list_head page_address_pool; /* freelist */
-static spinlock_t pool_lock;   /* protects page_address_pool */
 
 /*
  * Hash table bucket
@@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual)
if (virtual) {  /* Add */
BUG_ON(list_empty(page_address_pool));
 
-   spin_lock_irqsave(pool_lock, flags);
pam = list_entry(page_address_pool.next,
struct page_address_map, list);
list_del(pam-list);
-   spin_unlock_irqrestore(pool_lock, flags);
 
pam-page = page;
pam-virtual = virtual;
@@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual)
if (pam-page == page) {
list_del(pam-list);
spin_unlock_irqrestore(pas-lock, flags);
-   spin_lock_irqsave(pool_lock, flags);
list_add_tail(pam-list, page_address_pool);
-   spin_unlock_irqrestore(pool_lock, flags);
goto done;
}
}
@@ -438,7 +433,6 @@ void __init page_address_init(void)
INIT_LIST_HEAD(page_address_htable[i].lh);
spin_lock_init(page_address_htable[i].lock);
}
-   spin_lock_init(pool_lock);
 }
 
 #endif /* defined(CONFIG_HIGHMEM)  !defined(WANT_PAGE_VIRTUAL) */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] minor clean-up and optimize highmem related code

2012-10-31 Thread JoonSoo Kim

Hello, Andrew.

2012/10/29 JoonSoo Kim js1...@gmail.com:
 Hi, Minchan.

 2012/10/29 Minchan Kim minc...@kernel.org:
 Hi Joonsoo,

 On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
 This patchset clean-up and optimize highmem related code.

 [1] is just clean-up and doesn't introduce any functional change.
 [2-3] are for clean-up and optimization.
 These eliminate an useless lock opearation and list management.
 [4-5] is for optimization related to flush_all_zero_pkmaps().

 Joonsoo Kim (5):
   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
   mm, highmem: remove useless pool_lock
   mm, highmem: remove page_address_pool list
   mm, highmem: makes flush_all_zero_pkmaps() return index of last
 flushed entry
   mm, highmem: get virtual address of the page using PKMAP_ADDR()

 This patchset looks awesome to me.
 If you have a plan to respin, please CCed Peter.

 Thanks for review.
 I will wait more review and respin, the day after tomorrow.
 Version 2 will include fix about your comment.

Could you pick up second version of this patchset?

[3] is changed to leave one blank line.
[4] is changed in order to further optimize according to Minchan's comment.
It return first index of flushed entry, instead of last index.
Others doesn't changed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-02 Thread JoonSoo Kim

Hello, Minchan.

2012/11/1 Minchan Kim minc...@kernel.org:
 On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
 In current code, after flush_all_zero_pkmaps() is invoked,
 then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
 return index of first flushed entry. With this index,
 we can immediately map highmem page to virtual address represented by index.
 So change return type of flush_all_zero_pkmaps()
 and return index of first flushed entry.

 Additionally, update last_pkmap_nr to this index.
 It is certain that entry which is below this index is occupied by other 
 mapping,
 therefore updating last_pkmap_nr to this index is reasonable optimization.

 Cc: Mel Gorman m...@csn.ul.ie
 Cc: Peter Zijlstra a.p.zijls...@chello.nl
 Cc: Minchan Kim minc...@kernel.org
 Signed-off-by: Joonsoo Kim js1...@gmail.com

 diff --git a/include/linux/highmem.h b/include/linux/highmem.h
 index ef788b5..97ad208 100644
 --- a/include/linux/highmem.h
 +++ b/include/linux/highmem.h
 @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void 
 *vaddr, int size)

  #ifdef CONFIG_HIGHMEM
  #include asm/highmem.h
 +#define PKMAP_INVALID_INDEX (LAST_PKMAP)

  /* declarations for linux/mm/highmem.c */
  unsigned int nr_free_highpages(void);
 diff --git a/mm/highmem.c b/mm/highmem.c
 index d98b0a9..b365f7b 100644
 --- a/mm/highmem.c
 +++ b/mm/highmem.c
 @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
   return virt_to_page(addr);
  }

 -static void flush_all_zero_pkmaps(void)
 +static unsigned int flush_all_zero_pkmaps(void)
  {
   int i;
 - int need_flush = 0;
 + unsigned int index = PKMAP_INVALID_INDEX;

   flush_cache_kmaps();

 @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
 pkmap_page_table[i]);

   set_page_address(page, NULL);
 - need_flush = 1;
 + if (index == PKMAP_INVALID_INDEX)
 + index = i;
   }
 - if (need_flush)
 + if (index != PKMAP_INVALID_INDEX)
   flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
 +
 + return index;
  }

  /**
 @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
   */
  void kmap_flush_unused(void)
  {
 + unsigned int index;
 +
   lock_kmap();
 - flush_all_zero_pkmaps();
 + index = flush_all_zero_pkmaps();
 + if (index != PKMAP_INVALID_INDEX  (index  last_pkmap_nr))
 + last_pkmap_nr = index;

 I don't know how kmap_flush_unused is really fast path so how my nitpick
 is effective. Anyway,
 What problem happens if we do following as?

 lock()
 index = flush_all_zero_pkmaps();
 if (index != PKMAP_INVALID_INDEX)
 last_pkmap_nr = index;
 unlock();

 Normally, last_pkmap_nr is increased with searching empty slot in
 map_new_virtual. So I expect return value of flush_all_zero_pkmaps
 in kmap_flush_unused normally become either less than last_pkmap_nr
 or last_pkmap_nr + 1.

There is a case that return value of kmap_flush_unused() is larger
than last_pkmap_nr.
Look at the following example.

Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.

do kmap_flush_unused() = flush index 10 = last_pkmap = 10;
do kunmap() with index 17
do kmap_flush_unused() = flush index 17

So, little dirty implementation is needed.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 00/29] kmem controller for memcg.

2012-11-02 Thread JoonSoo Kim

Hello, Glauber.

2012/11/2 Glauber Costa glom...@parallels.com:
 On 11/02/2012 04:04 AM, Andrew Morton wrote:
 On Thu,  1 Nov 2012 16:07:16 +0400
 Glauber Costa glom...@parallels.com wrote:

 Hi,

 This work introduces the kernel memory controller for memcg. Unlike previous
 submissions, this includes the whole controller, comprised of slab and stack
 memory.

 I'm in the middle of (re)reading all this.  Meanwhile I'll push it all
 out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.

 One thing:

 Numbers can be found at https://lkml.org/lkml/2012/9/13/239

 You claim in the above that the fork worload is 'slab intensive.  Or
 at least, you seem to - it's a bit fuzzy.

 But how slab intensive is it, really?

 What is extremely slab intensive is networking.  The networking guys
 are very sensitive to slab performance.  If this hasn't already been
 done, could you please determine what impact this has upon networking?
 I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
 testing approaches.


 I can test it, but unfortunately I am unlikely to get to prepare a good
 environment before Barcelona.

 I know, however, that Greg Thelen was testing netperf in his setup.
 Greg, do you have any publishable numbers you could share?

Below is my humble opinion.
I am worrying about data cache footprint which is possibly caused by
this patchset, especially slab implementation.
If there are several memcg cgroups, each cgroup has it's own kmem_caches.
When each group do slab-intensive job hard, data cache may be overflowed easily,
and cache miss rate will be high, therefore this would decrease system
performance highly.
Is there any result about this?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH for-v3.7 1/2] slub: optimize poorly inlined kmalloc* functions

2012-10-20 Thread Joonsoo Kim

kmalloc() and kmalloc_node() is always inlined into generic code.
However, there is a mistake in implemention of the SLUB.

In kmalloc() and kmalloc_node() of the SLUB,
we try to compare kmalloc_caches[index] with NULL.
As it cannot be known at compile time,
this comparison is inserted into generic code invoking kmalloc*.
This may decrease system performance, so we should fix it.

Below is the result of size vmlinux
text size is decreased roughly 20KB

Before:
   textdata bss dec hex filename
100441771443168 5722112 172094571069871 vmlinux
After:
   textdata bss dec hex filename
100226271443136 5722112 171878751064423 vmlinux

Cc: Christoph Lameter c...@linux.com
Signed-off-by: Joonsoo Kim js1...@gmail.com
---
With Christoph's patchset(common kmalloc caches:
'[15/15] Common Kmalloc cache determination') which is not merged into mainline 
yet,
this issue will be fixed.
As it takes some time, I send this patch for v3.7

This patch is based on v3.7-rc1

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index df448ad..4c75f2b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -271,9 +271,10 @@ static __always_inline void *kmalloc(size_t size, gfp_t 
flags)
return kmalloc_large(size, flags);
 
if (!(flags  SLUB_DMA)) {
-   struct kmem_cache *s = kmalloc_slab(size);
+   int index = kmalloc_index(size);
+   struct kmem_cache *s = kmalloc_caches[index];
 
-   if (!s)
+   if (!index)
return ZERO_SIZE_PTR;
 
return kmem_cache_alloc_trace(s, flags, size);
@@ -304,9 +305,10 @@ static __always_inline void *kmalloc_node(size_t size, 
gfp_t flags, int node)
 {
if (__builtin_constant_p(size) 
size = SLUB_MAX_SIZE  !(flags  SLUB_DMA)) {
-   struct kmem_cache *s = kmalloc_slab(size);
+   int index = kmalloc_index(size);
+   struct kmem_cache *s = kmalloc_caches[index];
 
-   if (!s)
+   if (!index)
return ZERO_SIZE_PTR;
 
return kmem_cache_alloc_node_trace(s, flags, node, size);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA

2012-10-20 Thread Joonsoo Kim

kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = __GFP_DMA.
This patch optimize this case,
so when @flags = __GFP_DMA, it will be inlined into generic code.

Cc: Christoph Lameter c...@linux.com
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 4c75f2b..4adf50b 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -147,6 +147,7 @@ struct kmem_cache {
  * 2^x bytes of allocations.
  */
 extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
+extern struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -266,19 +267,24 @@ static __always_inline void *kmalloc_large(size_t size, 
gfp_t flags)
 
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+   struct kmem_cache *s;
+   int index;
+
if (__builtin_constant_p(size)) {
if (size  SLUB_MAX_SIZE)
return kmalloc_large(size, flags);
 
-   if (!(flags  SLUB_DMA)) {
-   int index = kmalloc_index(size);
-   struct kmem_cache *s = kmalloc_caches[index];
-
-   if (!index)
-   return ZERO_SIZE_PTR;
+   index = kmalloc_index(size);
+   if (!index)
+   return ZERO_SIZE_PTR;
+#ifdef CONFIG_ZONE_DMA
+   if (unlikely(flags  SLUB_DMA)) {
+   s = kmalloc_dma_caches[index];
+   } else
+#endif
+   s = kmalloc_caches[index];
 
-   return kmem_cache_alloc_trace(s, flags, size);
-   }
+   return kmem_cache_alloc_trace(s, flags, size);
}
return __kmalloc(size, flags);
 }
@@ -303,13 +309,19 @@ kmem_cache_alloc_node_trace(struct kmem_cache *s,
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
-   if (__builtin_constant_p(size) 
-   size = SLUB_MAX_SIZE  !(flags  SLUB_DMA)) {
-   int index = kmalloc_index(size);
-   struct kmem_cache *s = kmalloc_caches[index];
+   struct kmem_cache *s;
+   int index;
 
+   if (__builtin_constant_p(size)  size = SLUB_MAX_SIZE) {
+   index = kmalloc_index(size);
if (!index)
return ZERO_SIZE_PTR;
+#ifdef CONFIG_ZONE_DMA
+   if (unlikely(flags  SLUB_DMA)) {
+   s = kmalloc_dma_caches[index];
+   } else
+#endif
+   s = kmalloc_caches[index];
 
return kmem_cache_alloc_node_trace(s, flags, node, size);
}
diff --git a/mm/slub.c b/mm/slub.c
index a0d6984..a94533c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3222,7 +3222,8 @@ struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 EXPORT_SYMBOL(kmalloc_caches);
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+struct kmem_cache *kmalloc_dma_caches[SLUB_PAGE_SHIFT];
+EXPORT_SYMBOL(kmalloc_dma_caches);
 #endif
 
 static int __init setup_slub_min_order(char *str)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] workqueue: minor cleanup

2012-10-20 Thread Joonsoo Kim

This patchset do minor cleanup for workqueue code.

First patch makes minor behavior change, however, it is trivial.
Others doesn't makes any functional difference.
These are based on v3.7-rc1

Joonsoo Kim (3):
  workqueue: optimize mod_delayed_work_on() when @delay == 0
  workqueue: trivial fix for return statement in work_busy()
  workqueue: remove unused argument of wq_worker_waking_up()

 kernel/sched/core.c  |2 +-
 kernel/workqueue.c   |   11 +++
 kernel/workqueue_sched.h |2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] workqueue: optimize mod_delayed_work_on() when @delay == 0

2012-10-20 Thread Joonsoo Kim

After try_to_grab_pending(), __queue_delayed_work() is invoked
in mod_delayed_work_on(). When @delay == 0, we can call __queue_work()
directly in order to avoid setting useless timer.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d951daa..c57358e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1477,7 +1477,11 @@ bool mod_delayed_work_on(int cpu, struct 
workqueue_struct *wq,
} while (unlikely(ret == -EAGAIN));
 
if (likely(ret = 0)) {
-   __queue_delayed_work(cpu, wq, dwork, delay);
+   if (!delay)
+   __queue_work(cpu, wq, dwork-work);
+   else
+   __queue_delayed_work(cpu, wq, dwork, delay);
+
local_irq_restore(flags);
}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()

2012-10-20 Thread Joonsoo Kim

Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where
applicable') changes an approach to access nr_running.
Thus, wq_worker_waking_up() doesn't use @cpu anymore.
Remove it and remove comment related to it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..30a23d0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1243,7 +1243,7 @@ static void ttwu_activate(struct rq *rq, struct 
task_struct *p, int en_flags)
 
/* if a worker is waking up, notify workqueue */
if (p-flags  PF_WQ_WORKER)
-   wq_worker_waking_up(p, cpu_of(rq));
+   wq_worker_waking_up(p);
 }
 
 /*
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 27a6dee..daf101c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -727,7 +727,6 @@ static void wake_up_worker(struct worker_pool *pool)
 /**
  * wq_worker_waking_up - a worker is waking up
  * @task: task waking up
- * @cpu: CPU @task is waking up to
  *
  * This function is called during try_to_wake_up() when a worker is
  * being awoken.
@@ -735,7 +734,7 @@ static void wake_up_worker(struct worker_pool *pool)
  * CONTEXT:
  * spin_lock_irq(rq-lock)
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu)
+void wq_worker_waking_up(struct task_struct *task)
 {
struct worker *worker = kthread_data(task);
 
diff --git a/kernel/workqueue_sched.h b/kernel/workqueue_sched.h
index 2d10fc9..c1b45a5 100644
--- a/kernel/workqueue_sched.h
+++ b/kernel/workqueue_sched.h
@@ -4,6 +4,6 @@
  * Scheduler hooks for concurrency managed workqueue.  Only to be
  * included from sched.c and workqueue.c.
  */
-void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
+void wq_worker_waking_up(struct task_struct *task);
 struct task_struct *wq_worker_sleeping(struct task_struct *task,
   unsigned int cpu);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] workqueue: trivial fix for return statement in work_busy()

2012-10-20 Thread Joonsoo Kim

Return type of work_busy() is unsigned int.
There is return statement returning boolean value, 'false' in work_busy().
It is not problem, because 'false' may be treated '0'.
However, fixing it would make code robust.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c57358e..27a6dee 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3479,7 +3479,7 @@ unsigned int work_busy(struct work_struct *work)
unsigned int ret = 0;
 
if (!gcwq)
-   return false;
+   return 0;
 
spin_lock_irqsave(gcwq-lock, flags);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] slab: move kmem_cache_free to common code

2012-10-22 Thread JoonSoo Kim

Hello, Glauber.

2012/10/23 Glauber Costa glom...@parallels.com:
 On 10/22/2012 06:45 PM, Christoph Lameter wrote:
 On Mon, 22 Oct 2012, Glauber Costa wrote:

 + * kmem_cache_free - Deallocate an object
 + * @cachep: The cache the allocation was from.
 + * @objp: The previously allocated object.
 + *
 + * Free an object which was previously allocated from this
 + * cache.
 + */
 +void kmem_cache_free(struct kmem_cache *s, void *x)
 +{
 +__kmem_cache_free(s, x);
 +trace_kmem_cache_free(_RET_IP_, x);
 +}
 +EXPORT_SYMBOL(kmem_cache_free);
 +

 This results in an additional indirection if tracing is off. Wonder if
 there is a performance impact?

 if tracing is on, you mean?

 Tracing already incurs overhead, not sure how much a function call would
 add to the tracing overhead.

 I would not be concerned with this, but I can measure, if you have any
 specific workload in mind.

With this patch, kmem_cache_free() invokes __kmem_cache_free(),
that is, it add one more call instruction than before.

I think that Christoph's comment means above fact.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] workqueue: remove unused argument of wq_worker_waking_up()

2012-10-22 Thread JoonSoo Kim

2012/10/21 Tejun Heo t...@kernel.org:
 On Sun, Oct 21, 2012 at 01:30:07AM +0900, Joonsoo Kim wrote:
 Commit 63d95a91 ('workqueue: use @pool instead of @gcwq or @cpu where
 applicable') changes an approach to access nr_running.
 Thus, wq_worker_waking_up() doesn't use @cpu anymore.
 Remove it and remove comment related to it.

 Signed-off-by: Joonsoo Kim js1...@gmail.com

 I'm not sure whether I wanna remove or add WARN_ON_ONCE() on it.  That
 part has gone through some changes and seen some bugs.  Can we please
 do the following instead at least for now?

 if (!(worker-flags  WORKER_NOT_RUNNING)) {
 WARN_ON_ONCE(worker-pool-gcwq-cpu != cpu);
 atomic_inc(get_pool_nr_running(worker-pool));
 }


I have no objection to do this for now.
Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA

2012-10-22 Thread JoonSoo Kim

2012/10/22 Christoph Lameter c...@linux.com:
 On Sun, 21 Oct 2012, Joonsoo Kim wrote:

 kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = 
 __GFP_DMA.
 This patch optimize this case,
 so when @flags = __GFP_DMA, it will be inlined into generic code.

 __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
 not considered that it is worth to directly support it in the inlining
 code.



Hmm... but, the SLAB already did that optimization for __GFP_DMA.
Almost every kmalloc() is invoked with constant flags value,
so I think that overhead from this patch may be negligible.
With this patch, code size of vmlinux is reduced slightly.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] slab: move kmem_cache_free to common code

2012-10-23 Thread JoonSoo Kim

2012/10/23 Glauber Costa glom...@parallels.com:
 On 10/23/2012 12:07 PM, Glauber Costa wrote:
 On 10/23/2012 04:48 AM, JoonSoo Kim wrote:
 Hello, Glauber.

 2012/10/23 Glauber Costa glom...@parallels.com:
 On 10/22/2012 06:45 PM, Christoph Lameter wrote:
 On Mon, 22 Oct 2012, Glauber Costa wrote:

 + * kmem_cache_free - Deallocate an object
 + * @cachep: The cache the allocation was from.
 + * @objp: The previously allocated object.
 + *
 + * Free an object which was previously allocated from this
 + * cache.
 + */
 +void kmem_cache_free(struct kmem_cache *s, void *x)
 +{
 +__kmem_cache_free(s, x);
 +trace_kmem_cache_free(_RET_IP_, x);
 +}
 +EXPORT_SYMBOL(kmem_cache_free);
 +

 This results in an additional indirection if tracing is off. Wonder if
 there is a performance impact?

 if tracing is on, you mean?

 Tracing already incurs overhead, not sure how much a function call would
 add to the tracing overhead.

 I would not be concerned with this, but I can measure, if you have any
 specific workload in mind.

 With this patch, kmem_cache_free() invokes __kmem_cache_free(),
 that is, it add one more call instruction than before.

 I think that Christoph's comment means above fact.

 Ah, this. Ok, I got fooled by his mention to tracing.

 I do agree, but since freeing is ultimately dependent on the allocator
 layout, I don't see a clean way of doing this without dropping tears of
 sorrow around. The calls in slub/slab/slob would have to be somehow
 inlined. Hum... maybe it is possible to do it from
 include/linux/sl*b_def.h...

 Let me give it a try and see what I can come up with.


 Ok.

 I am attaching a PoC for this for your appreciation. This gets quite
 ugly, but it's the way I found without including sl{a,u,o}b.c directly -
 which would be even worse.

Hmm...
This is important issue for sl[aou]b common allocators.
Because there are similar functions like as kmem_cache_alloc, ksize, kfree, ...
So it is good time to resolve this issue.

As far as I know, now, we have 3 solutions.

1. include/linux/slab.h
__always_inline kmem_cache_free()
{
__kmem_cache_free();
blablabla...
}

2. define macro like as Glauber's solution
3. include sl[aou]b.c directly.

Is there other good solution?
Among them, I prefer solution 3, because future developing cost may
be minimum among them.

Solution 2 may be error-prone for future developing.
Solution 1 may make compile-time longer and larger code.

Is my understanding right?
Is Solution 3 really ugly?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA

2012-10-23 Thread JoonSoo Kim

Hi, Eric.

2012/10/23 Eric Dumazet eric.duma...@gmail.com:
 On Tue, 2012-10-23 at 11:29 +0900, JoonSoo Kim wrote:
 2012/10/22 Christoph Lameter c...@linux.com:
  On Sun, 21 Oct 2012, Joonsoo Kim wrote:
 
  kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = 
  __GFP_DMA.
  This patch optimize this case,
  so when @flags = __GFP_DMA, it will be inlined into generic code.
 
  __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
  not considered that it is worth to directly support it in the inlining
  code.
 
 

 Hmm... but, the SLAB already did that optimization for __GFP_DMA.
 Almost every kmalloc() is invoked with constant flags value,
 so I think that overhead from this patch may be negligible.
 With this patch, code size of vmlinux is reduced slightly.

 Only because you asked a allyesconfig

 GFP_DMA is used for less than 0.1 % of kmalloc() calls, for legacy
 hardware (from last century)

I'm not doing with allyesconfig,
but localmodconfig on my ubuntu desktop system.
On my system, 700 bytes of text of vmlinux is reduced
which mean there may be more than 100 callsite with GFP_DMA.

 In fact if you want to reduce even more your vmlinux, you could test

 if (__builtin_constant_p(flags)  (flags  SLUB_DMA))
 return kmem_cache_alloc_trace(s, flags, size);

 to force the call to out of line code.

The reason why I mention about code size is that I want to say it may
be good for performance,
although it has a just small impact.
I'm not interest of reducing code size :)

Thanks for comment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 05/18] slab/slub: struct memcg_params

2012-10-23 Thread JoonSoo Kim

Hi, Glauber.

2012/10/19 Glauber Costa glom...@parallels.com:
 For the kmem slab controller, we need to record some extra
 information in the kmem_cache structure.

 Signed-off-by: Glauber Costa glom...@parallels.com
 Signed-off-by: Suleiman Souhlal sulei...@google.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Tejun Heo t...@kernel.org
 ---
  include/linux/slab.h | 25 +
  include/linux/slab_def.h |  3 +++
  include/linux/slub_def.h |  3 +++
  mm/slab.h| 13 +
  4 files changed, 44 insertions(+)

 diff --git a/include/linux/slab.h b/include/linux/slab.h
 index 0dd2dfa..e4ea48a 100644
 --- a/include/linux/slab.h
 +++ b/include/linux/slab.h
 @@ -177,6 +177,31 @@ unsigned int kmem_cache_size(struct kmem_cache *);
  #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
  #endif

 +#include linux/workqueue.h

Why workqueue.h is includede at this time?
It may be future use, so is it better to add it later?
Adding it at proper time makes git blame works better.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 06/18] consider a memcg parameter in kmem_create_cache

2012-10-23 Thread JoonSoo Kim

2012/10/19 Glauber Costa glom...@parallels.com:

 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 6a1e096..59f6d54 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -339,6 +339,12 @@ struct mem_cgroup {
  #if defined(CONFIG_MEMCG_KMEM)  defined(CONFIG_INET)
 struct tcp_memcontrol tcp_mem;
  #endif
 +#if defined(CONFIG_MEMCG_KMEM)
 +   /* analogous to slab_common's slab_caches list. per-memcg */
 +   struct list_head memcg_slab_caches;
 +   /* Not a spinlock, we can take a lot of time walking the list */
 +   struct mutex slab_caches_mutex;
 +#endif
  };

It is better to change name of slab_caches_mutex to something
representing slab cache mutex of memcg.

 +int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
 +{
 +   size_t size = sizeof(struct memcg_cache_params);
 +
 +   if (!memcg_kmem_enabled())
 +   return 0;
 +
 +   s-memcg_params = kzalloc(size, GFP_KERNEL);
 +   if (!s-memcg_params)
 +   return -ENOMEM;
 +
 +   if (memcg)
 +   s-memcg_params-memcg = memcg;
 +   return 0;
 +}

Now, I don't read full-patchset and I have not enough knowledge about cgroup.
So I have a question.
When memcg_kmem_enable, creation kmem_cache of normal user(not
included in cgroup) also allocate memcg_params.
Is it right behavior?

 +void memcg_release_cache(struct kmem_cache *s)
 +{
 +   kfree(s-memcg_params);
 +}
 +
  /*
   * We need to verify if the allocation against current-mm-owner's memcg is
   * possible for the given order. But the page is not allocated yet, so we'll
 diff --git a/mm/slab.h b/mm/slab.h
 index 5ee1851..c35ecce 100644
 --- a/mm/slab.h
 +++ b/mm/slab.h
 @@ -35,12 +35,15 @@ extern struct kmem_cache *kmem_cache;
  /* Functions provided by the slab allocators */
  extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);

 +struct mem_cgroup;
  #ifdef CONFIG_SLUB
 -struct kmem_cache *__kmem_cache_alias(const char *name, size_t size,
 -   size_t align, unsigned long flags, void (*ctor)(void *));
 +struct kmem_cache *
 +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
 +  size_t align, unsigned long flags, void (*ctor)(void *));
  #else
 -static inline struct kmem_cache *__kmem_cache_alias(const char *name, size_t 
 size,
 -   size_t align, unsigned long flags, void (*ctor)(void *))
 +static inline struct kmem_cache *
 +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
 +  size_t align, unsigned long flags, void (*ctor)(void *))
  { return NULL; }
  #endif

 @@ -98,11 +101,23 @@ static inline bool is_root_cache(struct kmem_cache *s)
  {
 return !s-memcg_params || s-memcg_params-is_root_cache;
  }
 +
 +static inline bool cache_match_memcg(struct kmem_cache *cachep,
 +struct mem_cgroup *memcg)
 +{
 +   return (is_root_cache(cachep)  !memcg) ||
 +   (cachep-memcg_params-memcg == memcg);
 +}

When cachep-memcg_params == NULL and memcg is not NULL, NULL pointer
deref may be occurred.
Is there no situation like above?

  #else
  static inline bool is_root_cache(struct kmem_cache *s)
  {
 return true;
  }

 +static inline bool cache_match_memcg(struct kmem_cache *cachep,
 +struct mem_cgroup *memcg)
 +{
 +   return true;
 +}
  #endif
  #endif
 diff --git a/mm/slab_common.c b/mm/slab_common.c
 index bf4b4f1..f97f7b8 100644
 --- a/mm/slab_common.c
 +++ b/mm/slab_common.c
 @@ -18,6 +18,7 @@
  #include asm/cacheflush.h
  #include asm/tlbflush.h
  #include asm/page.h
 +#include linux/memcontrol.h

  #include slab.h

 @@ -27,7 +28,8 @@ DEFINE_MUTEX(slab_mutex);
  struct kmem_cache *kmem_cache;

  #ifdef CONFIG_DEBUG_VM
 -static int kmem_cache_sanity_check(const char *name, size_t size)
 +static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char 
 *name,
 +  size_t size)
  {
 struct kmem_cache *s = NULL;

 @@ -53,7 +55,7 @@ static int kmem_cache_sanity_check(const char *name, size_t 
 size)
 continue;
 }

 -   if (!strcmp(s-name, name)) {
 +   if (cache_match_memcg(s, memcg)  !strcmp(s-name, name)) {
 pr_err(%s (%s): Cache name already exists.\n,
__func__, name);
 dump_stack();
 @@ -66,7 +68,8 @@ static int kmem_cache_sanity_check(const char *name, size_t 
 size)
 return 0;
  }
  #else
 -static inline int kmem_cache_sanity_check(const char *name, size_t size)
 +static inline int kmem_cache_sanity_check(struct mem_cgroup *memcg,
 + const char *name, size_t size)
  {
 return 0;
  }
 @@ -97,8 +100,9 @@ static inline int kmem_cache_sanity_check(const char 
 *name, size_t size)
   * as davem.
   */

 -struct kmem_cache *kmem_cache_create(const

Re: [PATCH v2 1/2] kmem_cache: include allocators code directly into slab_common

2012-10-24 Thread JoonSoo Kim

2012/10/24 Glauber Costa glom...@parallels.com:
 On 10/24/2012 06:29 PM, Christoph Lameter wrote:
 On Wed, 24 Oct 2012, Glauber Costa wrote:

 Because of that, we either have to move all the entry points to the
 mm/slab.h and rely heavily on the pre-processor, or include all .c files
 in here.

 Hmm... That is a bit of a radical solution. The global optimizations now
 possible with the new gcc compiler include the ability to fold functions
 across different linkable objects. Andi, is that usable for kernel builds?


 In general, it takes quite a lot of time to take all those optimizations
 for granted. We still live a lot of time with multiple compiler versions
 building distros, etc, for quite some time.

 I would expect the end result for anyone not using such a compiler to be
 a sudden performance drop when using a new kernel. Not really pleasant.

I agree with Glauber's opinion.
And patch looks fine to me.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 08/18] memcg: infrastructure to match an allocation to the right cache

2012-10-24 Thread JoonSoo Kim

2012/10/19 Glauber Costa glom...@parallels.com:
 @@ -2930,9 +2937,188 @@ int memcg_register_cache(struct mem_cgroup *memcg, 
 struct kmem_cache *s)

  void memcg_release_cache(struct kmem_cache *s)
  {
 +   struct kmem_cache *root;
 +   int id = memcg_css_id(s-memcg_params-memcg);
 +
 +   if (s-memcg_params-is_root_cache)
 +   goto out;
 +
 +   root = s-memcg_params-root_cache;
 +   root-memcg_params-memcg_caches[id] = NULL;
 +   mem_cgroup_put(s-memcg_params-memcg);
 +out:
 kfree(s-memcg_params);
  }

memcg_css_id should be called after checking s-memcg_params-is_root_cache.
Because when is_root_cache == true, memcg_params has no memcg object.


 +/*
 + * This lock protects updaters, not readers. We want readers to be as fast as
 + * they can, and they will either see NULL or a valid cache value. Our model
 + * allow them to see NULL, in which case the root memcg will be selected.
 + *
 + * We need this lock because multiple allocations to the same cache from a 
 non
 + * GFP_WAIT area will span more than one worker. Only one of them can create
 + * the cache.
 + */
 +static DEFINE_MUTEX(memcg_cache_mutex);
 +static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 + struct kmem_cache *cachep)
 +{
 +   struct kmem_cache *new_cachep;
 +   int idx;
 +
 +   BUG_ON(!memcg_can_account_kmem(memcg));
 +
 +   idx = memcg_css_id(memcg);
 +
 +   mutex_lock(memcg_cache_mutex);
 +   new_cachep = cachep-memcg_params-memcg_caches[idx];
 +   if (new_cachep)
 +   goto out;
 +
 +   new_cachep = kmem_cache_dup(memcg, cachep);
 +
 +   if (new_cachep == NULL) {
 +   new_cachep = cachep;
 +   goto out;
 +   }
 +
 +   mem_cgroup_get(memcg);
 +   cachep-memcg_params-memcg_caches[idx] = new_cachep;
 +   wmb(); /* the readers won't lock, make sure everybody sees it */

Is there any rmb() pair?
As far as I know, without rmb(), wmb() doesn't guarantee anything.

 +   new_cachep-memcg_params-memcg = memcg;
 +   new_cachep-memcg_params-root_cache = cachep;

It may be better these assignment before the statement
cachep-memcg_params-memcg_caches[idx] = new_cachep.
Otherwise, it may produce race situation.

And assigning value to memcg_params-memcg and root_cache is redundant,
because it is already done in memcg_register_cache().

 +/*
 + * Return the kmem_cache we're supposed to use for a slab allocation.
 + * We try to use the current memcg's version of the cache.
 + *
 + * If the cache does not exist yet, if we are the first user of it,
 + * we either create it immediately, if possible, or create it asynchronously
 + * in a workqueue.
 + * In the latter case, we will let the current allocation go through with
 + * the original cache.
 + *
 + * Can't be called in interrupt context or from kernel threads.
 + * This function needs to be called with rcu_read_lock() held.
 + */
 +struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 + gfp_t gfp)
 +{
 +   struct mem_cgroup *memcg;
 +   int idx;
 +
 +   if (cachep-memcg_params  cachep-memcg_params-memcg)
 +   return cachep;

In __memcg_kmem_get_cache, cachep may be always root cache.
So checking cachep-memcg_params-memcg is somewhat strange.
Is it right?


Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] workqueue: insert warn_on_once() into wq_worker_waking_up()

2012-10-26 Thread Joonsoo Kim

Recently, workqueue code has gone through some changes and we found some bugs
related to this. To prevent futher bugs in advance, add WARN_ON_ONCE() in
wq_worker_waking_up(). When worker is not WORKER_NOT_RUNNIG state,
it should be binded to it's associated cpu and executed in that cpu.
Add code for checking this.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a1135c6..1a65132 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -739,8 +739,10 @@ void wq_worker_waking_up(struct task_struct *task, 
unsigned int cpu)
 {
struct worker *worker = kthread_data(task);
 
-   if (!(worker-flags  WORKER_NOT_RUNNING))
+   if (!(worker-flags  WORKER_NOT_RUNNING)) {
+   WARN_ON_ONCE(worker-pool-gcwq-cpu != cpu);
atomic_inc(get_pool_nr_running(worker-pool));
+   }
 }
 
 /**
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] clean-up initialization of deferrable timer

2012-10-26 Thread JoonSoo Kim

2012/10/19 Joonsoo Kim js1...@gmail.com:
 This patchset introduces setup_timer_deferrable() macro.
 Using it makes code simple and understandable.

 This patchset doesn't make any functional difference.
 It is just for clean-up.

 It is based on v3.7-rc1

 Joonsoo Kim (2):
   timer: add setup_timer_deferrable() macro
   timer: use new setup_timer_deferrable() macro

  drivers/acpi/apei/ghes.c |5 ++---
  drivers/ata/libata-core.c|5 ++---
  drivers/net/ethernet/nvidia/forcedeth.c  |   13 -
  drivers/net/ethernet/qlogic/qlge/qlge_main.c |4 +---
  drivers/net/vxlan.c  |5 ++---
  include/linux/timer.h|2 ++
  kernel/workqueue.c   |6 ++
  net/mac80211/agg-rx.c|   12 ++--
  net/mac80211/agg-tx.c|   12 ++--
  net/sched/cls_flow.c |5 ++---
  net/sched/sch_sfq.c  |5 ++---
  11 files changed, 31 insertions(+), 43 deletions(-)

 --
 1.7.9.5

Hello, Thomas.
Will you pick this for your tree?
Or is there anything wrong with it?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/2] kmem_cache: include allocators code directly into slab_common

2012-10-26 Thread JoonSoo Kim

2012/10/25 Christoph Lameter c...@linux.com:
 On Wed, 24 Oct 2012, Pekka Enberg wrote:

 So I hate this patch with a passion. We don't have any fastpaths in
 mm/slab_common.c nor should we. Those should be allocator specific.

 I have similar thoughts on the issue. Lets keep the fast paths allocator
 specific until we find a better way to handle this issue.

Okay. I see.
How about applying LTO not to the whole kernel code, but just to
slab_common.o + sl[aou]b.o?
I think that it may be possible, isn't it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap

2012-10-28 Thread Joonsoo Kim

To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, So change it.

Cc: Mel Gorman mgor...@suse.de
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/highmem.c b/mm/highmem.c
index d517cd1..b3b3d68 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr)
unsigned long addr = (unsigned long)vaddr;
 
if (addr = PKMAP_ADDR(0)  addr = PKMAP_ADDR(LAST_PKMAP)) {
-   int i = (addr - PKMAP_ADDR(0))  PAGE_SHIFT;
+   int i = PKMAP_NR(addr);
return pte_page(pkmap_page_table[i]);
}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] mm, highmem: remove page_address_pool list

2012-10-28 Thread Joonsoo Kim

We can find free page_address_map instance without the page_address_pool.
So remove it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/highmem.c b/mm/highmem.c
index 017bad1..731cf9a 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -323,11 +323,7 @@ struct page_address_map {
void *virtual;
struct list_head list;
 };
-
-/*
- * page_address_map freelist, allocated from page_address_maps.
- */
-static struct list_head page_address_pool; /* freelist */
+static struct page_address_map page_address_maps[LAST_PKMAP];
 
 /*
  * Hash table bucket
@@ -392,12 +388,7 @@ void set_page_address(struct page *page, void *virtual)
 
pas = page_slot(page);
if (virtual) {  /* Add */
-   BUG_ON(list_empty(page_address_pool));
-
-   pam = list_entry(page_address_pool.next,
-   struct page_address_map, list);
-   list_del(pam-list);
-
+   pam = page_address_maps[PKMAP_NR((unsigned long)virtual)];
pam-page = page;
pam-virtual = virtual;
 
@@ -410,7 +401,6 @@ void set_page_address(struct page *page, void *virtual)
if (pam-page == page) {
list_del(pam-list);
spin_unlock_irqrestore(pas-lock, flags);
-   list_add_tail(pam-list, page_address_pool);
goto done;
}
}
@@ -420,15 +410,10 @@ done:
return;
 }
 
-static struct page_address_map page_address_maps[LAST_PKMAP];
-
 void __init page_address_init(void)
 {
int i;
 
-   INIT_LIST_HEAD(page_address_pool);
-   for (i = 0; i  ARRAY_SIZE(page_address_maps); i++)
-   list_add(page_address_maps[i].list, page_address_pool);
for (i = 0; i  ARRAY_SIZE(page_address_htable); i++) {
INIT_LIST_HEAD(page_address_htable[i].lh);
spin_lock_init(page_address_htable[i].lock);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry

2012-10-28 Thread Joonsoo Kim

In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of last flushed entry.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..0683869 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, 
int size)
 
 #ifdef CONFIG_HIGHMEM
 #include asm/highmem.h
+#define PKMAP_INDEX_INVAL (-1)
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index 731cf9a..65beb9a 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
return virt_to_page(addr);
 }
 
-static void flush_all_zero_pkmaps(void)
+static int flush_all_zero_pkmaps(void)
 {
int i;
-   int need_flush = 0;
+   int index = PKMAP_INDEX_INVAL;
 
flush_cache_kmaps();
 
@@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void)
  pkmap_page_table[i]);
 
set_page_address(page, NULL);
-   need_flush = 1;
+   index = i;
}
-   if (need_flush)
+   if (index != PKMAP_INDEX_INVAL)
flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+
+   return index;
 }
 
 /**
@@ -160,6 +162,7 @@ void kmap_flush_unused(void)
 static inline unsigned long map_new_virtual(struct page *page)
 {
unsigned long vaddr;
+   int index = PKMAP_INDEX_INVAL;
int count;
 
 start:
@@ -168,40 +171,45 @@ start:
for (;;) {
last_pkmap_nr = (last_pkmap_nr + 1)  LAST_PKMAP_MASK;
if (!last_pkmap_nr) {
-   flush_all_zero_pkmaps();
-   count = LAST_PKMAP;
+   index = flush_all_zero_pkmaps();
+   if (index != PKMAP_INDEX_INVAL)
+   break; /* Found a usable entry */
}
-   if (!pkmap_count[last_pkmap_nr])
+   if (!pkmap_count[last_pkmap_nr]) {
+   index = last_pkmap_nr;
break;  /* Found a usable entry */
-   if (--count)
-   continue;
-
-   /*
-* Sleep for somebody else to unmap their entries
-*/
-   {
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_UNINTERRUPTIBLE);
-   add_wait_queue(pkmap_map_wait, wait);
-   unlock_kmap();
-   schedule();
-   remove_wait_queue(pkmap_map_wait, wait);
-   lock_kmap();
-
-   /* Somebody else might have mapped it while we slept */
-   if (page_address(page))
-   return (unsigned long)page_address(page);
-
-   /* Re-start */
-   goto start;
}
+   if (--count == 0)
+   break;
}
-   vaddr = PKMAP_ADDR(last_pkmap_nr);
+
+   /*
+* Sleep for somebody else to unmap their entries
+*/
+   if (index == PKMAP_INDEX_INVAL) {
+   DECLARE_WAITQUEUE(wait, current);
+
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   add_wait_queue(pkmap_map_wait, wait);
+   unlock_kmap();
+   schedule();
+   remove_wait_queue(pkmap_map_wait, wait);
+   lock_kmap();
+
+   /* Somebody else might have mapped it while we slept */
+   vaddr = (unsigned long)page_address(page);
+   if (vaddr)
+   return vaddr;
+
+   /* Re-start */
+   goto start;
+   }
+
+   vaddr = PKMAP_ADDR(index);
set_pte_at(init_mm, vaddr,
-  (pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
+  (pkmap_page_table[index]), mk_pte(page, kmap_prot));
 
-   pkmap_count[last_pkmap_nr] = 1;
+   pkmap_count[index] = 1;
set_page_address(page, (void *)vaddr);
 
return vaddr;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] mm, highmem: remove useless pool_lock

2012-10-28 Thread Joonsoo Kim

The pool_lock protects the page_address_pool from concurrent access.
But, access to the page_address_pool is already protected by kmap_lock.
So remove it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/highmem.c b/mm/highmem.c
index b3b3d68..017bad1 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -328,7 +328,6 @@ struct page_address_map {
  * page_address_map freelist, allocated from page_address_maps.
  */
 static struct list_head page_address_pool; /* freelist */
-static spinlock_t pool_lock;   /* protects page_address_pool */
 
 /*
  * Hash table bucket
@@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual)
if (virtual) {  /* Add */
BUG_ON(list_empty(page_address_pool));
 
-   spin_lock_irqsave(pool_lock, flags);
pam = list_entry(page_address_pool.next,
struct page_address_map, list);
list_del(pam-list);
-   spin_unlock_irqrestore(pool_lock, flags);
 
pam-page = page;
pam-virtual = virtual;
@@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual)
if (pam-page == page) {
list_del(pam-list);
spin_unlock_irqrestore(pas-lock, flags);
-   spin_lock_irqsave(pool_lock, flags);
list_add_tail(pam-list, page_address_pool);
-   spin_unlock_irqrestore(pool_lock, flags);
goto done;
}
}
@@ -438,7 +433,6 @@ void __init page_address_init(void)
INIT_LIST_HEAD(page_address_htable[i].lh);
spin_lock_init(page_address_htable[i].lock);
}
-   spin_lock_init(pool_lock);
 }
 
 #endif /* defined(CONFIG_HIGHMEM)  !defined(WANT_PAGE_VIRTUAL) */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] minor clean-up and optimize highmem related code

2012-10-28 Thread Joonsoo Kim

This patchset clean-up and optimize highmem related code.

[1] is just clean-up and doesn't introduce any functional change.
[2-3] are for clean-up and optimization.
These eliminate an useless lock opearation and list management.
[4-5] is for optimization related to flush_all_zero_pkmaps().

Joonsoo Kim (5):
  mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  mm, highmem: remove useless pool_lock
  mm, highmem: remove page_address_pool list
  mm, highmem: makes flush_all_zero_pkmaps() return index of last
flushed entry
  mm, highmem: get virtual address of the page using PKMAP_ADDR()

 include/linux/highmem.h |1 +
 mm/highmem.c|  102 ---
 2 files changed, 45 insertions(+), 58 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()

2012-10-28 Thread Joonsoo Kim

In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
Using this index, we can simply get virtual address of the page.
So change it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/highmem.c b/mm/highmem.c
index 65beb9a..1417f4f 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -137,8 +137,7 @@ static int flush_all_zero_pkmaps(void)
 * So no dangers, even with speculative execution.
 */
page = pte_page(pkmap_page_table[i]);
-   pte_clear(init_mm, (unsigned long)page_address(page),
- pkmap_page_table[i]);
+   pte_clear(init_mm, PKMAP_ADDR(i), pkmap_page_table[i]);
 
set_page_address(page, NULL);
index = i;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of last flushed entry

2012-10-29 Thread JoonSoo Kim

2012/10/29 Minchan Kim minc...@kernel.org:
 On Mon, Oct 29, 2012 at 04:12:55AM +0900, Joonsoo Kim wrote:
 In current code, after flush_all_zero_pkmaps() is invoked,
 then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
 return index of flushed entry. With this index,
 we can immediately map highmem page to virtual address represented by index.
 So change return type of flush_all_zero_pkmaps()
 and return index of last flushed entry.

 Signed-off-by: Joonsoo Kim js1...@gmail.com

 diff --git a/include/linux/highmem.h b/include/linux/highmem.h
 index ef788b5..0683869 100644
 --- a/include/linux/highmem.h
 +++ b/include/linux/highmem.h
 @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void 
 *vaddr, int size)

  #ifdef CONFIG_HIGHMEM
  #include asm/highmem.h
 +#define PKMAP_INDEX_INVAL (-1)

 How about this?

 #define PKMAP_INVALID_INDEX (-1)

Okay.


  /* declarations for linux/mm/highmem.c */
  unsigned int nr_free_highpages(void);
 diff --git a/mm/highmem.c b/mm/highmem.c
 index 731cf9a..65beb9a 100644
 --- a/mm/highmem.c
 +++ b/mm/highmem.c
 @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
   return virt_to_page(addr);
  }

 -static void flush_all_zero_pkmaps(void)
 +static int flush_all_zero_pkmaps(void)
  {
   int i;
 - int need_flush = 0;
 + int index = PKMAP_INDEX_INVAL;

   flush_cache_kmaps();

 @@ -141,10 +141,12 @@ static void flush_all_zero_pkmaps(void)
 pkmap_page_table[i]);

   set_page_address(page, NULL);
 - need_flush = 1;
 + index = i;

 How about returning first free index instead of last one?
 and update last_pkmap_nr to it.

Okay. It will be more good.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] minor clean-up and optimize highmem related code

2012-10-29 Thread JoonSoo Kim

Hi, Minchan.

2012/10/29 Minchan Kim minc...@kernel.org:
 Hi Joonsoo,

 On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
 This patchset clean-up and optimize highmem related code.

 [1] is just clean-up and doesn't introduce any functional change.
 [2-3] are for clean-up and optimization.
 These eliminate an useless lock opearation and list management.
 [4-5] is for optimization related to flush_all_zero_pkmaps().

 Joonsoo Kim (5):
   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
   mm, highmem: remove useless pool_lock
   mm, highmem: remove page_address_pool list
   mm, highmem: makes flush_all_zero_pkmaps() return index of last
 flushed entry
   mm, highmem: get virtual address of the page using PKMAP_ADDR()

 This patchset looks awesome to me.
 If you have a plan to respin, please CCed Peter.

Thanks for review.
I will wait more review and respin, the day after tomorrow.
Version 2 will include fix about your comment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Documentation: insert an additional information to android.txt

2012-10-29 Thread Joonsoo Kim

Without defining ARCH=arm, building a perf for Android ARM will be failed,
because it needs architecture specific files.
So add related information to a document.

Signed-off-by: Joonsoo Kim js1...@gmail.com

Cc: Irina Tirdea irina.tir...@intel.com
Cc: David Ahern dsah...@gmail.com
Cc: Ingo Molnar mi...@kernel.org
Cc: Jiri Olsa jo...@redhat.com
Cc: Namhyung Kim namhy...@kernel.org
Cc: Paul Mackerras pau...@samba.org
Cc: Pekka Enberg penb...@kernel.org
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Steven Rostedt rost...@goodmis.org

---
It is based on git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git,
perf-core-for-mingo branch.

diff --git a/tools/perf/Documentation/android.txt 
b/tools/perf/Documentation/android.txt
index a39dbbb..8484c3a 100644
--- a/tools/perf/Documentation/android.txt
+++ b/tools/perf/Documentation/android.txt
@@ -48,7 +48,10 @@ For x86:
 II. Compile perf for Android
 
 You need to run make with the NDK toolchain and sysroot defined above:
-  make CROSS_COMPILE=${NDK_TOOLCHAIN} CFLAGS=--sysroot=${NDK_SYSROOT}
+For arm:
+  make ARCH=arm CROSS_COMPILE=${NDK_TOOLCHAIN} 
CFLAGS=--sysroot=${NDK_SYSROOT}
+For x86:
+  make ARCH=x86 CROSS_COMPILE=${NDK_TOOLCHAIN} 
CFLAGS=--sysroot=${NDK_SYSROOT}
 
 III. Install perf
 ---
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] percpu: change a method freeing a chunk for consistency.

2012-10-29 Thread Joonsoo Kim

commit 099a19d9('allow limited allocation before slab is online') changes a 
method
allocating a chunk from kzalloc to pcpu_mem_alloc.
But, it missed changing matched free operation.
It may not be a problem for now, but fix it for consistency.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Christoph Lameter c...@linux.com

diff --git a/mm/percpu.c b/mm/percpu.c
index ddc5efb..ec25896 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -631,7 +631,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
if (!chunk)
return;
pcpu_mem_free(chunk-map, chunk-map_alloc * sizeof(chunk-map[0]));
-   kfree(chunk);
+   pcpu_mem_free(chunk, pcpu_chunk_struct_size);
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 11/18] sl[au]b: Allocate objects from memcg cache

2012-10-29 Thread JoonSoo Kim

Hi, Glauber.

2012/10/19 Glauber Costa glom...@parallels.com:
 We are able to match a cache allocation to a particular memcg.  If the
 task doesn't change groups during the allocation itself - a rare event,
 this will give us a good picture about who is the first group to touch a
 cache page.

 This patch uses the now available infrastructure by calling
 memcg_kmem_get_cache() before all the cache allocations.

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 CC: Tejun Heo t...@kernel.org
 ---
  include/linux/slub_def.h | 15 ++-
  mm/memcontrol.c  |  3 +++
  mm/slab.c|  6 +-
  mm/slub.c|  5 +++--
  4 files changed, 21 insertions(+), 8 deletions(-)

 diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
 index 961e72e..ed330df 100644
 --- a/include/linux/slub_def.h
 +++ b/include/linux/slub_def.h
 @@ -13,6 +13,8 @@
  #include linux/kobject.h

  #include linux/kmemleak.h
 +#include linux/memcontrol.h
 +#include linux/mm.h

  enum stat_item {
 ALLOC_FASTPATH, /* Allocation from cpu slab */
 @@ -209,14 +211,14 @@ static __always_inline int kmalloc_index(size_t size)
   * This ought to end up with a global pointer to the right cache
   * in kmalloc_caches.
   */
 -static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
 +static __always_inline struct kmem_cache *kmalloc_slab(gfp_t flags, size_t 
 size)
  {
 int index = kmalloc_index(size);

 if (index == 0)
 return NULL;

 -   return kmalloc_caches[index];
 +   return memcg_kmem_get_cache(kmalloc_caches[index], flags);
  }

You don't need this,
because memcg_kmem_get_cache() is invoked in both slab_alloc() and
__cache_alloc_node().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 13/18] memcg/sl[au]b Track all the memcg children of a kmem_cache.

2012-10-29 Thread JoonSoo Kim

2012/10/19 Glauber Costa glom...@parallels.com:
 +void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 +{
 +   struct kmem_cache *c;
 +   int i;
 +
 +   if (!s-memcg_params)
 +   return;
 +   if (!s-memcg_params-is_root_cache)
 +   return;
 +
 +   /*
 +* If the cache is being destroyed, we trust that there is no one else
 +* requesting objects from it. Even if there are, the sanity checks in
 +* kmem_cache_destroy should caught this ill-case.
 +*
 +* Still, we don't want anyone else freeing memcg_caches under our
 +* noses, which can happen if a new memcg comes to life. As usual,
 +* we'll take the set_limit_mutex to protect ourselves against this.
 +*/
 +   mutex_lock(set_limit_mutex);
 +   for (i = 0; i  memcg_limited_groups_array_size; i++) {
 +   c = s-memcg_params-memcg_caches[i];
 +   if (c)
 +   kmem_cache_destroy(c);
 +   }
 +   mutex_unlock(set_limit_mutex);
 +}

It may cause NULL deref.
Look at the following scenario.

1. some memcg slab caches has remained object.
2. start to destroy memcg.
3. schedule_delayed_work(kmem_cache_destroy_work_func, @delay 60hz)
4. all remained object is freed.
5. start to destroy root cache.
6. kmem_cache_destroy makes 's-memcg_params-memcg_caches[i] NULL!!
7. Start delayed work function.
8. cachep in kmem_cache_destroy_work_func() may be NULL

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] bootmem: remove not implemented function call, bootmem_arch_preferred_node()

2012-11-12 Thread Joonsoo Kim

There is no implementation of bootmeme_arch_preferred_node() and
call for this function will makes compile-error.
So, remove it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 434be4a..6f62c03e 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -589,19 +589,6 @@ static void * __init 
alloc_arch_preferred_bootmem(bootmem_data_t *bdata,
 {
if (WARN_ON_ONCE(slab_is_available()))
return kzalloc(size, GFP_NOWAIT);
-
-#ifdef CONFIG_HAVE_ARCH_BOOTMEM
-   {
-   bootmem_data_t *p_bdata;
-
-   p_bdata = bootmem_arch_preferred_node(bdata, size, align,
-   goal, limit);
-   if (p_bdata)
-   return alloc_bootmem_bdata(p_bdata, size, align,
-   goal, limit);
-   }
-#endif
-   return NULL;
 }
 
 static void * __init alloc_bootmem_core(unsigned long size,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] bootmem: remove alloc_arch_preferred_bootmem()

2012-11-12 Thread Joonsoo Kim

The name of function is not suitable for now.
And removing function and inlining it's code to each call sites
makes code more understandable.

Additionally, we shouldn't do allocation from bootmem
when slab_is_available(), so directly return kmalloc*'s return value.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/mm/bootmem.c b/mm/bootmem.c
index 6f62c03e..cd5c5a2 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -583,14 +583,6 @@ find_block:
return NULL;
 }
 
-static void * __init alloc_arch_preferred_bootmem(bootmem_data_t *bdata,
-   unsigned long size, unsigned long align,
-   unsigned long goal, unsigned long limit)
-{
-   if (WARN_ON_ONCE(slab_is_available()))
-   return kzalloc(size, GFP_NOWAIT);
-}
-
 static void * __init alloc_bootmem_core(unsigned long size,
unsigned long align,
unsigned long goal,
@@ -599,9 +591,8 @@ static void * __init alloc_bootmem_core(unsigned long size,
bootmem_data_t *bdata;
void *region;
 
-   region = alloc_arch_preferred_bootmem(NULL, size, align, goal, limit);
-   if (region)
-   return region;
+   if (WARN_ON_ONCE(slab_is_available()))
+   return kzalloc(size, GFP_NOWAIT);
 
list_for_each_entry(bdata, bdata_list, list) {
if (goal  bdata-node_low_pfn = PFN_DOWN(goal))
@@ -699,11 +690,9 @@ void * __init ___alloc_bootmem_node_nopanic(pg_data_t 
*pgdat,
 {
void *ptr;
 
+   if (WARN_ON_ONCE(slab_is_available()))
+   return kzalloc(size, GFP_NOWAIT);
 again:
-   ptr = alloc_arch_preferred_bootmem(pgdat-bdata, size,
-  align, goal, limit);
-   if (ptr)
-   return ptr;
 
/* do not panic in alloc_bootmem_bdata() */
if (limit  goal + size  limit)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] avr32, kconfig: remove HAVE_ARCH_BOOTMEM

2012-11-12 Thread Joonsoo Kim

Now, there is no code for CONFIG_HAVE_ARCH_BOOTMEM.
So remove it.

Cc: Haavard Skinnemoen hskinnem...@gmail.com
Cc: Hans-Christian Egtvedt egtv...@samfundet.no
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/avr32/Kconfig b/arch/avr32/Kconfig
index 06e73bf..c2bbc9a 100644
--- a/arch/avr32/Kconfig
+++ b/arch/avr32/Kconfig
@@ -193,9 +193,6 @@ source kernel/Kconfig.preempt
 config QUICKLIST
def_bool y
 
-config HAVE_ARCH_BOOTMEM
-   def_bool n
-
 config ARCH_HAVE_MEMORY_PRESENT
def_bool n
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] bootmem: fix wrong call parameter for free_bootmem()

2012-11-12 Thread Joonsoo Kim

It is somehow strange that alloc_bootmem return virtual address
and free_bootmem require physical address.
Anyway, free_bootmem()'s first parameter should be physical address.

There are some call sites for free_bootmem() with virtual address.
So fix them.

Cc: Andrew Morton a...@linux-foundation.org
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/powerpc/platforms/cell/celleb_pci.c 
b/arch/powerpc/platforms/cell/celleb_pci.c
index abc8af4..1735681 100644
--- a/arch/powerpc/platforms/cell/celleb_pci.c
+++ b/arch/powerpc/platforms/cell/celleb_pci.c
@@ -401,11 +401,11 @@ error:
} else {
if (config  *config) {
size = 256;
-   free_bootmem((unsigned long)(*config), size);
+   free_bootmem(__pa(*config), size);
}
if (res  *res) {
size = sizeof(struct celleb_pci_resource);
-   free_bootmem((unsigned long)(*res), size);
+   free_bootmem(__pa(*res), size);
}
}
 
diff --git a/drivers/macintosh/smu.c b/drivers/macintosh/smu.c
index 7d5a6b4..1963680 100644
--- a/drivers/macintosh/smu.c
+++ b/drivers/macintosh/smu.c
@@ -565,7 +565,7 @@ fail_msg_node:
 fail_db_node:
of_node_put(smu-db_node);
 fail_bootmem:
-   free_bootmem((unsigned long)smu, sizeof(struct smu_device));
+   free_bootmem(__pa(smu), sizeof(struct smu_device));
smu = NULL;
 fail_np:
of_node_put(np);
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 402a54a..d327b87 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -161,6 +161,6 @@ EXPORT_SYMBOL(free_cpumask_var);
  */
 void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
-   free_bootmem((unsigned long)mask, cpumask_size());
+   free_bootmem(__pa(mask), cpumask_size());
 }
 #endif
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] powerpc: change free_bootmem() to kfree()

2012-11-12 Thread Joonsoo Kim

commit ea96025a('Don't use alloc_bootmem() in init_IRQ() path')
changed alloc_bootmem() to kzalloc(),
but missed to change free_bootmem() to kfree().
So correct it.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c 
b/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c
index 328d221..74861a7 100644
--- a/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c
+++ b/arch/powerpc/platforms/82xx/pq2ads-pci-pic.c
@@ -16,7 +16,6 @@
 #include linux/spinlock.h
 #include linux/irq.h
 #include linux/types.h
-#include linux/bootmem.h
 #include linux/slab.h
 
 #include asm/io.h
@@ -149,7 +148,7 @@ int __init pq2ads_pci_init_irq(void)
priv-regs = of_iomap(np, 0);
if (!priv-regs) {
printk(KERN_ERR Cannot map PCI PIC registers.\n);
-   goto out_free_bootmem;
+   goto out_free_kmalloc;
}
 
/* mask all PCI interrupts */
@@ -171,9 +170,8 @@ int __init pq2ads_pci_init_irq(void)
 
 out_unmap_regs:
iounmap(priv-regs);
-out_free_bootmem:
-   free_bootmem((unsigned long)priv,
-sizeof(struct pq2ads_pci_pic));
+out_free_kmalloc:
+   kfree(priv);
of_node_put(np);
 out_unmap_irq:
irq_dispose_mapping(irq);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86/tlb: correct vmflag test for checking VM_HUGETLB

2012-11-12 Thread Joonsoo Kim

commit 611ae8e3f5204f7480b3b405993b3352cfa16662('enable tlb flush range
support for x86') change flush_tlb_mm_range() considerably. After this,
we test whether vmflag equal to VM_HUGETLB and it may be always failed,
because vmflag usually has other flags simultaneously.
Our intention is to check whether this vma is for hughtlb, so correct it
according to this purpose.

Cc: Alex Shi alex@intel.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 0777f04..60f926c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -197,7 +197,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
}

if (end == TLB_FLUSH_ALL || tlb_flushall_shift == -1
-   || vmflag == VM_HUGETLB) {
+   || vmflag  VM_HUGETLB) {
local_flush_tlb();
goto flush_all;
}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-12 Thread JoonSoo Kim

2012/11/3 Minchan Kim minc...@kernel.org:
 Hi Joonsoo,

 On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
 Hello, Minchan.

 2012/11/1 Minchan Kim minc...@kernel.org:
  On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
  In current code, after flush_all_zero_pkmaps() is invoked,
  then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
  return index of first flushed entry. With this index,
  we can immediately map highmem page to virtual address represented by 
  index.
  So change return type of flush_all_zero_pkmaps()
  and return index of first flushed entry.
 
  Additionally, update last_pkmap_nr to this index.
  It is certain that entry which is below this index is occupied by other 
  mapping,
  therefore updating last_pkmap_nr to this index is reasonable optimization.
 
  Cc: Mel Gorman m...@csn.ul.ie
  Cc: Peter Zijlstra a.p.zijls...@chello.nl
  Cc: Minchan Kim minc...@kernel.org
  Signed-off-by: Joonsoo Kim js1...@gmail.com
 
  diff --git a/include/linux/highmem.h b/include/linux/highmem.h
  index ef788b5..97ad208 100644
  --- a/include/linux/highmem.h
  +++ b/include/linux/highmem.h
  @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void 
  *vaddr, int size)
 
   #ifdef CONFIG_HIGHMEM
   #include asm/highmem.h
  +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
 
   /* declarations for linux/mm/highmem.c */
   unsigned int nr_free_highpages(void);
  diff --git a/mm/highmem.c b/mm/highmem.c
  index d98b0a9..b365f7b 100644
  --- a/mm/highmem.c
  +++ b/mm/highmem.c
  @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
return virt_to_page(addr);
   }
 
  -static void flush_all_zero_pkmaps(void)
  +static unsigned int flush_all_zero_pkmaps(void)
   {
int i;
  - int need_flush = 0;
  + unsigned int index = PKMAP_INVALID_INDEX;
 
flush_cache_kmaps();
 
  @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
  pkmap_page_table[i]);
 
set_page_address(page, NULL);
  - need_flush = 1;
  + if (index == PKMAP_INVALID_INDEX)
  + index = i;
}
  - if (need_flush)
  + if (index != PKMAP_INVALID_INDEX)
flush_tlb_kernel_range(PKMAP_ADDR(0), 
  PKMAP_ADDR(LAST_PKMAP));
  +
  + return index;
   }
 
   /**
  @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
*/
   void kmap_flush_unused(void)
   {
  + unsigned int index;
  +
lock_kmap();
  - flush_all_zero_pkmaps();
  + index = flush_all_zero_pkmaps();
  + if (index != PKMAP_INVALID_INDEX  (index  last_pkmap_nr))
  + last_pkmap_nr = index;
 
  I don't know how kmap_flush_unused is really fast path so how my nitpick
  is effective. Anyway,
  What problem happens if we do following as?
 
  lock()
  index = flush_all_zero_pkmaps();
  if (index != PKMAP_INVALID_INDEX)
  last_pkmap_nr = index;
  unlock();
 
  Normally, last_pkmap_nr is increased with searching empty slot in
  map_new_virtual. So I expect return value of flush_all_zero_pkmaps
  in kmap_flush_unused normally become either less than last_pkmap_nr
  or last_pkmap_nr + 1.

 There is a case that return value of kmap_flush_unused() is larger
 than last_pkmap_nr.

 I see but why it's problem? kmap_flush_unused returns larger value than
 last_pkmap_nr means that there is no free slot at below the value.
 So unconditional last_pkmap_nr update is vaild.

I think that this is not true.
Look at the slightly different example.

Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.

do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10;
do kunmap() with index 17
do kmap_flush_unused() = flush index 17 = last_pkmap = 17?

In this case, unconditional last_pkmap_nr update skip one kunmapped index.
So, conditional update is needed.

 Look at the following example.

 Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.

 do kmap_flush_unused() = flush index 10 = last_pkmap = 10;
 do kunmap() with index 17
 do kmap_flush_unused() = flush index 17

 So, little dirty implementation is needed.

 Thanks.

 --
 Kind Regards,
 Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] bootmem: fix wrong call parameter for free_bootmem()

2012-11-12 Thread JoonSoo Kim

Hi, Andrew.

2012/11/13 Andrew Morton a...@linux-foundation.org:
 On Tue, 13 Nov 2012 01:31:55 +0900
 Joonsoo Kim js1...@gmail.com wrote:

 It is somehow strange that alloc_bootmem return virtual address
 and free_bootmem require physical address.
 Anyway, free_bootmem()'s first parameter should be physical address.

 There are some call sites for free_bootmem() with virtual address.
 So fix them.

 Well gee, I wonder how that happened :(


 How does this look?

 From: Andrew Morton a...@linux-foundation.org
 Subject: bootmem-fix-wrong-call-parameter-for-free_bootmem-fix

 improve free_bootmem() and free_bootmem_pate() documentation

 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 Cc: FUJITA Tomonori fujita.tomon...@lab.ntt.co.jp
 Cc: Haavard Skinnemoen hskinnem...@gmail.com
 Cc: Hans-Christian Egtvedt egtv...@samfundet.no
 Cc: Johannes Weiner han...@cmpxchg.org
 Cc: Joonsoo Kim js1...@gmail.com
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---

  include/linux/bootmem.h |4 ++--
  mm/bootmem.c|   20 ++--
  2 files changed, 12 insertions(+), 12 deletions(-)

 --- a/mm/bootmem.c~bootmem-fix-wrong-call-parameter-for-free_bootmem-fix
 +++ a/mm/bootmem.c
 @@ -147,21 +147,21 @@ unsigned long __init init_bootmem(unsign

  /*
   * free_bootmem_late - free bootmem pages directly to page allocator
 - * @addr: starting address of the range
 + * @addr: starting physical address of the range
   * @size: size of the range in bytes
   *
   * This is only useful when the bootmem allocator has already been torn
   * down, but we are still initializing the system.  Pages are given directly
   * to the page allocator, no bootmem metadata is updated because it is gone.
   */
 -void __init free_bootmem_late(unsigned long addr, unsigned long size)
 +void __init free_bootmem_late(unsigned long physaddr, unsigned long size)
  {
 unsigned long cursor, end;

 -   kmemleak_free_part(__va(addr), size);
 +   kmemleak_free_part(__va(physaddr), size);

 -   cursor = PFN_UP(addr);
 -   end = PFN_DOWN(addr + size);
 +   cursor = PFN_UP(physaddr);
 +   end = PFN_DOWN(physaddr + size);

 for (; cursor  end; cursor++) {
 __free_pages_bootmem(pfn_to_page(cursor), 0);
 @@ -385,21 +385,21 @@ void __init free_bootmem_node(pg_data_t

  /**
   * free_bootmem - mark a page range as usable
 - * @addr: starting address of the range
 + * @addr: starting physical address of the range
   * @size: size of the range in bytes
   *
   * Partial pages will be considered reserved and left as they are.
   *
   * The range must be contiguous but may span node boundaries.
   */
 -void __init free_bootmem(unsigned long addr, unsigned long size)
 +void __init free_bootmem(unsigned long physaddr, unsigned long size)
  {
 unsigned long start, end;

 -   kmemleak_free_part(__va(addr), size);
 +   kmemleak_free_part(__va(physaddr), size);

 -   start = PFN_UP(addr);
 -   end = PFN_DOWN(addr + size);
 +   start = PFN_UP(physaddr);
 +   end = PFN_DOWN(physaddr + size);

 mark_bootmem(start, end, 0, 0);
  }
 --- 
 a/include/linux/bootmem.h~bootmem-fix-wrong-call-parameter-for-free_bootmem-fix
 +++ a/include/linux/bootmem.h
 @@ -51,8 +51,8 @@ extern unsigned long free_all_bootmem(vo
  extern void free_bootmem_node(pg_data_t *pgdat,
   unsigned long addr,
   unsigned long size);
 -extern void free_bootmem(unsigned long addr, unsigned long size);
 -extern void free_bootmem_late(unsigned long addr, unsigned long size);
 +extern void free_bootmem(unsigned long physaddr, unsigned long size);
 +extern void free_bootmem_late(unsigned long physaddr, unsigned long size);

  /*
   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
 _

Looks good to me :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-13 Thread JoonSoo Kim

2012/11/13 Minchan Kim minc...@kernel.org:
 On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
 2012/11/3 Minchan Kim minc...@kernel.org:
  Hi Joonsoo,
 
  On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
  Hello, Minchan.
 
  2012/11/1 Minchan Kim minc...@kernel.org:
   On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
   In current code, after flush_all_zero_pkmaps() is invoked,
   then re-iterate all pkmaps. It can be optimized if 
   flush_all_zero_pkmaps()
   return index of first flushed entry. With this index,
   we can immediately map highmem page to virtual address represented by 
   index.
   So change return type of flush_all_zero_pkmaps()
   and return index of first flushed entry.
  
   Additionally, update last_pkmap_nr to this index.
   It is certain that entry which is below this index is occupied by 
   other mapping,
   therefore updating last_pkmap_nr to this index is reasonable 
   optimization.
  
   Cc: Mel Gorman m...@csn.ul.ie
   Cc: Peter Zijlstra a.p.zijls...@chello.nl
   Cc: Minchan Kim minc...@kernel.org
   Signed-off-by: Joonsoo Kim js1...@gmail.com
  
   diff --git a/include/linux/highmem.h b/include/linux/highmem.h
   index ef788b5..97ad208 100644
   --- a/include/linux/highmem.h
   +++ b/include/linux/highmem.h
   @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void 
   *vaddr, int size)
  
#ifdef CONFIG_HIGHMEM
#include asm/highmem.h
   +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
  
/* declarations for linux/mm/highmem.c */
unsigned int nr_free_highpages(void);
   diff --git a/mm/highmem.c b/mm/highmem.c
   index d98b0a9..b365f7b 100644
   --- a/mm/highmem.c
   +++ b/mm/highmem.c
   @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
 return virt_to_page(addr);
}
  
   -static void flush_all_zero_pkmaps(void)
   +static unsigned int flush_all_zero_pkmaps(void)
{
 int i;
   - int need_flush = 0;
   + unsigned int index = PKMAP_INVALID_INDEX;
  
 flush_cache_kmaps();
  
   @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
   pkmap_page_table[i]);
  
 set_page_address(page, NULL);
   - need_flush = 1;
   + if (index == PKMAP_INVALID_INDEX)
   + index = i;
 }
   - if (need_flush)
   + if (index != PKMAP_INVALID_INDEX)
 flush_tlb_kernel_range(PKMAP_ADDR(0), 
   PKMAP_ADDR(LAST_PKMAP));
   +
   + return index;
}
  
/**
   @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
 */
void kmap_flush_unused(void)
{
   + unsigned int index;
   +
 lock_kmap();
   - flush_all_zero_pkmaps();
   + index = flush_all_zero_pkmaps();
   + if (index != PKMAP_INVALID_INDEX  (index  last_pkmap_nr))
   + last_pkmap_nr = index;
  
   I don't know how kmap_flush_unused is really fast path so how my nitpick
   is effective. Anyway,
   What problem happens if we do following as?
  
   lock()
   index = flush_all_zero_pkmaps();
   if (index != PKMAP_INVALID_INDEX)
   last_pkmap_nr = index;
   unlock();
  
   Normally, last_pkmap_nr is increased with searching empty slot in
   map_new_virtual. So I expect return value of flush_all_zero_pkmaps
   in kmap_flush_unused normally become either less than last_pkmap_nr
   or last_pkmap_nr + 1.
 
  There is a case that return value of kmap_flush_unused() is larger
  than last_pkmap_nr.
 
  I see but why it's problem? kmap_flush_unused returns larger value than
  last_pkmap_nr means that there is no free slot at below the value.
  So unconditional last_pkmap_nr update is vaild.

 I think that this is not true.
 Look at the slightly different example.

 Assume last_pkmap = 20 and index 1-9, 12-19 is kmapped. 10, 11 is kunmapped.

 do kmap_flush_unused() = flush index 10,11 = last_pkmap = 10;
 do kunmap() with index 17
 do kmap_flush_unused() = flush index 17 = last_pkmap = 17?

 In this case, unconditional last_pkmap_nr update skip one kunmapped index.
 So, conditional update is needed.

 Thanks for pouinting out, Joonsoo.
 You're right. I misunderstood your flush_all_zero_pkmaps change.
 As your change, flush_all_zero_pkmaps returns first *flushed* free slot index.
 What's the benefit returning flushed flushed free slot index rather than free 
 slot index?

If flush_all_zero_pkmaps() return free slot index rather than first
flushed free slot,
we need another comparison like as 'if pkmap_count[i] == 0' and
need another local variable for determining whether flush is occurred or not.
I want to minimize these overhead and churning of the code, although
they are negligible.

 I think flush_all_zero_pkmaps should return first free slot because customer 
 of
 flush_all_zero_pkmaps doesn't care whether it's just flushed or not.
 What he want is just free or not. In such case, we can remove above check and 
 it makes
 flusha_all_zero_pkmaps more intuitive

Re: [Q] Default SLAB allocator

2012-10-15 Thread JoonSoo Kim

Hello, Eric.

2012/10/14 Eric Dumazet eric.duma...@gmail.com:
 SLUB was really bad in the common workload you describe (allocations
 done by one cpu, freeing done by other cpus), because all kfree() hit
 the slow path and cpus contend in __slab_free() in the loop guarded by
 cmpxchg_double_slab(). SLAB has a cache for this, while SLUB directly
 hit the main struct page to add the freed object to freelist.

Could you elaborate more on how 'netperf RR' makes kernel allocations
done by one cpu, freeling done by other cpus, please?
I don't have enough background network subsystem, so I'm just curious.

 I played some months ago adding a percpu associative cache to SLUB, then
 just moved on other strategy.

 (Idea for this per cpu cache was to build a temporary free list of
 objects to batch accesses to struct page)

Is this implemented and submitted?
If it is, could you tell me the link for the patches?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] timer: use new setup_timer_deferrable() macro

2012-10-18 Thread Joonsoo Kim

Now, we have a handy macro for initializing deferrable timer.
Using it makes code clean and easy to understand.

Additionally, in some driver codes, use setup_timer() instead of init_timer().

This patch doesn't make any functional difference.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Len Brown l...@kernel.org
Cc: Jeff Garzik jgar...@pobox.com
Cc: Jitendra Kalsaria jitendra.kalsa...@qlogic.com
Cc: Ron Mercer ron.mer...@qlogic.com
Cc: Tejun Heo t...@kernel.org
Cc: Johannes Berg johan...@sipsolutions.net
Cc: John W. Linville linvi...@tuxdriver.com
Cc: David S. Miller da...@davemloft.net
Cc: Jamal Hadi Salim j...@mojatatu.com

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1599566..1707f9a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -944,9 +944,8 @@ static int __devinit ghes_probe(struct platform_device 
*ghes_dev)
}
switch (generic-notify.type) {
case ACPI_HEST_NOTIFY_POLLED:
-   ghes-timer.function = ghes_poll_func;
-   ghes-timer.data = (unsigned long)ghes;
-   init_timer_deferrable(ghes-timer);
+   setup_timer_deferrable(ghes-timer, ghes_poll_func,
+   (unsigned long)ghes);
ghes_add_timer(ghes);
break;
case ACPI_HEST_NOTIFY_EXTERNAL:
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 3cc7096..a811e6f 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -5616,9 +5616,8 @@ struct ata_port *ata_port_alloc(struct ata_host *host)
INIT_LIST_HEAD(ap-eh_done_q);
init_waitqueue_head(ap-eh_wait_q);
init_completion(ap-park_req_pending);
-   init_timer_deferrable(ap-fastdrain_timer);
-   ap-fastdrain_timer.function = ata_eh_fastdrain_timerfn;
-   ap-fastdrain_timer.data = (unsigned long)ap;
+   setup_timer_deferrable(ap-fastdrain_timer, ata_eh_fastdrain_timerfn,
+   (unsigned long)ap);
 
ap-cbl = ATA_CBL_NONE;
 
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 876bece..987e2cf 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5548,15 +5548,10 @@ static int __devinit nv_probe(struct pci_dev *pci_dev, 
const struct pci_device_i
spin_lock_init(np-hwstats_lock);
SET_NETDEV_DEV(dev, pci_dev-dev);
 
-   init_timer(np-oom_kick);
-   np-oom_kick.data = (unsigned long) dev;
-   np-oom_kick.function = nv_do_rx_refill;/* timer handler */
-   init_timer(np-nic_poll);
-   np-nic_poll.data = (unsigned long) dev;
-   np-nic_poll.function = nv_do_nic_poll; /* timer handler */
-   init_timer_deferrable(np-stats_poll);
-   np-stats_poll.data = (unsigned long) dev;
-   np-stats_poll.function = nv_do_stats_poll; /* timer handler */
+   setup_timer(np-oom_kick, nv_do_rx_refill, (unsigned long)dev);
+   setup_timer(np-nic_poll, nv_do_nic_poll, (unsigned long)dev);
+   setup_timer_deferrable(np-stats_poll, nv_do_stats_poll,
+   (unsigned long)dev);
 
err = pci_enable_device(pci_dev);
if (err)
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index b262d61..1687883 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -4707,9 +4707,7 @@ static int __devinit qlge_probe(struct pci_dev *pdev,
/* Start up the timer to trigger EEH if
 * the bus goes dead
 */
-   init_timer_deferrable(qdev-timer);
-   qdev-timer.data = (unsigned long)qdev;
-   qdev-timer.function = ql_timer;
+   setup_timer_deferrable(qdev-timer, ql_timer, (unsigned long)qdev);
qdev-timer.expires = jiffies + (5*HZ);
add_timer(qdev-timer);
ql_link_off(qdev);
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 607976c..743c7d7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -995,9 +995,8 @@ static void vxlan_setup(struct net_device *dev)
 
spin_lock_init(vxlan-hash_lock);
 
-   init_timer_deferrable(vxlan-age_timer);
-   vxlan-age_timer.function = vxlan_cleanup;
-   vxlan-age_timer.data = (unsigned long) vxlan;
+   setup_timer_deferrable(vxlan-age_timer, vxlan_cleanup,
+   (unsigned long)vxlan);
 
inet_get_local_port_range(low, high);
vxlan-port_min = low;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index d951daa..77d6f62 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3845,10 +3845,8 @@ static int __init init_workqueues(void)
INIT_LIST_HEAD(pool-worklist);
INIT_LIST_HEAD(pool-idle_list);
 
-   init_timer_deferrable

[PATCH 1/2] timer: add setup_timer_deferrable() macro

2012-10-18 Thread Joonsoo Kim

There are some users of deferrable timer. Because of lacking of
handy initializer, they should initialize deferrable timer fumblingly.
We might do better with new setup_timer_deferrable() macro.
So add it.

Following patch will makes some users of init_timer_deferrable() use
this handy macro.

Signed-off-by: Joonsoo Kim js1...@gmail.com

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197..5950276 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -151,6 +151,8 @@ static inline void init_timer_on_stack_key(struct 
timer_list *timer,
 
 #define setup_timer(timer, fn, data)   \
__setup_timer((timer), (fn), (data), 0)
+#define setup_timer_deferrable(timer, fn, data)
\
+   __setup_timer((timer), (fn), (data), TIMER_DEFERRABLE)
 #define setup_timer_on_stack(timer, fn, data)  \
__setup_timer_on_stack((timer), (fn), (data), 0)
 #define setup_deferrable_timer_on_stack(timer, fn, data)   \
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] clean-up initialization of deferrable timer

2012-10-18 Thread Joonsoo Kim

This patchset introduces setup_timer_deferrable() macro.
Using it makes code simple and understandable.

This patchset doesn't make any functional difference.
It is just for clean-up.

It is based on v3.7-rc1

Joonsoo Kim (2):
  timer: add setup_timer_deferrable() macro
  timer: use new setup_timer_deferrable() macro

 drivers/acpi/apei/ghes.c |5 ++---
 drivers/ata/libata-core.c|5 ++---
 drivers/net/ethernet/nvidia/forcedeth.c  |   13 -
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |4 +---
 drivers/net/vxlan.c  |5 ++---
 include/linux/timer.h|2 ++
 kernel/workqueue.c   |6 ++
 net/mac80211/agg-rx.c|   12 ++--
 net/mac80211/agg-tx.c|   12 ++--
 net/sched/cls_flow.c |5 ++---
 net/sched/sch_sfq.c  |5 ++---
 11 files changed, 31 insertions(+), 43 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Q] Default SLAB allocator

2012-10-18 Thread JoonSoo Kim

Hello, Eric.
Thank you very much for a kind comment about my question.
I have one more question related to network subsystem.
Please let me know what I misunderstand.

2012/10/14 Eric Dumazet eric.duma...@gmail.com:
 In latest kernels, skb-head no longer use kmalloc()/kfree(), so SLAB vs
 SLUB is less a concern for network loads.

 In 3.7, (commit 69b08f62e17) we use fragments of order-3 pages to
 populate skb-head.

You mentioned that in latest kernel skb-head no longer use kmalloc()/kfree().
But, why result of David's netperf RR test on v3.6 is differentiated
by choosing the allocator?
As far as I know, __netdev_alloc_frag may be introduced in v3.5, so
I'm just confused.
Does this test use __netdev_alloc_skb with __GFP_WAIT | GFP_DMA?

Does normal workload for network use __netdev_alloc_skb with
__GFP_WAIT | GFP_DMA?

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] mm: correct return value of migrate_pages()

2012-07-16 Thread Joonsoo Kim

migrate_pages() should return number of pages not migrated or error code.
When unmap_and_move return -EAGAIN, outer loop is re-execution without
initialising nr_failed. This makes nr_failed over-counted.

So this patch correct it by initialising nr_failed in outer loop.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Christoph Lameter c...@linux.com

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..294d52a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
 
for(pass = 0; pass  10  retry; pass++) {
retry = 0;
+   nr_failed = 0;
 
list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall

2012-07-16 Thread Joonsoo Kim

do_migrate_pages() can return the number of pages not migrated.
Because migrate_pages() syscall return this value directly,
migrate_pages() syscall may return the number of pages not migrated.
In fail case in migrate_pages() syscall, we should return error value.
So change err to -EIO

Additionally, Correct comment above do_migrate_pages()

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Christoph Lameter c...@linux.com

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..f7df271 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int 
source, int dest,
  * Move pages between the two nodesets so as to preserve the physical
  * layout as much as possible.
  *
- * Returns the number of page that could not be moved.
+ * Returns error or the number of pages not migrated.
  */
 int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 const nodemask_t *to, int flags)
@@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, 
maxnode,
 
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+   if (err  0)
+   err = -EIO;
 
mmput(mm);
 out:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()

2012-07-16 Thread Joonsoo Kim

migrate_pages() would return positive value in some failure case,
so 'ret  0 ? 0 : ret' may be wrong.
This fix it and remove one dead statement.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Michal Nazarewicz min...@mina86.com
Cc: Marek Szyprowski m.szyprow...@samsung.com
Cc: Minchan Kim minc...@kernel.org
Cc: Christoph Lameter c...@linux.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret  0 ? ret : -EBUSY;
break;
}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
 
putback_lru_pages(cc.migratepages);
-   return ret  0 ? 0 : ret;
+   return ret = 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4] mm: fix possible incorrect return value of move_pages() syscall

2012-07-16 Thread Joonsoo Kim

move_pages() syscall may return success in case that
do_move_page_to_node_array return positive value which means migration failed.
This patch changes return value of do_move_page_to_node_array
for not returning positive value. It can fix the problem.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Brice Goglin br...@myri.com
Cc: Christoph Lameter c...@linux.com
Cc: Minchan Kim minc...@kernel.org

diff --git a/mm/migrate.c b/mm/migrate.c
index 294d52a..adabaf4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1171,7 +1171,7 @@ set_status:
}
 
up_read(mm-mmap_sem);
-   return err;
+   return err  0 ? -EIO : err;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm: correct return value of migrate_pages()

2012-07-16 Thread JoonSoo Kim

2012/7/17 Christoph Lameter c...@linux.com:
 On Tue, 17 Jul 2012, Joonsoo Kim wrote:

 migrate_pages() should return number of pages not migrated or error code.
 When unmap_and_move return -EAGAIN, outer loop is re-execution without
 initialising nr_failed. This makes nr_failed over-counted.

 The itention of the nr_failed was only to give an indication as to how
 many attempts where made. The failed pages where on a separate queue that
 seems to have vanished.

 So this patch correct it by initialising nr_failed in outer loop.

 Well yea it makes sense since retry is initialized there as well.

 Acked-by: Christoph Lameter c...@linux.com

Thanks for comment.

Additinally, I find that migrate_huge_pages() is needed identical fix
as migrate_pages().

@@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from,

for (pass = 0; pass  10  retry; pass++) {
retry = 0;
+   nr_failed = 0;

list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();

When I resend with this, could I include Acked-by: Christoph Lameter
c...@linux.com?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm: correct return value of migrate_pages()

2012-07-16 Thread JoonSoo Kim

2012/7/17 Michal Nazarewicz min...@tlen.pl:
 Acked-by: Michal Nazarewicz min...@mina86.com

Thanks.

 Actually, it makes me wonder if there is any code that uses this
 information.  If not, it would be best in my opinion to make it return
 zero or negative error code, but that would have to be checked.

I think that, too.
I looked at every callsites for migrate_pages() and there is no place
which really need fail count.
This function sometimes makes caller error-prone,
so I think changing return value is preferable.

How do you think, Christoph?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm: fix possible incorrect return value of migrate_pages() syscall

2012-07-16 Thread JoonSoo Kim

2012/7/17 Michal Nazarewicz min...@tlen.pl:
 Joonsoo Kim js1...@gmail.com writes:
 do_migrate_pages() can return the number of pages not migrated.
 Because migrate_pages() syscall return this value directly,
 migrate_pages() syscall may return the number of pages not migrated.
 In fail case in migrate_pages() syscall, we should return error value.
 So change err to -EIO

 Additionally, Correct comment above do_migrate_pages()

 Signed-off-by: Joonsoo Kim js1...@gmail.com
 Cc: Sasha Levin levinsasha...@gmail.com
 Cc: Christoph Lameter c...@linux.com

 Acked-by: Michal Nazarewicz min...@mina86.com

Thanks.

When I resend with changing -EIO to -EBUSY,
could I include Acked-by: Michal Nazarewicz min...@mina86.com?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm: fix return value in __alloc_contig_migrate_range()

2012-07-16 Thread JoonSoo Kim

2012/7/17 Michal Nazarewicz min...@mina86.com:
 Joonsoo Kim js1...@gmail.com writes:

 migrate_pages() would return positive value in some failure case,
 so 'ret  0 ? 0 : ret' may be wrong.
 This fix it and remove one dead statement.

 Signed-off-by: Joonsoo Kim js1...@gmail.com
 Cc: Michal Nazarewicz min...@mina86.com
 Cc: Marek Szyprowski m.szyprow...@samsung.com
 Cc: Minchan Kim minc...@kernel.org
 Cc: Christoph Lameter c...@linux.com

 Have you actually encountered this problem?  If migrate_pages() fails
 with a positive value, the code that you are removing kicks in and
 -EBUSY is assigned to ret (now that I look at it, I think that in the
 current code the return ret  0 ? 0 : ret; statement could be reduced
 to return ret;).  Your code seems to be cleaner, but the commit
 message does not look accurate to me.

I don't encounter this problem yet.

If migrate_pages() with offlining false meets KSM page, then migration failed.
In this case, failed page is removed from cc.migratepage list and
return failed count.
So it can be possible exiting loop without testing ++tries == 5 and
ret is over the zero.
Is there any point which I missing?
Is there any possible scenario migrate_pages return   0 and
cc.migratepages is empty?

I'm not expert for MM, so please comment my humble opinion.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4 v2] mm: correct return value of migrate_pages() and migrate_huge_pages()

2012-07-17 Thread Joonsoo Kim

migrate_pages() should return number of pages not migrated or error code.
When unmap_and_move return -EAGAIN, outer loop is re-execution without
initialising nr_failed. This makes nr_failed over-counted.

So this patch correct it by initialising nr_failed in outer loop.

migrate_huge_pages() is identical case as migrate_pages()

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Christoph Lameter c...@linux.com
Acked-by: Christoph Lameter c...@linux.com
Acked-by: Michal Nazarewicz min...@mina86.com

diff --git a/mm/migrate.c b/mm/migrate.c
index be26d5c..f495c58 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -982,6 +982,7 @@ int migrate_pages(struct list_head *from,
 
for(pass = 0; pass  10  retry; pass++) {
retry = 0;
+   nr_failed = 0;
 
list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();
@@ -1029,6 +1030,7 @@ int migrate_huge_pages(struct list_head *from,
 
for (pass = 0; pass  10  retry; pass++) {
retry = 0;
+   nr_failed = 0;
 
list_for_each_entry_safe(page, page2, from, lru) {
cond_resched();
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall

2012-07-17 Thread Joonsoo Kim

do_migrate_pages() can return the number of pages not migrated.
Because migrate_pages() syscall return this value directly,
migrate_pages() syscall may return the number of pages not migrated.
In fail case in migrate_pages() syscall, we should return error value.
So change err to -EBUSY

Additionally, Correct comment above do_migrate_pages()

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Sasha Levin levinsasha...@gmail.com
Cc: Christoph Lameter c...@linux.com

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..0732729 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,7 +948,7 @@ static int migrate_to_node(struct mm_struct *mm, int 
source, int dest,
  * Move pages between the two nodesets so as to preserve the physical
  * layout as much as possible.
  *
- * Returns the number of page that could not be moved.
+ * Returns error or the number of pages not migrated.
  */
 int do_migrate_pages(struct mm_struct *mm, const nodemask_t *from,
 const nodemask_t *to, int flags)
@@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, 
maxnode,
 
err = do_migrate_pages(mm, old, new,
capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
+   if (err  0)
+   err = -EBUSY;
 
mmput(mm);
 out:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4 v2] mm: fix possible incorrect return value of move_pages() syscall

2012-07-17 Thread Joonsoo Kim

move_pages() syscall may return success in case that
do_move_page_to_node_array return positive value which means migration failed.
This patch changes return value of do_move_page_to_node_array
for not returning positive value. It can fix the problem.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Brice Goglin br...@myri.com
Cc: Christoph Lameter c...@linux.com
Cc: Minchan Kim minc...@kernel.org

diff --git a/mm/migrate.c b/mm/migrate.c
index f495c58..eeaf409 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1172,7 +1172,7 @@ set_status:
}
 
up_read(mm-mmap_sem);
-   return err;
+   return err  0 ? -EBUSY : err;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()

2012-07-17 Thread Joonsoo Kim

migrate_pages() would return positive value in some failure case,
so 'ret  0 ? 0 : ret' may be wrong.
This fix it and remove one dead statement.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Michal Nazarewicz min...@mina86.com
Cc: Marek Szyprowski m.szyprow...@samsung.com
Cc: Minchan Kim minc...@kernel.org
Cc: Christoph Lameter c...@linux.com
Acked-by: Christoph Lameter c...@linux.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret  0 ? ret : -EBUSY;
break;
}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
 
putback_lru_pages(cc.migratepages);
-   return ret  0 ? 0 : ret;
+   return ret = 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4 v2] mm: fix possible incorrect return value of migrate_pages() syscall

2012-07-17 Thread JoonSoo Kim

2012/7/17 Christoph Lameter c...@linux.com:
 On Tue, 17 Jul 2012, Joonsoo Kim wrote:

 @@ -1382,6 +1382,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned 
 long, maxnode,

   err = do_migrate_pages(mm, old, new,
   capable(CAP_SYS_NICE) ? MPOL_MF_MOVE_ALL : MPOL_MF_MOVE);
 + if (err  0)
 + err = -EBUSY;

   mmput(mm);
  out:

 Why not have do_migrate_pages() return EBUSY if we do not need the number
 of failed/retried pages?

There is no serious reason.
do_migrate_pages() have two callsites, although another one doesn't
use return value.
do_migrate_pages() is commented Return the number of page 
And my focus is fixing possible error in migrate_pages() syscall.
So, I keep to return the number of failed/retired pages.

If we really think the number of failed/retired pages is useless, in that time,
instead that do_migrate_pages() return EBUSY, we can make migrate_pages()
return EBUSY. I think it is better to fix all the related codes at one go.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4 v2] mm: fix return value in __alloc_contig_migrate_range()

2012-07-17 Thread JoonSoo Kim

2012/7/17 Michal Nazarewicz min...@mina86.com:
 On Tue, 17 Jul 2012 14:33:34 +0200, Joonsoo Kim js1...@gmail.com wrote:

 migrate_pages() would return positive value in some failure case,
 so 'ret  0 ? 0 : ret' may be wrong.
 This fix it and remove one dead statement.


 How about the following message:

 --- 8 ---
 migrate_pages() can return positive value while at the same time emptying
 the list of pages it was called with.  Such situation means that it went
 through all the pages on the list some of which failed to be migrated.

 If that happens, __alloc_contig_migrate_range()'s loop may finish without
 ++tries == 5 never being checked.  This in turn means that at the end
 of the function, ret may have a positive value, which should be treated
 as an error.

 This patch changes __alloc_contig_migrate_range() so that the return
 statement converts positive ret value into -EBUSY error.
 --- 8 ---

It's good.
I will resend patch replacing my comment with yours.
Thanks for help.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range()

2012-07-17 Thread Joonsoo Kim

migrate_pages() can return positive value while at the same time emptying
the list of pages it was called with.  Such situation means that it went
through all the pages on the list some of which failed to be migrated.

If that happens, __alloc_contig_migrate_range()'s loop may finish without
++tries == 5 never being checked.  This in turn means that at the end
of the function, ret may have a positive value, which should be treated
as an error.

This patch changes __alloc_contig_migrate_range() so that the return
statement converts positive ret value into -EBUSY error.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Michal Nazarewicz min...@mina86.com
Cc: Marek Szyprowski m.szyprow...@samsung.com
Cc: Minchan Kim minc...@kernel.org
Cc: Christoph Lameter c...@linux.com
Acked-by: Christoph Lameter c...@linux.com
Acked-by: Michal Nazarewicz min...@mina86.com

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret  0 ? ret : -EBUSY;
break;
}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
 
putback_lru_pages(cc.migratepages);
-   return ret  0 ? 0 : ret;
+   return ret = 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()

2012-07-17 Thread Joonsoo Kim

Commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 ('mm: compaction: introduce
sync-light migration for use by compaction') change declaration of
migrate_pages() and migrate_huge_pages().
But, it miss changing argument of migrate_huge_pages()
in soft_offline_huge_page(). In this case, we should call with MIGRATE_SYNC.
So change it.

Additionally, there is mismatch between type of argument and function
declaration for migrate_pages(). So fix this simple case, too.

Signed-off-by: Joonsoo Kim js1...@gmail.com
Cc: Christoph Lameter c...@linux.com
Cc: Mel Gorman mgor...@suse.de

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ab1e714..afde561 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1431,8 +1431,8 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
/* Keep page count to indicate a given hugepage is isolated. */
 
list_add(hpage-lru, pagelist);
-   ret = migrate_huge_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, 0,
-   true);
+   ret = migrate_huge_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, false,
+   MIGRATE_SYNC);
if (ret) {
struct page *page1, *page2;
list_for_each_entry_safe(page1, page2, pagelist, lru)
@@ -1561,7 +1561,7 @@ int soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(page-lru, pagelist);
ret = migrate_pages(pagelist, new_page, MPOL_MF_MOVE_ALL,
-   0, MIGRATE_SYNC);
+   false, MIGRATE_SYNC);
if (ret) {
putback_lru_pages(pagelist);
pr_info(soft offline: %#lx: migration failed %d, type 
%lx\n,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/6] correct load_balance()

2013-04-22 Thread Joonsoo Kim

On Tue, Mar 26, 2013 at 03:01:34PM +0900, Joonsoo Kim wrote:
 Commit 88b8dac0 makes load_balance() consider other cpus in its group.
 But, there are some missing parts for this feature to work properly.
 This patchset correct these things and make load_balance() robust.
 
 Others are related to LBF_ALL_PINNED. This is fallback functionality
 when all tasks can't be moved as cpu affinity. But, currently,
 if imbalance is not large enough to task's load, we leave LBF_ALL_PINNED
 flag and 'redo' is triggered. This is not our intention, so correct it.
 
 These are based on v3.9-rc4.
 
 Changelog
 v1-v2: Changes from Peter's suggestion
  [4/6]: don't include a code to evaluate load value in can_migrate_task()
  [5/6]: rename load_balance_tmpmask to load_balance_mask
  [6/6]: not use one more cpumasks, use env's cpus for prevent to re-select
 
 Joonsoo Kim (6):
   sched: change position of resched_cpu() in load_balance()
   sched: explicitly cpu_idle_type checking in rebalance_domains()
   sched: don't consider other cpus in our group in case of NEWLY_IDLE
   sched: move up affinity check to mitigate useless redoing overhead
   sched: rename load_balance_tmpmask to load_balance_mask
   sched: prevent to re-select dst-cpu in load_balance()
 
  kernel/sched/core.c |4 +--
  kernel/sched/fair.c |   67 
 +++
  2 files changed, 38 insertions(+), 33 deletions(-)

Hello, Ingo and Peter.

Just ping for this patchset.
Please let me know what I have to do for merging this patchset.

Thanks.

 
 -- 
 1.7.9.5
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/4] kprobes: delay blacklist symbol lookup until we actually need it

2013-04-04 Thread Joonsoo Kim

Hello, Oskar.

On Thu, Apr 04, 2013 at 02:51:26PM +0200, Oskar Andero wrote:
 From: Toby Collett toby.coll...@sonymobile.com
 
 The symbol lookup can take a long time and kprobes is
 initialised very early in boot, so delay symbol lookup
 until the blacklist is first used.
 
 Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
 Cc: David S. Miller da...@davemloft.net
 Reviewed-by: Radovan Lekanovic radovan.lekano...@sonymobile.com
 Signed-off-by: Toby Collett toby.coll...@sonymobile.com
 Signed-off-by: Oskar Andero oskar.and...@sonymobile.com
 ---
  kernel/kprobes.c | 98 
 ++--
  1 file changed, 60 insertions(+), 38 deletions(-)
 
 diff --git a/kernel/kprobes.c b/kernel/kprobes.c
 index e35be53..0a270e5 100644
 --- a/kernel/kprobes.c
 +++ b/kernel/kprobes.c
 @@ -68,6 +68,7 @@
  #endif
  
  static int kprobes_initialized;
 +static int kprobe_blacklist_initialized;
  static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
  static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
  
 @@ -102,6 +103,60 @@ static struct kprobe_blackpoint kprobe_blacklist[] = {
   {NULL}/* Terminator */
  };
  
 +/* it can take some time (  100ms ) to initialise the
 + * blacklist so we delay this until we actually need it
 + */
 +static void init_kprobe_blacklist(void)
 +{
 + int i;
 + unsigned long offset = 0, size = 0;
 + char *modname, namebuf[128];
 + const char *symbol_name;
 + void *addr;
 + struct kprobe_blackpoint *kb;
 +
 + mutex_lock(kprobe_mutex);
 + if (kprobe_blacklist_initialized)
 + goto out;
 +
 + /*
 +  * Lookup and populate the kprobe_blacklist.
 +  *
 +  * Unlike the kretprobe blacklist, we'll need to determine
 +  * the range of addresses that belong to the said functions,
 +  * since a kprobe need not necessarily be at the beginning
 +  * of a function.
 +  */
 + for (kb = kprobe_blacklist; kb-name != NULL; kb++) {
 + kprobe_lookup_name(kb-name, addr);
 + if (!addr)
 + continue;
 +
 + kb-start_addr = (unsigned long)addr;
 + symbol_name = kallsyms_lookup(kb-start_addr,
 + size, offset, modname, namebuf);
 + if (!symbol_name)
 + kb-range = 0;
 + else
 + kb-range = size;
 + }
 +
 + if (kretprobe_blacklist_size) {
 + /* lookup the function address from its name */
 + for (i = 0; kretprobe_blacklist[i].name != NULL; i++) {
 + kprobe_lookup_name(kretprobe_blacklist[i].name,
 +kretprobe_blacklist[i].addr);
 + if (!kretprobe_blacklist[i].addr)
 + printk(kretprobe: lookup failed: %s\n,
 +kretprobe_blacklist[i].name);
 + }
 + }
 + kprobe_blacklist_initialized = 1;

You need smp_wmb() before assigning 'kprobe_blacklist_initialized = 1'.
This guarantee that who see kprobe_blacklist_initialized = 1 will get
updated data of kprobe_blacklist.

Please refer my previous patch once more :)

And How about define kprobe_blacklist_initialized as boolean?

Thanks.

 +
 +out:
 + mutex_unlock(kprobe_mutex);
 +}
 +
  #ifdef __ARCH_WANT_KPROBES_INSN_SLOT
  /*
   * kprobe-ainsn.insn points to the copy of the instruction to be
 @@ -1331,6 +1386,9 @@ static int __kprobes in_kprobes_functions(unsigned long 
 addr)
   if (addr = (unsigned long)__kprobes_text_start 
   addr  (unsigned long)__kprobes_text_end)
   return -EINVAL;
 +
 + if (unlikely(!kprobe_blacklist_initialized))
 + init_kprobe_blacklist();
   /*
* If there exists a kprobe_blacklist, verify and
* fail any probe registration in the prohibited area
 @@ -1816,6 +1874,8 @@ int __kprobes register_kretprobe(struct kretprobe *rp)
   void *addr;
  
   if (kretprobe_blacklist_size) {
 + if (unlikely(!kprobe_blacklist_initialized))
 + init_kprobe_blacklist();
   addr = kprobe_addr(rp-kp);
   if (IS_ERR(addr))
   return PTR_ERR(addr);
 @@ -2065,11 +2125,6 @@ static struct notifier_block kprobe_module_nb = {
  static int __init init_kprobes(void)
  {
   int i, err = 0;
 - unsigned long offset = 0, size = 0;
 - char *modname, namebuf[128];
 - const char *symbol_name;
 - void *addr;
 - struct kprobe_blackpoint *kb;
  
   /* FIXME allocate the probe table, currently defined statically */
   /* initialize all list heads */
 @@ -2079,39 +2134,6 @@ static int __init init_kprobes(void)
   raw_spin_lock_init((kretprobe_table_locks[i].lock));
   }
  
 - /*
 -  * Lookup and populate the kprobe_blacklist.
 -  *
 -  * Unlike the kretprobe blacklist, we'll need to determine
 -  * the

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-04 Thread Joonsoo Kim

On Thu, Apr 04, 2013 at 01:53:25PM +, Christoph Lameter wrote:
 On Thu, 4 Apr 2013, Joonsoo Kim wrote:
 
  Pekka alreay applied it.
  Do we need update?
 
 Well I thought the passing of the count via lru.next would be something
 worthwhile to pick up.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

Hello, Pekka.

Here goes a patch implementing Christoph's idea.
Instead of updating my previous patch, I re-write this patch on top of
your slab/next tree.

Thanks.

8---
From e1c18793dd2a9d9cef87b07faf975364b71276d7 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim iamjoonsoo@lge.com
Date: Fri, 5 Apr 2013 10:49:36 +0900
Subject: [PATCH] slub: use page-lru.next to calculate nr of acquired object

We can pass inuse count via page-lru.next in order to calculate number
of acquired objects and it is more beautiful way. This reduces one
function argument and makes clean code.

Cc: Christoph Lameter c...@linux.com
Suggested-by: Christoph Lameter c...@linux.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/slub.c b/mm/slub.c
index 21b3f00..8a35464 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1493,11 +1493,12 @@ static inline void remove_partial(struct 
kmem_cache_node *n,
  */
 static inline void *acquire_slab(struct kmem_cache *s,
struct kmem_cache_node *n, struct page *page,
-   int mode, int *objects)
+   int mode)
 {
void *freelist;
unsigned long counters;
struct page new;
+   unsigned long inuse;
 
/*
 * Zap the freelist and set the frozen bit.
@@ -1507,7 +1508,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
freelist = page-freelist;
counters = page-counters;
new.counters = counters;
-   *objects = new.objects - new.inuse;
+   inuse = page-inuse;
if (mode) {
new.inuse = page-objects;
new.freelist = NULL;
@@ -1525,6 +1526,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return NULL;
 
remove_partial(n, page);
+   page-lru.next = (void *)inuse;
WARN_ON(!freelist);
return freelist;
 }
@@ -1541,7 +1543,6 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
struct page *page, *page2;
void *object = NULL;
int available = 0;
-   int objects;
 
/*
 * Racy check. If we mistakenly see no partial slabs then we
@@ -1559,11 +1560,11 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
if (!pfmemalloc_match(page, flags))
continue;
 
-   t = acquire_slab(s, n, page, object == NULL, objects);
+   t = acquire_slab(s, n, page, object == NULL);
if (!t)
break;
 
-   available += objects;
+   available += (page-objects - (unsigned long)page-lru.next);
if (!object) {
c-page = page;
stat(s, ALLOC_FROM_PARTIAL);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-04 Thread Joonsoo Kim

Hello, Preeti.

On Thu, Apr 04, 2013 at 12:18:32PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 On 04/04/2013 06:12 AM, Joonsoo Kim wrote:
  Hello, Preeti.
 
  
  So, how about extending a sched_period with rq-nr_running, instead of
  cfs_rq-nr_running? It is my quick thought and I think that we can ensure
  to run atleast once in this extending sched_period.
 
 Yeah this seems to be correct.This would ensure sched_min_granularity
 also. So then in the scenarion where there are 2 tgs in a runqueue with
 10 tasks each,when we calculate the sched_slice of any task,the
 __sched_period() would return 4*20 = 80ms.
 
 The sched_slice of each of the task would be 80/20 = 4ms. But what about
 the sched_slice of each task group? How would that be calculated then?

Ah... Okay.
I will think more deeply about this issue.

 
 Let us take the above example and walk through this problem.This would
 probably help us spot the issues involved with this.
 
  And, do we leave a problem if we cannot guaranteed atleast once property?
 
 This would depend on the results of the benchmarks with the changes.I am
 unable to comment on this off the top of my head.

Okay. :)

Thanks for your kind review!!

 
 Thank you
 
 Regards
 Preeti U Murthy
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab

2013-04-08 Thread Joonsoo Kim

Currently, freed pages via rcu is not counted for reclaimed_slab, because
it is freed in rcu context, not current task context. But, this free is
initiated by this task, so counting this into this task's reclaimed_slab
is meaningful to decide whether we continue reclaim, or not.
So change code to count these pages for this task's reclaimed_slab.

Cc: Christoph Lameter c...@linux-foundation.org
Cc: Pekka Enberg penb...@kernel.org
Cc: Matt Mackall m...@selenic.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/slub.c b/mm/slub.c
index 4aec537..16fd2d5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1409,8 +1409,6 @@ static void __free_slab(struct kmem_cache *s, struct page 
*page)
 
memcg_release_pages(s, order);
page_mapcount_reset(page);
-   if (current-reclaim_state)
-   current-reclaim_state-reclaimed_slab += pages;
__free_memcg_kmem_pages(page, order);
 }
 
@@ -1431,6 +1429,8 @@ static void rcu_free_slab(struct rcu_head *h)
 
 static void free_slab(struct kmem_cache *s, struct page *page)
 {
+   int pages = 1  compound_order(page);
+
if (unlikely(s-flags  SLAB_DESTROY_BY_RCU)) {
struct rcu_head *head;
 
@@ -1450,6 +1450,9 @@ static void free_slab(struct kmem_cache *s, struct page 
*page)
call_rcu(head, rcu_free_slab);
} else
__free_slab(s, page);
+
+   if (current-reclaim_state)
+   current-reclaim_state-reclaimed_slab += pages;
 }
 
 static void discard_slab(struct kmem_cache *s, struct page *page)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] mm, slab: count freed pages via rcu as this task's reclaimed_slab

2013-04-08 Thread Joonsoo Kim

Currently, freed pages via rcu is not counted for reclaimed_slab, because
it is freed in rcu context, not current task context. But, this free is
initiated by this task, so counting this into this task's reclaimed_slab
is meaningful to decide whether we continue reclaim, or not.
So change code to count these pages for this task's reclaimed_slab.

Cc: Christoph Lameter c...@linux-foundation.org
Cc: Pekka Enberg penb...@kernel.org
Cc: Matt Mackall m...@selenic.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/slab.c b/mm/slab.c
index 856e4a1..4d94bcb 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1934,8 +1934,6 @@ static void kmem_freepages(struct kmem_cache *cachep, 
void *addr)
}
 
memcg_release_pages(cachep, cachep-gfporder);
-   if (current-reclaim_state)
-   current-reclaim_state-reclaimed_slab += nr_freed;
free_memcg_kmem_pages((unsigned long)addr, cachep-gfporder);
 }
 
@@ -2165,6 +2163,7 @@ static void slab_destroy_debugcheck(struct kmem_cache 
*cachep, struct slab *slab
  */
 static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp)
 {
+   unsigned long nr_freed = (1  cachep-gfporder);
void *addr = slabp-s_mem - slabp-colouroff;
 
slab_destroy_debugcheck(cachep, slabp);
@@ -2180,6 +2179,9 @@ static void slab_destroy(struct kmem_cache *cachep, 
struct slab *slabp)
if (OFF_SLAB(cachep))
kmem_cache_free(cachep-slabp_cache, slabp);
}
+
+   if (current-reclaim_state)
+   current-reclaim_state-reclaimed_slab += nr_freed;
 }
 
 /**
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] mm, vmscan: count accidental reclaimed pages failed to put into lru

2013-04-08 Thread Joonsoo Kim

In shrink_(in)active_list(), we can fail to put into lru, and these pages
are reclaimed accidentally. Currently, these pages are not counted
for sc-nr_reclaimed, but with this information, we can stop to reclaim
earlier, so can reduce overhead of reclaim.

Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..5d60ae0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -365,7 +365,7 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t 
gfp_mask);
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
-extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern unsigned long free_hot_cold_page_list(struct list_head *list, int cold);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..a5f3952 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1360,14 +1360,18 @@ out:
 /*
  * Free a list of 0-order pages
  */
-void free_hot_cold_page_list(struct list_head *list, int cold)
+unsigned long free_hot_cold_page_list(struct list_head *list, int cold)
 {
+   unsigned long nr_reclaimed = 0;
struct page *page, *next;
 
list_for_each_entry_safe(page, next, list, lru) {
trace_mm_page_free_batched(page, cold);
free_hot_cold_page(page, cold);
+   nr_reclaimed++;
}
+
+   return nr_reclaimed;
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..eff2927 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -915,7 +915,6 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 */
__clear_page_locked(page);
 free_it:
-   nr_reclaimed++;
 
/*
 * Is there need to periodically free_page_list? It would
@@ -954,7 +953,7 @@ keep:
if (nr_dirty  nr_dirty == nr_congested  global_reclaim(sc))
zone_set_flag(zone, ZONE_CONGESTED);
 
-   free_hot_cold_page_list(free_pages, 1);
+   nr_reclaimed += free_hot_cold_page_list(free_pages, 1);
 
list_splice(ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
@@ -1321,7 +1320,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
if (nr_taken == 0)
return 0;
 
-   nr_reclaimed = shrink_page_list(page_list, zone, sc, TTU_UNMAP,
+   nr_reclaimed += shrink_page_list(page_list, zone, sc, TTU_UNMAP,
nr_dirty, nr_writeback, false);
 
spin_lock_irq(zone-lru_lock);
@@ -1343,7 +1342,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
 
spin_unlock_irq(zone-lru_lock);
 
-   free_hot_cold_page_list(page_list, 1);
+   nr_reclaimed += free_hot_cold_page_list(page_list, 1);
 
/*
 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1438,7 +1437,7 @@ static void move_active_pages_to_lru(struct lruvec 
*lruvec,
__count_vm_events(PGDEACTIVATE, pgmoved);
 }
 
-static void shrink_active_list(unsigned long nr_to_scan,
+static unsigned long shrink_active_list(unsigned long nr_to_scan,
   struct lruvec *lruvec,
   struct scan_control *sc,
   enum lru_list lru)
@@ -1534,7 +1533,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(zone-lru_lock);
 
-   free_hot_cold_page_list(l_hold, 1);
+   return free_hot_cold_page_list(l_hold, 1);
 }
 
 #ifdef CONFIG_SWAP
@@ -1617,7 +1616,8 @@ static unsigned long shrink_list(enum lru_list lru, 
unsigned long nr_to_scan,
 {
if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, lru))
-   shrink_active_list(nr_to_scan, lruvec, sc, lru);
+   return shrink_active_list(nr_to_scan, lruvec, sc, lru);
+
return 0;
}
 
@@ -1861,8 +1861,8 @@ static void shrink_lruvec(struct lruvec *lruvec, struct 
scan_control *sc)
 * rebalance the anon lru active/inactive ratio.
 */
if (inactive_anon_is_low(lruvec))
-   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-  sc, LRU_ACTIVE_ANON);
+   sc-nr_reclaimed += shrink_active_list(SWAP_CLUSTER_MAX,
+   lruvec, sc, LRU_ACTIVE_ANON);
 
throttle_vm_writeout(sc-gfp_mask);
 }
@@ -2470,23 +2470,27 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control

Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-09 Thread Joonsoo Kim

Hello, Mel.
Sorry for too late question.

On Sun, Mar 17, 2013 at 01:04:14PM +, Mel Gorman wrote:
 If kswaps fails to make progress but continues to shrink slab then it'll
 either discard all of slab or consume CPU uselessly scanning shrinkers.
 This patch causes kswapd to only call the shrinkers once per priority.
 
 Signed-off-by: Mel Gorman mgor...@suse.de
 ---
  mm/vmscan.c | 28 +---
  1 file changed, 21 insertions(+), 7 deletions(-)
 
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 7d5a932..84375b2 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
 order, long remaining,
   */
  static bool kswapd_shrink_zone(struct zone *zone,
  struct scan_control *sc,
 -unsigned long lru_pages)
 +unsigned long lru_pages,
 +bool shrinking_slab)
  {
 - unsigned long nr_slab;
 + unsigned long nr_slab = 0;
   struct reclaim_state *reclaim_state = current-reclaim_state;
   struct shrink_control shrink = {
   .gfp_mask = sc-gfp_mask,
 @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
   sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
   shrink_zone(zone, sc);
  
 - reclaim_state-reclaimed_slab = 0;
 - nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
 - sc-nr_reclaimed += reclaim_state-reclaimed_slab;
 + /*
 +  * Slabs are shrunk for each zone once per priority or if the zone
 +  * being balanced is otherwise unreclaimable
 +  */
 + if (shrinking_slab || !zone_reclaimable(zone)) {
 + reclaim_state-reclaimed_slab = 0;
 + nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
 + sc-nr_reclaimed += reclaim_state-reclaimed_slab;
 + }
  
   if (nr_slab == 0  !zone_reclaimable(zone))
   zone-all_unreclaimable = 1;

Why shrink_slab() is called here?
I think that outside of zone loop is better place to run shrink_slab(),
because shrink_slab() is not directly related to a specific zone.

And this is a question not related to this patch.
Why nr_slab is used here to decide zone-all_unreclaimable?
nr_slab is not directly related whether a specific zone is reclaimable
or not, and, moreover, nr_slab is not directly related to number of
reclaimed pages. It just say some objects in the system are freed.

This question comes from my ignorance, so please enlighten me.

Thanks.

 @@ -2713,6 +2720,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
 int order,
   int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
   unsigned long nr_soft_reclaimed;
   unsigned long nr_soft_scanned;
 + bool shrinking_slab = true;
   struct scan_control sc = {
   .gfp_mask = GFP_KERNEL,
   .priority = DEF_PRIORITY,
 @@ -2861,7 +2869,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
 int order,
* already being scanned that that high
* watermark would be met at 100% efficiency.
*/
 - if (kswapd_shrink_zone(zone, sc, lru_pages))
 + if (kswapd_shrink_zone(zone, sc,
 + lru_pages, shrinking_slab))
   raise_priority = false;
  
   nr_to_reclaim += sc.nr_to_reclaim;
 @@ -2900,6 +2909,9 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
 int order,
   pfmemalloc_watermark_ok(pgdat))
   wake_up(pgdat-pfmemalloc_wait);
  
 + /* Only shrink slab once per priority */
 + shrinking_slab = false;
 +
   /*
* Fragmentation may mean that the system cannot be rebalanced
* for high-order allocations in all zones. If twice the
 @@ -2925,8 +2937,10 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, 
 int order,
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
 - if (raise_priority || !this_reclaimed)
 + if (raise_priority || !this_reclaimed) {
   sc.priority--;
 + shrinking_slab = true;
 + }
   } while (sc.priority = 1 
!pgdat_balanced(pgdat, order, *classzone_idx));
  
 -- 
 1.8.1.4
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-09 Thread Joonsoo Kim

Hello, Mel.

On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
 On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
  Hello, Mel.
  Sorry for too late question.
  
 
 No need to apologise at all.
 
  On Sun, Mar 17, 2013 at 01:04:14PM +, Mel Gorman wrote:
   If kswaps fails to make progress but continues to shrink slab then it'll
   either discard all of slab or consume CPU uselessly scanning shrinkers.
   This patch causes kswapd to only call the shrinkers once per priority.
   
   Signed-off-by: Mel Gorman mgor...@suse.de
   ---
mm/vmscan.c | 28 +---
1 file changed, 21 insertions(+), 7 deletions(-)
   
   diff --git a/mm/vmscan.c b/mm/vmscan.c
   index 7d5a932..84375b2 100644
   --- a/mm/vmscan.c
   +++ b/mm/vmscan.c
   @@ -2661,9 +2661,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, 
   int order, long remaining,
 */
static bool kswapd_shrink_zone(struct zone *zone,
struct scan_control *sc,
   -unsigned long lru_pages)
   +unsigned long lru_pages,
   +bool shrinking_slab)
{
   - unsigned long nr_slab;
   + unsigned long nr_slab = 0;
 struct reclaim_state *reclaim_state = current-reclaim_state;
 struct shrink_control shrink = {
 .gfp_mask = sc-gfp_mask,
   @@ -2673,9 +2674,15 @@ static bool kswapd_shrink_zone(struct zone *zone,
 sc-nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
 shrink_zone(zone, sc);

   - reclaim_state-reclaimed_slab = 0;
   - nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
   - sc-nr_reclaimed += reclaim_state-reclaimed_slab;
   + /*
   +  * Slabs are shrunk for each zone once per priority or if the zone
   +  * being balanced is otherwise unreclaimable
   +  */
   + if (shrinking_slab || !zone_reclaimable(zone)) {
   + reclaim_state-reclaimed_slab = 0;
   + nr_slab = shrink_slab(shrink, sc-nr_scanned, lru_pages);
   + sc-nr_reclaimed += reclaim_state-reclaimed_slab;
   + }

 if (nr_slab == 0  !zone_reclaimable(zone))
 zone-all_unreclaimable = 1;
  
  Why shrink_slab() is called here?
 
 Preserves existing behaviour.

Yes, but, with this patch, existing behaviour is changed, that is, we call
shrink_slab() once per priority. For now, there is no reason this function
is called here. How about separating it and executing it outside of
zone loop?

We can do it with another zone loop in order to decide a
zone-all_unreclaimble. Below is pseudo code from my quick thought.


for each zone
shrink_zone()
end

nr_slab = shrink_slab()

if (nr_slab == 0) {
for each zone
if (!zone_reclaimable)
zone-all_unreclaimble = 1
end
end

}

 
  I think that outside of zone loop is better place to run shrink_slab(),
  because shrink_slab() is not directly related to a specific zone.
  
 
 This is true and has been the case for a long time. The slab shrinkers
 are not zone aware and it is complicated by the fact that slab usage can
 indirectly pin memory on other zones. Consider for example a slab object
 that is an inode entry that is allocated from the Normal zone on a
 32-bit machine. Reclaiming may free memory from the Highmem zone.
 
 It's less obvious a problem on 64-bit machines but freeing slab objects
 from a zone like DMA32 can indirectly free memory from the Normal zone or
 even another node entirely.
 
  And this is a question not related to this patch.
  Why nr_slab is used here to decide zone-all_unreclaimable?
 
 Slab is not directly associated with a slab but as reclaiming slab can
 free memory from unpredictable zones we do not consider a zone to be
 fully unreclaimable until we cannot shrink slab any more.
 
 You may be thinking that this is extremely heavy handed and you're
 right, it is.
 
  nr_slab is not directly related whether a specific zone is reclaimable
  or not, and, moreover, nr_slab is not directly related to number of
  reclaimed pages. It just say some objects in the system are freed.
  
 
 All true, it's the indirect relation between slab objects and the memory
 that is freed when slab objects are reclaimed that has to be taken into
 account.
 
  This question comes from my ignorance, so please enlighten me.
  
 
 I hope this clarifies matters.

Very helpful :)

Thanks.

 
 -- 
 Mel Gorman
 SUSE Labs
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/10] mm: vmscan: Have kswapd shrink slab only once per priority

2013-04-09 Thread Joonsoo Kim

Hello, Dave.

On Wed, Apr 10, 2013 at 11:07:34AM +1000, Dave Chinner wrote:
 On Tue, Apr 09, 2013 at 12:13:59PM +0100, Mel Gorman wrote:
  On Tue, Apr 09, 2013 at 03:53:25PM +0900, Joonsoo Kim wrote:
  
   I think that outside of zone loop is better place to run shrink_slab(),
   because shrink_slab() is not directly related to a specific zone.
   
  
  This is true and has been the case for a long time. The slab shrinkers
  are not zone aware and it is complicated by the fact that slab usage can
  indirectly pin memory on other zones.
 ..
   And this is a question not related to this patch.
   Why nr_slab is used here to decide zone-all_unreclaimable?
  
  Slab is not directly associated with a slab but as reclaiming slab can
  free memory from unpredictable zones we do not consider a zone to be
  fully unreclaimable until we cannot shrink slab any more.
 
 This is something the numa aware shrinkers will greatly help with -
 instead of being a global shrink it becomes a
 node-the-zone-belongs-to shrink, and so
 
  You may be thinking that this is extremely heavy handed and you're
  right, it is.
 
 ... it is much less heavy handed than the current code...
 
   nr_slab is not directly related whether a specific zone is reclaimable
   or not, and, moreover, nr_slab is not directly related to number of
   reclaimed pages. It just say some objects in the system are freed.
   
  
  All true, it's the indirect relation between slab objects and the memory
  that is freed when slab objects are reclaimed that has to be taken into
  account.
 
 Node awareness within the shrinker infrastructure and LRUs make the
 relationship much more direct ;)

Yes, I think so ;)

Thanks.

 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab

2013-04-09 Thread Joonsoo Kim

Hello, Christoph.

On Tue, Apr 09, 2013 at 02:28:06PM +, Christoph Lameter wrote:
 On Tue, 9 Apr 2013, Joonsoo Kim wrote:
 
  Currently, freed pages via rcu is not counted for reclaimed_slab, because
  it is freed in rcu context, not current task context. But, this free is
  initiated by this task, so counting this into this task's reclaimed_slab
  is meaningful to decide whether we continue reclaim, or not.
  So change code to count these pages for this task's reclaimed_slab.
 
 slab-reclaim_state guides the reclaim actions in vmscan.c. With this
 patch slab-reclaim_state could get quite a high value without new pages being
 available for allocation. slab-reclaim_state will only be updated
 when the RCU period ends.

Okay.

In addition, there is a little place who use SLAB_DESTROY_BY_RCU.
I will drop this patch[2/3] and [3/3] for next spin.

Thanks.

 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm, vmscan: count accidental reclaimed pages failed to put into lru

2013-04-09 Thread Joonsoo Kim

Hello, Minchan.

On Tue, Apr 09, 2013 at 02:55:14PM +0900, Minchan Kim wrote:
 Hello Joonsoo,
 
 On Tue, Apr 09, 2013 at 10:21:16AM +0900, Joonsoo Kim wrote:
  In shrink_(in)active_list(), we can fail to put into lru, and these pages
  are reclaimed accidentally. Currently, these pages are not counted
  for sc-nr_reclaimed, but with this information, we can stop to reclaim
  earlier, so can reduce overhead of reclaim.
  
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Nice catch!
 
 But this patch handles very corner case and makes reclaim function's name
 rather stupid so I'd like to see text size change after we apply this patch.
 Other nipicks below.

Ah... Yes.
I can re-work it to add number to sc-nr_reclaimed directly for both cases,
shrink_active_list() and age_active_anon().

 
  
  diff --git a/include/linux/gfp.h b/include/linux/gfp.h
  index 0f615eb..5d60ae0 100644
  --- a/include/linux/gfp.h
  +++ b/include/linux/gfp.h
  @@ -365,7 +365,7 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t 
  gfp_mask);
   extern void __free_pages(struct page *page, unsigned int order);
   extern void free_pages(unsigned long addr, unsigned int order);
   extern void free_hot_cold_page(struct page *page, int cold);
  -extern void free_hot_cold_page_list(struct list_head *list, int cold);
  +extern unsigned long free_hot_cold_page_list(struct list_head *list, int 
  cold);
   
   extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
   extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
  diff --git a/mm/page_alloc.c b/mm/page_alloc.c
  index 8fcced7..a5f3952 100644
  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -1360,14 +1360,18 @@ out:
   /*
* Free a list of 0-order pages
*/
  -void free_hot_cold_page_list(struct list_head *list, int cold)
  +unsigned long free_hot_cold_page_list(struct list_head *list, int cold)
   {
  +   unsigned long nr_reclaimed = 0;
 
 How about nr_free or nr_freed for consistent with function title?

Okay.

 
  struct page *page, *next;
   
  list_for_each_entry_safe(page, next, list, lru) {
  trace_mm_page_free_batched(page, cold);
  free_hot_cold_page(page, cold);
  +   nr_reclaimed++;
  }
  +
  +   return nr_reclaimed;
   }
   
   /*
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index 88c5fed..eff2927 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -915,7 +915,6 @@ static unsigned long shrink_page_list(struct list_head 
  *page_list,
   */
  __clear_page_locked(page);
   free_it:
  -   nr_reclaimed++;
   
  /*
   * Is there need to periodically free_page_list? It would
  @@ -954,7 +953,7 @@ keep:
  if (nr_dirty  nr_dirty == nr_congested  global_reclaim(sc))
  zone_set_flag(zone, ZONE_CONGESTED);
   
  -   free_hot_cold_page_list(free_pages, 1);
  +   nr_reclaimed += free_hot_cold_page_list(free_pages, 1);
 
 Nice cleanup.
 
   
  list_splice(ret_pages, page_list);
  count_vm_events(PGACTIVATE, pgactivate);
  @@ -1321,7 +1320,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
  lruvec *lruvec,
  if (nr_taken == 0)
  return 0;
   
  -   nr_reclaimed = shrink_page_list(page_list, zone, sc, TTU_UNMAP,
  +   nr_reclaimed += shrink_page_list(page_list, zone, sc, TTU_UNMAP,
  nr_dirty, nr_writeback, false);
 
 Do you have any reason to change?
 To me, '=' is more clear to initialize the variable.
 When I see above, I have to look through above lines to catch where code
 used the nr_reclaimed.
 

There is no reason, I will change it.

   
  spin_lock_irq(zone-lru_lock);
  @@ -1343,7 +1342,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
  lruvec *lruvec,
   
  spin_unlock_irq(zone-lru_lock);
   
  -   free_hot_cold_page_list(page_list, 1);
  +   nr_reclaimed += free_hot_cold_page_list(page_list, 1);
 
 How about considering vmstat, too?
 It could be minor but you are considering freed page as
 reclaim context. (ie, sc-nr_reclaimed) so it would be more appropriate.

I don't understand what you mean.
Please explain more what you have in mind :)

 
   
  /*
   * If reclaim is isolating dirty pages under writeback, it implies
  @@ -1438,7 +1437,7 @@ static void move_active_pages_to_lru(struct lruvec 
  *lruvec,
  __count_vm_events(PGDEACTIVATE, pgmoved);
   }
   
  -static void shrink_active_list(unsigned long nr_to_scan,
  +static unsigned long shrink_active_list(unsigned long nr_to_scan,
 struct lruvec *lruvec,
 struct scan_control *sc,
 enum lru_list lru)
  @@ -1534,7 +1533,7 @@ static void shrink_active_list(unsigned long 
  nr_to_scan,
  __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
  spin_unlock_irq(zone-lru_lock);
   
  -   free_hot_cold_page_list(l_hold, 1);
  +   return

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-10 Thread Joonsoo Kim

On Wed, Apr 10, 2013 at 09:31:10AM +0300, Pekka Enberg wrote:
 On Mon, Apr 8, 2013 at 3:32 PM, Steven Rostedt rost...@goodmis.org wrote:
   Index: linux/mm/slub.c
   ===
   --- linux.orig/mm/slub.c2013-03-28 12:14:26.958358688 -0500
   +++ linux/mm/slub.c 2013-04-01 10:23:24.677584499 -0500
   @@ -1498,6 +1498,7 @@ static inline void *acquire_slab(struct
   void *freelist;
   unsigned long counters;
   struct page new;
   +   unsigned long objects;
  
   /*
* Zap the freelist and set the frozen bit.
   @@ -1507,6 +1508,7 @@ static inline void *acquire_slab(struct
   freelist = page-freelist;
   counters = page-counters;
   new.counters = counters;
   +   objects = page-inuse;
   if (mode) {
   new.inuse = page-objects;
   new.freelist = NULL;
   @@ -1524,6 +1526,7 @@ static inline void *acquire_slab(struct
   return NULL;
  
   remove_partial(n, page);
   +   page-lru.next = (void *)objects;
   WARN_ON(!freelist);
   return freelist;
}
 
  Good. I like your method which use lru.next in order to hand over
  number of objects.
 
  I hate it ;-)
 
  It just seems to be something that's not very robust and can cause hours
  of debugging in the future. I mean, there's not even a comment
  explaining what is happening. The lru is a union with other slub
  partials structs that is not very obvious. If something is out of order,
  it can easily break, and there's nothing here that points to why.
 
  Just pass the damn objects pointer by reference and use that. It's easy
  to understand, read and is robust.
 
 Christoph, Joonsoo, comments?

Steven's comment is reasonable to me.

If there is no objection from Christoph,
let's drop a patch in which I implement Christoph's idea.

Thanks.

 
 Pekka
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm, slub: count freed pages via rcu as this task's reclaimed_slab

2013-04-10 Thread JoonSoo Kim

2013/4/10 Christoph Lameter c...@linux.com:
 On Wed, 10 Apr 2013, Joonsoo Kim wrote:

 Hello, Christoph.

 On Tue, Apr 09, 2013 at 02:28:06PM +, Christoph Lameter wrote:
  On Tue, 9 Apr 2013, Joonsoo Kim wrote:
 
   Currently, freed pages via rcu is not counted for reclaimed_slab, because
   it is freed in rcu context, not current task context. But, this free is
   initiated by this task, so counting this into this task's reclaimed_slab
   is meaningful to decide whether we continue reclaim, or not.
   So change code to count these pages for this task's reclaimed_slab.
 
  slab-reclaim_state guides the reclaim actions in vmscan.c. With this
  patch slab-reclaim_state could get quite a high value without new pages 
  being
  available for allocation. slab-reclaim_state will only be updated
  when the RCU period ends.

 Okay.

 In addition, there is a little place who use SLAB_DESTROY_BY_RCU.
 I will drop this patch[2/3] and [3/3] for next spin.

 What you have discoverd is an issue that we have so far overlooked. Could
 you add comments to both places explaining the situation?

Yes, I can.

 RCU is used for
 some inode and the dentry cache. Failing to account for these frees could
 pose a problem. One solution would be to ensure that we get through an RCU
 quiescent period in the slabs reclaim. If we can ensure that then your
 patch may be ok.

Hmm... I don't perfectly understand RCU code and it's quiescent
period. But, yes, it
can be one of possible solutions in my quick thought. Currently, I
have no ability to
do that, so I skip to think about this.

Thanks.


 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-03-31 Thread Joonsoo Kim

Hello, Preeti.

On Fri, Mar 29, 2013 at 12:42:53PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
  Following-up upper se in sched_slice() should not be done,
  because sched_slice() is used for checking that resched is needed
  whithin *this* cfs_rq and there is one problem related to this
  in current implementation.
  
  The problem is that if we follow-up upper se in sched_slice(), it is
  possible that we get a ideal slice which is lower than
  sysctl_sched_min_granularity.
  
  For example, we assume that we have 4 tg which is attached to root tg
  with same share and each one have 20 runnable tasks on cpu0, respectivly.
  In this case, __sched_period() return sysctl_sched_min_granularity * 20
  and then go into loop. At first iteration, we compute a portion of slice
  for this task on this cfs_rq, so get a slice, sysctl_sched_min_granularity.
  Afterward, we enter second iteration and get a slice which is a quarter of
  sysctl_sched_min_granularity, because there is 4 tgs with same share
  in that cfs_rq.
  
  Ensuring slice larger than min_granularity is important for performance
  and there is no lower bound about this, except timer tick, we should
  fix it not to consider upper se when calculating sched_slice.
  
 
 I am assuming the sysctl_sched_latency = 20ms and
 sysctl_sched_min_granularity = 4ms.
 In that case:
 
 With your patch, the sum of the sched_slice(runnable_task) on each_tg is
  40ms = sched_min_granularity * 10, while the parent tg has a
 sched_slice of sysctl_sched_latency / 4 = (20 / 4) = 5ms.
 
 Ideally the children's cpu share must add upto the parent's share.


I don't think so.

We should schedule out the parent tg if 5ms is over. As we do so, we can
fairly distribute time slice to every tg within short term. If we add
the children's cpu share upto the parent's, the parent tg may have
large time slice, so it cannot be preempted easily. There may be a latency
problem if there are many tgs.

Thanks.

 
 Thank you
 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

2013-03-31 Thread Joonsoo Kim

Hello Preeti.

On Fri, Mar 29, 2013 at 05:05:37PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo
 
 On 03/28/2013 01:28 PM, Joonsoo Kim wrote:
  sched_slice() compute ideal runtime slice. If there are many tasks
  in cfs_rq, period for this cfs_rq is extended to guarantee that each task
  has time slice at least, sched_min_granularity. And then each task get
  a portion of this period for it. If there is a task which have much larger
  load weight than others, a portion of period can exceed far more than
  sysctl_sched_latency.
 
 Correct. But that does not matter, the length of the scheduling latency
 period is determined by the return value of ___sched_period(), not the
 value of sysctl_sched_latency. You would not extend the period,if you
 wanted all tasks to have a slice within the sysctl_sched_latency, right?
 
 So since the value of the length of the scheduling latency period, is
 dynamic depending on the number of the processes running, the
 sysctl_sched_latency which is the default latency period length is not
 mesed with, but is only used as a base to determine the actual
 scheduling period.
 
  
  For exampple, you can simply imagine that one task with nice -20 and
  9 tasks with nice 0 on one cfs_rq. In this case, load weight sum for
  this cfs_rq is 88761 + 9 * 1024, 97977. So a portion of slice for the
  task with nice -20 is sysctl_sched_min_granularity * 10 * (88761 / 97977),
  that is, approximately, sysctl_sched_min_granularity * 9. This aspect
  can be much larger if there is more tasks with nice 0.
 
 Yeah so the __sched_period says that within 40ms, all tasks need to be
 scheduled ateast once, and the highest priority task gets nearly 36ms of
 it, while the rest is distributed among the others.
 
  
  So we should limit this possible weird situation.
  
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
  
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index e232421..6ceffbc 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct 
  sched_entity *se)
  }
  slice = calc_delta_mine(slice, se-load.weight, load);
  
  +   if (unlikely(slice  sysctl_sched_latency))
  +   slice = sysctl_sched_latency;
 
 Then in this case the highest priority thread would get
 20ms(sysctl_sched_latency), and the rest would get
 sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
 Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
 = 23.7ms, while your scheduling latency period was extended to 40ms,just
 so that each of these tasks don't have their sched_slices shrunk due to
 large number of tasks.

I don't know I understand your question correctly.
I will do my best to answer your comment. :)

With this patch, I just limit maximum slice at one time. Scheduling is
controlled through the vruntime. So, in this case, the task with nice -20
will be scheduled twice.

20 + (0.4 * 9) + 20 = 43.9 ms

And after 43.9 ms, this process is repeated.

So I can tell you that scheduling period is preserved as before.

If we give a long period to a task at one go, it can cause
a latency problem. So IMHO, limiting this is meaningful.

Thanks.

 
  +
  return slice;
   }
  
 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-03-31 Thread Joonsoo Kim

Hello, Peter.

On Fri, Mar 29, 2013 at 12:45:14PM +0100, Peter Zijlstra wrote:
 On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
  There is not enough reason to place this checking at
  update_sg_lb_stats(),
  except saving one iteration for sched_group_cpus. But with this
  change,
  we can save two memset cost and can expect better compiler
  optimization,
  so clean-up may be more beneficial to us.
 
 It would be good if you'd described what you actually did, there's a
 lot of code movement and now I have to figure out wth you did and why.

Okay. I will do that when respin.

Thanks.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-03-31 Thread Joonsoo Kim

On Fri, Mar 29, 2013 at 12:58:26PM +0100, Peter Zijlstra wrote:
 On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
  +static int should_we_balance(struct lb_env *env)
  +{
  +   struct sched_group *sg = env-sd-groups;
  +   int cpu, balance_cpu = -1;
  +
  +   /*
  +* In the newly idle case, we will allow all the cpu's
  +* to do the newly idle load balance.
  +*/
  +   if (env-idle == CPU_NEWLY_IDLE)
  +   return 1;
  +
  +   /* Try to find first idle cpu */
  +   for_each_cpu_and(cpu, sched_group_cpus(sg), env-cpus)
  +   if (cpumask_test_cpu(cpu, sched_group_mask(sg)) 
  +   idle_cpu(cpu)) {
  +   balance_cpu = cpu;
  +   break;
  +   }
 
 /me hands you a bucket of curlies too.. use them I'll send you another
 bucket when this one gets empty!

Thanks!

 (We always put in curlies on multi lines statements -- as opposed to
 multi statement blocks where they're required)

Okay. I will do that when respin.

Thanks.

 
  +
  +   if (balance_cpu == -1)
  +   balance_cpu = group_balance_cpu(sg);
  +
  +   /*
  +* First idle cpu or the first cpu(busiest) in this sched
  group
  +* is eligible for doing load balancing at this and above
  domains.
  +*/
  +   if (balance_cpu != env-dst_cpu)
  +   return 0;
  +
  +   return 1;
  +}
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] kprobe: initialize kprobe_blacklist when it is used firstly

2013-04-01 Thread Joonsoo Kim

Currently, initializing kprobe_blacklist is done during boot process.
It takes 230 ms on our android platform and this is significant amount
for our use case. We can disable CONFIG_KPROBES for production kernel,
but it is hassle. This kprobe functionality is not commonly used,
so we don't need to pay this cost at all times. With this rationale,
change code to initialize kprobe_blacklist when it is used firstly.

Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Anil S Keshavamurthy anil.s.keshavamur...@intel.com
Cc: David S. Miller da...@davemloft.net
Cc: Masami Hiramatsu masami.hiramatsu...@hitachi.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
---
I fotgot to add lkml.
Sorry for noise.

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index e35be53..5e90092 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -101,6 +101,7 @@ static struct kprobe_blackpoint kprobe_blacklist[] = {
{mcount,},/* mcount can be called from everywhere */
{NULL}/* Terminator */
 };
+static bool kprobe_blacklist_initialized;
 
 #ifdef __ARCH_WANT_KPROBES_INSN_SLOT
 /*
@@ -1324,6 +1325,49 @@ out:
return ret;
 }
 
+static void __kprobes init_kprobe_blacklist(void)
+{
+   unsigned long offset = 0, size = 0;
+   char *modname, namebuf[128];
+   const char *symbol_name;
+   void *addr;
+   struct kprobe_blackpoint *kb;
+
+   mutex_lock(kprobe_mutex);
+   if (kprobe_blacklist_initialized)
+   goto out;
+
+   /*
+* Lookup and populate the kprobe_blacklist.
+*
+* Unlike the kretprobe blacklist, we'll need to determine
+* the range of addresses that belong to the said functions,
+* since a kprobe need not necessarily be at the beginning
+* of a function.
+*/
+   for (kb = kprobe_blacklist; kb-name != NULL; kb++) {
+   kprobe_lookup_name(kb-name, addr);
+   if (!addr)
+   continue;
+
+   kb-start_addr = (unsigned long)addr;
+   symbol_name = kallsyms_lookup(kb-start_addr,
+   size, offset, modname, namebuf);
+   if (!symbol_name)
+   kb-range = 0;
+   else
+   kb-range = size;
+   }
+
+   /* This guarantee that who see initilized will
+* get a updated data of kprobe_blacklist */
+   smp_wmb();
+   kprobe_blacklist_initialized = true;
+
+out:
+   mutex_unlock(kprobe_mutex);
+}
+
 static int __kprobes in_kprobes_functions(unsigned long addr)
 {
struct kprobe_blackpoint *kb;
@@ -1331,6 +1375,7 @@ static int __kprobes in_kprobes_functions(unsigned long 
addr)
if (addr = (unsigned long)__kprobes_text_start 
addr  (unsigned long)__kprobes_text_end)
return -EINVAL;
+
/*
 * If there exists a kprobe_blacklist, verify and
 * fail any probe registration in the prohibited area
@@ -1476,6 +1521,9 @@ int __kprobes register_kprobe(struct kprobe *p)
struct module *probed_mod;
kprobe_opcode_t *addr;
 
+   if (unlikely(!kprobe_blacklist_initialized))
+   init_kprobe_blacklist();
+
/* Adjust probe address from symbol */
addr = kprobe_addr(p);
if (IS_ERR(addr))
@@ -2065,11 +2113,6 @@ static struct notifier_block kprobe_module_nb = {
 static int __init init_kprobes(void)
 {
int i, err = 0;
-   unsigned long offset = 0, size = 0;
-   char *modname, namebuf[128];
-   const char *symbol_name;
-   void *addr;
-   struct kprobe_blackpoint *kb;
 
/* FIXME allocate the probe table, currently defined statically */
/* initialize all list heads */
@@ -2079,28 +2122,6 @@ static int __init init_kprobes(void)
raw_spin_lock_init((kretprobe_table_locks[i].lock));
}
 
-   /*
-* Lookup and populate the kprobe_blacklist.
-*
-* Unlike the kretprobe blacklist, we'll need to determine
-* the range of addresses that belong to the said functions,
-* since a kprobe need not necessarily be at the beginning
-* of a function.
-*/
-   for (kb = kprobe_blacklist; kb-name != NULL; kb++) {
-   kprobe_lookup_name(kb-name, addr);
-   if (!addr)
-   continue;
-
-   kb-start_addr = (unsigned long)addr;
-   symbol_name = kallsyms_lookup(kb-start_addr,
-   size, offset, modname, namebuf);
-   if (!symbol_name)
-   kb-range = 0;
-   else
-   kb-range = size;
-   }
-
if (kretprobe_blacklist_size) {
/* lookup the function address from its name */
for (i = 0; kretprobe_blacklist[i].name != NULL; i++) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message

Re: [RFC PATCH 5/6] ARM, mm: change meaning of max_low_pfn to maximum pfn for nobootmem

2013-04-01 Thread Joonsoo Kim

Hello, Russell.

On Mon, Mar 25, 2013 at 09:48:16AM +, Russell King - ARM Linux wrote:
 On Mon, Mar 25, 2013 at 01:11:13PM +0900, Joonsoo Kim wrote:
  nobootmem use max_low_pfn for computing boundary in free_all_bootmem()
  So we need proper value to max_low_pfn.
  
  But, there is some difficulty related to max_low_pfn. max_low_pfn is used
  for two meanings in various architectures. One is for number of pages
  in lowmem and the other is for maximum lowmem pfn. Now, in ARM, it is used
  as number of pages in lowmem. You can get more information in below link.
  http://lwn.net/Articles/543408/
  http://lwn.net/Articles/543424/
  
  As I investigated, architectures which use max_low_pfn as maximum pfn are
  more than others, so to change meaning of max_low_pfn to maximum pfn
  is preferable solution to me. This patch change max_low_pfn as maximum
  lowmem pfn in ARM. In addition, min_low_pfn, max_pfn is assigned according
  to this criteria.
  
  There is no real user for max_low_pfn except block/blk-setting.c and
  blk-setting.c assume that max_low_pfn is maximum lowmem pfn,
  so this patch may not harm anything.
 
 This will need some very rigorous testing before it can go into mainline.

There is no attention from others to this patchset. :)
Do you have any plan to test this?
What I can do best is just to run this patchset on my platform.

Would you guide me how to go into mainline for this patchset.

Thanks.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-01 Thread Joonsoo Kim

Hello, Christoph.

On Mon, Apr 01, 2013 at 03:33:23PM +, Christoph Lameter wrote:
 Subject: slub: Fix object counts in acquire_slab V2
 
 It seems that we were overallocating objects from the slab queues
 since get_partial_node() assumed that page-inuse was undisturbed by
 acquire_slab(). Save the # of objects in page-lru.next in acquire_slab()
 and pass it to get_partial_node() that way.
 
 I have a vague memory that Joonsoo also ran into this issue awhile back.

Yes. I sent a patch for this two month ago. :)

 
 Signed-off-by: Christoph Lameter c...@linux.com
 
 Index: linux/mm/slub.c
 ===
 --- linux.orig/mm/slub.c  2013-03-28 12:14:26.958358688 -0500
 +++ linux/mm/slub.c   2013-04-01 10:23:24.677584499 -0500
 @@ -1498,6 +1498,7 @@ static inline void *acquire_slab(struct
   void *freelist;
   unsigned long counters;
   struct page new;
 + unsigned long objects;
 
   /*
* Zap the freelist and set the frozen bit.
 @@ -1507,6 +1508,7 @@ static inline void *acquire_slab(struct
   freelist = page-freelist;
   counters = page-counters;
   new.counters = counters;
 + objects = page-inuse;
   if (mode) {
   new.inuse = page-objects;
   new.freelist = NULL;
 @@ -1524,6 +1526,7 @@ static inline void *acquire_slab(struct
   return NULL;
 
   remove_partial(n, page);
 + page-lru.next = (void *)objects;
   WARN_ON(!freelist);
   return freelist;
  }

Good. I like your method which use lru.next in order to hand over
number of objects.

 @@ -1565,7 +1568,7 @@ static void *get_partial_node(struct kme
   c-page = page;
   stat(s, ALLOC_FROM_PARTIAL);
   object = t;
 - available =  page-objects - page-inuse;
 + available =  page-objects - (unsigned 
 long)page-lru.next;
   } else {
   available = put_cpu_partial(s, page, 0);
   stat(s, CPU_PARTIAL_NODE);

We need one more fix for correctness.
When available is assigned by put_cpu_partial, it doesn't count cpu slab's 
objects.
Please reference my old patch.

https://lkml.org/lkml/2013/1/21/64

Thanks.

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-01 Thread Joonsoo Kim

Hello, Christoph.

On Mon, Apr 01, 2013 at 03:32:43PM +, Christoph Lameter wrote:
 On Thu, 28 Mar 2013, Paul Gortmaker wrote:
 
   Index: linux/init/Kconfig
   ===
   --- linux.orig/init/Kconfig 2013-03-28 12:14:26.958358688 -0500
   +++ linux/init/Kconfig  2013-03-28 12:19:46.275866639 -0500
   @@ -1514,6 +1514,14 @@ config SLOB
  
endchoice
  
   +config SLUB_CPU_PARTIAL
   +   depends on SLUB
   +   bool SLUB per cpu partial cache
   +   help
   + Per cpu partial caches accellerate freeing of objects at the
   + price of more indeterminism in the latency of the free.
   + Typically one would choose no for a realtime system.
 
  Is batch a better description than accelerate ?  Something like
 
 Its not a batching but a cache that is going to be mainly used for new
 allocations on the same processor.
 
   Per cpu partial caches allows batch freeing of objects to maximize
   throughput.  However, this can increase the length of time spent
   holding key locks, which can increase latency spikes with respect
   to responsiveness.  Select yes unless you are tuning for a realtime
   oriented system.
 
  Also, I believe this will cause a behaviour change for people who
  just run make oldconfig -- since there is no default line.  Meaning
  that it used to be unconditionally on, but now I think it will be off
  by default, if people just mindlessly hold down Enter key.
 
 Ok.
 
  For RT, we'll want default N if RT_FULL (RT_BASE?) but for mainline,
  I expect you'll want default Y in order to be consistent with previous
  behaviour?
 
 I was not sure exactly how to handle that one yet for realtime. So I need
 two different patches?
 
  I've not built/booted yet, but I'll follow up if I see anything else in 
  doing
  that.
 
 Here is an updated patch. I will also send an updated fixup patch.
 
 
 Subject: slub: Make cpu partial slab support configurable V2
 
 cpu partial support can introduce level of indeterminism that is not wanted
 in certain context (like a realtime kernel). Make it configurable.
 
 Signed-off-by: Christoph Lameter c...@linux.com
 
 Index: linux/include/linux/slub_def.h
 ===
 --- linux.orig/include/linux/slub_def.h   2013-04-01 10:27:05.908964674 
 -0500
 +++ linux/include/linux/slub_def.h2013-04-01 10:27:19.905178531 -0500
 @@ -47,7 +47,9 @@ struct kmem_cache_cpu {
   void **freelist;/* Pointer to next available object */
   unsigned long tid;  /* Globally unique transaction id */
   struct page *page;  /* The slab from which we are allocating */
 +#ifdef CONFIG_SLUB_CPU_PARTIAL
   struct page *partial;   /* Partially allocated frozen slabs */
 +#endif
  #ifdef CONFIG_SLUB_STATS
   unsigned stat[NR_SLUB_STAT_ITEMS];
  #endif
 @@ -84,7 +86,9 @@ struct kmem_cache {
   int size;   /* The size of an object including meta data */
   int object_size;/* The size of an object without meta data */
   int offset; /* Free pointer offset. */
 +#ifdef CONFIG_SLUB_CPU_PARTIAL
   int cpu_partial;/* Number of per cpu partial objects to keep 
 around */
 +#endif
   struct kmem_cache_order_objects oo;
 
   /* Allocation and freeing of slabs */

When !CONFIG_SLUB_CPU_PARTIAL, should we remove these variable?
Without removing these, we can make code more simpler and maintainable.

 Index: linux/mm/slub.c
 ===
 --- linux.orig/mm/slub.c  2013-04-01 10:27:05.908964674 -0500
 +++ linux/mm/slub.c   2013-04-01 10:27:19.905178531 -0500
 @@ -1531,7 +1531,9 @@ static inline void *acquire_slab(struct
   return freelist;
  }
 
 +#ifdef CONFIG_SLUB_CPU_PARTIAL
  static int put_cpu_partial(struct kmem_cache *s, struct page *page, int 
 drain);
 +#endif
  static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
  /*
 @@ -1570,10 +1572,20 @@ static void *get_partial_node(struct kme
   object = t;
   available =  page-objects - (unsigned 
 long)page-lru.next;
   } else {
 +#ifdef CONFIG_SLUB_CPU_PARTIAL
   available = put_cpu_partial(s, page, 0);
   stat(s, CPU_PARTIAL_NODE);
 +#else
 + BUG();
 +#endif
   }
 - if (kmem_cache_debug(s) || available  s-cpu_partial / 2)
 + if (kmem_cache_debug(s) ||
 +#ifdef CONFIG_SLUB_CPU_PARTIAL
 + available  s-cpu_partial / 2
 +#else
 + available  0
 +#endif
 + )
   break;
 
   }

How about introducing wrapper function, cpu_partial_enabled()
like as kmem_cache_debug()?

int cpu_partial_enabled(s)
{
return kmem_cache_debug(s) || blablabla
}

As you already know, when

Re: [PATCH 5/5] sched: limit sched_slice if it is more than sysctl_sched_latency

2013-04-01 Thread Joonsoo Kim

Hello, Preeti.

On Mon, Apr 01, 2013 at 12:15:50PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 On 04/01/2013 10:39 AM, Joonsoo Kim wrote:
  Hello Preeti.
  So we should limit this possible weird situation.
 
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
  diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
  index e232421..6ceffbc 100644
  --- a/kernel/sched/fair.c
  +++ b/kernel/sched/fair.c
  @@ -645,6 +645,9 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct 
  sched_entity *se)
}
slice = calc_delta_mine(slice, se-load.weight, load);
 
  + if (unlikely(slice  sysctl_sched_latency))
  + slice = sysctl_sched_latency;
 
  Then in this case the highest priority thread would get
  20ms(sysctl_sched_latency), and the rest would get
  sysctl_sched_min_granularity * 10 * (1024/97977) which would be 0.4ms.
  Then all tasks would get scheduled ateast once within 20ms + (0.4*9) ms
  = 23.7ms, while your scheduling latency period was extended to 40ms,just
  so that each of these tasks don't have their sched_slices shrunk due to
  large number of tasks.
  
  I don't know I understand your question correctly.
  I will do my best to answer your comment. :)
  
  With this patch, I just limit maximum slice at one time. Scheduling is
  controlled through the vruntime. So, in this case, the task with nice -20
  will be scheduled twice.
  
  20 + (0.4 * 9) + 20 = 43.9 ms
  
  And after 43.9 ms, this process is repeated.
  
  So I can tell you that scheduling period is preserved as before.
  
  If we give a long period to a task at one go, it can cause
  a latency problem. So IMHO, limiting this is meaningful.
 
 Thank you very much for the explanation. Just one question. What is the
 reason behind you choosing sysctl_sched_latency as the upper bound here?

sysctl_sched_latency is a sched_slice when there is one task.
So, I think that this is proper as upper bound.

Thanks.

 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-01 Thread Joonsoo Kim

Hello, Preeti.

On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
  Hello, Preeti.
  
 
 
  Ideally the children's cpu share must add upto the parent's share.
 
  
  I don't think so.
  
  We should schedule out the parent tg if 5ms is over. As we do so, we can
  fairly distribute time slice to every tg within short term. If we add
  the children's cpu share upto the parent's, the parent tg may have
  large time slice, so it cannot be preempted easily. There may be a latency
  problem if there are many tgs.
 
 In the case where the #running  sched_nr_latency, the children's
 sched_slices add up to the parent's.
 
 A rq with two tgs,each with 3 tasks.
 
 Each of these tasks have a sched slice of
 [(sysctl_sched_latency / 3) / 2] as of the present implementation.
 
 The sum of the above sched_slice of all tasks of a tg will lead to the
 sched_slice of its parent: sysctl_sched_latency / 2
 
 This breaks when the nr_running on each tg  sched_nr_latency. However I
 don't know if this is a good thing or a bad thing.

Ah.. Now I get your point. Yes, you are right and it may be good thing.
With that property, all tasks in the system can be scheduled at least once
in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
so my patch may be wrong. With my patch, all tasks in the system cannot be
scheduled at least once in sysctl_sched_latency. Instead, it schedule
all tasks in cfs_rq at least once in sysctl_sched_latency if there is
no other tgs.

I think that it is real problem that sysctl_sched_min_granularity is not
guaranteed for each task.
Instead of this patch, how about considering low bound?

if (slice  sysctl_sched_min_granularity)
slice = sysctl_sched_min_granularity;

Thanks.

 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-02 Thread Joonsoo Kim

Hello, Preeti.

On Tue, Apr 02, 2013 at 10:25:23AM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 On 04/02/2013 07:55 AM, Joonsoo Kim wrote:
  Hello, Preeti.
  
  On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
  Hi Joonsoo,
 
  On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
  Hello, Preeti.
 
 
 
  Ideally the children's cpu share must add upto the parent's share.
 
 
  I don't think so.
 
  We should schedule out the parent tg if 5ms is over. As we do so, we can
  fairly distribute time slice to every tg within short term. If we add
  the children's cpu share upto the parent's, the parent tg may have
  large time slice, so it cannot be preempted easily. There may be a latency
  problem if there are many tgs.
 
  In the case where the #running  sched_nr_latency, the children's
  sched_slices add up to the parent's.
 
  A rq with two tgs,each with 3 tasks.
 
  Each of these tasks have a sched slice of
  [(sysctl_sched_latency / 3) / 2] as of the present implementation.
 
  The sum of the above sched_slice of all tasks of a tg will lead to the
  sched_slice of its parent: sysctl_sched_latency / 2
 
  This breaks when the nr_running on each tg  sched_nr_latency. However I
  don't know if this is a good thing or a bad thing.
  
  Ah.. Now I get your point. Yes, you are right and it may be good thing.
  With that property, all tasks in the system can be scheduled at least once
  in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
  so my patch may be wrong. With my patch, all tasks in the system cannot be
  scheduled at least once in sysctl_sched_latency. Instead, it schedule
  all tasks in cfs_rq at least once in sysctl_sched_latency if there is
  no other tgs.
 
 Exactly. You have got all the above points right.
 
  
  I think that it is real problem that sysctl_sched_min_granularity is not
  guaranteed for each task.
  Instead of this patch, how about considering low bound?
  
  if (slice  sysctl_sched_min_granularity)
  slice = sysctl_sched_min_granularity;
 
 Consider the below scenario.
 
 A runqueue has two task groups,each with 10 tasks.
 
 With the current implementation,each of these tasks get a sched_slice of
 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
 of both the task groups) will get the chance to run.
 
 But what is the scheduling period in this scenario? Is it 40ms (extended
 sysctl_sched_latency), which is the scheduling period for each of the
 runqueues with 10 tasks in it?
 Or is it 80ms which is the total of the scheduling periods of each of
 the run queues with 10 tasks.Either way all tasks seem to get scheduled
 atleast once within the scheduling period.So we appear to be safe.
 Although the sched_slice  sched_min_granularity.
 
 With your above lower bound of sysctl_sched_min_granularity, each task
 of each tg gets 4ms as its sched_slice.So in a matter of
 (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
 being put forth here as well, we don't appear to be safe if the
 scheduling_period is considered to be 40ms, otherwise it appears fine to
 me, because it ensures the sched_slice is atleast sched_min_granularity
 no matter what.

So, you mean that we should guarantee to schedule each task atleast once
in sysctl_sched_latency? But this is not guaranteed in current code,
this is why I made this patch. Please refer a patch description.

Thanks.

 
 Thank you
 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-02 Thread Joonsoo Kim

Hello, Mike.

On Tue, Apr 02, 2013 at 04:35:26AM +0200, Mike Galbraith wrote:
 On Tue, 2013-04-02 at 11:25 +0900, Joonsoo Kim wrote: 
  Hello, Preeti.
  
  On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
   Hi Joonsoo,
   
   On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
Hello, Preeti.

   
   
Ideally the children's cpu share must add upto the parent's share.
   

I don't think so.

We should schedule out the parent tg if 5ms is over. As we do so, we can
fairly distribute time slice to every tg within short term. If we add
the children's cpu share upto the parent's, the parent tg may have
large time slice, so it cannot be preempted easily. There may be a 
latency
problem if there are many tgs.
   
   In the case where the #running  sched_nr_latency, the children's
   sched_slices add up to the parent's.
   
   A rq with two tgs,each with 3 tasks.
   
   Each of these tasks have a sched slice of
   [(sysctl_sched_latency / 3) / 2] as of the present implementation.
   
   The sum of the above sched_slice of all tasks of a tg will lead to the
   sched_slice of its parent: sysctl_sched_latency / 2
   
   This breaks when the nr_running on each tg  sched_nr_latency. However I
   don't know if this is a good thing or a bad thing.
  
  Ah.. Now I get your point. Yes, you are right and it may be good thing.
  With that property, all tasks in the system can be scheduled at least once
  in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
  so my patch may be wrong. With my patch, all tasks in the system cannot be
  scheduled at least once in sysctl_sched_latency. Instead, it schedule
  all tasks in cfs_rq at least once in sysctl_sched_latency if there is
  no other tgs.
  
  I think that it is real problem that sysctl_sched_min_granularity is not
  guaranteed for each task.
  Instead of this patch, how about considering low bound?
  
  if (slice  sysctl_sched_min_granularity)
  slice = sysctl_sched_min_granularity;
 
 How many SCHED_IDLE or +nice tasks will fit in that?

It is more related to how many running tasks in cfs_rq and how many tg is
in the system. If we have two tgs which have more than sched_nr_latency
tasks, all these tasks fit in this condition in current implementation.

Thanks.

 
 -Mike
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-04-02 Thread Joonsoo Kim

Hello, Peter.

On Tue, Apr 02, 2013 at 10:10:06AM +0200, Peter Zijlstra wrote:
 On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
  Now checking that this cpu is appropriate to balance is embedded into
  update_sg_lb_stats() and this checking has no direct relationship to
  this
  function.
  
  There is not enough reason to place this checking at
  update_sg_lb_stats(),
  except saving one iteration for sched_group_cpus.
 
 Its only one iteration if there's only 2 groups, but there can be more
 than 2, take any desktop Intel i7, it will have 4-8 cores, each with
 HT; thus the CPU domain will have 4-8 groups.
 
 And note that local_group is always the first group of a domain, so
 we'd stop the balance at the first group and avoid touching the other
 3-7, avoiding touching cachelines on 6-14 cpus.
 
 So this short-cut does make sense.. its not pretty, granted, but
 killing it doesn't seem right.

It seems that there is some misunderstanding about this patch.
In this patch, we don't iterate all groups. Instead, we iterate on
cpus of local sched_group only. So there is no penalty you mentioned.

In summary, net effect is below.

* For cpus which are not proper to balance,
Reduce function call,
Reduce memset

* For cpus which should balance
Extra one iteration on cpus of local sched_group in should_we_balance()
Reduce some branch, so expect better optimization

Thanks.

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-03 Thread Joonsoo Kim

Hello, Preeti.

On Tue, Apr 02, 2013 at 11:02:43PM +0530, Preeti U Murthy wrote:
 Hi Joonsoo,
 
 
  I think that it is real problem that sysctl_sched_min_granularity is not
  guaranteed for each task.
  Instead of this patch, how about considering low bound?
 
  if (slice  sysctl_sched_min_granularity)
slice = sysctl_sched_min_granularity;
 
  Consider the below scenario.
 
  A runqueue has two task groups,each with 10 tasks.
 
  With the current implementation,each of these tasks get a sched_slice of
  2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
  of both the task groups) will get the chance to run.
 
  But what is the scheduling period in this scenario? Is it 40ms (extended
  sysctl_sched_latency), which is the scheduling period for each of the
  runqueues with 10 tasks in it?
  Or is it 80ms which is the total of the scheduling periods of each of
  the run queues with 10 tasks.Either way all tasks seem to get scheduled
  atleast once within the scheduling period.So we appear to be safe.
  Although the sched_slice  sched_min_granularity.
 
  With your above lower bound of sysctl_sched_min_granularity, each task
  of each tg gets 4ms as its sched_slice.So in a matter of
  (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
  being put forth here as well, we don't appear to be safe if the
  scheduling_period is considered to be 40ms, otherwise it appears fine to
  me, because it ensures the sched_slice is atleast sched_min_granularity
  no matter what.
  
  So, you mean that we should guarantee to schedule each task atleast once
  in sysctl_sched_latency? 
 
 I would rather say all tasks get to run atleast once in a sched_period,
 which could be just sysctl_sched_latency or more depending on the number
 of tasks in the current implementation.
 
 But this is not guaranteed in current code,
  this is why I made this patch. Please refer a patch description.
 
 Ok,take the example of a runqueue with 2 task groups,each with 10
 tasks.Same as your previous example. Can you explain how your patch
 ensures that all 20 tasks get to run atleast once in a sched_period?

My patch doesn't ensure that :)
I just want to say a problem of current implementation which can't
ensure to run atleast once in sched_period through my patch description.

So, how about extending a sched_period with rq-nr_running, instead of
cfs_rq-nr_running? It is my quick thought and I think that we can ensure
to run atleast once in this extending sched_period.

And, do we leave a problem if we cannot guaranteed atleast once property?

Thanks.

 
 Regards
 Preeti U Murthy
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-04-03 Thread Joonsoo Kim

Hello, Peter.

On Tue, Apr 02, 2013 at 12:29:42PM +0200, Peter Zijlstra wrote:
 On Tue, 2013-04-02 at 12:00 +0200, Peter Zijlstra wrote:
  On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote:
   
   It seems that there is some misunderstanding about this patch.
   In this patch, we don't iterate all groups. Instead, we iterate on
   cpus of local sched_group only. So there is no penalty you mentioned.
  
  OK, I'll go stare at it again..
 
 Ah, I see, you're doing should_we_balance() _before_
 find_busiest_group() and instead you're doing another for_each_cpu() in
 there.
 
 I'd write the thing like:
 
 static bool should_we_balance(struct lb_env *env)
 {
   struct sched_group *sg = env-sd-groups;
   struct cpumask *sg_cpus, *sg_mask;
   int cpu, balance_cpu = -1;
 
   if (env-idle == CPU_NEWLY_IDLE)
   return true;
 
   sg_cpus = sched_group_cpus(sg);
   sg_mask = sched_group_mask(sg);
 
   for_each_cpu_and(cpu, sg_cpus, env-cpus) {
   if (!cpumask_test_cpu(cpu, sg_mask))
   continue;
 
   if (!idle_cpu(cpu))
   continue;
 
   balance_cpu = cpu;
   break;
   }
 
   if (balance_cpu == -1)
   balance_cpu = group_balance_cpu(sg);
 
   return balance_cpu == env-dst_cpu;
 }

Okay. It looks nice.

 
 I also considered doing the group_balance_cpu() first to avoid having
 to do the idle_cpu() scan, but that's a slight behavioural change
 afaict.

In my quick thought, we can avoid it through below way.

balance_cpu = group_balance_cpu(sg);
if (idle_cpu(balance_cpu))
return balance_cpu == env-dst_cpu;
else
do idle_cpus() scan loop

Is it your idea? If not, please let me know your idea.

Thanks.

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-03 Thread Joonsoo Kim

Hello, Christoph.

On Tue, Apr 02, 2013 at 07:25:20PM +, Christoph Lameter wrote:
 On Tue, 2 Apr 2013, Joonsoo Kim wrote:
 
  We need one more fix for correctness.
  When available is assigned by put_cpu_partial, it doesn't count cpu slab's 
  objects.
  Please reference my old patch.
 
  https://lkml.org/lkml/2013/1/21/64
 
 Could you update your patch and submit it again?

Pekka alreay applied it.
Do we need update?

Thanks.

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4538 matches

Mail list logo