Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-23 Thread Glauber Costa

>>> Perhaps we're just trying to take a conservative initial implementation
>>> which is consistent with user visible pages.
>>>
>>
>> The way I see it, is not about being conservative, but rather about my
>> physical safety. It is quite easy and natural to assume that "all
>> modifications to page cgroup are done under lock". So someone modifying
>> this later will likely find out about this exception in a rather
>> unpleasant way. They know where I live, and guns for hire are everywhere.
>>
>> Note that it is not unreasonable to believe that we can modify this
>> later. This can be a way out, for example, for the memcg lifecycle problem.
>>
>> I agree with your analysis and we can ultimately remove it, but if we
>> cannot pinpoint any performance problems to here, maybe consistency
>> wins. Also, the locking operation itself is a bit expensive, but the
>> biggest price is the actual contention. If we'll have nobody contending
>> for the same page_cgroup, the problem - if exists - shouldn't be that
>> bad. And if we ever have, the lock is needed.
> 
> Sounds reasonable. Another reason we might have to eventually revisit
> this lock is the fact that lock_page_cgroup() is not generally irq_safe.
> I assume that slab pages may be freed in softirq and would thus (in an
> upcoming patch series) call __memcg_kmem_free_page.  There are a few
> factors that might make it safe to grab this lock here (and below in
> __memcg_kmem_free_page) from hard/softirq context:
> * the pc lock is a per page bit spinlock.  So we only need to worry
>   about interrupting a task which holds the same page's lock to avoid
>   deadlock.
> * for accounted kernel pages, I am not aware of other code beyond
>   __memcg_kmem_charge_page and __memcg_kmem_free_page which grab pc
>   lock.  So we shouldn't find __memcg_kmem_free_page() called from a
>   context which interrupted a holder of the page's pc lock.
> 

All very right.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-23 Thread Glauber Costa

 Perhaps we're just trying to take a conservative initial implementation
 which is consistent with user visible pages.


 The way I see it, is not about being conservative, but rather about my
 physical safety. It is quite easy and natural to assume that all
 modifications to page cgroup are done under lock. So someone modifying
 this later will likely find out about this exception in a rather
 unpleasant way. They know where I live, and guns for hire are everywhere.

 Note that it is not unreasonable to believe that we can modify this
 later. This can be a way out, for example, for the memcg lifecycle problem.

 I agree with your analysis and we can ultimately remove it, but if we
 cannot pinpoint any performance problems to here, maybe consistency
 wins. Also, the locking operation itself is a bit expensive, but the
 biggest price is the actual contention. If we'll have nobody contending
 for the same page_cgroup, the problem - if exists - shouldn't be that
 bad. And if we ever have, the lock is needed.
 
 Sounds reasonable. Another reason we might have to eventually revisit
 this lock is the fact that lock_page_cgroup() is not generally irq_safe.
 I assume that slab pages may be freed in softirq and would thus (in an
 upcoming patch series) call __memcg_kmem_free_page.  There are a few
 factors that might make it safe to grab this lock here (and below in
 __memcg_kmem_free_page) from hard/softirq context:
 * the pc lock is a per page bit spinlock.  So we only need to worry
   about interrupting a task which holds the same page's lock to avoid
   deadlock.
 * for accounted kernel pages, I am not aware of other code beyond
   __memcg_kmem_charge_page and __memcg_kmem_free_page which grab pc
   lock.  So we shouldn't find __memcg_kmem_free_page() called from a
   context which interrupted a holder of the page's pc lock.
 

All very right.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-22 Thread Greg Thelen
On Wed, Aug 22 2012, Glauber Costa wrote:

> On 08/22/2012 01:50 AM, Greg Thelen wrote:
>> On Thu, Aug 09 2012, Glauber Costa wrote:
>> 
>>> This patch introduces infrastructure for tracking kernel memory pages to
>>> a given memcg. This will happen whenever the caller includes the flag
>>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>>
>>> In memcontrol.h those functions are wrapped in inline accessors.  The
>>> idea is to later on, patch those with static branches, so we don't incur
>>> any overhead when no mem cgroups with limited kmem are being used.
>>>
>>> [ v2: improved comments and standardized function names ]
>>>
>>> Signed-off-by: Glauber Costa 
>>> CC: Christoph Lameter 
>>> CC: Pekka Enberg 
>>> CC: Michal Hocko 
>>> CC: Kamezawa Hiroyuki 
>>> CC: Johannes Weiner 
>>> ---
>>>  include/linux/memcontrol.h |  79 +++
>>>  mm/memcontrol.c| 185 
>>> +
>>>  2 files changed, 264 insertions(+)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 8d9489f..75b247e 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -21,6 +21,7 @@
>>>  #define _LINUX_MEMCONTROL_H
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  struct mem_cgroup;
>>>  struct page_cgroup;
>>> @@ -399,6 +400,11 @@ struct sock;
>>>  #ifdef CONFIG_MEMCG_KMEM
>>>  void sock_update_memcg(struct sock *sk);
>>>  void sock_release_memcg(struct sock *sk);
>>> +
>>> +#define memcg_kmem_on 1
>>> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
>>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
>>> +void __memcg_kmem_free_page(struct page *page, int order);
>>>  #else
>>>  static inline void sock_update_memcg(struct sock *sk)
>>>  {
>>> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>>>  static inline void sock_release_memcg(struct sock *sk)
>>>  {
>>>  }
>>> +
>>> +#define memcg_kmem_on 0
>>> +static inline bool
>>> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>>> +{
>>> +   return false;
>>> +}
>>> +
>>> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
>>> +{
>>> +}
>>> +
>>> +static inline void
>>> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>>> order)
>>> +{
>>> +}
>>>  #endif /* CONFIG_MEMCG_KMEM */
>>> +
>>> +/**
>>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>>> + * @gfp: the gfp allocation flags.
>>> + * @handle: a pointer to the memcg this was charged against.
>>> + * @order: allocation order.
>>> + *
>>> + * returns true if the memcg where the current task belongs can hold this
>>> + * allocation.
>>> + *
>>> + * We return true automatically if this allocation is not to be accounted 
>>> to
>>> + * any memcg.
>>> + */
>>> +static __always_inline bool
>>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>>> +{
>>> +   if (!memcg_kmem_on)
>>> +   return true;
>>> +   if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>>> +   return true;
>>> +   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>>> +   return true;
>>> +   return __memcg_kmem_new_page(gfp, handle, order);
>>> +}
>>> +
>>> +/**
>>> + * memcg_kmem_free_page: uncharge pages from memcg
>>> + * @page: pointer to struct page being freed
>>> + * @order: allocation order.
>>> + *
>>> + * there is no need to specify memcg here, since it is embedded in 
>>> page_cgroup
>>> + */
>>> +static __always_inline void
>>> +memcg_kmem_free_page(struct page *page, int order)
>>> +{
>>> +   if (memcg_kmem_on)
>>> +   __memcg_kmem_free_page(page, order);
>>> +}
>>> +
>>> +/**
>>> + * memcg_kmem_commit_page: embeds correct memcg in a page
>>> + * @handle: a pointer to the memcg this was charged against.
>>> + * @page: pointer to struct page recently allocated
>>> + * @handle: the memcg structure we charged against
>>> + * @order: allocation order.
>>> + *
>>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>>> + * failure of the allocation. if @page is NULL, this function will revert 
>>> the
>>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>>> + * corresponding page_cgroup.
>>> + */
>>> +static __always_inline void
>>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>>> order)
>>> +{
>>> +   if (memcg_kmem_on)
>>> +   __memcg_kmem_commit_page(page, handle, order);
>>> +}
>>>  #endif /* _LINUX_MEMCONTROL_H */
>>>  
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 54e93de..e9824c1 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -10,6 +10,10 @@
>>>   * Copyright (C) 2009 Nokia Corporation
>>>   * Author: Kirill A. Shutemov
>>>   *
>>> + * Kernel Memory Controller
>>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>>> + * Authors: Glauber Costa and Suleiman Souhlal
>>> + *

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-22 Thread Glauber Costa
On 08/22/2012 01:50 AM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
> 
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline accessors.  The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> [ v2: improved comments and standardized function names ]
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> ---
>>  include/linux/memcontrol.h |  79 +++
>>  mm/memcontrol.c| 185 
>> +
>>  2 files changed, 264 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 8d9489f..75b247e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -21,6 +21,7 @@
>>  #define _LINUX_MEMCONTROL_H
>>  #include 
>>  #include 
>> +#include 
>>  
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -399,6 +400,11 @@ struct sock;
>>  #ifdef CONFIG_MEMCG_KMEM
>>  void sock_update_memcg(struct sock *sk);
>>  void sock_release_memcg(struct sock *sk);
>> +
>> +#define memcg_kmem_on 1
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
>> +void __memcg_kmem_free_page(struct page *page, int order);
>>  #else
>>  static inline void sock_update_memcg(struct sock *sk)
>>  {
>> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>>  static inline void sock_release_memcg(struct sock *sk)
>>  {
>>  }
>> +
>> +#define memcg_kmem_on 0
>> +static inline bool
>> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> +return false;
>> +}
>> +
>> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void
>> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>> order)
>> +{
>> +}
>>  #endif /* CONFIG_MEMCG_KMEM */
>> +
>> +/**
>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> +if (!memcg_kmem_on)
>> +return true;
>> +if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> +return true;
>> +if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> +return true;
>> +return __memcg_kmem_new_page(gfp, handle, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_free_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in 
>> page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +if (memcg_kmem_on)
>> +__memcg_kmem_free_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_page: embeds correct memcg in a page
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @page: pointer to struct page recently allocated
>> + * @handle: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert 
>> the
>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>> order)
>> +{
>> +if (memcg_kmem_on)
>> +__memcg_kmem_commit_page(page, handle, order);
>> +}
>>  #endif /* _LINUX_MEMCONTROL_H */
>>  
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>>   * Copyright (C) 2009 Nokia Corporation
>>   * Author: Kirill A. Shutemov
>>   *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License as published by
>>   * the Free Software 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-22 Thread Glauber Costa
On 08/22/2012 01:50 AM, Greg Thelen wrote:
 On Thu, Aug 09 2012, Glauber Costa wrote:
 
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.

 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.

 [ v2: improved comments and standardized function names ]

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
  #ifdef CONFIG_MEMCG_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
  #else
  static inline void sock_update_memcg(struct sock *sk)
  {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
  #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +if (!memcg_kmem_on)
 +return true;
 +if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 +return true;
 +if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 +return true;
 +return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 +if (memcg_kmem_on)
 +__memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +if (memcg_kmem_on)
 +__memcg_kmem_commit_page(page, handle, order);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
   * Copyright (C) 2009 Nokia Corporation
   * Author: Kirill A. Shutemov
   *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-22 Thread Greg Thelen
On Wed, Aug 22 2012, Glauber Costa wrote:

 On 08/22/2012 01:50 AM, Greg Thelen wrote:
 On Thu, Aug 09 2012, Glauber Costa wrote:
 
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.

 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.

 [ v2: improved comments and standardized function names ]

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
  #ifdef CONFIG_MEMCG_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
  #else
  static inline void sock_update_memcg(struct sock *sk)
  {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +   return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
  #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted 
 to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +   if (!memcg_kmem_on)
 +   return true;
 +   if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 +   return true;
 +   if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 +   return true;
 +   return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 +   if (memcg_kmem_on)
 +   __memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +   if (memcg_kmem_on)
 +   __memcg_kmem_commit_page(page, handle, order);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
   * Copyright (C) 2009 Nokia Corporation
   * Author: Kirill A. Shutemov
   *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -434,6 +438,9 @@ struct mem_cgroup 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-21 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors.  The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> ---
>  include/linux/memcontrol.h |  79 +++
>  mm/memcontrol.c| 185 
> +
>  2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
>  #define _LINUX_MEMCONTROL_H
>  #include 
>  #include 
> +#include 
>  
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
>  #ifdef CONFIG_MEMCG_KMEM
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
>  #else
>  static inline void sock_update_memcg(struct sock *sk)
>  {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in 
> page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}
>  #endif /* _LINUX_MEMCONTROL_H */
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
>   * Copyright (C) 2009 Nokia Corporation
>   * Author: Kirill A. Shutemov
>   *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
> cgroup_subsys_state *s)
>  #include 
>  
>  static bool 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-21 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.

 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.

 [ v2: improved comments and standardized function names ]

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
  #ifdef CONFIG_MEMCG_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
  #else
  static inline void sock_update_memcg(struct sock *sk)
  {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
  #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + if (!memcg_kmem_on)
 + return true;
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + return true;
 + if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 + return true;
 + return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_commit_page(page, handle, order);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
   * Copyright (C) 2009 Nokia Corporation
   * Author: Kirill A. Shutemov
   *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
 cgroup_subsys_state *s)
  #include net/ip.h
  
  

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-20 Thread Glauber Costa
On 08/20/2012 05:36 PM, Kamezawa Hiroyuki wrote:
> (2012/08/16 2:00), Glauber Costa wrote:
>> On 08/15/2012 08:38 PM, Greg Thelen wrote:
>>> On Wed, Aug 15 2012, Glauber Costa wrote:
>>>
 On 08/14/2012 10:58 PM, Greg Thelen wrote:
> On Mon, Aug 13 2012, Glauber Costa wrote:
>
> +WARN_ON(mem_cgroup_is_root(memcg));
> +size = (1 << order) << PAGE_SHIFT;
> +memcg_uncharge_kmem(memcg, size);
> +mem_cgroup_put(memcg);
>>> Why do we need ref-counting here ? kmem res_counter cannot work as
>>> reference ?
>> This is of course the pair of the mem_cgroup_get() you commented on
>> earlier. If we need one, we need the other. If we don't need one, we
>> don't need the other =)
>>
>> The guarantee we're trying to give here is that the memcg
>> structure will
>> stay around while there are dangling charges to kmem, that we decided
>> not to move (remember: moving it for the stack is simple, for the
>> slab
>> is very complicated and ill-defined, and I believe it is better to
>> treat
>> all kmem equally here)
>
> By keeping memcg structures hanging around until the last referring
> kmem
> page is uncharged do such zombie memcg each consume a css_id and thus
> put pressure on the 64k css_id space?  I imagine in pathological cases
> this would prevent creation of new cgroups until these zombies are
> dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.
>>>
>>> Fair point.  But this doesn't seems like a feature.  It's probably not
>>> needed initially, but what do you think about creating a
>>> memcg_kernel_context structure which is allocated when memcg is
>>> allocated?  Kernel pages charged to a memcg would have
>>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
>>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
>>> is unlinked from cgroupfs while allowing for the active kernel pages to
>>> continue pointing to a valid memcg_kernel_context.  This would be a
>>> reference counted structure much like you are doing with memcg.  When a
>>> memcg is deleted the memcg_kernel_context would be linked into its
>>> surviving parent memcg.  This would avoid needing to visit each kernel
>>> page.
>>
>> You need more, you need at the res_counters to stay around as well. And
>> probably other fields.
>>
>> So my fear here is that as you add fields to that structure, you can
>> defeat a bit the goal of reducing memory consumption. Still leaves the
>> css space, yes. But by doing this we can introduce some subtle bugs by
>> having a field in the wrong structure.
>>
> 
> Hm, can't we free css_id and delete css structure from the css_id idr tree
> when a memcg goes zombie ?
> 
> Thanks,
> -Kame

Kame,

I wrote a patch that does exactly that. Can you take a look? (I posted
it already)
I actually need to go back to it, because greg seems to be right saying
that that will break things for memsw. But a simplified version may work.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-20 Thread Kamezawa Hiroyuki

(2012/08/16 2:00), Glauber Costa wrote:

On 08/15/2012 08:38 PM, Greg Thelen wrote:

On Wed, Aug 15 2012, Glauber Costa wrote:


On 08/14/2012 10:58 PM, Greg Thelen wrote:

On Mon, Aug 13 2012, Glauber Costa wrote:


+   WARN_ON(mem_cgroup_is_root(memcg));
+   size = (1 << order) << PAGE_SHIFT;
+   memcg_uncharge_kmem(memcg, size);
+   mem_cgroup_put(memcg);

Why do we need ref-counting here ? kmem res_counter cannot work as
reference ?

This is of course the pair of the mem_cgroup_get() you commented on
earlier. If we need one, we need the other. If we don't need one, we
don't need the other =)

The guarantee we're trying to give here is that the memcg structure will
stay around while there are dangling charges to kmem, that we decided
not to move (remember: moving it for the stack is simple, for the slab
is very complicated and ill-defined, and I believe it is better to treat
all kmem equally here)


By keeping memcg structures hanging around until the last referring kmem
page is uncharged do such zombie memcg each consume a css_id and thus
put pressure on the 64k css_id space?  I imagine in pathological cases
this would prevent creation of new cgroups until these zombies are
dereferenced.


Yes, but although this patch makes it more likely, it doesn't introduce
that. If the tasks, for instance, grab a reference to the cgroup dentry
in the filesystem (like their CWD, etc), they will also keep the cgroup
around.


Fair point.  But this doesn't seems like a feature.  It's probably not
needed initially, but what do you think about creating a
memcg_kernel_context structure which is allocated when memcg is
allocated?  Kernel pages charged to a memcg would have
page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
would allow the mem_cgroup and its css_id to be deleted when the cgroup
is unlinked from cgroupfs while allowing for the active kernel pages to
continue pointing to a valid memcg_kernel_context.  This would be a
reference counted structure much like you are doing with memcg.  When a
memcg is deleted the memcg_kernel_context would be linked into its
surviving parent memcg.  This would avoid needing to visit each kernel
page.


You need more, you need at the res_counters to stay around as well. And
probably other fields.

So my fear here is that as you add fields to that structure, you can
defeat a bit the goal of reducing memory consumption. Still leaves the
css space, yes. But by doing this we can introduce some subtle bugs by
having a field in the wrong structure.



Hm, can't we free css_id and delete css structure from the css_id idr tree
when a memcg goes zombie ?

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-20 Thread Kamezawa Hiroyuki

(2012/08/16 2:00), Glauber Costa wrote:

On 08/15/2012 08:38 PM, Greg Thelen wrote:

On Wed, Aug 15 2012, Glauber Costa wrote:


On 08/14/2012 10:58 PM, Greg Thelen wrote:

On Mon, Aug 13 2012, Glauber Costa wrote:


+   WARN_ON(mem_cgroup_is_root(memcg));
+   size = (1  order)  PAGE_SHIFT;
+   memcg_uncharge_kmem(memcg, size);
+   mem_cgroup_put(memcg);

Why do we need ref-counting here ? kmem res_counter cannot work as
reference ?

This is of course the pair of the mem_cgroup_get() you commented on
earlier. If we need one, we need the other. If we don't need one, we
don't need the other =)

The guarantee we're trying to give here is that the memcg structure will
stay around while there are dangling charges to kmem, that we decided
not to move (remember: moving it for the stack is simple, for the slab
is very complicated and ill-defined, and I believe it is better to treat
all kmem equally here)


By keeping memcg structures hanging around until the last referring kmem
page is uncharged do such zombie memcg each consume a css_id and thus
put pressure on the 64k css_id space?  I imagine in pathological cases
this would prevent creation of new cgroups until these zombies are
dereferenced.


Yes, but although this patch makes it more likely, it doesn't introduce
that. If the tasks, for instance, grab a reference to the cgroup dentry
in the filesystem (like their CWD, etc), they will also keep the cgroup
around.


Fair point.  But this doesn't seems like a feature.  It's probably not
needed initially, but what do you think about creating a
memcg_kernel_context structure which is allocated when memcg is
allocated?  Kernel pages charged to a memcg would have
page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
would allow the mem_cgroup and its css_id to be deleted when the cgroup
is unlinked from cgroupfs while allowing for the active kernel pages to
continue pointing to a valid memcg_kernel_context.  This would be a
reference counted structure much like you are doing with memcg.  When a
memcg is deleted the memcg_kernel_context would be linked into its
surviving parent memcg.  This would avoid needing to visit each kernel
page.


You need more, you need at the res_counters to stay around as well. And
probably other fields.

So my fear here is that as you add fields to that structure, you can
defeat a bit the goal of reducing memory consumption. Still leaves the
css space, yes. But by doing this we can introduce some subtle bugs by
having a field in the wrong structure.



Hm, can't we free css_id and delete css structure from the css_id idr tree
when a memcg goes zombie ?

Thanks,
-Kame





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-20 Thread Glauber Costa
On 08/20/2012 05:36 PM, Kamezawa Hiroyuki wrote:
 (2012/08/16 2:00), Glauber Costa wrote:
 On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +WARN_ON(mem_cgroup_is_root(memcg));
 +size = (1  order)  PAGE_SHIFT;
 +memcg_uncharge_kmem(memcg, size);
 +mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg
 structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the
 slab
 is very complicated and ill-defined, and I believe it is better to
 treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring
 kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.

 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.

 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 
 Hm, can't we free css_id and delete css structure from the css_id idr tree
 when a memcg goes zombie ?
 
 Thanks,
 -Kame

Kame,

I wrote a patch that does exactly that. Can you take a look? (I posted
it already)
I actually need to go back to it, because greg seems to be right saying
that that will break things for memsw. But a simplified version may work.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-17 Thread Glauber Costa
On 08/17/2012 06:36 AM, Kamezawa Hiroyuki wrote:
> I just want to see the same logic used in mem_cgroup_uncharge_common().
> Hmm, at setting pc->mem_cgroup, the things happens in
>set pc->mem_cgroup
>set Used bit
> order. If you clear pc->mem_cgroup
>unset Used bit
>clear pc->mem_cgroup
> seems reasonable.

Makes sense. I'll make sure we're consistent here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-17 Thread Glauber Costa
On 08/17/2012 06:36 AM, Kamezawa Hiroyuki wrote:
 I just want to see the same logic used in mem_cgroup_uncharge_common().
 Hmm, at setting pc-mem_cgroup, the things happens in
set pc-mem_cgroup
set Used bit
 order. If you clear pc-mem_cgroup
unset Used bit
clear pc-mem_cgroup
 seems reasonable.

Makes sense. I'll make sure we're consistent here.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Kamezawa Hiroyuki
(2012/08/13 17:28), Glauber Costa wrote:
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +  if (memcg_kmem_on)
 +  __memcg_kmem_commit_page(page, handle, order);
 +}
>> Doesn't this 2 functions has no short-cuts ?
> 
> Sorry kame, what exactly do you mean?
> 
I meant avoinding function call. But please ignore, I missed following patches.


>> if (memcg_kmem_on && handle) ?
> I guess this can be done to avoid a function call.
> 
>> Maybe free() needs to access page_cgroup...
>>
> Can you also be a bit more specific here?
> 

Please ignore, I misunderstood the usage of free_accounted_pages().

 +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 +  bool ret = true;
 +  size_t size;
 +  struct task_struct *p;
 +
 +  *handle = NULL;
 +  rcu_read_lock();
 +  p = rcu_dereference(current->mm->owner);
 +  memcg = mem_cgroup_from_task(p);
 +  if (!memcg_kmem_enabled(memcg))
 +  goto out;
 +
 +  mem_cgroup_get(memcg);
 +
>> This mem_cgroup_get() will be a potentioal performance problem.
>> Don't you have good idea to avoid accessing atomic counter here ?
>> I think some kind of percpu counter or a feature to disable "move task"
>> will be a help.
> 
> 
> 
> 
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  pc->mem_cgroup = memcg;
 +  SetPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +}
 +
 +void __memcg_kmem_free_page(struct page *page, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  size_t size;
 +  struct page_cgroup *pc;
 +
 +  if (mem_cgroup_disabled())
 +  return;
 +
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  memcg = pc->mem_cgroup;
 +  pc->mem_cgroup = NULL;
> 
>> shouldn't this happen after checking "Used" bit ?
>> Ah, BTW, why do you need to clear pc->memcg ?
> 
> As for clearing pc->memcg, I think I'm just being overzealous. I can't
> foresee any problems due to removing it.
> 
> As for the Used bit, what difference does it make when we clear it?
>
I just want to see the same logic used in mem_cgroup_uncharge_common().
Hmm, at setting pc->mem_cgroup, the things happens in
   set pc->mem_cgroup
   set Used bit
order. If you clear pc->mem_cgroup
   unset Used bit
   clear pc->mem_cgroup
seems reasonable.


 
 +  if (!PageCgroupUsed(pc)) {
 +  unlock_page_cgroup(pc);
 +  return;
 +  }
 +  ClearPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +
 +  /*
 +   * Checking if kmem accounted is enabled won't work for uncharge, since
 +   * it is possible that the user enabled kmem tracking, allocated, and
 +   * then disabled it again.
 +   *
 +   * We trust if there is a memcg associated with the page, it is a valid
 +   * allocation
 +   */
 +  if (!memcg)
 +  return;
 +
 +  WARN_ON(mem_cgroup_is_root(memcg));
 +  size = (1 << order) << PAGE_SHIFT;
 +  memcg_uncharge_kmem(memcg, size);
 +  mem_cgroup_put(memcg);
>> Why do we need ref-counting here ? kmem res_counter cannot work as
>> reference ?
> This is of course the pair of the mem_cgroup_get() you commented on
> earlier. If we need one, we need the other. If we don't need one, we
> don't need the other =)
> 
> The guarantee we're trying to give here is that the memcg structure will
> stay around while there are dangling charges to kmem, that we decided
> not to move (remember: moving it for the stack is simple, for the slab
> is very complicated and ill-defined, and I believe it is better to treat
> all kmem equally here)
> 
> So maybe we can be clever here, and avoid reference counting at all
> times. We call mem_cgroup_get() when the first charge occurs, and then
> go for mem_cgroup_put() when our count reaches 0.
> 
> What do you think about that?
> 

I think that should work. I don't want to add not-optimized atomic counter ops
in this very hot path.


> 
 +#ifdef CONFIG_MEMCG_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
>> What does 'delta' means ?
>>
> I can change it to something like nr_bytes, more informative.
> 
 +  struct res_counter *fail_res;
 +  struct mem_cgroup *_memcg;
 +  int ret;
 +  bool may_oom;
 +  bool nofail = false;
 +
 +  may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
 +  !(gfp & __GFP_NORETRY);
 +
 +  ret = 0;
 +
 +  

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 07:05 PM, Michal Hocko wrote:
> On Thu 16-08-12 13:57:07, Glauber Costa wrote:
>> On 08/16/2012 01:53 PM, Michal Hocko wrote:
>>> On Wed 15-08-12 18:27:45, Glauber Costa wrote:

>>
>> I see now, you seem to be right.
>
> No I am not because it seems that I am really blind these days...
> We were doing this in mem_cgroup_do_charge for ages:
>   if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
>
> /me goes to hide and get with further feedback with a clean head.
>
> Sorry about that.
>
 I am as well, since I went to look at mem_cgroup_do_charge() and missed
 that.
>>>
>>> I thought we are not doing atomic allocations in user pages accounting
>>> but I was obviously wrong because at least shmem uses atomic
>>> allocations for ages.
>>>
 Do you have any other concerns specific to this patch ?
>>>
>>> I understood you changed also handle thingy. So the patch should be
>>> correct.
>>> Do you plan to send an updated version?
>>>
>> That depends more on you than on me! =)
>>
>> Do you still have any concerns regarding the u+k charging as it stands
>> now? That would be the last big concern I heard during this iteration.
> 
> Well, I am still not 100% sure because I still see technical
> difficulties that are not addressed by the patchset (memcg-oom, memcg
> slab shrinking, possibly others). More importantly this is changing the
> current semantic of the limit so we should better be careful about it
> and check that we are not making the code tight to specific workloads
> without a way out.
> 
> On the other hand I do not want to block the progress here without
> having _really_ good arguments against that couldn't be handled later
> (and it seems that some of my concerns are work in progress already).
> 
> I have to admit I like several things about the patchset. Especially the
> way how it enables easy-to-setup (aka don't care about kmem details just
> make sure you can cap the thing) as well as "I know exactly what I want
> to do" usecases.
> It is also good nice that only users of the feature are affected by
> potential issues.
> 
> So I think it is worth a broader attention which could produce other use
> cases which could show potential drawbacks from the u+k semantic but I
> would be still very careful about merging it to the Linus tree and only
> merge it after at least the memcg reclaim path is slab aware. Living in
> the -mm tree should help us with the testing converage.
> 
> Does it sounds reasonable?
> 
What I really want is to have it in an "official" tree so it starts
getting used and tested without me having to rebase at every single change.

If Andrew is okay merging this into -mm, it is fine for me.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Michal Hocko
On Thu 16-08-12 13:57:07, Glauber Costa wrote:
> On 08/16/2012 01:53 PM, Michal Hocko wrote:
> > On Wed 15-08-12 18:27:45, Glauber Costa wrote:
> >>
> 
>  I see now, you seem to be right.
> >>>
> >>> No I am not because it seems that I am really blind these days...
> >>> We were doing this in mem_cgroup_do_charge for ages:
> >>>   if (!(gfp_mask & __GFP_WAIT))
> >>> return CHARGE_WOULDBLOCK;
> >>>
> >>> /me goes to hide and get with further feedback with a clean head.
> >>>
> >>> Sorry about that.
> >>>
> >> I am as well, since I went to look at mem_cgroup_do_charge() and missed
> >> that.
> > 
> > I thought we are not doing atomic allocations in user pages accounting
> > but I was obviously wrong because at least shmem uses atomic
> > allocations for ages.
> > 
> >> Do you have any other concerns specific to this patch ?
> > 
> > I understood you changed also handle thingy. So the patch should be
> > correct.
> > Do you plan to send an updated version?
> > 
> That depends more on you than on me! =)
> 
> Do you still have any concerns regarding the u+k charging as it stands
> now? That would be the last big concern I heard during this iteration.

Well, I am still not 100% sure because I still see technical
difficulties that are not addressed by the patchset (memcg-oom, memcg
slab shrinking, possibly others). More importantly this is changing the
current semantic of the limit so we should better be careful about it
and check that we are not making the code tight to specific workloads
without a way out.

On the other hand I do not want to block the progress here without
having _really_ good arguments against that couldn't be handled later
(and it seems that some of my concerns are work in progress already).

I have to admit I like several things about the patchset. Especially the
way how it enables easy-to-setup (aka don't care about kmem details just
make sure you can cap the thing) as well as "I know exactly what I want
to do" usecases.
It is also good nice that only users of the feature are affected by
potential issues.

So I think it is worth a broader attention which could produce other use
cases which could show potential drawbacks from the u+k semantic but I
would be still very careful about merging it to the Linus tree and only
merge it after at least the memcg reclaim path is slab aware. Living in
the -mm tree should help us with the testing converage.

Does it sounds reasonable?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 01:53 PM, Michal Hocko wrote:
> On Wed 15-08-12 18:27:45, Glauber Costa wrote:
>>

 I see now, you seem to be right.
>>>
>>> No I am not because it seems that I am really blind these days...
>>> We were doing this in mem_cgroup_do_charge for ages:
>>> if (!(gfp_mask & __GFP_WAIT))
>>> return CHARGE_WOULDBLOCK;
>>>
>>> /me goes to hide and get with further feedback with a clean head.
>>>
>>> Sorry about that.
>>>
>> I am as well, since I went to look at mem_cgroup_do_charge() and missed
>> that.
> 
> I thought we are not doing atomic allocations in user pages accounting
> but I was obviously wrong because at least shmem uses atomic
> allocations for ages.
> 
>> Do you have any other concerns specific to this patch ?
> 
> I understood you changed also handle thingy. So the patch should be
> correct.
> Do you plan to send an updated version?
> 
That depends more on you than on me! =)

Do you still have any concerns regarding the u+k charging as it stands
now? That would be the last big concern I heard during this iteration.

If you are happy with the answers you got so far, and believe it is
acceptable to proceed with the charging this way, I will be ready to
send an updated version soon.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Michal Hocko
On Wed 15-08-12 18:27:45, Glauber Costa wrote:
> 
> >>
> >> I see now, you seem to be right.
> > 
> > No I am not because it seems that I am really blind these days...
> > We were doing this in mem_cgroup_do_charge for ages:
> > if (!(gfp_mask & __GFP_WAIT))
> > return CHARGE_WOULDBLOCK;
> > 
> > /me goes to hide and get with further feedback with a clean head.
> > 
> > Sorry about that.
> > 
> I am as well, since I went to look at mem_cgroup_do_charge() and missed
> that.

I thought we are not doing atomic allocations in user pages accounting
but I was obviously wrong because at least shmem uses atomic
allocations for ages.

> Do you have any other concerns specific to this patch ?

I understood you changed also handle thingy. So the patch should be
correct.
Do you plan to send an updated version?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 07:37 AM, Greg Thelen wrote:
> On Wed, Aug 15 2012, Glauber Costa wrote:
> 
>> On 08/15/2012 09:12 PM, Greg Thelen wrote:
>>> On Wed, Aug 15 2012, Glauber Costa wrote:
>>>
 On 08/15/2012 08:38 PM, Greg Thelen wrote:
> On Wed, Aug 15 2012, Glauber Costa wrote:
>
>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
>>> On Mon, Aug 13 2012, Glauber Costa wrote:
>>>
>>> +   WARN_ON(mem_cgroup_is_root(memcg));
>>> +   size = (1 << order) << PAGE_SHIFT;
>>> +   memcg_uncharge_kmem(memcg, size);
>>> +   mem_cgroup_put(memcg);
> Why do we need ref-counting here ? kmem res_counter cannot work as
> reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure 
 will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to 
 treat
 all kmem equally here)
>>>
>>> By keeping memcg structures hanging around until the last referring kmem
>>> page is uncharged do such zombie memcg each consume a css_id and thus
>>> put pressure on the 64k css_id space?  I imagine in pathological cases
>>> this would prevent creation of new cgroups until these zombies are
>>> dereferenced.
>>
>> Yes, but although this patch makes it more likely, it doesn't introduce
>> that. If the tasks, for instance, grab a reference to the cgroup dentry
>> in the filesystem (like their CWD, etc), they will also keep the cgroup
>> around.
>
> Fair point.  But this doesn't seems like a feature.  It's probably not
> needed initially, but what do you think about creating a
> memcg_kernel_context structure which is allocated when memcg is
> allocated?  Kernel pages charged to a memcg would have
> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
> would allow the mem_cgroup and its css_id to be deleted when the cgroup
> is unlinked from cgroupfs while allowing for the active kernel pages to
> continue pointing to a valid memcg_kernel_context.  This would be a
> reference counted structure much like you are doing with memcg.  When a
> memcg is deleted the memcg_kernel_context would be linked into its
> surviving parent memcg.  This would avoid needing to visit each kernel
> page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.
>>>
>>> I am not sure the res_counters would need to stay around.  Once a
>>> memcg_kernel_context has been reparented, then any future kernel page
>>> uncharge calls will uncharge the parent res_counter.
>>
>> Well, if you hold the memcg due to a reference, like in the dentry case,
>> then fine. But if this is a dangling charge, as will be the case with
>> the slab, then you have to uncharge it.
>>
>> An arbitrary number of parents might have been deleted as well, so you
>> need to transverse them all until you reach a live parent to uncharge from.
> 
> I was thinking that each time a memcg is deleted move the
> memcg_kernel_context from the victim memcg to its parent.  When moving,
> also update the context to refer to the parent and link context to
> parent:
>   for_each_kernel_context(kernel_context, memcg) {
> kernel_context->memcg = memcg->parent;
> list_add(_context->list, >parent->kernel_contexts);
>   }
> 
> Whenever pages referring to a memcg_kernel_context are uncharged they
> will uncharge the nearest surviving parent memcg.
> 
>> To do that, your counters have to be still alive.
> 
> The counters of nearest surviving parent will be alive and pointed to by
> memcg_kernel_context->memcg.
> 
 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 Did you observe that to be a big problem in your systems?
>>>
>>> No I have not seen this yet.  But our past solutions have reparented
>>> kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
>>> concerns with your approach are just a suspicion because we have been
>>> experimenting with accounting of even more kernel memory (e.g. vmalloc,
>>> kernel stacks, page tables).  As the scope of such accounting grows the
>>> chance of long lived charged pages grows and thus the chance of zombies
>>> which exhaust the css_id space grows.
>>

Can't we just free the css_id, and convention that it should not be used
after mem_cgroup_destroy()? The memory will still stay around,

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 07:37 AM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:
 
 On 08/15/2012 09:12 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +   WARN_ON(mem_cgroup_is_root(memcg));
 +   size = (1  order)  PAGE_SHIFT;
 +   memcg_uncharge_kmem(memcg, size);
 +   mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure 
 will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to 
 treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.

 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.

 I am not sure the res_counters would need to stay around.  Once a
 memcg_kernel_context has been reparented, then any future kernel page
 uncharge calls will uncharge the parent res_counter.

 Well, if you hold the memcg due to a reference, like in the dentry case,
 then fine. But if this is a dangling charge, as will be the case with
 the slab, then you have to uncharge it.

 An arbitrary number of parents might have been deleted as well, so you
 need to transverse them all until you reach a live parent to uncharge from.
 
 I was thinking that each time a memcg is deleted move the
 memcg_kernel_context from the victim memcg to its parent.  When moving,
 also update the context to refer to the parent and link context to
 parent:
   for_each_kernel_context(kernel_context, memcg) {
 kernel_context-memcg = memcg-parent;
 list_add(kernel_context-list, memcg-parent-kernel_contexts);
   }
 
 Whenever pages referring to a memcg_kernel_context are uncharged they
 will uncharge the nearest surviving parent memcg.
 
 To do that, your counters have to be still alive.
 
 The counters of nearest surviving parent will be alive and pointed to by
 memcg_kernel_context-memcg.
 
 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 Did you observe that to be a big problem in your systems?

 No I have not seen this yet.  But our past solutions have reparented
 kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 concerns with your approach are just a suspicion because we have been
 experimenting with accounting of even more kernel memory (e.g. vmalloc,
 kernel stacks, page tables).  As the scope of such accounting grows the
 chance of long lived charged pages grows and thus the chance of zombies
 which exhaust the css_id space grows.


Can't we just free the css_id, and convention that it should not be used
after mem_cgroup_destroy()? The memory will still stay around,
sure, but at least the pressure on the css_id space goes away.

I am testing a patch that does precisely that here, and will let you
know of the results. But if you were willing to have a smaller structure
just to serve as a zombie, any approach that works for it would have to
assume the css_id was already freed, so I don't anticipate huge problems.

 Well, since we agree this can all be done under the hood, I'd say let's
 wait until a 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Michal Hocko
On Wed 15-08-12 18:27:45, Glauber Costa wrote:
 
 
  I see now, you seem to be right.
  
  No I am not because it seems that I am really blind these days...
  We were doing this in mem_cgroup_do_charge for ages:
  if (!(gfp_mask  __GFP_WAIT))
  return CHARGE_WOULDBLOCK;
  
  /me goes to hide and get with further feedback with a clean head.
  
  Sorry about that.
  
 I am as well, since I went to look at mem_cgroup_do_charge() and missed
 that.

I thought we are not doing atomic allocations in user pages accounting
but I was obviously wrong because at least shmem uses atomic
allocations for ages.

 Do you have any other concerns specific to this patch ?

I understood you changed also handle thingy. So the patch should be
correct.
Do you plan to send an updated version?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 01:53 PM, Michal Hocko wrote:
 On Wed 15-08-12 18:27:45, Glauber Costa wrote:


 I see now, you seem to be right.

 No I am not because it seems that I am really blind these days...
 We were doing this in mem_cgroup_do_charge for ages:
 if (!(gfp_mask  __GFP_WAIT))
 return CHARGE_WOULDBLOCK;

 /me goes to hide and get with further feedback with a clean head.

 Sorry about that.

 I am as well, since I went to look at mem_cgroup_do_charge() and missed
 that.
 
 I thought we are not doing atomic allocations in user pages accounting
 but I was obviously wrong because at least shmem uses atomic
 allocations for ages.
 
 Do you have any other concerns specific to this patch ?
 
 I understood you changed also handle thingy. So the patch should be
 correct.
 Do you plan to send an updated version?
 
That depends more on you than on me! =)

Do you still have any concerns regarding the u+k charging as it stands
now? That would be the last big concern I heard during this iteration.

If you are happy with the answers you got so far, and believe it is
acceptable to proceed with the charging this way, I will be ready to
send an updated version soon.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Michal Hocko
On Thu 16-08-12 13:57:07, Glauber Costa wrote:
 On 08/16/2012 01:53 PM, Michal Hocko wrote:
  On Wed 15-08-12 18:27:45, Glauber Costa wrote:
 
 
  I see now, you seem to be right.
 
  No I am not because it seems that I am really blind these days...
  We were doing this in mem_cgroup_do_charge for ages:
if (!(gfp_mask  __GFP_WAIT))
  return CHARGE_WOULDBLOCK;
 
  /me goes to hide and get with further feedback with a clean head.
 
  Sorry about that.
 
  I am as well, since I went to look at mem_cgroup_do_charge() and missed
  that.
  
  I thought we are not doing atomic allocations in user pages accounting
  but I was obviously wrong because at least shmem uses atomic
  allocations for ages.
  
  Do you have any other concerns specific to this patch ?
  
  I understood you changed also handle thingy. So the patch should be
  correct.
  Do you plan to send an updated version?
  
 That depends more on you than on me! =)
 
 Do you still have any concerns regarding the u+k charging as it stands
 now? That would be the last big concern I heard during this iteration.

Well, I am still not 100% sure because I still see technical
difficulties that are not addressed by the patchset (memcg-oom, memcg
slab shrinking, possibly others). More importantly this is changing the
current semantic of the limit so we should better be careful about it
and check that we are not making the code tight to specific workloads
without a way out.

On the other hand I do not want to block the progress here without
having _really_ good arguments against that couldn't be handled later
(and it seems that some of my concerns are work in progress already).

I have to admit I like several things about the patchset. Especially the
way how it enables easy-to-setup (aka don't care about kmem details just
make sure you can cap the thing) as well as I know exactly what I want
to do usecases.
It is also good nice that only users of the feature are affected by
potential issues.

So I think it is worth a broader attention which could produce other use
cases which could show potential drawbacks from the u+k semantic but I
would be still very careful about merging it to the Linus tree and only
merge it after at least the memcg reclaim path is slab aware. Living in
the -mm tree should help us with the testing converage.

Does it sounds reasonable?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Glauber Costa
On 08/16/2012 07:05 PM, Michal Hocko wrote:
 On Thu 16-08-12 13:57:07, Glauber Costa wrote:
 On 08/16/2012 01:53 PM, Michal Hocko wrote:
 On Wed 15-08-12 18:27:45, Glauber Costa wrote:


 I see now, you seem to be right.

 No I am not because it seems that I am really blind these days...
 We were doing this in mem_cgroup_do_charge for ages:
   if (!(gfp_mask  __GFP_WAIT))
 return CHARGE_WOULDBLOCK;

 /me goes to hide and get with further feedback with a clean head.

 Sorry about that.

 I am as well, since I went to look at mem_cgroup_do_charge() and missed
 that.

 I thought we are not doing atomic allocations in user pages accounting
 but I was obviously wrong because at least shmem uses atomic
 allocations for ages.

 Do you have any other concerns specific to this patch ?

 I understood you changed also handle thingy. So the patch should be
 correct.
 Do you plan to send an updated version?

 That depends more on you than on me! =)

 Do you still have any concerns regarding the u+k charging as it stands
 now? That would be the last big concern I heard during this iteration.
 
 Well, I am still not 100% sure because I still see technical
 difficulties that are not addressed by the patchset (memcg-oom, memcg
 slab shrinking, possibly others). More importantly this is changing the
 current semantic of the limit so we should better be careful about it
 and check that we are not making the code tight to specific workloads
 without a way out.
 
 On the other hand I do not want to block the progress here without
 having _really_ good arguments against that couldn't be handled later
 (and it seems that some of my concerns are work in progress already).
 
 I have to admit I like several things about the patchset. Especially the
 way how it enables easy-to-setup (aka don't care about kmem details just
 make sure you can cap the thing) as well as I know exactly what I want
 to do usecases.
 It is also good nice that only users of the feature are affected by
 potential issues.
 
 So I think it is worth a broader attention which could produce other use
 cases which could show potential drawbacks from the u+k semantic but I
 would be still very careful about merging it to the Linus tree and only
 merge it after at least the memcg reclaim path is slab aware. Living in
 the -mm tree should help us with the testing converage.
 
 Does it sounds reasonable?
 
What I really want is to have it in an official tree so it starts
getting used and tested without me having to rebase at every single change.

If Andrew is okay merging this into -mm, it is fine for me.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-16 Thread Kamezawa Hiroyuki
(2012/08/13 17:28), Glauber Costa wrote:
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +  if (memcg_kmem_on)
 +  __memcg_kmem_commit_page(page, handle, order);
 +}
 Doesn't this 2 functions has no short-cuts ?
 
 Sorry kame, what exactly do you mean?
 
I meant avoinding function call. But please ignore, I missed following patches.


 if (memcg_kmem_on  handle) ?
 I guess this can be done to avoid a function call.
 
 Maybe free() needs to access page_cgroup...

 Can you also be a bit more specific here?
 

Please ignore, I misunderstood the usage of free_accounted_pages().

 +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 +  bool ret = true;
 +  size_t size;
 +  struct task_struct *p;
 +
 +  *handle = NULL;
 +  rcu_read_lock();
 +  p = rcu_dereference(current-mm-owner);
 +  memcg = mem_cgroup_from_task(p);
 +  if (!memcg_kmem_enabled(memcg))
 +  goto out;
 +
 +  mem_cgroup_get(memcg);
 +
 This mem_cgroup_get() will be a potentioal performance problem.
 Don't you have good idea to avoid accessing atomic counter here ?
 I think some kind of percpu counter or a feature to disable move task
 will be a help.
 
 
 
 
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  pc-mem_cgroup = memcg;
 +  SetPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +}
 +
 +void __memcg_kmem_free_page(struct page *page, int order)
 +{
 +  struct mem_cgroup *memcg;
 +  size_t size;
 +  struct page_cgroup *pc;
 +
 +  if (mem_cgroup_disabled())
 +  return;
 +
 +  pc = lookup_page_cgroup(page);
 +  lock_page_cgroup(pc);
 +  memcg = pc-mem_cgroup;
 +  pc-mem_cgroup = NULL;
 
 shouldn't this happen after checking Used bit ?
 Ah, BTW, why do you need to clear pc-memcg ?
 
 As for clearing pc-memcg, I think I'm just being overzealous. I can't
 foresee any problems due to removing it.
 
 As for the Used bit, what difference does it make when we clear it?

I just want to see the same logic used in mem_cgroup_uncharge_common().
Hmm, at setting pc-mem_cgroup, the things happens in
   set pc-mem_cgroup
   set Used bit
order. If you clear pc-mem_cgroup
   unset Used bit
   clear pc-mem_cgroup
seems reasonable.


 
 +  if (!PageCgroupUsed(pc)) {
 +  unlock_page_cgroup(pc);
 +  return;
 +  }
 +  ClearPageCgroupUsed(pc);
 +  unlock_page_cgroup(pc);
 +
 +  /*
 +   * Checking if kmem accounted is enabled won't work for uncharge, since
 +   * it is possible that the user enabled kmem tracking, allocated, and
 +   * then disabled it again.
 +   *
 +   * We trust if there is a memcg associated with the page, it is a valid
 +   * allocation
 +   */
 +  if (!memcg)
 +  return;
 +
 +  WARN_ON(mem_cgroup_is_root(memcg));
 +  size = (1  order)  PAGE_SHIFT;
 +  memcg_uncharge_kmem(memcg, size);
 +  mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)
 
 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)
 
 So maybe we can be clever here, and avoid reference counting at all
 times. We call mem_cgroup_get() when the first charge occurs, and then
 go for mem_cgroup_put() when our count reaches 0.
 
 What do you think about that?
 

I think that should work. I don't want to add not-optimized atomic counter ops
in this very hot path.


 
 +#ifdef CONFIG_MEMCG_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
 What does 'delta' means ?

 I can change it to something like nr_bytes, more informative.
 
 +  struct res_counter *fail_res;
 +  struct mem_cgroup *_memcg;
 +  int ret;
 +  bool may_oom;
 +  bool nofail = false;
 +
 +  may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS) 
 +  !(gfp  __GFP_NORETRY);
 +
 +  ret = 0;
 +
 +  if (!memcg)
 +  return ret;
 +
 +  _memcg = memcg;
 +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +  _memcg, may_oom);
 +
 +  if (ret == -EINTR)  {
 +  nofail = true;
 +  /*
 +   * __mem_cgroup_try_charge() chosed to bypass to root due to
 +   * OOM kill or fatal signal.  Since our only options are to
 +   * either fail the allocation or charge it to this cgroup, do
 + 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

> On 08/15/2012 09:12 PM, Greg Thelen wrote:
>> On Wed, Aug 15 2012, Glauber Costa wrote:
>> 
>>> On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

> On 08/14/2012 10:58 PM, Greg Thelen wrote:
>> On Mon, Aug 13 2012, Glauber Costa wrote:
>>
>> +WARN_ON(mem_cgroup_is_root(memcg));
>> +size = (1 << order) << PAGE_SHIFT;
>> +memcg_uncharge_kmem(memcg, size);
>> +mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
>>> This is of course the pair of the mem_cgroup_get() you commented on
>>> earlier. If we need one, we need the other. If we don't need one, we
>>> don't need the other =)
>>>
>>> The guarantee we're trying to give here is that the memcg structure will
>>> stay around while there are dangling charges to kmem, that we decided
>>> not to move (remember: moving it for the stack is simple, for the slab
>>> is very complicated and ill-defined, and I believe it is better to treat
>>> all kmem equally here)
>>
>> By keeping memcg structures hanging around until the last referring kmem
>> page is uncharged do such zombie memcg each consume a css_id and thus
>> put pressure on the 64k css_id space?  I imagine in pathological cases
>> this would prevent creation of new cgroups until these zombies are
>> dereferenced.
>
> Yes, but although this patch makes it more likely, it doesn't introduce
> that. If the tasks, for instance, grab a reference to the cgroup dentry
> in the filesystem (like their CWD, etc), they will also keep the cgroup
> around.

 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.
>>>
>>> You need more, you need at the res_counters to stay around as well. And
>>> probably other fields.
>> 
>> I am not sure the res_counters would need to stay around.  Once a
>> memcg_kernel_context has been reparented, then any future kernel page
>> uncharge calls will uncharge the parent res_counter.
>
> Well, if you hold the memcg due to a reference, like in the dentry case,
> then fine. But if this is a dangling charge, as will be the case with
> the slab, then you have to uncharge it.
>
> An arbitrary number of parents might have been deleted as well, so you
> need to transverse them all until you reach a live parent to uncharge from.

I was thinking that each time a memcg is deleted move the
memcg_kernel_context from the victim memcg to its parent.  When moving,
also update the context to refer to the parent and link context to
parent:
  for_each_kernel_context(kernel_context, memcg) {
kernel_context->memcg = memcg->parent;
list_add(_context->list, >parent->kernel_contexts);
  }

Whenever pages referring to a memcg_kernel_context are uncharged they
will uncharge the nearest surviving parent memcg.

> To do that, your counters have to be still alive.

The counters of nearest surviving parent will be alive and pointed to by
memcg_kernel_context->memcg.

>>> So my fear here is that as you add fields to that structure, you can
>>> defeat a bit the goal of reducing memory consumption. Still leaves the
>>> css space, yes. But by doing this we can introduce some subtle bugs by
>>> having a field in the wrong structure.
>>>
>>> Did you observe that to be a big problem in your systems?
>> 
>> No I have not seen this yet.  But our past solutions have reparented
>> kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
>> concerns with your approach are just a suspicion because we have been
>> experimenting with accounting of even more kernel memory (e.g. vmalloc,
>> kernel stacks, page tables).  As the scope of such accounting grows the
>> chance of long lived charged pages grows and thus the chance of zombies
>> which exhaust the css_id space grows.
>
> Well, since we agree this can all be done under the hood, I'd say let's
> wait until a problem actually exists, since the solution is likely to be
> a bit convoluted...
>
> I personally believe that if won't have a lot of task movement, most of
> the data will go away as the cgroup dies. The remainder 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 09:12 PM, Greg Thelen wrote:
> On Wed, Aug 15 2012, Glauber Costa wrote:
> 
>> On 08/15/2012 08:38 PM, Greg Thelen wrote:
>>> On Wed, Aug 15 2012, Glauber Costa wrote:
>>>
 On 08/14/2012 10:58 PM, Greg Thelen wrote:
> On Mon, Aug 13 2012, Glauber Costa wrote:
>
> + WARN_ON(mem_cgroup_is_root(memcg));
> + size = (1 << order) << PAGE_SHIFT;
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
>>> Why do we need ref-counting here ? kmem res_counter cannot work as
>>> reference ?
>> This is of course the pair of the mem_cgroup_get() you commented on
>> earlier. If we need one, we need the other. If we don't need one, we
>> don't need the other =)
>>
>> The guarantee we're trying to give here is that the memcg structure will
>> stay around while there are dangling charges to kmem, that we decided
>> not to move (remember: moving it for the stack is simple, for the slab
>> is very complicated and ill-defined, and I believe it is better to treat
>> all kmem equally here)
>
> By keeping memcg structures hanging around until the last referring kmem
> page is uncharged do such zombie memcg each consume a css_id and thus
> put pressure on the 64k css_id space?  I imagine in pathological cases
> this would prevent creation of new cgroups until these zombies are
> dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.
>>>
>>> Fair point.  But this doesn't seems like a feature.  It's probably not
>>> needed initially, but what do you think about creating a
>>> memcg_kernel_context structure which is allocated when memcg is
>>> allocated?  Kernel pages charged to a memcg would have
>>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
>>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
>>> is unlinked from cgroupfs while allowing for the active kernel pages to
>>> continue pointing to a valid memcg_kernel_context.  This would be a
>>> reference counted structure much like you are doing with memcg.  When a
>>> memcg is deleted the memcg_kernel_context would be linked into its
>>> surviving parent memcg.  This would avoid needing to visit each kernel
>>> page.
>>
>> You need more, you need at the res_counters to stay around as well. And
>> probably other fields.
> 
> I am not sure the res_counters would need to stay around.  Once a
> memcg_kernel_context has been reparented, then any future kernel page
> uncharge calls will uncharge the parent res_counter.

Well, if you hold the memcg due to a reference, like in the dentry case,
then fine. But if this is a dangling charge, as will be the case with
the slab, then you have to uncharge it.

An arbitrary number of parents might have been deleted as well, so you
need to transverse them all until you reach a live parent to uncharge from.

To do that, your counters have to be still alive.

> 
>> So my fear here is that as you add fields to that structure, you can
>> defeat a bit the goal of reducing memory consumption. Still leaves the
>> css space, yes. But by doing this we can introduce some subtle bugs by
>> having a field in the wrong structure.
>>
>> Did you observe that to be a big problem in your systems?
> 
> No I have not seen this yet.  But our past solutions have reparented
> kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
> concerns with your approach are just a suspicion because we have been
> experimenting with accounting of even more kernel memory (e.g. vmalloc,
> kernel stacks, page tables).  As the scope of such accounting grows the
> chance of long lived charged pages grows and thus the chance of zombies
> which exhaust the css_id space grows.

Well, since we agree this can all be done under the hood, I'd say let's
wait until a problem actually exists, since the solution is likely to be
a bit convoluted...

I personally believe that if won't have a lot of task movement, most of
the data will go away as the cgroup dies. The remainder shouldn't be too
much to hold it in memory for a lot of time. This is of course assuming
a real use case, not an adversarial scenario, which is quite easy to
come up with: just create a task, hold a bunch of kmem, move the task
away, delete the cgroup, etc.

That said, nothing stops us to actively try to create a scenario that
would demonstrate such a problem.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

> On 08/15/2012 08:38 PM, Greg Thelen wrote:
>> On Wed, Aug 15 2012, Glauber Costa wrote:
>> 
>>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +  WARN_ON(mem_cgroup_is_root(memcg));
 +  size = (1 << order) << PAGE_SHIFT;
 +  memcg_uncharge_kmem(memcg, size);
 +  mem_cgroup_put(memcg);
>> Why do we need ref-counting here ? kmem res_counter cannot work as
>> reference ?
> This is of course the pair of the mem_cgroup_get() you commented on
> earlier. If we need one, we need the other. If we don't need one, we
> don't need the other =)
>
> The guarantee we're trying to give here is that the memcg structure will
> stay around while there are dangling charges to kmem, that we decided
> not to move (remember: moving it for the stack is simple, for the slab
> is very complicated and ill-defined, and I believe it is better to treat
> all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.
>>>
>>> Yes, but although this patch makes it more likely, it doesn't introduce
>>> that. If the tasks, for instance, grab a reference to the cgroup dentry
>>> in the filesystem (like their CWD, etc), they will also keep the cgroup
>>> around.
>> 
>> Fair point.  But this doesn't seems like a feature.  It's probably not
>> needed initially, but what do you think about creating a
>> memcg_kernel_context structure which is allocated when memcg is
>> allocated?  Kernel pages charged to a memcg would have
>> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
>> would allow the mem_cgroup and its css_id to be deleted when the cgroup
>> is unlinked from cgroupfs while allowing for the active kernel pages to
>> continue pointing to a valid memcg_kernel_context.  This would be a
>> reference counted structure much like you are doing with memcg.  When a
>> memcg is deleted the memcg_kernel_context would be linked into its
>> surviving parent memcg.  This would avoid needing to visit each kernel
>> page.
>
> You need more, you need at the res_counters to stay around as well. And
> probably other fields.

I am not sure the res_counters would need to stay around.  Once a
memcg_kernel_context has been reparented, then any future kernel page
uncharge calls will uncharge the parent res_counter.

> So my fear here is that as you add fields to that structure, you can
> defeat a bit the goal of reducing memory consumption. Still leaves the
> css space, yes. But by doing this we can introduce some subtle bugs by
> having a field in the wrong structure.
>
> Did you observe that to be a big problem in your systems?

No I have not seen this yet.  But our past solutions have reparented
kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
concerns with your approach are just a suspicion because we have been
experimenting with accounting of even more kernel memory (e.g. vmalloc,
kernel stacks, page tables).  As the scope of such accounting grows the
chance of long lived charged pages grows and thus the chance of zombies
which exhaust the css_id space grows.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 08:38 PM, Greg Thelen wrote:
> On Wed, Aug 15 2012, Glauber Costa wrote:
> 
>> On 08/14/2012 10:58 PM, Greg Thelen wrote:
>>> On Mon, Aug 13 2012, Glauber Costa wrote:
>>>
>>> +   WARN_ON(mem_cgroup_is_root(memcg));
>>> +   size = (1 << order) << PAGE_SHIFT;
>>> +   memcg_uncharge_kmem(memcg, size);
>>> +   mem_cgroup_put(memcg);
> Why do we need ref-counting here ? kmem res_counter cannot work as
> reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)
>>>
>>> By keeping memcg structures hanging around until the last referring kmem
>>> page is uncharged do such zombie memcg each consume a css_id and thus
>>> put pressure on the 64k css_id space?  I imagine in pathological cases
>>> this would prevent creation of new cgroups until these zombies are
>>> dereferenced.
>>
>> Yes, but although this patch makes it more likely, it doesn't introduce
>> that. If the tasks, for instance, grab a reference to the cgroup dentry
>> in the filesystem (like their CWD, etc), they will also keep the cgroup
>> around.
> 
> Fair point.  But this doesn't seems like a feature.  It's probably not
> needed initially, but what do you think about creating a
> memcg_kernel_context structure which is allocated when memcg is
> allocated?  Kernel pages charged to a memcg would have
> page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
> would allow the mem_cgroup and its css_id to be deleted when the cgroup
> is unlinked from cgroupfs while allowing for the active kernel pages to
> continue pointing to a valid memcg_kernel_context.  This would be a
> reference counted structure much like you are doing with memcg.  When a
> memcg is deleted the memcg_kernel_context would be linked into its
> surviving parent memcg.  This would avoid needing to visit each kernel
> page.

You need more, you need at the res_counters to stay around as well. And
probably other fields.

So my fear here is that as you add fields to that structure, you can
defeat a bit the goal of reducing memory consumption. Still leaves the
css space, yes. But by doing this we can introduce some subtle bugs by
having a field in the wrong structure.

Did you observe that to be a big problem in your systems?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

> On 08/14/2012 10:58 PM, Greg Thelen wrote:
>> On Mon, Aug 13 2012, Glauber Costa wrote:
>> 
>> +WARN_ON(mem_cgroup_is_root(memcg));
>> +size = (1 << order) << PAGE_SHIFT;
>> +memcg_uncharge_kmem(memcg, size);
>> +mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
>>> This is of course the pair of the mem_cgroup_get() you commented on
>>> earlier. If we need one, we need the other. If we don't need one, we
>>> don't need the other =)
>>>
>>> The guarantee we're trying to give here is that the memcg structure will
>>> stay around while there are dangling charges to kmem, that we decided
>>> not to move (remember: moving it for the stack is simple, for the slab
>>> is very complicated and ill-defined, and I believe it is better to treat
>>> all kmem equally here)
>> 
>> By keeping memcg structures hanging around until the last referring kmem
>> page is uncharged do such zombie memcg each consume a css_id and thus
>> put pressure on the 64k css_id space?  I imagine in pathological cases
>> this would prevent creation of new cgroups until these zombies are
>> dereferenced.
>
> Yes, but although this patch makes it more likely, it doesn't introduce
> that. If the tasks, for instance, grab a reference to the cgroup dentry
> in the filesystem (like their CWD, etc), they will also keep the cgroup
> around.

Fair point.  But this doesn't seems like a feature.  It's probably not
needed initially, but what do you think about creating a
memcg_kernel_context structure which is allocated when memcg is
allocated?  Kernel pages charged to a memcg would have
page_cgroup->mem_cgroup=memcg_kernel_context rather than memcg.  This
would allow the mem_cgroup and its css_id to be deleted when the cgroup
is unlinked from cgroupfs while allowing for the active kernel pages to
continue pointing to a valid memcg_kernel_context.  This would be a
reference counted structure much like you are doing with memcg.  When a
memcg is deleted the memcg_kernel_context would be linked into its
surviving parent memcg.  This would avoid needing to visit each kernel
page.

>> Is there any way to see how much kmem such zombie memcg are consuming?
>> I think we could find these with
>> for_each_mem_cgroup_tree(root_mem_cgroup).
>
> Yes, just need an interface for that. But I think it is something that
> can be addressed orthogonaly to this work, in a separate patch, not as
> some fundamental limitation.

Agreed.

>> Basically, I'm wanting to know where kernel memory has been
>> allocated.  For live memcg, an admin can cat
>> memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure how
>> to get this info.  It looks like the root_mem_cgroup
>> memory.kmem.usage_in_bytes is not hierarchically charged.
>> 
>
> Not sure what you mean by not being hierarchically charged. It should
> be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
> do see kmem being charged all the way to the root cgroup when hierarchy
> is used. (we just can't limit it there)

You're correct, my mistake.

I think the procedure to determine out the amount of zombie kmem is:
  root_mem_cgroup.kmem_usage_in_bytes - 
sum(all top level memcg memory.kmem_usage_in_bytes)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa

>>
>> I see now, you seem to be right.
> 
> No I am not because it seems that I am really blind these days...
> We were doing this in mem_cgroup_do_charge for ages:
>   if (!(gfp_mask & __GFP_WAIT))
> return CHARGE_WOULDBLOCK;
> 
> /me goes to hide and get with further feedback with a clean head.
> 
> Sorry about that.
> 
I am as well, since I went to look at mem_cgroup_do_charge() and missed
that.

Do you have any other concerns specific to this patch ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Michal Hocko
On Wed 15-08-12 18:01:51, Glauber Costa wrote:
> On 08/15/2012 05:09 PM, Michal Hocko wrote:
> > On Wed 15-08-12 13:42:24, Glauber Costa wrote:
> > [...]
>  +
>  +ret = 0;
>  +
>  +if (!memcg)
>  +return ret;
>  +
>  +_memcg = memcg;
>  +ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>  +&_memcg, may_oom);
> >>>
> >>> This is really dangerous because atomic allocation which seem to be
> >>> possible could result in deadlocks because of the reclaim. 
> >>
> >> Can you elaborate on how this would happen?
> > 
> > Say you have an atomic allocation and we hit the limit so we get either
> > to reclaim which can sleep or to oom which can sleep as well (depending
> > on the oom_control).
> > 
> 
> I see now, you seem to be right.

No I am not because it seems that I am really blind these days...
We were doing this in mem_cgroup_do_charge for ages:
if (!(gfp_mask & __GFP_WAIT))
return CHARGE_WOULDBLOCK;

/me goes to hide and get with further feedback with a clean head.

Sorry about that.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 05:09 PM, Michal Hocko wrote:
> On Wed 15-08-12 13:42:24, Glauber Costa wrote:
> [...]
 +
 +  ret = 0;
 +
 +  if (!memcg)
 +  return ret;
 +
 +  _memcg = memcg;
 +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +  &_memcg, may_oom);
>>>
>>> This is really dangerous because atomic allocation which seem to be
>>> possible could result in deadlocks because of the reclaim. 
>>
>> Can you elaborate on how this would happen?
> 
> Say you have an atomic allocation and we hit the limit so we get either
> to reclaim which can sleep or to oom which can sleep as well (depending
> on the oom_control).
> 

I see now, you seem to be right.

How about we change the following code in mem_cgroup_do_charge:

if (gfp_mask & __GFP_NORETRY)
return CHARGE_NOMEM;

to:

if ((gfp_mask & __GFP_NORETRY) || (gfp_mask & __GFP_ATOMIC))
return CHARGE_NOMEM;

?

Would this take care of the issue ?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Michal Hocko
On Wed 15-08-12 13:42:24, Glauber Costa wrote:
[...]
> >> +
> >> +  ret = 0;
> >> +
> >> +  if (!memcg)
> >> +  return ret;
> >> +
> >> +  _memcg = memcg;
> >> +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> >> +  &_memcg, may_oom);
> > 
> > This is really dangerous because atomic allocation which seem to be
> > possible could result in deadlocks because of the reclaim. 
> 
> Can you elaborate on how this would happen?

Say you have an atomic allocation and we hit the limit so we get either
to reclaim which can sleep or to oom which can sleep as well (depending
on the oom_control).

> > Also, as I have mentioned in the other email in this thread. Why
> > should we reclaim just because of kernel allocation when we are not
> > reclaiming any of it because shrink_slab is ignored in the memcg
> > reclaim.
> 
> Don't get too distracted by the fact that shrink_slab is ignored. It is
> temporary, and while this being ignored now leads to suboptimal
> behavior, it will 1st, only affect its users, and 2nd, not be disastrous.

It's not just about shrink_slab it is also about triggering memcg-oom
which doesn't consider kmem accounted memory so the wrong tasks could
be killed. It is true that the impact is packed inside the group
(hierarchy) so you are right it won't be disastrous.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 01:42 PM, Glauber Costa wrote:
>> Also, as I
>> > have mentioned in the other email in this thread. Why should we reclaim
>> > just because of kernel allocation when we are not reclaiming any of it
>> > because shrink_slab is ignored in the memcg reclaim.
> 
> Don't get too distracted by the fact that shrink_slab is ignored. It is
> temporary, and while this being ignored now leads to suboptimal
> behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
> 
> I see it this as more or less on pair with the soft limit reclaim
> problem we had. It is not ideal, but it already provided functionality
> 

Okay, I sent the e-mail before finishing it... duh

What I meant in this last sentence, is that the situation while the
memcg-aware shrinkers doesn't land in the kernel is more or less the
same (obviously not exactly) as with the soft reclaim work. It is an
evolutionary approach that provides some functionality that is not yet
perfect but already solves lots of problems for people willing to live
with its temporary drawbacks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa

>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> +if (!memcg_kmem_on)
>> +return true;
>> +if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> 
> OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
> or a mention in the changelog.

documentation can't hurt!

Just added.

> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> +EXPORT_SYMBOL(__memcg_kmem_new_page);
> 
> Why is this exported?
> 

It shouldn't be. Removed.

>> +
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
>> +{
>> +struct page_cgroup *pc;
>> +struct mem_cgroup *memcg = handle;
>> +
>> +if (!memcg)
>> +return;
>> +
>> +WARN_ON(mem_cgroup_is_root(memcg));
>> +/* The page allocation must have failed. Revert */
>> +if (!page) {
>> +size_t size = PAGE_SIZE << order;
>> +
>> +memcg_uncharge_kmem(memcg, size);
>> +mem_cgroup_put(memcg);
>> +return;
>> +}
>> +
>> +pc = lookup_page_cgroup(page);
>> +lock_page_cgroup(pc);
>> +pc->mem_cgroup = memcg;
>> +SetPageCgroupUsed(pc);
> 
> Don't we need a write barrier before assigning memcg? Same as
> __mem_cgroup_commit_charge. This tests the Used bit always from within
> lock_page_cgroup so it should be safe but I am not 100% sure about the
> rest of the code.
> 
Well, I don't see the reason, precisely because we'll always grab it
from within the locked region. That should ensure all the necessary
serialization.

>> +#ifdef CONFIG_MEMCG_KMEM
>> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
>> +{
>> +struct res_counter *fail_res;
>> +struct mem_cgroup *_memcg;
>> +int ret;
>> +bool may_oom;
>> +bool nofail = false;
>> +
>> +may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
>> +!(gfp & __GFP_NORETRY);
> 
> This deserves a comment.
> 
can't hurt!! =)

>> +
>> +ret = 0;
>> +
>> +if (!memcg)
>> +return ret;
>> +
>> +_memcg = memcg;
>> +ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>> +&_memcg, may_oom);
> 
> This is really dangerous because atomic allocation which seem to be
> possible could result in deadlocks because of the reclaim. 

Can you elaborate on how this would happen?

> Also, as I
> have mentioned in the other email in this thread. Why should we reclaim
> just because of kernel allocation when we are not reclaiming any of it
> because shrink_slab is ignored in the memcg reclaim.


Don't get too distracted by the fact that shrink_slab is ignored. It is
temporary, and while this being ignored now leads to suboptimal
behavior, it will 1st, only affect its users, and 2nd, not be disastrous.

I see it this as more or less on pair with the soft limit reclaim
problem we had. It is not ideal, but it already provided functionality

>> +
>> +if (ret == -EINTR)  {
>> +nofail = true;
>> +/*
>> + * __mem_cgroup_try_charge() chosed to bypass to root due to
>> + * OOM kill or fatal signal.  Since our only options are to
>> + * either fail the allocation or charge it to this cgroup, do
>> + * it as a temporary condition. But we can't fail. From a
>> + * kmem/slab perspective, the cache has already been selected,
>> + * by mem_cgroup_get_kmem_cache(), so it is too late to change
>> + * our minds
>> + */
>> +res_counter_charge_nofail(>res, delta, _res);
>> +if (do_swap_account)
>> +res_counter_charge_nofail(>memsw, delta,
>> +  _res);
> 
> Hmmm, this is kind of ugly but I guess unvoidable with the current
> implementation. Oh well...
> 

Oh well...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/14/2012 10:58 PM, Greg Thelen wrote:
> On Mon, Aug 13 2012, Glauber Costa wrote:
> 
> + WARN_ON(mem_cgroup_is_root(memcg));
> + size = (1 << order) << PAGE_SHIFT;
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
>>> Why do we need ref-counting here ? kmem res_counter cannot work as
>>> reference ?
>> This is of course the pair of the mem_cgroup_get() you commented on
>> earlier. If we need one, we need the other. If we don't need one, we
>> don't need the other =)
>>
>> The guarantee we're trying to give here is that the memcg structure will
>> stay around while there are dangling charges to kmem, that we decided
>> not to move (remember: moving it for the stack is simple, for the slab
>> is very complicated and ill-defined, and I believe it is better to treat
>> all kmem equally here)
> 
> By keeping memcg structures hanging around until the last referring kmem
> page is uncharged do such zombie memcg each consume a css_id and thus
> put pressure on the 64k css_id space?  I imagine in pathological cases
> this would prevent creation of new cgroups until these zombies are
> dereferenced.

Yes, but although this patch makes it more likely, it doesn't introduce
that. If the tasks, for instance, grab a reference to the cgroup dentry
in the filesystem (like their CWD, etc), they will also keep the cgroup
around.


> Is there any way to see how much kmem such zombie memcg are consuming?
> I think we could find these with
> for_each_mem_cgroup_tree(root_mem_cgroup).

Yes, just need an interface for that. But I think it is something that
can be addressed orthogonaly to this work, in a separate patch, not as
some fundamental limitation.

>  Basically, I'm wanting to
> know where kernel memory has been allocated.  For live memcg, an admin
> can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
> how to get this info.  It looks like the root_mem_cgroup
> memory.kmem.usage_in_bytes is not hierarchically charged.
> 

Not sure what you mean by not being hierarchically charged. It should
be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
do see kmem being charged all the way to the root cgroup when hierarchy
is used. (we just can't limit it there)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:
 
 + WARN_ON(mem_cgroup_is_root(memcg));
 + size = (1  order)  PAGE_SHIFT;
 + memcg_uncharge_kmem(memcg, size);
 + mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)
 
 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

Yes, but although this patch makes it more likely, it doesn't introduce
that. If the tasks, for instance, grab a reference to the cgroup dentry
in the filesystem (like their CWD, etc), they will also keep the cgroup
around.


 Is there any way to see how much kmem such zombie memcg are consuming?
 I think we could find these with
 for_each_mem_cgroup_tree(root_mem_cgroup).

Yes, just need an interface for that. But I think it is something that
can be addressed orthogonaly to this work, in a separate patch, not as
some fundamental limitation.

  Basically, I'm wanting to
 know where kernel memory has been allocated.  For live memcg, an admin
 can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
 how to get this info.  It looks like the root_mem_cgroup
 memory.kmem.usage_in_bytes is not hierarchically charged.
 

Not sure what you mean by not being hierarchically charged. It should
be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
do see kmem being charged all the way to the root cgroup when hierarchy
is used. (we just can't limit it there)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa

 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +if (!memcg_kmem_on)
 +return true;
 +if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 
 OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
 or a mention in the changelog.

documentation can't hurt!

Just added.

 [...]
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 [...]
 +EXPORT_SYMBOL(__memcg_kmem_new_page);
 
 Why is this exported?
 

It shouldn't be. Removed.

 +
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
 +{
 +struct page_cgroup *pc;
 +struct mem_cgroup *memcg = handle;
 +
 +if (!memcg)
 +return;
 +
 +WARN_ON(mem_cgroup_is_root(memcg));
 +/* The page allocation must have failed. Revert */
 +if (!page) {
 +size_t size = PAGE_SIZE  order;
 +
 +memcg_uncharge_kmem(memcg, size);
 +mem_cgroup_put(memcg);
 +return;
 +}
 +
 +pc = lookup_page_cgroup(page);
 +lock_page_cgroup(pc);
 +pc-mem_cgroup = memcg;
 +SetPageCgroupUsed(pc);
 
 Don't we need a write barrier before assigning memcg? Same as
 __mem_cgroup_commit_charge. This tests the Used bit always from within
 lock_page_cgroup so it should be safe but I am not 100% sure about the
 rest of the code.
 
Well, I don't see the reason, precisely because we'll always grab it
from within the locked region. That should ensure all the necessary
serialization.

 +#ifdef CONFIG_MEMCG_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
 +struct res_counter *fail_res;
 +struct mem_cgroup *_memcg;
 +int ret;
 +bool may_oom;
 +bool nofail = false;
 +
 +may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS) 
 +!(gfp  __GFP_NORETRY);
 
 This deserves a comment.
 
can't hurt!! =)

 +
 +ret = 0;
 +
 +if (!memcg)
 +return ret;
 +
 +_memcg = memcg;
 +ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +_memcg, may_oom);
 
 This is really dangerous because atomic allocation which seem to be
 possible could result in deadlocks because of the reclaim. 

Can you elaborate on how this would happen?

 Also, as I
 have mentioned in the other email in this thread. Why should we reclaim
 just because of kernel allocation when we are not reclaiming any of it
 because shrink_slab is ignored in the memcg reclaim.


Don't get too distracted by the fact that shrink_slab is ignored. It is
temporary, and while this being ignored now leads to suboptimal
behavior, it will 1st, only affect its users, and 2nd, not be disastrous.

I see it this as more or less on pair with the soft limit reclaim
problem we had. It is not ideal, but it already provided functionality

 +
 +if (ret == -EINTR)  {
 +nofail = true;
 +/*
 + * __mem_cgroup_try_charge() chosed to bypass to root due to
 + * OOM kill or fatal signal.  Since our only options are to
 + * either fail the allocation or charge it to this cgroup, do
 + * it as a temporary condition. But we can't fail. From a
 + * kmem/slab perspective, the cache has already been selected,
 + * by mem_cgroup_get_kmem_cache(), so it is too late to change
 + * our minds
 + */
 +res_counter_charge_nofail(memcg-res, delta, fail_res);
 +if (do_swap_account)
 +res_counter_charge_nofail(memcg-memsw, delta,
 +  fail_res);
 
 Hmmm, this is kind of ugly but I guess unvoidable with the current
 implementation. Oh well...
 

Oh well...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 01:42 PM, Glauber Costa wrote:
 Also, as I
  have mentioned in the other email in this thread. Why should we reclaim
  just because of kernel allocation when we are not reclaiming any of it
  because shrink_slab is ignored in the memcg reclaim.
 
 Don't get too distracted by the fact that shrink_slab is ignored. It is
 temporary, and while this being ignored now leads to suboptimal
 behavior, it will 1st, only affect its users, and 2nd, not be disastrous.
 
 I see it this as more or less on pair with the soft limit reclaim
 problem we had. It is not ideal, but it already provided functionality
 

Okay, I sent the e-mail before finishing it... duh

What I meant in this last sentence, is that the situation while the
memcg-aware shrinkers doesn't land in the kernel is more or less the
same (obviously not exactly) as with the soft reclaim work. It is an
evolutionary approach that provides some functionality that is not yet
perfect but already solves lots of problems for people willing to live
with its temporary drawbacks.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Michal Hocko
On Wed 15-08-12 13:42:24, Glauber Costa wrote:
[...]
  +
  +  ret = 0;
  +
  +  if (!memcg)
  +  return ret;
  +
  +  _memcg = memcg;
  +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
  +  _memcg, may_oom);
  
  This is really dangerous because atomic allocation which seem to be
  possible could result in deadlocks because of the reclaim. 
 
 Can you elaborate on how this would happen?

Say you have an atomic allocation and we hit the limit so we get either
to reclaim which can sleep or to oom which can sleep as well (depending
on the oom_control).

  Also, as I have mentioned in the other email in this thread. Why
  should we reclaim just because of kernel allocation when we are not
  reclaiming any of it because shrink_slab is ignored in the memcg
  reclaim.
 
 Don't get too distracted by the fact that shrink_slab is ignored. It is
 temporary, and while this being ignored now leads to suboptimal
 behavior, it will 1st, only affect its users, and 2nd, not be disastrous.

It's not just about shrink_slab it is also about triggering memcg-oom
which doesn't consider kmem accounted memory so the wrong tasks could
be killed. It is true that the impact is packed inside the group
(hierarchy) so you are right it won't be disastrous.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 05:09 PM, Michal Hocko wrote:
 On Wed 15-08-12 13:42:24, Glauber Costa wrote:
 [...]
 +
 +  ret = 0;
 +
 +  if (!memcg)
 +  return ret;
 +
 +  _memcg = memcg;
 +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 +  _memcg, may_oom);

 This is really dangerous because atomic allocation which seem to be
 possible could result in deadlocks because of the reclaim. 

 Can you elaborate on how this would happen?
 
 Say you have an atomic allocation and we hit the limit so we get either
 to reclaim which can sleep or to oom which can sleep as well (depending
 on the oom_control).
 

I see now, you seem to be right.

How about we change the following code in mem_cgroup_do_charge:

if (gfp_mask  __GFP_NORETRY)
return CHARGE_NOMEM;

to:

if ((gfp_mask  __GFP_NORETRY) || (gfp_mask  __GFP_ATOMIC))
return CHARGE_NOMEM;

?

Would this take care of the issue ?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Michal Hocko
On Wed 15-08-12 18:01:51, Glauber Costa wrote:
 On 08/15/2012 05:09 PM, Michal Hocko wrote:
  On Wed 15-08-12 13:42:24, Glauber Costa wrote:
  [...]
  +
  +ret = 0;
  +
  +if (!memcg)
  +return ret;
  +
  +_memcg = memcg;
  +ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
  +_memcg, may_oom);
 
  This is really dangerous because atomic allocation which seem to be
  possible could result in deadlocks because of the reclaim. 
 
  Can you elaborate on how this would happen?
  
  Say you have an atomic allocation and we hit the limit so we get either
  to reclaim which can sleep or to oom which can sleep as well (depending
  on the oom_control).
  
 
 I see now, you seem to be right.

No I am not because it seems that I am really blind these days...
We were doing this in mem_cgroup_do_charge for ages:
if (!(gfp_mask  __GFP_WAIT))
return CHARGE_WOULDBLOCK;

/me goes to hide and get with further feedback with a clean head.

Sorry about that.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa


 I see now, you seem to be right.
 
 No I am not because it seems that I am really blind these days...
 We were doing this in mem_cgroup_do_charge for ages:
   if (!(gfp_mask  __GFP_WAIT))
 return CHARGE_WOULDBLOCK;
 
 /me goes to hide and get with further feedback with a clean head.
 
 Sorry about that.
 
I am as well, since I went to look at mem_cgroup_do_charge() and missed
that.

Do you have any other concerns specific to this patch ?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:
 
 +WARN_ON(mem_cgroup_is_root(memcg));
 +size = (1  order)  PAGE_SHIFT;
 +memcg_uncharge_kmem(memcg, size);
 +mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)
 
 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.

Fair point.  But this doesn't seems like a feature.  It's probably not
needed initially, but what do you think about creating a
memcg_kernel_context structure which is allocated when memcg is
allocated?  Kernel pages charged to a memcg would have
page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
would allow the mem_cgroup and its css_id to be deleted when the cgroup
is unlinked from cgroupfs while allowing for the active kernel pages to
continue pointing to a valid memcg_kernel_context.  This would be a
reference counted structure much like you are doing with memcg.  When a
memcg is deleted the memcg_kernel_context would be linked into its
surviving parent memcg.  This would avoid needing to visit each kernel
page.

 Is there any way to see how much kmem such zombie memcg are consuming?
 I think we could find these with
 for_each_mem_cgroup_tree(root_mem_cgroup).

 Yes, just need an interface for that. But I think it is something that
 can be addressed orthogonaly to this work, in a separate patch, not as
 some fundamental limitation.

Agreed.

 Basically, I'm wanting to know where kernel memory has been
 allocated.  For live memcg, an admin can cat
 memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure how
 to get this info.  It looks like the root_mem_cgroup
 memory.kmem.usage_in_bytes is not hierarchically charged.
 

 Not sure what you mean by not being hierarchically charged. It should
 be, when use_hierarchy = 1. As a matter of fact, I just tested it, and I
 do see kmem being charged all the way to the root cgroup when hierarchy
 is used. (we just can't limit it there)

You're correct, my mistake.

I think the procedure to determine out the amount of zombie kmem is:
  root_mem_cgroup.kmem_usage_in_bytes - 
sum(all top level memcg memory.kmem_usage_in_bytes)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:
 
 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +   WARN_ON(mem_cgroup_is_root(memcg));
 +   size = (1  order)  PAGE_SHIFT;
 +   memcg_uncharge_kmem(memcg, size);
 +   mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.
 
 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

You need more, you need at the res_counters to stay around as well. And
probably other fields.

So my fear here is that as you add fields to that structure, you can
defeat a bit the goal of reducing memory consumption. Still leaves the
css space, yes. But by doing this we can introduce some subtle bugs by
having a field in the wrong structure.

Did you observe that to be a big problem in your systems?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:
 
 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +  WARN_ON(mem_cgroup_is_root(memcg));
 +  size = (1  order)  PAGE_SHIFT;
 +  memcg_uncharge_kmem(memcg, size);
 +  mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.
 
 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.

I am not sure the res_counters would need to stay around.  Once a
memcg_kernel_context has been reparented, then any future kernel page
uncharge calls will uncharge the parent res_counter.

 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 Did you observe that to be a big problem in your systems?

No I have not seen this yet.  But our past solutions have reparented
kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
concerns with your approach are just a suspicion because we have been
experimenting with accounting of even more kernel memory (e.g. vmalloc,
kernel stacks, page tables).  As the scope of such accounting grows the
chance of long lived charged pages grows and thus the chance of zombies
which exhaust the css_id space grows.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Glauber Costa
On 08/15/2012 09:12 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:
 
 On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 + WARN_ON(mem_cgroup_is_root(memcg));
 + size = (1  order)  PAGE_SHIFT;
 + memcg_uncharge_kmem(memcg, size);
 + mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.

 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.
 
 I am not sure the res_counters would need to stay around.  Once a
 memcg_kernel_context has been reparented, then any future kernel page
 uncharge calls will uncharge the parent res_counter.

Well, if you hold the memcg due to a reference, like in the dentry case,
then fine. But if this is a dangling charge, as will be the case with
the slab, then you have to uncharge it.

An arbitrary number of parents might have been deleted as well, so you
need to transverse them all until you reach a live parent to uncharge from.

To do that, your counters have to be still alive.

 
 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 Did you observe that to be a big problem in your systems?
 
 No I have not seen this yet.  But our past solutions have reparented
 kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 concerns with your approach are just a suspicion because we have been
 experimenting with accounting of even more kernel memory (e.g. vmalloc,
 kernel stacks, page tables).  As the scope of such accounting grows the
 chance of long lived charged pages grows and thus the chance of zombies
 which exhaust the css_id space grows.

Well, since we agree this can all be done under the hood, I'd say let's
wait until a problem actually exists, since the solution is likely to be
a bit convoluted...

I personally believe that if won't have a lot of task movement, most of
the data will go away as the cgroup dies. The remainder shouldn't be too
much to hold it in memory for a lot of time. This is of course assuming
a real use case, not an adversarial scenario, which is quite easy to
come up with: just create a task, hold a bunch of kmem, move the task
away, delete the cgroup, etc.

That said, nothing stops us to actively try to create a scenario that
would demonstrate such a problem.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-15 Thread Greg Thelen
On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/15/2012 09:12 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:
 
 On 08/15/2012 08:38 PM, Greg Thelen wrote:
 On Wed, Aug 15 2012, Glauber Costa wrote:

 On 08/14/2012 10:58 PM, Greg Thelen wrote:
 On Mon, Aug 13 2012, Glauber Costa wrote:

 +WARN_ON(mem_cgroup_is_root(memcg));
 +size = (1  order)  PAGE_SHIFT;
 +memcg_uncharge_kmem(memcg, size);
 +mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)

 By keeping memcg structures hanging around until the last referring kmem
 page is uncharged do such zombie memcg each consume a css_id and thus
 put pressure on the 64k css_id space?  I imagine in pathological cases
 this would prevent creation of new cgroups until these zombies are
 dereferenced.

 Yes, but although this patch makes it more likely, it doesn't introduce
 that. If the tasks, for instance, grab a reference to the cgroup dentry
 in the filesystem (like their CWD, etc), they will also keep the cgroup
 around.

 Fair point.  But this doesn't seems like a feature.  It's probably not
 needed initially, but what do you think about creating a
 memcg_kernel_context structure which is allocated when memcg is
 allocated?  Kernel pages charged to a memcg would have
 page_cgroup-mem_cgroup=memcg_kernel_context rather than memcg.  This
 would allow the mem_cgroup and its css_id to be deleted when the cgroup
 is unlinked from cgroupfs while allowing for the active kernel pages to
 continue pointing to a valid memcg_kernel_context.  This would be a
 reference counted structure much like you are doing with memcg.  When a
 memcg is deleted the memcg_kernel_context would be linked into its
 surviving parent memcg.  This would avoid needing to visit each kernel
 page.

 You need more, you need at the res_counters to stay around as well. And
 probably other fields.
 
 I am not sure the res_counters would need to stay around.  Once a
 memcg_kernel_context has been reparented, then any future kernel page
 uncharge calls will uncharge the parent res_counter.

 Well, if you hold the memcg due to a reference, like in the dentry case,
 then fine. But if this is a dangling charge, as will be the case with
 the slab, then you have to uncharge it.

 An arbitrary number of parents might have been deleted as well, so you
 need to transverse them all until you reach a live parent to uncharge from.

I was thinking that each time a memcg is deleted move the
memcg_kernel_context from the victim memcg to its parent.  When moving,
also update the context to refer to the parent and link context to
parent:
  for_each_kernel_context(kernel_context, memcg) {
kernel_context-memcg = memcg-parent;
list_add(kernel_context-list, memcg-parent-kernel_contexts);
  }

Whenever pages referring to a memcg_kernel_context are uncharged they
will uncharge the nearest surviving parent memcg.

 To do that, your counters have to be still alive.

The counters of nearest surviving parent will be alive and pointed to by
memcg_kernel_context-memcg.

 So my fear here is that as you add fields to that structure, you can
 defeat a bit the goal of reducing memory consumption. Still leaves the
 css space, yes. But by doing this we can introduce some subtle bugs by
 having a field in the wrong structure.

 Did you observe that to be a big problem in your systems?
 
 No I have not seen this yet.  But our past solutions have reparented
 kmem_cache's to root memcg so we have been avoiding zombie memcg.  My
 concerns with your approach are just a suspicion because we have been
 experimenting with accounting of even more kernel memory (e.g. vmalloc,
 kernel stacks, page tables).  As the scope of such accounting grows the
 chance of long lived charged pages grows and thus the chance of zombies
 which exhaust the css_id space grows.

 Well, since we agree this can all be done under the hood, I'd say let's
 wait until a problem actually exists, since the solution is likely to be
 a bit convoluted...

 I personally believe that if won't have a lot of task movement, most of
 the data will go away as the cgroup dies. The remainder shouldn't be too
 much to hold it in memory for a lot of time. This is of course assuming
 a real use case, not an adversarial scenario, which is quite easy to
 come up with: just create a task, hold a bunch of kmem, move the task
 away, delete the cgroup, etc.

 That said, nothing stops us to actively try to create a scenario that
 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Greg Thelen
On Mon, Aug 13 2012, Glauber Costa wrote:

>>> > + WARN_ON(mem_cgroup_is_root(memcg));
>>> > + size = (1 << order) << PAGE_SHIFT;
>>> > + memcg_uncharge_kmem(memcg, size);
>>> > + mem_cgroup_put(memcg);
>> Why do we need ref-counting here ? kmem res_counter cannot work as
>> reference ?
> This is of course the pair of the mem_cgroup_get() you commented on
> earlier. If we need one, we need the other. If we don't need one, we
> don't need the other =)
>
> The guarantee we're trying to give here is that the memcg structure will
> stay around while there are dangling charges to kmem, that we decided
> not to move (remember: moving it for the stack is simple, for the slab
> is very complicated and ill-defined, and I believe it is better to treat
> all kmem equally here)

By keeping memcg structures hanging around until the last referring kmem
page is uncharged do such zombie memcg each consume a css_id and thus
put pressure on the 64k css_id space?  I imagine in pathological cases
this would prevent creation of new cgroups until these zombies are
dereferenced.

Is there any way to see how much kmem such zombie memcg are consuming?
I think we could find these with
for_each_mem_cgroup_tree(root_mem_cgroup).  Basically, I'm wanting to
know where kernel memory has been allocated.  For live memcg, an admin
can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
how to get this info.  It looks like the root_mem_cgroup
memory.kmem.usage_in_bytes is not hierarchically charged.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Michal Hocko
On Thu 09-08-12 17:01:14, Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
> 
> In memcontrol.h those functions are wrapped in inline accessors.  The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
> 
> [ v2: improved comments and standardized function names ]
> 
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> ---
>  include/linux/memcontrol.h |  79 +++
>  mm/memcontrol.c| 185 
> +
>  2 files changed, 264 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
[...]
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))

OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
or a mention in the changelog.

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> +EXPORT_SYMBOL(__memcg_kmem_new_page);

Why is this exported?

> +
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
> +{
> + struct page_cgroup *pc;
> + struct mem_cgroup *memcg = handle;
> +
> + if (!memcg)
> + return;
> +
> + WARN_ON(mem_cgroup_is_root(memcg));
> + /* The page allocation must have failed. Revert */
> + if (!page) {
> + size_t size = PAGE_SIZE << order;
> +
> + memcg_uncharge_kmem(memcg, size);
> + mem_cgroup_put(memcg);
> + return;
> + }
> +
> + pc = lookup_page_cgroup(page);
> + lock_page_cgroup(pc);
> + pc->mem_cgroup = memcg;
> + SetPageCgroupUsed(pc);

Don't we need a write barrier before assigning memcg? Same as
__mem_cgroup_commit_charge. This tests the Used bit always from within
lock_page_cgroup so it should be safe but I am not 100% sure about the
rest of the code.

[...]
> +EXPORT_SYMBOL(__memcg_kmem_free_page);

Why is the symbol exported?

>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
> @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
>  __setup("swapaccount=", enable_swap_account);
>  
>  #endif
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
> +{
> + struct res_counter *fail_res;
> + struct mem_cgroup *_memcg;
> + int ret;
> + bool may_oom;
> + bool nofail = false;
> +
> + may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
> + !(gfp & __GFP_NORETRY);

This deserves a comment.

> +
> + ret = 0;
> +
> + if (!memcg)
> + return ret;
> +
> + _memcg = memcg;
> + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> + &_memcg, may_oom);

This is really dangerous because atomic allocation which seem to be
possible could result in deadlocks because of the reclaim. Also, as I
have mentioned in the other email in this thread. Why should we reclaim
just because of kernel allocation when we are not reclaiming any of it
because shrink_slab is ignored in the memcg reclaim.

> +
> + if (ret == -EINTR)  {
> + nofail = true;
> + /*
> +  * __mem_cgroup_try_charge() chosed to bypass to root due to
> +  * OOM kill or fatal signal.  Since our only options are to
> +  * either fail the allocation or charge it to this cgroup, do
> +  * it as a temporary condition. But we can't fail. From a
> +  * kmem/slab perspective, the cache has already been selected,
> +  * by mem_cgroup_get_kmem_cache(), so it is too late to change
> +  * our minds
> +  */
> + res_counter_charge_nofail(>res, delta, _res);
> + if (do_swap_account)
> + res_counter_charge_nofail(>memsw, delta,
> +   _res);

Hmmm, this is kind of ugly but I guess unvoidable with the current

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Glauber Costa
On 08/10/2012 09:27 PM, Kamezawa Hiroyuki wrote:
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> > +{
>> > +  struct mem_cgroup *memcg;
>> > +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> > +  bool ret = true;
>> > +  size_t size;
>> > +  struct task_struct *p;
>> > +
>> > +  *handle = NULL;
>> > +  rcu_read_lock();
>> > +  p = rcu_dereference(current->mm->owner);
>> > +  memcg = mem_cgroup_from_task(p);
>> > +  if (!memcg_kmem_enabled(memcg))
>> > +  goto out;
>> > +
>> > +  mem_cgroup_get(memcg);
>> > +
> This mem_cgroup_get() will be a potentioal performance problem.
> Don't you have good idea to avoid accessing atomic counter here ?
> I think some kind of percpu counter or a feature to disable "move task"
> will be a help.
> 
> 

I have just sent out a proposal to deal with this. I tried the trick of
marking only the first charge and last uncharge, and it works quite
alright at the cost of a bit test on most calls to memcg_kmem_charge.

Please let me know what you think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Glauber Costa
On 08/10/2012 09:27 PM, Kamezawa Hiroyuki wrote:
 +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
  +{
  +  struct mem_cgroup *memcg;
  +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
  +  bool ret = true;
  +  size_t size;
  +  struct task_struct *p;
  +
  +  *handle = NULL;
  +  rcu_read_lock();
  +  p = rcu_dereference(current-mm-owner);
  +  memcg = mem_cgroup_from_task(p);
  +  if (!memcg_kmem_enabled(memcg))
  +  goto out;
  +
  +  mem_cgroup_get(memcg);
  +
 This mem_cgroup_get() will be a potentioal performance problem.
 Don't you have good idea to avoid accessing atomic counter here ?
 I think some kind of percpu counter or a feature to disable move task
 will be a help.
 
 

I have just sent out a proposal to deal with this. I tried the trick of
marking only the first charge and last uncharge, and it works quite
alright at the cost of a bit test on most calls to memcg_kmem_charge.

Please let me know what you think.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Michal Hocko
On Thu 09-08-12 17:01:14, Glauber Costa wrote:
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 
 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.
 
 [ v2: improved comments and standardized function names ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
[...]
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + if (!memcg_kmem_on)
 + return true;
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))

OK, I see the point behind __GFP_NOFAIL but it would deserve a comment
or a mention in the changelog.

[...]
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
[...]
 +EXPORT_SYMBOL(__memcg_kmem_new_page);

Why is this exported?

 +
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order)
 +{
 + struct page_cgroup *pc;
 + struct mem_cgroup *memcg = handle;
 +
 + if (!memcg)
 + return;
 +
 + WARN_ON(mem_cgroup_is_root(memcg));
 + /* The page allocation must have failed. Revert */
 + if (!page) {
 + size_t size = PAGE_SIZE  order;
 +
 + memcg_uncharge_kmem(memcg, size);
 + mem_cgroup_put(memcg);
 + return;
 + }
 +
 + pc = lookup_page_cgroup(page);
 + lock_page_cgroup(pc);
 + pc-mem_cgroup = memcg;
 + SetPageCgroupUsed(pc);

Don't we need a write barrier before assigning memcg? Same as
__mem_cgroup_commit_charge. This tests the Used bit always from within
lock_page_cgroup so it should be safe but I am not 100% sure about the
rest of the code.

[...]
 +EXPORT_SYMBOL(__memcg_kmem_free_page);

Why is the symbol exported?

  #endif /* CONFIG_MEMCG_KMEM */
  
  #if defined(CONFIG_INET)  defined(CONFIG_MEMCG_KMEM)
 @@ -5759,3 +5878,69 @@ static int __init enable_swap_account(char *s)
  __setup(swapaccount=, enable_swap_account);
  
  #endif
 +
 +#ifdef CONFIG_MEMCG_KMEM
 +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
 +{
 + struct res_counter *fail_res;
 + struct mem_cgroup *_memcg;
 + int ret;
 + bool may_oom;
 + bool nofail = false;
 +
 + may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS) 
 + !(gfp  __GFP_NORETRY);

This deserves a comment.

 +
 + ret = 0;
 +
 + if (!memcg)
 + return ret;
 +
 + _memcg = memcg;
 + ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 + _memcg, may_oom);

This is really dangerous because atomic allocation which seem to be
possible could result in deadlocks because of the reclaim. Also, as I
have mentioned in the other email in this thread. Why should we reclaim
just because of kernel allocation when we are not reclaiming any of it
because shrink_slab is ignored in the memcg reclaim.

 +
 + if (ret == -EINTR)  {
 + nofail = true;
 + /*
 +  * __mem_cgroup_try_charge() chosed to bypass to root due to
 +  * OOM kill or fatal signal.  Since our only options are to
 +  * either fail the allocation or charge it to this cgroup, do
 +  * it as a temporary condition. But we can't fail. From a
 +  * kmem/slab perspective, the cache has already been selected,
 +  * by mem_cgroup_get_kmem_cache(), so it is too late to change
 +  * our minds
 +  */
 + res_counter_charge_nofail(memcg-res, delta, fail_res);
 + if (do_swap_account)
 + res_counter_charge_nofail(memcg-memsw, delta,
 +   fail_res);

Hmmm, this is kind of ugly but I guess unvoidable with the current

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-14 Thread Greg Thelen
On Mon, Aug 13 2012, Glauber Costa wrote:

  + WARN_ON(mem_cgroup_is_root(memcg));
  + size = (1  order)  PAGE_SHIFT;
  + memcg_uncharge_kmem(memcg, size);
  + mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
 This is of course the pair of the mem_cgroup_get() you commented on
 earlier. If we need one, we need the other. If we don't need one, we
 don't need the other =)

 The guarantee we're trying to give here is that the memcg structure will
 stay around while there are dangling charges to kmem, that we decided
 not to move (remember: moving it for the stack is simple, for the slab
 is very complicated and ill-defined, and I believe it is better to treat
 all kmem equally here)

By keeping memcg structures hanging around until the last referring kmem
page is uncharged do such zombie memcg each consume a css_id and thus
put pressure on the 64k css_id space?  I imagine in pathological cases
this would prevent creation of new cgroups until these zombies are
dereferenced.

Is there any way to see how much kmem such zombie memcg are consuming?
I think we could find these with
for_each_mem_cgroup_tree(root_mem_cgroup).  Basically, I'm wanting to
know where kernel memory has been allocated.  For live memcg, an admin
can cat memory.kmem.usage_in_bytes.  But for zombie memcg, I'm not sure
how to get this info.  It looks like the root_mem_cgroup
memory.kmem.usage_in_bytes is not hierarchically charged.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Greg Thelen
On Mon, Aug 13 2012, Glauber Costa wrote:

>> 
>> Here's the dmesg splat.
>> 
>
> Do you always get this report in the same way?
> I managed to get a softirq inconsistency like yours, but the complaint
> goes for a different lock.

Yes, I repeatedly get the same dmesg splat below.

Once I your 'execute the whole memcg freeing in rcu callback' patch,
then the warnings are not printed.  I'll take a closer look at the patch
soon.

>> [  335.550398] =
>> [  335.554739] [ INFO: inconsistent lock state ]
>> [  335.559091] 3.5.0-dbg-DEV #3 Tainted: GW
>> [  335.563946] -
>> [  335.568290] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [  335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
>> [  335.579508]  (&(>lock)->rlock){+.?...}, at: [] 
>> __mem_cgroup_free+0x8d/0x1b0
>> [  335.588525] {SOFTIRQ-ON-W} state was registered at:
>> [  335.593389]   [] __lock_acquire+0x623/0x1a50
>> [  335.599200]   [] lock_acquire+0x95/0x150
>> [  335.604670]   [] _raw_spin_lock+0x41/0x50
>> [  335.610232]   [] __mem_cgroup_free+0x8d/0x1b0
>> [  335.616135]   [] mem_cgroup_put+0x45/0x50
>> [  335.621696]   [] mem_cgroup_destroy+0x22/0x30
>> [  335.627592]   [] cgroup_diput+0xbf/0x160
>> [  335.633062]   [] d_delete+0x12f/0x1a0
>> [  335.638276]   [] vfs_rmdir+0x11e/0x140
>> [  335.643565]   [] do_rmdir+0x113/0x130
>> [  335.648773]   [] sys_rmdir+0x16/0x20
>> [  335.653900]   [] cstar_dispatch+0x7/0x1f
>> [  335.659370] irq event stamp: 399732
>> [  335.662846] hardirqs last  enabled at (399732): [] 
>> res_counter_uncharge_until+0x68/0xa0
>> [  335.672383] hardirqs last disabled at (399731): [] 
>> res_counter_uncharge_until+0x28/0xa0
>> [  335.681916] softirqs last  enabled at (399710): [] 
>> _local_bh_enable+0x13/0x20
>> [  335.690590] softirqs last disabled at (399711): [] 
>> call_softirq+0x1c/0x30
>> [  335.698914]
>> [  335.698914] other info that might help us debug this:
>> [  335.705415]  Possible unsafe locking scenario:
>> [  335.705415]
>> [  335.711317]CPU0
>> [  335.713757]
>> [  335.716198]   lock(&(>lock)->rlock);
>> [  335.720282]   
>> [  335.722896] lock(&(>lock)->rlock);
>> [  335.727153]
>> [  335.727153]  *** DEADLOCK ***
>> [  335.727153]
>> [  335.733055] no locks held by swapper/10/0.
>> [  335.737141]
>> [  335.737141] stack backtrace:
>> [  335.741483] Pid: 0, comm: swapper/10 Tainted: GW3.5.0-dbg-DEV 
>> #3
>> [  335.748510] Call Trace:
>> [  335.750952][] print_usage_bug+0x1fc/0x20d
>> [  335.757286]  [] ? save_stack_trace+0x2f/0x50
>> [  335.763098]  [] mark_lock+0x29d/0x300
>> [  335.768309]  [] ? 
>> print_irq_inversion_bug.part.36+0x1f0/0x1f0
>> [  335.775599]  [] __lock_acquire+0x5ac/0x1a50
>> [  335.781323]  [] ? __lock_acquire+0x2e4/0x1a50
>> [  335.787224]  [] ? __mem_cgroup_free+0x8d/0x1b0
>> [  335.793212]  [] lock_acquire+0x95/0x150
>> [  335.798594]  [] ? __mem_cgroup_free+0x8d/0x1b0
>> [  335.804581]  [] ? res_counter_uncharge_until+0x3d/0xa0
>> [  335.811263]  [] _raw_spin_lock+0x41/0x50
>> [  335.816731]  [] ? __mem_cgroup_free+0x8d/0x1b0
>> [  335.822724]  [] __mem_cgroup_free+0x8d/0x1b0
>> [  335.828538]  [] mem_cgroup_put+0x45/0x50
>> [  335.834002]  [] __memcg_kmem_free_page+0xa6/0x110
>> [  335.840256]  [] free_accounted_pages+0x99/0xa0
>> [  335.846243]  [] free_task+0x3f/0x70
>> [  335.851278]  [] __put_task_struct+0xbc/0x130
>> [  335.857094]  [] delayed_put_task_struct+0x54/0xd0
>> [  335.863338]  [] __rcu_process_callbacks+0x1e4/0x490
>> [  335.869757]  [] rcu_process_callbacks+0x2f/0x80
>> [  335.875835]  [] __do_softirq+0xc5/0x270
>> [  335.881218]  [] ? clockevents_program_event+0x74/0x100
>> [  335.887895]  [] ? tick_program_event+0x24/0x30
>> [  335.893882]  [] call_softirq+0x1c/0x30
>> [  335.899179]  [] do_softirq+0x8d/0xc0
>> [  335.904301]  [] irq_exit+0xae/0xe0
>> [  335.909251]  [] smp_apic_timer_interrupt+0x6e/0x99
>> [  335.915591]  [] apic_timer_interrupt+0x6c/0x80
>> [  335.921583][] ? default_idle+0x67/0x270
>> [  335.927741]  [] ? default_idle+0x65/0x270
>> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
> 
> Here's the dmesg splat.
> 

Do you always get this report in the same way?
I managed to get a softirq inconsistency like yours, but the complaint
goes for a different lock.


> [  335.550398] =
> [  335.554739] [ INFO: inconsistent lock state ]
> [  335.559091] 3.5.0-dbg-DEV #3 Tainted: GW
> [  335.563946] -
> [  335.568290] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [  335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [  335.579508]  (&(>lock)->rlock){+.?...}, at: [] 
> __mem_cgroup_free+0x8d/0x1b0
> [  335.588525] {SOFTIRQ-ON-W} state was registered at:
> [  335.593389]   [] __lock_acquire+0x623/0x1a50
> [  335.599200]   [] lock_acquire+0x95/0x150
> [  335.604670]   [] _raw_spin_lock+0x41/0x50
> [  335.610232]   [] __mem_cgroup_free+0x8d/0x1b0
> [  335.616135]   [] mem_cgroup_put+0x45/0x50
> [  335.621696]   [] mem_cgroup_destroy+0x22/0x30
> [  335.627592]   [] cgroup_diput+0xbf/0x160
> [  335.633062]   [] d_delete+0x12f/0x1a0
> [  335.638276]   [] vfs_rmdir+0x11e/0x140
> [  335.643565]   [] do_rmdir+0x113/0x130
> [  335.648773]   [] sys_rmdir+0x16/0x20
> [  335.653900]   [] cstar_dispatch+0x7/0x1f
> [  335.659370] irq event stamp: 399732
> [  335.662846] hardirqs last  enabled at (399732): [] 
> res_counter_uncharge_until+0x68/0xa0
> [  335.672383] hardirqs last disabled at (399731): [] 
> res_counter_uncharge_until+0x28/0xa0
> [  335.681916] softirqs last  enabled at (399710): [] 
> _local_bh_enable+0x13/0x20
> [  335.690590] softirqs last disabled at (399711): [] 
> call_softirq+0x1c/0x30
> [  335.698914]
> [  335.698914] other info that might help us debug this:
> [  335.705415]  Possible unsafe locking scenario:
> [  335.705415]
> [  335.711317]CPU0
> [  335.713757]
> [  335.716198]   lock(&(>lock)->rlock);
> [  335.720282]   
> [  335.722896] lock(&(>lock)->rlock);
> [  335.727153]
> [  335.727153]  *** DEADLOCK ***
> [  335.727153]
> [  335.733055] no locks held by swapper/10/0.
> [  335.737141]
> [  335.737141] stack backtrace:
> [  335.741483] Pid: 0, comm: swapper/10 Tainted: GW3.5.0-dbg-DEV 
> #3
> [  335.748510] Call Trace:
> [  335.750952][] print_usage_bug+0x1fc/0x20d
> [  335.757286]  [] ? save_stack_trace+0x2f/0x50
> [  335.763098]  [] mark_lock+0x29d/0x300
> [  335.768309]  [] ? 
> print_irq_inversion_bug.part.36+0x1f0/0x1f0
> [  335.775599]  [] __lock_acquire+0x5ac/0x1a50
> [  335.781323]  [] ? __lock_acquire+0x2e4/0x1a50
> [  335.787224]  [] ? __mem_cgroup_free+0x8d/0x1b0
> [  335.793212]  [] lock_acquire+0x95/0x150
> [  335.798594]  [] ? __mem_cgroup_free+0x8d/0x1b0
> [  335.804581]  [] ? res_counter_uncharge_until+0x3d/0xa0
> [  335.811263]  [] _raw_spin_lock+0x41/0x50
> [  335.816731]  [] ? __mem_cgroup_free+0x8d/0x1b0
> [  335.822724]  [] __mem_cgroup_free+0x8d/0x1b0
> [  335.828538]  [] mem_cgroup_put+0x45/0x50
> [  335.834002]  [] __memcg_kmem_free_page+0xa6/0x110
> [  335.840256]  [] free_accounted_pages+0x99/0xa0
> [  335.846243]  [] free_task+0x3f/0x70
> [  335.851278]  [] __put_task_struct+0xbc/0x130
> [  335.857094]  [] delayed_put_task_struct+0x54/0xd0
> [  335.863338]  [] __rcu_process_callbacks+0x1e4/0x490
> [  335.869757]  [] rcu_process_callbacks+0x2f/0x80
> [  335.875835]  [] __do_softirq+0xc5/0x270
> [  335.881218]  [] ? clockevents_program_event+0x74/0x100
> [  335.887895]  [] ? tick_program_event+0x24/0x30
> [  335.893882]  [] call_softirq+0x1c/0x30
> [  335.899179]  [] do_softirq+0x8d/0xc0
> [  335.904301]  [] irq_exit+0xae/0xe0
> [  335.909251]  [] smp_apic_timer_interrupt+0x6e/0x99
> [  335.915591]  [] apic_timer_interrupt+0x6c/0x80
> [  335.921583][] ? default_idle+0x67/0x270
> [  335.927741]  [] ? default_idle+0x65/0x270
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
>> > + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> > + * failure of the allocation. if @page is NULL, this function will revert 
>> > the
>> > + * charges. Otherwise, it will commit the memcg given by @handle to the
>> > + * corresponding page_cgroup.
>> > + */
>> > +static __always_inline void
>> > +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>> > order)
>> > +{
>> > +  if (memcg_kmem_on)
>> > +  __memcg_kmem_commit_page(page, handle, order);
>> > +}
> Doesn't this 2 functions has no short-cuts ?

Sorry kame, what exactly do you mean?

> if (memcg_kmem_on && handle) ?
I guess this can be done to avoid a function call.

> Maybe free() needs to access page_cgroup...
> 
Can you also be a bit more specific here?

>> > +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
>> > +{
>> > +  struct mem_cgroup *memcg;
>> > +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
>> > +  bool ret = true;
>> > +  size_t size;
>> > +  struct task_struct *p;
>> > +
>> > +  *handle = NULL;
>> > +  rcu_read_lock();
>> > +  p = rcu_dereference(current->mm->owner);
>> > +  memcg = mem_cgroup_from_task(p);
>> > +  if (!memcg_kmem_enabled(memcg))
>> > +  goto out;
>> > +
>> > +  mem_cgroup_get(memcg);
>> > +
> This mem_cgroup_get() will be a potentioal performance problem.
> Don't you have good idea to avoid accessing atomic counter here ?
> I think some kind of percpu counter or a feature to disable "move task"
> will be a help.




>> > +  pc = lookup_page_cgroup(page);
>> > +  lock_page_cgroup(pc);
>> > +  pc->mem_cgroup = memcg;
>> > +  SetPageCgroupUsed(pc);
>> > +  unlock_page_cgroup(pc);
>> > +}
>> > +
>> > +void __memcg_kmem_free_page(struct page *page, int order)
>> > +{
>> > +  struct mem_cgroup *memcg;
>> > +  size_t size;
>> > +  struct page_cgroup *pc;
>> > +
>> > +  if (mem_cgroup_disabled())
>> > +  return;
>> > +
>> > +  pc = lookup_page_cgroup(page);
>> > +  lock_page_cgroup(pc);
>> > +  memcg = pc->mem_cgroup;
>> > +  pc->mem_cgroup = NULL;

> shouldn't this happen after checking "Used" bit ?
> Ah, BTW, why do you need to clear pc->memcg ?

As for clearing pc->memcg, I think I'm just being overzealous. I can't
foresee any problems due to removing it.

As for the Used bit, what difference does it make when we clear it?

>> > +  if (!PageCgroupUsed(pc)) {
>> > +  unlock_page_cgroup(pc);
>> > +  return;
>> > +  }
>> > +  ClearPageCgroupUsed(pc);
>> > +  unlock_page_cgroup(pc);
>> > +
>> > +  /*
>> > +   * Checking if kmem accounted is enabled won't work for uncharge, since
>> > +   * it is possible that the user enabled kmem tracking, allocated, and
>> > +   * then disabled it again.
>> > +   *
>> > +   * We trust if there is a memcg associated with the page, it is a valid
>> > +   * allocation
>> > +   */
>> > +  if (!memcg)
>> > +  return;
>> > +
>> > +  WARN_ON(mem_cgroup_is_root(memcg));
>> > +  size = (1 << order) << PAGE_SHIFT;
>> > +  memcg_uncharge_kmem(memcg, size);
>> > +  mem_cgroup_put(memcg);
> Why do we need ref-counting here ? kmem res_counter cannot work as
> reference ?
This is of course the pair of the mem_cgroup_get() you commented on
earlier. If we need one, we need the other. If we don't need one, we
don't need the other =)

The guarantee we're trying to give here is that the memcg structure will
stay around while there are dangling charges to kmem, that we decided
not to move (remember: moving it for the stack is simple, for the slab
is very complicated and ill-defined, and I believe it is better to treat
all kmem equally here)

So maybe we can be clever here, and avoid reference counting at all
times. We call mem_cgroup_get() when the first charge occurs, and then
go for mem_cgroup_put() when our count reaches 0.

What do you think about that?


>> > +#ifdef CONFIG_MEMCG_KMEM
>> > +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
>> > +{
> What does 'delta' means ?
> 
I can change it to something like nr_bytes, more informative.

>> > +  struct res_counter *fail_res;
>> > +  struct mem_cgroup *_memcg;
>> > +  int ret;
>> > +  bool may_oom;
>> > +  bool nofail = false;
>> > +
>> > +  may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
>> > +  !(gfp & __GFP_NORETRY);
>> > +
>> > +  ret = 0;
>> > +
>> > +  if (!memcg)
>> > +  return ret;
>> > +
>> > +  _memcg = memcg;
>> > +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
>> > +  &_memcg, may_oom);
>> > +
>> > +  if (ret == -EINTR)  {
>> > +  nofail = true;
>> > +  /*
>> > +   * __mem_cgroup_try_charge() chosed to bypass to root due to
>> > +   * OOM kill or fatal signal.  Since our only options are to
>> > +   * either fail the allocation or charge it to this cgroup, do
>> > +   * it as a temporary condition. But we can't fail. From a
>> > +   * kmem/slab perspective, the cache has already been selected,
>> > 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
On 08/11/2012 09:11 AM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
> 
>> This patch introduces infrastructure for tracking kernel memory pages to
>> a given memcg. This will happen whenever the caller includes the flag
>> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>>
>> In memcontrol.h those functions are wrapped in inline accessors.  The
>> idea is to later on, patch those with static branches, so we don't incur
>> any overhead when no mem cgroups with limited kmem are being used.
>>
>> [ v2: improved comments and standardized function names ]
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> ---
>>  include/linux/memcontrol.h |  79 +++
>>  mm/memcontrol.c| 185 
>> +
>>  2 files changed, 264 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 8d9489f..75b247e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -21,6 +21,7 @@
>>  #define _LINUX_MEMCONTROL_H
>>  #include 
>>  #include 
>> +#include 
>>  
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -399,6 +400,11 @@ struct sock;
>>  #ifdef CONFIG_MEMCG_KMEM
>>  void sock_update_memcg(struct sock *sk);
>>  void sock_release_memcg(struct sock *sk);
>> +
>> +#define memcg_kmem_on 1
>> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
>> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
>> +void __memcg_kmem_free_page(struct page *page, int order);
>>  #else
>>  static inline void sock_update_memcg(struct sock *sk)
>>  {
>> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>>  static inline void sock_release_memcg(struct sock *sk)
>>  {
>>  }
>> +
>> +#define memcg_kmem_on 0
>> +static inline bool
>> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> +return false;
>> +}
>> +
>> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void
>> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>> order)
>> +{
>> +}
>>  #endif /* CONFIG_MEMCG_KMEM */
>> +
>> +/**
>> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
>> + * @gfp: the gfp allocation flags.
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @order: allocation order.
>> + *
>> + * returns true if the memcg where the current task belongs can hold this
>> + * allocation.
>> + *
>> + * We return true automatically if this allocation is not to be accounted to
>> + * any memcg.
>> + */
>> +static __always_inline bool
>> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
>> +{
>> +if (!memcg_kmem_on)
>> +return true;
>> +if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
>> +return true;
>> +if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>> +return true;
>> +return __memcg_kmem_new_page(gfp, handle, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_free_page: uncharge pages from memcg
>> + * @page: pointer to struct page being freed
>> + * @order: allocation order.
>> + *
>> + * there is no need to specify memcg here, since it is embedded in 
>> page_cgroup
>> + */
>> +static __always_inline void
>> +memcg_kmem_free_page(struct page *page, int order)
>> +{
>> +if (memcg_kmem_on)
>> +__memcg_kmem_free_page(page, order);
>> +}
>> +
>> +/**
>> + * memcg_kmem_commit_page: embeds correct memcg in a page
>> + * @handle: a pointer to the memcg this was charged against.
>> + * @page: pointer to struct page recently allocated
>> + * @handle: the memcg structure we charged against
>> + * @order: allocation order.
>> + *
>> + * Needs to be called after memcg_kmem_new_page, regardless of success or
>> + * failure of the allocation. if @page is NULL, this function will revert 
>> the
>> + * charges. Otherwise, it will commit the memcg given by @handle to the
>> + * corresponding page_cgroup.
>> + */
>> +static __always_inline void
>> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
>> order)
>> +{
>> +if (memcg_kmem_on)
>> +__memcg_kmem_commit_page(page, handle, order);
>> +}
>>  #endif /* _LINUX_MEMCONTROL_H */
>>  
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 54e93de..e9824c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -10,6 +10,10 @@
>>   * Copyright (C) 2009 Nokia Corporation
>>   * Author: Kirill A. Shutemov
>>   *
>> + * Kernel Memory Controller
>> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
>> + * Authors: Glauber Costa and Suleiman Souhlal
>> + *
>>   * This program is free software; you can redistribute it and/or modify
>>   * it under the terms of the GNU General Public License as published by
>>   * the Free Software 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
On 08/11/2012 09:11 AM, Greg Thelen wrote:
 On Thu, Aug 09 2012, Glauber Costa wrote:
 
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.

 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.

 [ v2: improved comments and standardized function names ]

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
  #ifdef CONFIG_MEMCG_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
  #else
  static inline void sock_update_memcg(struct sock *sk)
  {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
  #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 +if (!memcg_kmem_on)
 +return true;
 +if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 +return true;
 +if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 +return true;
 +return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 +if (memcg_kmem_on)
 +__memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert 
 the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +if (memcg_kmem_on)
 +__memcg_kmem_commit_page(page, handle, order);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
   * Copyright (C) 2009 Nokia Corporation
   * Author: Kirill A. Shutemov
   *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
  + * Needs to be called after memcg_kmem_new_page, regardless of success or
  + * failure of the allocation. if @page is NULL, this function will revert 
  the
  + * charges. Otherwise, it will commit the memcg given by @handle to the
  + * corresponding page_cgroup.
  + */
  +static __always_inline void
  +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
  order)
  +{
  +  if (memcg_kmem_on)
  +  __memcg_kmem_commit_page(page, handle, order);
  +}
 Doesn't this 2 functions has no short-cuts ?

Sorry kame, what exactly do you mean?

 if (memcg_kmem_on  handle) ?
I guess this can be done to avoid a function call.

 Maybe free() needs to access page_cgroup...
 
Can you also be a bit more specific here?

  +bool __memcg_kmem_new_page(gfp_t gfp, void *_handle, int order)
  +{
  +  struct mem_cgroup *memcg;
  +  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
  +  bool ret = true;
  +  size_t size;
  +  struct task_struct *p;
  +
  +  *handle = NULL;
  +  rcu_read_lock();
  +  p = rcu_dereference(current-mm-owner);
  +  memcg = mem_cgroup_from_task(p);
  +  if (!memcg_kmem_enabled(memcg))
  +  goto out;
  +
  +  mem_cgroup_get(memcg);
  +
 This mem_cgroup_get() will be a potentioal performance problem.
 Don't you have good idea to avoid accessing atomic counter here ?
 I think some kind of percpu counter or a feature to disable move task
 will be a help.




  +  pc = lookup_page_cgroup(page);
  +  lock_page_cgroup(pc);
  +  pc-mem_cgroup = memcg;
  +  SetPageCgroupUsed(pc);
  +  unlock_page_cgroup(pc);
  +}
  +
  +void __memcg_kmem_free_page(struct page *page, int order)
  +{
  +  struct mem_cgroup *memcg;
  +  size_t size;
  +  struct page_cgroup *pc;
  +
  +  if (mem_cgroup_disabled())
  +  return;
  +
  +  pc = lookup_page_cgroup(page);
  +  lock_page_cgroup(pc);
  +  memcg = pc-mem_cgroup;
  +  pc-mem_cgroup = NULL;

 shouldn't this happen after checking Used bit ?
 Ah, BTW, why do you need to clear pc-memcg ?

As for clearing pc-memcg, I think I'm just being overzealous. I can't
foresee any problems due to removing it.

As for the Used bit, what difference does it make when we clear it?

  +  if (!PageCgroupUsed(pc)) {
  +  unlock_page_cgroup(pc);
  +  return;
  +  }
  +  ClearPageCgroupUsed(pc);
  +  unlock_page_cgroup(pc);
  +
  +  /*
  +   * Checking if kmem accounted is enabled won't work for uncharge, since
  +   * it is possible that the user enabled kmem tracking, allocated, and
  +   * then disabled it again.
  +   *
  +   * We trust if there is a memcg associated with the page, it is a valid
  +   * allocation
  +   */
  +  if (!memcg)
  +  return;
  +
  +  WARN_ON(mem_cgroup_is_root(memcg));
  +  size = (1  order)  PAGE_SHIFT;
  +  memcg_uncharge_kmem(memcg, size);
  +  mem_cgroup_put(memcg);
 Why do we need ref-counting here ? kmem res_counter cannot work as
 reference ?
This is of course the pair of the mem_cgroup_get() you commented on
earlier. If we need one, we need the other. If we don't need one, we
don't need the other =)

The guarantee we're trying to give here is that the memcg structure will
stay around while there are dangling charges to kmem, that we decided
not to move (remember: moving it for the stack is simple, for the slab
is very complicated and ill-defined, and I believe it is better to treat
all kmem equally here)

So maybe we can be clever here, and avoid reference counting at all
times. We call mem_cgroup_get() when the first charge occurs, and then
go for mem_cgroup_put() when our count reaches 0.

What do you think about that?


  +#ifdef CONFIG_MEMCG_KMEM
  +int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta)
  +{
 What does 'delta' means ?
 
I can change it to something like nr_bytes, more informative.

  +  struct res_counter *fail_res;
  +  struct mem_cgroup *_memcg;
  +  int ret;
  +  bool may_oom;
  +  bool nofail = false;
  +
  +  may_oom = (gfp  __GFP_WAIT)  (gfp  __GFP_FS) 
  +  !(gfp  __GFP_NORETRY);
  +
  +  ret = 0;
  +
  +  if (!memcg)
  +  return ret;
  +
  +  _memcg = memcg;
  +  ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
  +  _memcg, may_oom);
  +
  +  if (ret == -EINTR)  {
  +  nofail = true;
  +  /*
  +   * __mem_cgroup_try_charge() chosed to bypass to root due to
  +   * OOM kill or fatal signal.  Since our only options are to
  +   * either fail the allocation or charge it to this cgroup, do
  +   * it as a temporary condition. But we can't fail. From a
  +   * kmem/slab perspective, the cache has already been selected,
  +   * by mem_cgroup_get_kmem_cache(), so it is too late to change
  +   * our minds
  +   */
  +  res_counter_charge_nofail(memcg-res, delta, fail_res);
  +  if (do_swap_account)
  +  res_counter_charge_nofail(memcg-memsw, delta,
  +fail_res);
  

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Glauber Costa
 
 Here's the dmesg splat.
 

Do you always get this report in the same way?
I managed to get a softirq inconsistency like yours, but the complaint
goes for a different lock.


 [  335.550398] =
 [  335.554739] [ INFO: inconsistent lock state ]
 [  335.559091] 3.5.0-dbg-DEV #3 Tainted: GW
 [  335.563946] -
 [  335.568290] inconsistent {SOFTIRQ-ON-W} - {IN-SOFTIRQ-W} usage.
 [  335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 [  335.579508]  ((rtpz-lock)-rlock){+.?...}, at: [8118216d] 
 __mem_cgroup_free+0x8d/0x1b0
 [  335.588525] {SOFTIRQ-ON-W} state was registered at:
 [  335.593389]   [810cb073] __lock_acquire+0x623/0x1a50
 [  335.599200]   [810cca55] lock_acquire+0x95/0x150
 [  335.604670]   [81582531] _raw_spin_lock+0x41/0x50
 [  335.610232]   [8118216d] __mem_cgroup_free+0x8d/0x1b0
 [  335.616135]   [811822d5] mem_cgroup_put+0x45/0x50
 [  335.621696]   [81182302] mem_cgroup_destroy+0x22/0x30
 [  335.627592]   [810e093f] cgroup_diput+0xbf/0x160
 [  335.633062]   [811a07ef] d_delete+0x12f/0x1a0
 [  335.638276]   [8119671e] vfs_rmdir+0x11e/0x140
 [  335.643565]   [81199173] do_rmdir+0x113/0x130
 [  335.648773]   [8119a5e6] sys_rmdir+0x16/0x20
 [  335.653900]   [8158c74f] cstar_dispatch+0x7/0x1f
 [  335.659370] irq event stamp: 399732
 [  335.662846] hardirqs last  enabled at (399732): [810e8e08] 
 res_counter_uncharge_until+0x68/0xa0
 [  335.672383] hardirqs last disabled at (399731): [810e8dc8] 
 res_counter_uncharge_until+0x28/0xa0
 [  335.681916] softirqs last  enabled at (399710): [81085dd3] 
 _local_bh_enable+0x13/0x20
 [  335.690590] softirqs last disabled at (399711): [8158c48c] 
 call_softirq+0x1c/0x30
 [  335.698914]
 [  335.698914] other info that might help us debug this:
 [  335.705415]  Possible unsafe locking scenario:
 [  335.705415]
 [  335.711317]CPU0
 [  335.713757]
 [  335.716198]   lock((rtpz-lock)-rlock);
 [  335.720282]   Interrupt
 [  335.722896] lock((rtpz-lock)-rlock);
 [  335.727153]
 [  335.727153]  *** DEADLOCK ***
 [  335.727153]
 [  335.733055] no locks held by swapper/10/0.
 [  335.737141]
 [  335.737141] stack backtrace:
 [  335.741483] Pid: 0, comm: swapper/10 Tainted: GW3.5.0-dbg-DEV 
 #3
 [  335.748510] Call Trace:
 [  335.750952]  IRQ  [81579a27] print_usage_bug+0x1fc/0x20d
 [  335.757286]  [81058a9f] ? save_stack_trace+0x2f/0x50
 [  335.763098]  [810ca9ed] mark_lock+0x29d/0x300
 [  335.768309]  [810c9e10] ? 
 print_irq_inversion_bug.part.36+0x1f0/0x1f0
 [  335.775599]  [810caffc] __lock_acquire+0x5ac/0x1a50
 [  335.781323]  [810cad34] ? __lock_acquire+0x2e4/0x1a50
 [  335.787224]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.793212]  [810cca55] lock_acquire+0x95/0x150
 [  335.798594]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.804581]  [810e8ddd] ? res_counter_uncharge_until+0x3d/0xa0
 [  335.811263]  [81582531] _raw_spin_lock+0x41/0x50
 [  335.816731]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.822724]  [8118216d] __mem_cgroup_free+0x8d/0x1b0
 [  335.828538]  [811822d5] mem_cgroup_put+0x45/0x50
 [  335.834002]  [811828a6] __memcg_kmem_free_page+0xa6/0x110
 [  335.840256]  [81138109] free_accounted_pages+0x99/0xa0
 [  335.846243]  [8107b09f] free_task+0x3f/0x70
 [  335.851278]  [8107b18c] __put_task_struct+0xbc/0x130
 [  335.857094]  [81081524] delayed_put_task_struct+0x54/0xd0
 [  335.863338]  [810fd354] __rcu_process_callbacks+0x1e4/0x490
 [  335.869757]  [810fd62f] rcu_process_callbacks+0x2f/0x80
 [  335.875835]  [810862f5] __do_softirq+0xc5/0x270
 [  335.881218]  [810c49b4] ? clockevents_program_event+0x74/0x100
 [  335.887895]  [810c5d94] ? tick_program_event+0x24/0x30
 [  335.893882]  [8158c48c] call_softirq+0x1c/0x30
 [  335.899179]  [8104cefd] do_softirq+0x8d/0xc0
 [  335.904301]  [810867de] irq_exit+0xae/0xe0
 [  335.909251]  [8158cc3e] smp_apic_timer_interrupt+0x6e/0x99
 [  335.915591]  [8158ba9c] apic_timer_interrupt+0x6c/0x80
 [  335.921583]  EOI  [810530e7] ? default_idle+0x67/0x270
 [  335.927741]  [810530e5] ? default_idle+0x65/0x270
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-13 Thread Greg Thelen
On Mon, Aug 13 2012, Glauber Costa wrote:

 
 Here's the dmesg splat.
 

 Do you always get this report in the same way?
 I managed to get a softirq inconsistency like yours, but the complaint
 goes for a different lock.

Yes, I repeatedly get the same dmesg splat below.

Once I your 'execute the whole memcg freeing in rcu callback' patch,
then the warnings are not printed.  I'll take a closer look at the patch
soon.

 [  335.550398] =
 [  335.554739] [ INFO: inconsistent lock state ]
 [  335.559091] 3.5.0-dbg-DEV #3 Tainted: GW
 [  335.563946] -
 [  335.568290] inconsistent {SOFTIRQ-ON-W} - {IN-SOFTIRQ-W} usage.
 [  335.574286] swapper/10/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 [  335.579508]  ((rtpz-lock)-rlock){+.?...}, at: [8118216d] 
 __mem_cgroup_free+0x8d/0x1b0
 [  335.588525] {SOFTIRQ-ON-W} state was registered at:
 [  335.593389]   [810cb073] __lock_acquire+0x623/0x1a50
 [  335.599200]   [810cca55] lock_acquire+0x95/0x150
 [  335.604670]   [81582531] _raw_spin_lock+0x41/0x50
 [  335.610232]   [8118216d] __mem_cgroup_free+0x8d/0x1b0
 [  335.616135]   [811822d5] mem_cgroup_put+0x45/0x50
 [  335.621696]   [81182302] mem_cgroup_destroy+0x22/0x30
 [  335.627592]   [810e093f] cgroup_diput+0xbf/0x160
 [  335.633062]   [811a07ef] d_delete+0x12f/0x1a0
 [  335.638276]   [8119671e] vfs_rmdir+0x11e/0x140
 [  335.643565]   [81199173] do_rmdir+0x113/0x130
 [  335.648773]   [8119a5e6] sys_rmdir+0x16/0x20
 [  335.653900]   [8158c74f] cstar_dispatch+0x7/0x1f
 [  335.659370] irq event stamp: 399732
 [  335.662846] hardirqs last  enabled at (399732): [810e8e08] 
 res_counter_uncharge_until+0x68/0xa0
 [  335.672383] hardirqs last disabled at (399731): [810e8dc8] 
 res_counter_uncharge_until+0x28/0xa0
 [  335.681916] softirqs last  enabled at (399710): [81085dd3] 
 _local_bh_enable+0x13/0x20
 [  335.690590] softirqs last disabled at (399711): [8158c48c] 
 call_softirq+0x1c/0x30
 [  335.698914]
 [  335.698914] other info that might help us debug this:
 [  335.705415]  Possible unsafe locking scenario:
 [  335.705415]
 [  335.711317]CPU0
 [  335.713757]
 [  335.716198]   lock((rtpz-lock)-rlock);
 [  335.720282]   Interrupt
 [  335.722896] lock((rtpz-lock)-rlock);
 [  335.727153]
 [  335.727153]  *** DEADLOCK ***
 [  335.727153]
 [  335.733055] no locks held by swapper/10/0.
 [  335.737141]
 [  335.737141] stack backtrace:
 [  335.741483] Pid: 0, comm: swapper/10 Tainted: GW3.5.0-dbg-DEV 
 #3
 [  335.748510] Call Trace:
 [  335.750952]  IRQ  [81579a27] print_usage_bug+0x1fc/0x20d
 [  335.757286]  [81058a9f] ? save_stack_trace+0x2f/0x50
 [  335.763098]  [810ca9ed] mark_lock+0x29d/0x300
 [  335.768309]  [810c9e10] ? 
 print_irq_inversion_bug.part.36+0x1f0/0x1f0
 [  335.775599]  [810caffc] __lock_acquire+0x5ac/0x1a50
 [  335.781323]  [810cad34] ? __lock_acquire+0x2e4/0x1a50
 [  335.787224]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.793212]  [810cca55] lock_acquire+0x95/0x150
 [  335.798594]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.804581]  [810e8ddd] ? res_counter_uncharge_until+0x3d/0xa0
 [  335.811263]  [81582531] _raw_spin_lock+0x41/0x50
 [  335.816731]  [8118216d] ? __mem_cgroup_free+0x8d/0x1b0
 [  335.822724]  [8118216d] __mem_cgroup_free+0x8d/0x1b0
 [  335.828538]  [811822d5] mem_cgroup_put+0x45/0x50
 [  335.834002]  [811828a6] __memcg_kmem_free_page+0xa6/0x110
 [  335.840256]  [81138109] free_accounted_pages+0x99/0xa0
 [  335.846243]  [8107b09f] free_task+0x3f/0x70
 [  335.851278]  [8107b18c] __put_task_struct+0xbc/0x130
 [  335.857094]  [81081524] delayed_put_task_struct+0x54/0xd0
 [  335.863338]  [810fd354] __rcu_process_callbacks+0x1e4/0x490
 [  335.869757]  [810fd62f] rcu_process_callbacks+0x2f/0x80
 [  335.875835]  [810862f5] __do_softirq+0xc5/0x270
 [  335.881218]  [810c49b4] ? clockevents_program_event+0x74/0x100
 [  335.887895]  [810c5d94] ? tick_program_event+0x24/0x30
 [  335.893882]  [8158c48c] call_softirq+0x1c/0x30
 [  335.899179]  [8104cefd] do_softirq+0x8d/0xc0
 [  335.904301]  [810867de] irq_exit+0xae/0xe0
 [  335.909251]  [8158cc3e] smp_apic_timer_interrupt+0x6e/0x99
 [  335.915591]  [8158ba9c] apic_timer_interrupt+0x6c/0x80
 [  335.921583]  EOI  [810530e7] ? default_idle+0x67/0x270
 [  335.927741]  [810530e5] ? default_idle+0x65/0x270
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-10 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
>
> In memcontrol.h those functions are wrapped in inline accessors.  The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
>
> [ v2: improved comments and standardized function names ]
>
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> ---
>  include/linux/memcontrol.h |  79 +++
>  mm/memcontrol.c| 185 
> +
>  2 files changed, 264 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
>  #define _LINUX_MEMCONTROL_H
>  #include 
>  #include 
> +#include 
>  
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
>  #ifdef CONFIG_MEMCG_KMEM
>  void sock_update_memcg(struct sock *sk);
>  void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
>  #else
>  static inline void sock_update_memcg(struct sock *sk)
>  {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>  static inline void sock_release_memcg(struct sock *sk)
>  {
>  }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in 
> page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}
>  #endif /* _LINUX_MEMCONTROL_H */
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
>   * Copyright (C) 2009 Nokia Corporation
>   * Author: Kirill A. Shutemov
>   *
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
> cgroup_subsys_state *s)
>  #include 
>  
>  static bool 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
> This patch introduces infrastructure for tracking kernel memory pages to
> a given memcg. This will happen whenever the caller includes the flag
> __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
> 
> In memcontrol.h those functions are wrapped in inline accessors.  The
> idea is to later on, patch those with static branches, so we don't incur
> any overhead when no mem cgroups with limited kmem are being used.
> 
> [ v2: improved comments and standardized function names ]
> 
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> ---
>   include/linux/memcontrol.h |  79 +++
>   mm/memcontrol.c| 185 
> +
>   2 files changed, 264 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 8d9489f..75b247e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -21,6 +21,7 @@
>   #define _LINUX_MEMCONTROL_H
>   #include 
>   #include 
> +#include 
>   
>   struct mem_cgroup;
>   struct page_cgroup;
> @@ -399,6 +400,11 @@ struct sock;
>   #ifdef CONFIG_MEMCG_KMEM
>   void sock_update_memcg(struct sock *sk);
>   void sock_release_memcg(struct sock *sk);
> +
> +#define memcg_kmem_on 1
> +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
> +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
> +void __memcg_kmem_free_page(struct page *page, int order);
>   #else
>   static inline void sock_update_memcg(struct sock *sk)
>   {
> @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
>   static inline void sock_release_memcg(struct sock *sk)
>   {
>   }
> +
> +#define memcg_kmem_on 0
> +static inline bool
> +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + return false;
> +}
> +
> +static inline void  __memcg_kmem_free_page(struct page *page, int order)
> +{
> +}
> +
> +static inline void
> +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> +}
>   #endif /* CONFIG_MEMCG_KMEM */
> +
> +/**
> + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @handle: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static __always_inline bool
> +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
> +{
> + if (!memcg_kmem_on)
> + return true;
> + if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
> + return true;
> + if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> + return true;
> + return __memcg_kmem_new_page(gfp, handle, order);
> +}
> +
> +/**
> + * memcg_kmem_free_page: uncharge pages from memcg
> + * @page: pointer to struct page being freed
> + * @order: allocation order.
> + *
> + * there is no need to specify memcg here, since it is embedded in 
> page_cgroup
> + */
> +static __always_inline void
> +memcg_kmem_free_page(struct page *page, int order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_free_page(page, order);
> +}
> +
> +/**
> + * memcg_kmem_commit_page: embeds correct memcg in a page
> + * @handle: a pointer to the memcg this was charged against.
> + * @page: pointer to struct page recently allocated
> + * @handle: the memcg structure we charged against
> + * @order: allocation order.
> + *
> + * Needs to be called after memcg_kmem_new_page, regardless of success or
> + * failure of the allocation. if @page is NULL, this function will revert the
> + * charges. Otherwise, it will commit the memcg given by @handle to the
> + * corresponding page_cgroup.
> + */
> +static __always_inline void
> +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
> order)
> +{
> + if (memcg_kmem_on)
> + __memcg_kmem_commit_page(page, handle, order);
> +}

Doesn't this 2 functions has no short-cuts ?

if (memcg_kmem_on && handle) ?

Maybe free() needs to access page_cgroup...



>   #endif /* _LINUX_MEMCONTROL_H */
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 54e93de..e9824c1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -10,6 +10,10 @@
>* Copyright (C) 2009 Nokia Corporation
>* Author: Kirill A. Shutemov
>*
> + * Kernel Memory Controller
> + * Copyright (C) 2012 Parallels Inc. and Google Inc.
> + * Authors: Glauber Costa and Suleiman Souhlal
> + *
>* This program is free software; you can redistribute it and/or modify
>* it under the terms of the GNU General Public License as published by
>* the Free Software Foundation; either version 2 of the 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.
 
 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.
 
 [ v2: improved comments and standardized function names ]
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
   include/linux/memcontrol.h |  79 +++
   mm/memcontrol.c| 185 
 +
   2 files changed, 264 insertions(+)
 
 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
   #define _LINUX_MEMCONTROL_H
   #include linux/cgroup.h
   #include linux/vm_event_item.h
 +#include linux/hardirq.h
   
   struct mem_cgroup;
   struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
   #ifdef CONFIG_MEMCG_KMEM
   void sock_update_memcg(struct sock *sk);
   void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
   #else
   static inline void sock_update_memcg(struct sock *sk)
   {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
   static inline void sock_release_memcg(struct sock *sk)
   {
   }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
   #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + if (!memcg_kmem_on)
 + return true;
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + return true;
 + if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 + return true;
 + return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_commit_page(page, handle, order);
 +}

Doesn't this 2 functions has no short-cuts ?

if (memcg_kmem_on  handle) ?

Maybe free() needs to access page_cgroup...



   #endif /* _LINUX_MEMCONTROL_H */
   
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
* Copyright (C) 2009 Nokia Corporation
* Author: Kirill A. Shutemov
*
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; 

Re: [PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-10 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

 This patch introduces infrastructure for tracking kernel memory pages to
 a given memcg. This will happen whenever the caller includes the flag
 __GFP_KMEMCG flag, and the task belong to a memcg other than the root.

 In memcontrol.h those functions are wrapped in inline accessors.  The
 idea is to later on, patch those with static branches, so we don't incur
 any overhead when no mem cgroups with limited kmem are being used.

 [ v2: improved comments and standardized function names ]

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/memcontrol.h |  79 +++
  mm/memcontrol.c| 185 
 +
  2 files changed, 264 insertions(+)

 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
 index 8d9489f..75b247e 100644
 --- a/include/linux/memcontrol.h
 +++ b/include/linux/memcontrol.h
 @@ -21,6 +21,7 @@
  #define _LINUX_MEMCONTROL_H
  #include linux/cgroup.h
  #include linux/vm_event_item.h
 +#include linux/hardirq.h
  
  struct mem_cgroup;
  struct page_cgroup;
 @@ -399,6 +400,11 @@ struct sock;
  #ifdef CONFIG_MEMCG_KMEM
  void sock_update_memcg(struct sock *sk);
  void sock_release_memcg(struct sock *sk);
 +
 +#define memcg_kmem_on 1
 +bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
 +void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
 +void __memcg_kmem_free_page(struct page *page, int order);
  #else
  static inline void sock_update_memcg(struct sock *sk)
  {
 @@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
  static inline void sock_release_memcg(struct sock *sk)
  {
  }
 +
 +#define memcg_kmem_on 0
 +static inline bool
 +__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + return false;
 +}
 +
 +static inline void  __memcg_kmem_free_page(struct page *page, int order)
 +{
 +}
 +
 +static inline void
 +__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 +}
  #endif /* CONFIG_MEMCG_KMEM */
 +
 +/**
 + * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
 + * @gfp: the gfp allocation flags.
 + * @handle: a pointer to the memcg this was charged against.
 + * @order: allocation order.
 + *
 + * returns true if the memcg where the current task belongs can hold this
 + * allocation.
 + *
 + * We return true automatically if this allocation is not to be accounted to
 + * any memcg.
 + */
 +static __always_inline bool
 +memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
 +{
 + if (!memcg_kmem_on)
 + return true;
 + if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
 + return true;
 + if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
 + return true;
 + return __memcg_kmem_new_page(gfp, handle, order);
 +}
 +
 +/**
 + * memcg_kmem_free_page: uncharge pages from memcg
 + * @page: pointer to struct page being freed
 + * @order: allocation order.
 + *
 + * there is no need to specify memcg here, since it is embedded in 
 page_cgroup
 + */
 +static __always_inline void
 +memcg_kmem_free_page(struct page *page, int order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_free_page(page, order);
 +}
 +
 +/**
 + * memcg_kmem_commit_page: embeds correct memcg in a page
 + * @handle: a pointer to the memcg this was charged against.
 + * @page: pointer to struct page recently allocated
 + * @handle: the memcg structure we charged against
 + * @order: allocation order.
 + *
 + * Needs to be called after memcg_kmem_new_page, regardless of success or
 + * failure of the allocation. if @page is NULL, this function will revert the
 + * charges. Otherwise, it will commit the memcg given by @handle to the
 + * corresponding page_cgroup.
 + */
 +static __always_inline void
 +memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
 order)
 +{
 + if (memcg_kmem_on)
 + __memcg_kmem_commit_page(page, handle, order);
 +}
  #endif /* _LINUX_MEMCONTROL_H */
  
 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
 index 54e93de..e9824c1 100644
 --- a/mm/memcontrol.c
 +++ b/mm/memcontrol.c
 @@ -10,6 +10,10 @@
   * Copyright (C) 2009 Nokia Corporation
   * Author: Kirill A. Shutemov
   *
 + * Kernel Memory Controller
 + * Copyright (C) 2012 Parallels Inc. and Google Inc.
 + * Authors: Glauber Costa and Suleiman Souhlal
 + *
   * This program is free software; you can redistribute it and/or modify
   * it under the terms of the GNU General Public License as published by
   * the Free Software Foundation; either version 2 of the License, or
 @@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
 cgroup_subsys_state *s)
  #include net/ip.h
  
  

[PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-09 Thread Glauber Costa
This patch introduces infrastructure for tracking kernel memory pages to
a given memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.

In memcontrol.h those functions are wrapped in inline accessors.  The
idea is to later on, patch those with static branches, so we don't incur
any overhead when no mem cgroups with limited kmem are being used.

[ v2: improved comments and standardized function names ]

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
---
 include/linux/memcontrol.h |  79 +++
 mm/memcontrol.c| 185 +
 2 files changed, 264 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d9489f..75b247e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
 #define _LINUX_MEMCONTROL_H
 #include 
 #include 
+#include 
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -399,6 +400,11 @@ struct sock;
 #ifdef CONFIG_MEMCG_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+
+#define memcg_kmem_on 1
+bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
+void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
+void __memcg_kmem_free_page(struct page *page, int order);
 #else
 static inline void sock_update_memcg(struct sock *sk)
 {
@@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+
+#define memcg_kmem_on 0
+static inline bool
+__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+   return false;
+}
+
+static inline void  __memcg_kmem_free_page(struct page *page, int order)
+{
+}
+
+static inline void
+__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
order)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
+
+/**
+ * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @handle: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static __always_inline bool
+memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+   if (!memcg_kmem_on)
+   return true;
+   if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
+   return true;
+   if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
+   return true;
+   return __memcg_kmem_new_page(gfp, handle, order);
+}
+
+/**
+ * memcg_kmem_free_page: uncharge pages from memcg
+ * @page: pointer to struct page being freed
+ * @order: allocation order.
+ *
+ * there is no need to specify memcg here, since it is embedded in page_cgroup
+ */
+static __always_inline void
+memcg_kmem_free_page(struct page *page, int order)
+{
+   if (memcg_kmem_on)
+   __memcg_kmem_free_page(page, order);
+}
+
+/**
+ * memcg_kmem_commit_page: embeds correct memcg in a page
+ * @handle: a pointer to the memcg this was charged against.
+ * @page: pointer to struct page recently allocated
+ * @handle: the memcg structure we charged against
+ * @order: allocation order.
+ *
+ * Needs to be called after memcg_kmem_new_page, regardless of success or
+ * failure of the allocation. if @page is NULL, this function will revert the
+ * charges. Otherwise, it will commit the memcg given by @handle to the
+ * corresponding page_cgroup.
+ */
+static __always_inline void
+memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
+{
+   if (memcg_kmem_on)
+   __memcg_kmem_commit_page(page, handle, order);
+}
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54e93de..e9824c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
  * Copyright (C) 2009 Nokia Corporation
  * Author: Kirill A. Shutemov
  *
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *s)
 #include 
 
 static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
+static void memcg_uncharge_kmem(struct mem_cgroup *memcg, s64 delta);
+
 void sock_update_memcg(struct sock *sk)
 {
if (mem_cgroup_sockets_enabled) {
@@ -488,6 +495,118 @@ struct cg_proto 

[PATCH v2 06/11] memcg: kmem controller infrastructure

2012-08-09 Thread Glauber Costa
This patch introduces infrastructure for tracking kernel memory pages to
a given memcg. This will happen whenever the caller includes the flag
__GFP_KMEMCG flag, and the task belong to a memcg other than the root.

In memcontrol.h those functions are wrapped in inline accessors.  The
idea is to later on, patch those with static branches, so we don't incur
any overhead when no mem cgroups with limited kmem are being used.

[ v2: improved comments and standardized function names ]

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Christoph Lameter c...@linux.com
CC: Pekka Enberg penb...@cs.helsinki.fi
CC: Michal Hocko mho...@suse.cz
CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
CC: Johannes Weiner han...@cmpxchg.org
---
 include/linux/memcontrol.h |  79 +++
 mm/memcontrol.c| 185 +
 2 files changed, 264 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d9489f..75b247e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -21,6 +21,7 @@
 #define _LINUX_MEMCONTROL_H
 #include linux/cgroup.h
 #include linux/vm_event_item.h
+#include linux/hardirq.h
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -399,6 +400,11 @@ struct sock;
 #ifdef CONFIG_MEMCG_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
+
+#define memcg_kmem_on 1
+bool __memcg_kmem_new_page(gfp_t gfp, void *handle, int order);
+void __memcg_kmem_commit_page(struct page *page, void *handle, int order);
+void __memcg_kmem_free_page(struct page *page, int order);
 #else
 static inline void sock_update_memcg(struct sock *sk)
 {
@@ -406,6 +412,79 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+
+#define memcg_kmem_on 0
+static inline bool
+__memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+   return false;
+}
+
+static inline void  __memcg_kmem_free_page(struct page *page, int order)
+{
+}
+
+static inline void
+__memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int 
order)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
+
+/**
+ * memcg_kmem_new_page: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @handle: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static __always_inline bool
+memcg_kmem_new_page(gfp_t gfp, void *handle, int order)
+{
+   if (!memcg_kmem_on)
+   return true;
+   if (!(gfp  __GFP_KMEMCG) || (gfp  __GFP_NOFAIL))
+   return true;
+   if (in_interrupt() || (!current-mm) || (current-flags  PF_KTHREAD))
+   return true;
+   return __memcg_kmem_new_page(gfp, handle, order);
+}
+
+/**
+ * memcg_kmem_free_page: uncharge pages from memcg
+ * @page: pointer to struct page being freed
+ * @order: allocation order.
+ *
+ * there is no need to specify memcg here, since it is embedded in page_cgroup
+ */
+static __always_inline void
+memcg_kmem_free_page(struct page *page, int order)
+{
+   if (memcg_kmem_on)
+   __memcg_kmem_free_page(page, order);
+}
+
+/**
+ * memcg_kmem_commit_page: embeds correct memcg in a page
+ * @handle: a pointer to the memcg this was charged against.
+ * @page: pointer to struct page recently allocated
+ * @handle: the memcg structure we charged against
+ * @order: allocation order.
+ *
+ * Needs to be called after memcg_kmem_new_page, regardless of success or
+ * failure of the allocation. if @page is NULL, this function will revert the
+ * charges. Otherwise, it will commit the memcg given by @handle to the
+ * corresponding page_cgroup.
+ */
+static __always_inline void
+memcg_kmem_commit_page(struct page *page, struct mem_cgroup *handle, int order)
+{
+   if (memcg_kmem_on)
+   __memcg_kmem_commit_page(page, handle, order);
+}
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54e93de..e9824c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -10,6 +10,10 @@
  * Copyright (C) 2009 Nokia Corporation
  * Author: Kirill A. Shutemov
  *
+ * Kernel Memory Controller
+ * Copyright (C) 2012 Parallels Inc. and Google Inc.
+ * Authors: Glauber Costa and Suleiman Souhlal
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -434,6 +438,9 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *s)
 #include net/ip.h
 
 static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
+static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, s64 delta);
+static void