Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa
On 08/15/2012 05:22 PM, Mel Gorman wrote:
>> I believe it
>> > to be a better and less complicated approach then letting a page appear
>> > and then charging it. Besides being consistent with the rest of memcg,
>> > it won't create unnecessary disturbance in the page allocator
>> > when the allocation is to fail.
>> > 
> I still don't get why you did not just return a mem_cgroup instead of a
> handle.
> 

Forgot this one, sorry:

The reason is to keep the semantics simple.

What should we return if the code is not compiled in? If we return NULL
for failure, the test becomes

memcg = memcg_kmem_charge_page(gfp, order);
if (!memcg)
  exit;

If we're not compiled in, we'd either return positive garbage or we need
to wrap it inside an ifdef

I personally believe to be a lot more clear to standardize on true
to mean "allocation can proceed".

the compiled out case becomes:

if (!true)
   exit;

which is easily compiled away altogether. Now of course, using struct
mem_cgroup makes sense, and I have already changed that here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa
>>
>> As for the type, do you think using struct mem_cgroup would be less
>> confusing?
>>
> 
> Yes and returning the mem_cgroup or NULL instead of bool.

Ok. struct mem_cgroup it is.

> 
>> The placeholder is there, but it is later patched
>> to the final thing.
>> With that explained, if you want me to change it to something else, I
>> can do it. Should I ?
>>
> 
> Not in this patch anyway. I would have preferred a pattern like this but
> that's about it.
> 
> #ifdef CONFIG_MEMCG_KMEM
> extern struct static_key memcg_kmem_enabled_key;
> static inline int memcg_kmem_enabled(void)
> {
>return static_key_false(_kmem_enabled_key);
> }
> #else
> 
> static inline bool memcg_kmem_enabled(void)
> {
>return false;
> }
> #endif
> 

humm, I'll have to think about this name.
"memcg_kmem_enabled" means it is enabled in this cgroup. It is actually
used inside memcontrol.c to denote precisely that.

Now the static branch, of course, means it is globally enabled. Or as I
called here, "on".


> Two reasons. One, it does not use the terms "on" and "enabled"
> interchangeably.  The other reason is down to taste as I'm copying the
> pattern I used myself for sk_memalloc_socks(). Of course I am biased.
> 
> Also, why is the key exported?
> 

Same reason. The slab will now have inline functions that will test
against that. The alloc functions themselves, are inside the page
allocator, and the exports can go away.

But the static branch will still be tested inside inlined functions in
the slab.

That said, for the sake of simplicity, I can make it go away here, and
add that to the right place later.

>>> I also find it *very* strange to have a function named as if it is an
>>> allocation-style function when it in fact it's looking up a mem_cgroup
>>> and charging it (and uncharging it in the error path if necessary). If
>>> it was called memcg_kmem_newpage_charge I might have found it a little
>>> better.
>>
>> I don't feel strongly about names in general. I can change it.
>> Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().
>>
> 
> I would prefer that anyway. Names have meaning and people make assumptions on
> the implementation depending on the name. We should try to be as consistent
> as possible or maintenance becomes harder. I know there are areas where
> we are not consistent at all but we should not compound the problem.

memcg_kmem_page_charge() is even better I believe, and that is what I
changed this to in my tree.

>>> As this thing is called from within the allocator, it's not clear why
>>> __memcg_kmem_new_page is exported. I can't imagine why a module would call
>>> it directly although maybe you cover that somewhere else in the series.
>>
>> Okay, more people commented on this, so let me clarify: They shouldn't
>> be. They were initially exported when this was about the slab only,
>> because they could be called from inlined functions from the allocators.
>> Now that the charge/uncharge was moved to the page allocator - which
>> already allowed me the big benefit of separating this in two pieces,
>> none of this needs to be exported.
>>
>> Sorry for not noticing this myself, but thanks for the eyes =)
>>
> 
> You're welcome. I expect to see all the exports disappear so. If there
> are any exports left I think it would be important to document why they
> have to be exported. This is particularly true because they are
> EXPORT_SYMBOL not EXPORT_SYMBOL_GPL. I think it would be good to know in
> advance why a module (particularly an out-of-tree one) would be
> interested.
> 

I will remove them all for now.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Mel Gorman
On Wed, Aug 15, 2012 at 01:08:08PM +0400, Glauber Costa wrote:
> On 08/14/2012 07:16 PM, Mel Gorman wrote:
> > On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
> >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> >> page allocator will call the corresponding memcg functions to validate
> >> the allocation. Tasks in the root memcg can always proceed.
> >>
> >> To avoid adding markers to the page - and a kmem flag that would
> >> necessarily follow, as much as doing page_cgroup lookups for no reason,
> > 
> > As you already guessed, doing a page_cgroup in the page allocator free
> > path would be a no-go.
> 
> Specifically yes, but in general, you will be able to observe that I am
> taking all the possible measures to make sure existing paths are
> disturbed as little as possible.
> 
> Thanks for your review here
> 
> >>  
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index b956cec..da341dc 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> >> order,
> >>struct page *page = NULL;
> >>int migratetype = allocflags_to_migratetype(gfp_mask);
> >>unsigned int cpuset_mems_cookie;
> >> +  void *handle = NULL;
> >>  
> >>gfp_mask &= gfp_allowed_mask;
> >>  
> >> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> >> order,
> >>return NULL;
> >>  
> >>/*
> >> +   * Will only have any effect when __GFP_KMEMCG is set.
> >> +   * This is verified in the (always inline) callee
> >> +   */
> >> +  if (!memcg_kmem_new_page(gfp_mask, , order))
> > 
> > memcg_kmem_new_page takes a void * parameter already but here you are
> > passing in a void **. This probably happens to work because you do this
> > 
> > struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> > 
> > but that appears to defeat the purpose of having an opaque type as a
> > "handle". You have to treat it different then passing it into the commit
> > function because it expects a void *. The motivation for an opaque type
> > is completely unclear to me and how it is managed with a mix of void *
> > and void ** is very confusing.
> 
> okay.
> 
> The opaque exists because I am doing speculative charging.

I do not get why speculative charing would mandate an opaque type or
"handle". It looks like like a fairly standard prepare/commit pattern to me.

> I believe it
> to be a better and less complicated approach then letting a page appear
> and then charging it. Besides being consistent with the rest of memcg,
> it won't create unnecessary disturbance in the page allocator
> when the allocation is to fail.
> 

I still don't get why you did not just return a mem_cgroup instead of a
handle.

> Now, tasks can move between memcgs, so we can't rely on grabbing it from
> current in commit_page, so we pass it around as a handle.

You could just as easily passed around the mem_cgroup and it would have
been less obscure. Maybe this makes sense from a memcg context and matches
some coding pattern there that I'm not aware of.

> Also, even if
> the task could not move, we already got it once from the task, and that
> is not for free. Better save it.
> 
> Aside from the handle needed, the cost is more or less the same compared
> to doing it in one pass. All we do by using speculative charging is to
> split the cost in two, and doing it from two places.
> We'd have to charge + update page_cgroup anyway.
> 
> As for the type, do you think using struct mem_cgroup would be less
> confusing?
> 

Yes and returning the mem_cgroup or NULL instead of bool.

> > On a similar note I spotted #define memcg_kmem_on 1 . That is also
> > different just for the sake of it. The convension is to do something
> > like this
> > 
> > /* This helps us to avoid #ifdef CONFIG_NUMA */
> > #ifdef CONFIG_NUMA
> > #define NUMA_BUILD 1
> > #else
> > #define NUMA_BUILD 0
> > #endif
> 
> For simple defines, yes. But a later patch will turn this into a static
> branch test. memcg_kmem_on will be always 0 when compile-disabled, but
> when enable will expand to static_branch(&...).
> 

I see.

> 
> > memcg_kmem_on was difficult to guess based on its name. I thought initially
> > that it would only be active if a memcg existed or at least something like
> > mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
> 
> For now. And I thought that adding the static branch in this patch would
> only confuse matters.

Ah, I see now. I had stopped reading the series once I reached this
patch. I don't think it would have mattered much to collapse the two
patches together but ok.

The static key handling does look a little suspicious. You appear to do
reference counting in memcg_update_kmem_limit for every mem_cgroup_write()
but decrement it on memcg exit. This does not appear as if it would be
symmetric if the memcg files were written to multiple times (maybe that's
not allowed?). Either way, 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Michal Hocko
On Thu 09-08-12 17:01:15, Glauber Costa wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   struct page *page = NULL;
>   int migratetype = allocflags_to_migratetype(gfp_mask);
>   unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>  
>   gfp_mask &= gfp_allowed_mask;
>  
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   return NULL;
>  
>   /*
> +  * Will only have any effect when __GFP_KMEMCG is set.
> +  * This is verified in the (always inline) callee
> +  */
> + if (!memcg_kmem_new_page(gfp_mask, , order))
> + return NULL;

When the previous patch introduced this function I thought the handle
obfuscantion is to prevent from spreading struct mem_cgroup inside the
page allocator but memcg_kmem_commit_page uses the type directly. So why
that obfuscation? Even handle as a name sounds unnecessarily confusing.
I would go with struct mem_cgroup **memcgp or even return the pointer on
success or NULL otherwise.

[...]
> +EXPORT_SYMBOL(__free_accounted_pages);

Why exported?

Btw. this is called from call_rcu context but it itself calls call_rcu
down the chain in mem_cgroup_put. Is it safe?

[...]
> +EXPORT_SYMBOL(free_accounted_pages);

here again
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa
On 08/14/2012 07:16 PM, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
> 
> As you already guessed, doing a page_cgroup in the page allocator free
> path would be a no-go.

Specifically yes, but in general, you will be able to observe that I am
taking all the possible measures to make sure existing paths are
disturbed as little as possible.

Thanks for your review here

>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b956cec..da341dc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  struct page *page = NULL;
>>  int migratetype = allocflags_to_migratetype(gfp_mask);
>>  unsigned int cpuset_mems_cookie;
>> +void *handle = NULL;
>>  
>>  gfp_mask &= gfp_allowed_mask;
>>  
>> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  return NULL;
>>  
>>  /*
>> + * Will only have any effect when __GFP_KMEMCG is set.
>> + * This is verified in the (always inline) callee
>> + */
>> +if (!memcg_kmem_new_page(gfp_mask, , order))
> 
> memcg_kmem_new_page takes a void * parameter already but here you are
> passing in a void **. This probably happens to work because you do this
> 
> struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
> 
> but that appears to defeat the purpose of having an opaque type as a
> "handle". You have to treat it different then passing it into the commit
> function because it expects a void *. The motivation for an opaque type
> is completely unclear to me and how it is managed with a mix of void *
> and void ** is very confusing.


okay.

The opaque exists because I am doing speculative charging. I believe it
to be a better and less complicated approach then letting a page appear
and then charging it. Besides being consistent with the rest of memcg,
it won't create unnecessary disturbance in the page allocator
when the allocation is to fail.

Now, tasks can move between memcgs, so we can't rely on grabbing it from
current in commit_page, so we pass it around as a handle. Also, even if
the task could not move, we already got it once from the task, and that
is not for free. Better save it.

Aside from the handle needed, the cost is more or less the same compared
to doing it in one pass. All we do by using speculative charging is to
split the cost in two, and doing it from two places.
We'd have to charge + update page_cgroup anyway.

As for the type, do you think using struct mem_cgroup would be less
confusing?

> On a similar note I spotted #define memcg_kmem_on 1 . That is also
> different just for the sake of it. The convension is to do something
> like this
> 
> /* This helps us to avoid #ifdef CONFIG_NUMA */
> #ifdef CONFIG_NUMA
> #define NUMA_BUILD 1
> #else
> #define NUMA_BUILD 0
> #endif


For simple defines, yes. But a later patch will turn this into a static
branch test. memcg_kmem_on will be always 0 when compile-disabled, but
when enable will expand to static_branch(&...).


> memcg_kmem_on was difficult to guess based on its name. I thought initially
> that it would only be active if a memcg existed or at least something like
> mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.

For now. And I thought that adding the static branch in this patch would
only confuse matters. The placeholder is there, but it is later patched
to the final thing.

With that explained, if you want me to change it to something else, I
can do it. Should I ?

> I also find it *very* strange to have a function named as if it is an
> allocation-style function when it in fact it's looking up a mem_cgroup
> and charging it (and uncharging it in the error path if necessary). If
> it was called memcg_kmem_newpage_charge I might have found it a little
> better.

I don't feel strongly about names in general. I can change it.
Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().

> This whole operation also looks very expensive (cgroup lookups, RCU locks
> taken etc) but I guess you're willing to take that cost in the same of
> isolating containers from each other. However, I strongly suggest that
> this overhead is measured in advance. It should not stop the series being
> merged as such but it should be understood because if the cost is high
> then this feature will be avoided like the plague. I am skeptical that
> distributions would enable this by default, at least not without support
> for cgroup_disable=kmem

Enabling this feature will bring you nothing, therefore, no (or 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa
On 08/14/2012 07:16 PM, Mel Gorman wrote:
 On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 
 As you already guessed, doing a page_cgroup in the page allocator free
 path would be a no-go.

Specifically yes, but in general, you will be able to observe that I am
taking all the possible measures to make sure existing paths are
disturbed as little as possible.

Thanks for your review here

  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  struct page *page = NULL;
  int migratetype = allocflags_to_migratetype(gfp_mask);
  unsigned int cpuset_mems_cookie;
 +void *handle = NULL;
  
  gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  return NULL;
  
  /*
 + * Will only have any effect when __GFP_KMEMCG is set.
 + * This is verified in the (always inline) callee
 + */
 +if (!memcg_kmem_new_page(gfp_mask, handle, order))
 
 memcg_kmem_new_page takes a void * parameter already but here you are
 passing in a void **. This probably happens to work because you do this
 
 struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
 
 but that appears to defeat the purpose of having an opaque type as a
 handle. You have to treat it different then passing it into the commit
 function because it expects a void *. The motivation for an opaque type
 is completely unclear to me and how it is managed with a mix of void *
 and void ** is very confusing.


okay.

The opaque exists because I am doing speculative charging. I believe it
to be a better and less complicated approach then letting a page appear
and then charging it. Besides being consistent with the rest of memcg,
it won't create unnecessary disturbance in the page allocator
when the allocation is to fail.

Now, tasks can move between memcgs, so we can't rely on grabbing it from
current in commit_page, so we pass it around as a handle. Also, even if
the task could not move, we already got it once from the task, and that
is not for free. Better save it.

Aside from the handle needed, the cost is more or less the same compared
to doing it in one pass. All we do by using speculative charging is to
split the cost in two, and doing it from two places.
We'd have to charge + update page_cgroup anyway.

As for the type, do you think using struct mem_cgroup would be less
confusing?

 On a similar note I spotted #define memcg_kmem_on 1 . That is also
 different just for the sake of it. The convension is to do something
 like this
 
 /* This helps us to avoid #ifdef CONFIG_NUMA */
 #ifdef CONFIG_NUMA
 #define NUMA_BUILD 1
 #else
 #define NUMA_BUILD 0
 #endif


For simple defines, yes. But a later patch will turn this into a static
branch test. memcg_kmem_on will be always 0 when compile-disabled, but
when enable will expand to static_branch(...).


 memcg_kmem_on was difficult to guess based on its name. I thought initially
 that it would only be active if a memcg existed or at least something like
 mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.

For now. And I thought that adding the static branch in this patch would
only confuse matters. The placeholder is there, but it is later patched
to the final thing.

With that explained, if you want me to change it to something else, I
can do it. Should I ?

 I also find it *very* strange to have a function named as if it is an
 allocation-style function when it in fact it's looking up a mem_cgroup
 and charging it (and uncharging it in the error path if necessary). If
 it was called memcg_kmem_newpage_charge I might have found it a little
 better.

I don't feel strongly about names in general. I can change it.
Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().

 This whole operation also looks very expensive (cgroup lookups, RCU locks
 taken etc) but I guess you're willing to take that cost in the same of
 isolating containers from each other. However, I strongly suggest that
 this overhead is measured in advance. It should not stop the series being
 merged as such but it should be understood because if the cost is high
 then this feature will be avoided like the plague. I am skeptical that
 distributions would enable this by default, at least not without support
 for cgroup_disable=kmem

Enabling this feature will bring you nothing, therefore, no (or little)
overhead. Nothing of this will be patched in until the first memcg gets
kmem limited. The 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Michal Hocko
On Thu 09-08-12 17:01:15, Glauber Costa wrote:
[...]
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
  
   gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
  
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))
 + return NULL;

When the previous patch introduced this function I thought the handle
obfuscantion is to prevent from spreading struct mem_cgroup inside the
page allocator but memcg_kmem_commit_page uses the type directly. So why
that obfuscation? Even handle as a name sounds unnecessarily confusing.
I would go with struct mem_cgroup **memcgp or even return the pointer on
success or NULL otherwise.

[...]
 +EXPORT_SYMBOL(__free_accounted_pages);

Why exported?

Btw. this is called from call_rcu context but it itself calls call_rcu
down the chain in mem_cgroup_put. Is it safe?

[...]
 +EXPORT_SYMBOL(free_accounted_pages);

here again
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Mel Gorman
On Wed, Aug 15, 2012 at 01:08:08PM +0400, Glauber Costa wrote:
 On 08/14/2012 07:16 PM, Mel Gorman wrote:
  On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
  When a process tries to allocate a page with the __GFP_KMEMCG flag, the
  page allocator will call the corresponding memcg functions to validate
  the allocation. Tasks in the root memcg can always proceed.
 
  To avoid adding markers to the page - and a kmem flag that would
  necessarily follow, as much as doing page_cgroup lookups for no reason,
  
  As you already guessed, doing a page_cgroup in the page allocator free
  path would be a no-go.
 
 Specifically yes, but in general, you will be able to observe that I am
 taking all the possible measures to make sure existing paths are
 disturbed as little as possible.
 
 Thanks for your review here
 
   
  diff --git a/mm/page_alloc.c b/mm/page_alloc.c
  index b956cec..da341dc 100644
  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
  order,
 struct page *page = NULL;
 int migratetype = allocflags_to_migratetype(gfp_mask);
 unsigned int cpuset_mems_cookie;
  +  void *handle = NULL;
   
 gfp_mask = gfp_allowed_mask;
   
  @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
  order,
 return NULL;
   
 /*
  +   * Will only have any effect when __GFP_KMEMCG is set.
  +   * This is verified in the (always inline) callee
  +   */
  +  if (!memcg_kmem_new_page(gfp_mask, handle, order))
  
  memcg_kmem_new_page takes a void * parameter already but here you are
  passing in a void **. This probably happens to work because you do this
  
  struct mem_cgroup **handle = (struct mem_cgroup **)_handle;
  
  but that appears to defeat the purpose of having an opaque type as a
  handle. You have to treat it different then passing it into the commit
  function because it expects a void *. The motivation for an opaque type
  is completely unclear to me and how it is managed with a mix of void *
  and void ** is very confusing.
 
 okay.
 
 The opaque exists because I am doing speculative charging.

I do not get why speculative charing would mandate an opaque type or
handle. It looks like like a fairly standard prepare/commit pattern to me.

 I believe it
 to be a better and less complicated approach then letting a page appear
 and then charging it. Besides being consistent with the rest of memcg,
 it won't create unnecessary disturbance in the page allocator
 when the allocation is to fail.
 

I still don't get why you did not just return a mem_cgroup instead of a
handle.

 Now, tasks can move between memcgs, so we can't rely on grabbing it from
 current in commit_page, so we pass it around as a handle.

You could just as easily passed around the mem_cgroup and it would have
been less obscure. Maybe this makes sense from a memcg context and matches
some coding pattern there that I'm not aware of.

 Also, even if
 the task could not move, we already got it once from the task, and that
 is not for free. Better save it.
 
 Aside from the handle needed, the cost is more or less the same compared
 to doing it in one pass. All we do by using speculative charging is to
 split the cost in two, and doing it from two places.
 We'd have to charge + update page_cgroup anyway.
 
 As for the type, do you think using struct mem_cgroup would be less
 confusing?
 

Yes and returning the mem_cgroup or NULL instead of bool.

  On a similar note I spotted #define memcg_kmem_on 1 . That is also
  different just for the sake of it. The convension is to do something
  like this
  
  /* This helps us to avoid #ifdef CONFIG_NUMA */
  #ifdef CONFIG_NUMA
  #define NUMA_BUILD 1
  #else
  #define NUMA_BUILD 0
  #endif
 
 For simple defines, yes. But a later patch will turn this into a static
 branch test. memcg_kmem_on will be always 0 when compile-disabled, but
 when enable will expand to static_branch(...).
 

I see.

 
  memcg_kmem_on was difficult to guess based on its name. I thought initially
  that it would only be active if a memcg existed or at least something like
  mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.
 
 For now. And I thought that adding the static branch in this patch would
 only confuse matters.

Ah, I see now. I had stopped reading the series once I reached this
patch. I don't think it would have mattered much to collapse the two
patches together but ok.

The static key handling does look a little suspicious. You appear to do
reference counting in memcg_update_kmem_limit for every mem_cgroup_write()
but decrement it on memcg exit. This does not appear as if it would be
symmetric if the memcg files were written to multiple times (maybe that's
not allowed?). Either way, the comment says it can never be disabled but
as you have static_key_slow_dec calls it would appear that you *do*
support them being disabled. Confusing.

 The placeholder is there, but 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa

 As for the type, do you think using struct mem_cgroup would be less
 confusing?

 
 Yes and returning the mem_cgroup or NULL instead of bool.

Ok. struct mem_cgroup it is.

 
 The placeholder is there, but it is later patched
 to the final thing.
 With that explained, if you want me to change it to something else, I
 can do it. Should I ?

 
 Not in this patch anyway. I would have preferred a pattern like this but
 that's about it.
 
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
 static inline int memcg_kmem_enabled(void)
 {
return static_key_false(memcg_kmem_enabled_key);
 }
 #else
 
 static inline bool memcg_kmem_enabled(void)
 {
return false;
 }
 #endif
 

humm, I'll have to think about this name.
memcg_kmem_enabled means it is enabled in this cgroup. It is actually
used inside memcontrol.c to denote precisely that.

Now the static branch, of course, means it is globally enabled. Or as I
called here, on.


 Two reasons. One, it does not use the terms on and enabled
 interchangeably.  The other reason is down to taste as I'm copying the
 pattern I used myself for sk_memalloc_socks(). Of course I am biased.
 
 Also, why is the key exported?
 

Same reason. The slab will now have inline functions that will test
against that. The alloc functions themselves, are inside the page
allocator, and the exports can go away.

But the static branch will still be tested inside inlined functions in
the slab.

That said, for the sake of simplicity, I can make it go away here, and
add that to the right place later.

 I also find it *very* strange to have a function named as if it is an
 allocation-style function when it in fact it's looking up a mem_cgroup
 and charging it (and uncharging it in the error path if necessary). If
 it was called memcg_kmem_newpage_charge I might have found it a little
 better.

 I don't feel strongly about names in general. I can change it.
 Will update to memcg_kmem_newpage_charge() and memcg_kmem_page_uncharge().

 
 I would prefer that anyway. Names have meaning and people make assumptions on
 the implementation depending on the name. We should try to be as consistent
 as possible or maintenance becomes harder. I know there are areas where
 we are not consistent at all but we should not compound the problem.

memcg_kmem_page_charge() is even better I believe, and that is what I
changed this to in my tree.

 As this thing is called from within the allocator, it's not clear why
 __memcg_kmem_new_page is exported. I can't imagine why a module would call
 it directly although maybe you cover that somewhere else in the series.

 Okay, more people commented on this, so let me clarify: They shouldn't
 be. They were initially exported when this was about the slab only,
 because they could be called from inlined functions from the allocators.
 Now that the charge/uncharge was moved to the page allocator - which
 already allowed me the big benefit of separating this in two pieces,
 none of this needs to be exported.

 Sorry for not noticing this myself, but thanks for the eyes =)

 
 You're welcome. I expect to see all the exports disappear so. If there
 are any exports left I think it would be important to document why they
 have to be exported. This is particularly true because they are
 EXPORT_SYMBOL not EXPORT_SYMBOL_GPL. I think it would be good to know in
 advance why a module (particularly an out-of-tree one) would be
 interested.
 

I will remove them all for now.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-15 Thread Glauber Costa
On 08/15/2012 05:22 PM, Mel Gorman wrote:
 I believe it
  to be a better and less complicated approach then letting a page appear
  and then charging it. Besides being consistent with the rest of memcg,
  it won't create unnecessary disturbance in the page allocator
  when the allocation is to fail.
  
 I still don't get why you did not just return a mem_cgroup instead of a
 handle.
 

Forgot this one, sorry:

The reason is to keep the semantics simple.

What should we return if the code is not compiled in? If we return NULL
for failure, the test becomes

memcg = memcg_kmem_charge_page(gfp, order);
if (!memcg)
  exit;

If we're not compiled in, we'd either return positive garbage or we need
to wrap it inside an ifdef

I personally believe to be a lot more clear to standardize on true
to mean allocation can proceed.

the compiled out case becomes:

if (!true)
   exit;

which is easily compiled away altogether. Now of course, using struct
mem_cgroup makes sense, and I have already changed that here.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-14 Thread Mel Gorman
On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
> 
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,

As you already guessed, doing a page_cgroup in the page allocator free
path would be a no-go.

This is my first time glancing at the series and I'm only paying close
attention to this patch so pardon me if my observations have been made
already.

> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
> 
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Suleiman Souhlal 
> ---
>  include/linux/gfp.h |  3 +++
>  mm/page_alloc.c | 38 ++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   struct page *page = NULL;
>   int migratetype = allocflags_to_migratetype(gfp_mask);
>   unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>  
>   gfp_mask &= gfp_allowed_mask;
>  
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   return NULL;
>  
>   /*
> +  * Will only have any effect when __GFP_KMEMCG is set.
> +  * This is verified in the (always inline) callee
> +  */
> + if (!memcg_kmem_new_page(gfp_mask, , order))

memcg_kmem_new_page takes a void * parameter already but here you are
passing in a void **. This probably happens to work because you do this

struct mem_cgroup **handle = (struct mem_cgroup **)_handle;

but that appears to defeat the purpose of having an opaque type as a
"handle". You have to treat it different then passing it into the commit
function because it expects a void *. The motivation for an opaque type
is completely unclear to me and how it is managed with a mix of void *
and void ** is very confusing.

On a similar note I spotted #define memcg_kmem_on 1 . That is also
different just for the sake of it. The convension is to do something
like this

/* This helps us to avoid #ifdef CONFIG_NUMA */
#ifdef CONFIG_NUMA
#define NUMA_BUILD 1
#else
#define NUMA_BUILD 0
#endif

memcg_kmem_on was difficult to guess based on its name. I thought initially
that it would only be active if a memcg existed or at least something like
mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.

I also find it *very* strange to have a function named as if it is an
allocation-style function when it in fact it's looking up a mem_cgroup
and charging it (and uncharging it in the error path if necessary). If
it was called memcg_kmem_newpage_charge I might have found it a little
better.  While I believe you have to take care to avoid confusion with
mem_cgroup_newpage_charge, it would be preferable if the APIs were similar.
memcg is hard enough as it is to understand without having different APIs.

This whole operation also looks very expensive (cgroup lookups, RCU locks
taken etc) but I guess you're willing to take that cost in the same of
isolating containers from each other. However, I strongly suggest that
this overhead is measured in advance. It should not stop the series being
merged as such but it should be understood because if the cost is high
then this feature will be avoided like the plague. I am skeptical that
distributions would enable this by default, at least not without support
for cgroup_disable=kmem

As this thing is called from within the allocator, it's not clear why
__memcg_kmem_new_page is exported. I can't imagine why a module would call
it directly although maybe you cover that somewhere else in the series.

>From the point of view of a hook, that is acceptable but just barely. I have
slammed other hooks 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-14 Thread Mel Gorman
On Thu, Aug 09, 2012 at 05:01:15PM +0400, Glauber Costa wrote:
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.
 
 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,

As you already guessed, doing a page_cgroup in the page allocator free
path would be a no-go.

This is my first time glancing at the series and I'm only paying close
attention to this patch so pardon me if my observations have been made
already.

 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/gfp.h |  3 +++
  mm/page_alloc.c | 38 ++
  2 files changed, 41 insertions(+)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
  
   gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
  
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))

memcg_kmem_new_page takes a void * parameter already but here you are
passing in a void **. This probably happens to work because you do this

struct mem_cgroup **handle = (struct mem_cgroup **)_handle;

but that appears to defeat the purpose of having an opaque type as a
handle. You have to treat it different then passing it into the commit
function because it expects a void *. The motivation for an opaque type
is completely unclear to me and how it is managed with a mix of void *
and void ** is very confusing.

On a similar note I spotted #define memcg_kmem_on 1 . That is also
different just for the sake of it. The convension is to do something
like this

/* This helps us to avoid #ifdef CONFIG_NUMA */
#ifdef CONFIG_NUMA
#define NUMA_BUILD 1
#else
#define NUMA_BUILD 0
#endif

memcg_kmem_on was difficult to guess based on its name. I thought initially
that it would only be active if a memcg existed or at least something like
mem_cgroup_disabled() but it's actually enabled if CONFIG_MEMCG_KMEM is set.

I also find it *very* strange to have a function named as if it is an
allocation-style function when it in fact it's looking up a mem_cgroup
and charging it (and uncharging it in the error path if necessary). If
it was called memcg_kmem_newpage_charge I might have found it a little
better.  While I believe you have to take care to avoid confusion with
mem_cgroup_newpage_charge, it would be preferable if the APIs were similar.
memcg is hard enough as it is to understand without having different APIs.

This whole operation also looks very expensive (cgroup lookups, RCU locks
taken etc) but I guess you're willing to take that cost in the same of
isolating containers from each other. However, I strongly suggest that
this overhead is measured in advance. It should not stop the series being
merged as such but it should be understood because if the cost is high
then this feature will be avoided like the plague. I am skeptical that
distributions would enable this by default, at least not without support
for cgroup_disable=kmem

As this thing is called from within the allocator, it's not clear why
__memcg_kmem_new_page is exported. I can't imagine why a module would call
it directly although maybe you cover that somewhere else in the series.

From the point of view 

Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Mel Gorman
On Mon, Aug 13, 2012 at 12:03:38PM +0400, Glauber Costa wrote:
> On 08/10/2012 09:33 PM, Kamezawa Hiroyuki wrote:
> > (2012/08/09 22:01), Glauber Costa wrote:
> >> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> >> page allocator will call the corresponding memcg functions to validate
> >> the allocation. Tasks in the root memcg can always proceed.
> >>
> >> To avoid adding markers to the page - and a kmem flag that would
> >> necessarily follow, as much as doing page_cgroup lookups for no reason,
> >> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> >> for telling the page allocator that this is such an allocation at
> >> free_pages() time. This is done by the invocation of
> >> __free_accounted_pages() and free_accounted_pages().
> >>
> >> Signed-off-by: Glauber Costa 
> >> CC: Christoph Lameter 
> >> CC: Pekka Enberg 
> >> CC: Michal Hocko 
> >> CC: Kamezawa Hiroyuki 
> >> CC: Johannes Weiner 
> >> CC: Suleiman Souhlal 
> > 
> > Ah, ok. free_accounted_page() seems good.
> > 
> > Acked-by: KAMEZAWA Hiroyuki 
> > 
> > I myself is okay with this. But...
> > 
> > Because you add a new hook to alloc_pages(), please get Ack from Mel
> > before requesting merge.
> > 
> > Thanks,
> > -Kame
> 
> Absolutely.
> 
> Mel, would you mind taking a look at this series and commenting on this?
> 

It'll take me a few days but I'll get around to it.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Glauber Costa
On 08/10/2012 09:33 PM, Kamezawa Hiroyuki wrote:
> (2012/08/09 22:01), Glauber Costa wrote:
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time. This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> CC: Suleiman Souhlal 
> 
> Ah, ok. free_accounted_page() seems good.
> 
> Acked-by: KAMEZAWA Hiroyuki 
> 
> I myself is okay with this. But...
> 
> Because you add a new hook to alloc_pages(), please get Ack from Mel
> before requesting merge.
> 
> Thanks,
> -Kame

Absolutely.

Mel, would you mind taking a look at this series and commenting on this?

Thanks in advance.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Glauber Costa
On 08/10/2012 09:36 PM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
> 
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time. This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> CC: Suleiman Souhlal 
>> ---
>>  include/linux/gfp.h |  3 +++
>>  mm/page_alloc.c | 38 ++
>>  2 files changed, 41 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index d8eae4d..029570f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
>> order);
>>  extern void free_hot_cold_page(struct page *page, int cold);
>>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>>  
>> +extern void __free_accounted_pages(struct page *page, unsigned int order);
>> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
>> +
>>  #define __free_page(page) __free_pages((page), 0)
>>  #define free_page(addr) free_pages((addr), 0)
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b956cec..da341dc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  struct page *page = NULL;
>>  int migratetype = allocflags_to_migratetype(gfp_mask);
>>  unsigned int cpuset_mems_cookie;
>> +void *handle = NULL;
>>  
>>  gfp_mask &= gfp_allowed_mask;
>>  
>> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  return NULL;
>>  
>>  /*
>> + * Will only have any effect when __GFP_KMEMCG is set.
>> + * This is verified in the (always inline) callee
>> + */
>> +if (!memcg_kmem_new_page(gfp_mask, , order))
>> +return NULL;
>> +
>> +/*
>>   * Check the zones suitable for the gfp_mask contain at least one
>>   * valid zone. It's possible to have an empty zonelist as a result
>>   * of GFP_THISNODE and a memoryless node
>> @@ -2583,6 +2591,8 @@ out:
>>  if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
>>  goto retry_cpuset;
>>  
>> +memcg_kmem_commit_page(page, handle, order);
>> +
>>  return page;
>>  }
>>  EXPORT_SYMBOL(__alloc_pages_nodemask);
>> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int 
>> order)
>>  
>>  EXPORT_SYMBOL(free_pages);
>>  
>> +/*
>> + * __free_accounted_pages and free_accounted_pages will free pages allocated
>> + * with __GFP_KMEMCG.
>> + *
>> + * Those pages are accounted to a particular memcg, embedded in the
>> + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
>> search
>> + * for that information only to find out that it is NULL for users who have 
>> no
>> + * interest in that whatsoever, we provide these functions.
>> + *
>> + * The caller knows better which flags it relies on.
>> + */
>> +void __free_accounted_pages(struct page *page, unsigned int order)
>> +{
>> +memcg_kmem_free_page(page, order);
>> +__free_pages(page, order);
>> +}
>> +EXPORT_SYMBOL(__free_accounted_pages);
>> +
>> +void free_accounted_pages(unsigned long addr, unsigned int order)
>> +{
>> +if (addr != 0) {
>> +VM_BUG_ON(!virt_addr_valid((void *)addr));
>> +memcg_kmem_free_page(virt_to_page((void *)addr), order);
>> +__free_pages(virt_to_page((void *)addr), order);
> 
> Nit.  Is there any reason not to replace the above two lines with:
>   __free_accounted_pages(virt_to_page((void *)addr), order);
> 
Not any particular reason. If people prefer it this way, I can do that
with no problems.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Glauber Costa
On 08/10/2012 09:36 PM, Greg Thelen wrote:
 On Thu, Aug 09 2012, Glauber Costa wrote:
 
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/gfp.h |  3 +++
  mm/page_alloc.c | 38 ++
  2 files changed, 41 insertions(+)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  struct page *page = NULL;
  int migratetype = allocflags_to_migratetype(gfp_mask);
  unsigned int cpuset_mems_cookie;
 +void *handle = NULL;
  
  gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  return NULL;
  
  /*
 + * Will only have any effect when __GFP_KMEMCG is set.
 + * This is verified in the (always inline) callee
 + */
 +if (!memcg_kmem_new_page(gfp_mask, handle, order))
 +return NULL;
 +
 +/*
   * Check the zones suitable for the gfp_mask contain at least one
   * valid zone. It's possible to have an empty zonelist as a result
   * of GFP_THISNODE and a memoryless node
 @@ -2583,6 +2591,8 @@ out:
  if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
  goto retry_cpuset;
  
 +memcg_kmem_commit_page(page, handle, order);
 +
  return page;
  }
  EXPORT_SYMBOL(__alloc_pages_nodemask);
 @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int 
 order)
  
  EXPORT_SYMBOL(free_pages);
  
 +/*
 + * __free_accounted_pages and free_accounted_pages will free pages allocated
 + * with __GFP_KMEMCG.
 + *
 + * Those pages are accounted to a particular memcg, embedded in the
 + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
 search
 + * for that information only to find out that it is NULL for users who have 
 no
 + * interest in that whatsoever, we provide these functions.
 + *
 + * The caller knows better which flags it relies on.
 + */
 +void __free_accounted_pages(struct page *page, unsigned int order)
 +{
 +memcg_kmem_free_page(page, order);
 +__free_pages(page, order);
 +}
 +EXPORT_SYMBOL(__free_accounted_pages);
 +
 +void free_accounted_pages(unsigned long addr, unsigned int order)
 +{
 +if (addr != 0) {
 +VM_BUG_ON(!virt_addr_valid((void *)addr));
 +memcg_kmem_free_page(virt_to_page((void *)addr), order);
 +__free_pages(virt_to_page((void *)addr), order);
 
 Nit.  Is there any reason not to replace the above two lines with:
   __free_accounted_pages(virt_to_page((void *)addr), order);
 
Not any particular reason. If people prefer it this way, I can do that
with no problems.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Glauber Costa
On 08/10/2012 09:33 PM, Kamezawa Hiroyuki wrote:
 (2012/08/09 22:01), Glauber Costa wrote:
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 
 Ah, ok. free_accounted_page() seems good.
 
 Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 
 I myself is okay with this. But...
 
 Because you add a new hook to alloc_pages(), please get Ack from Mel
 before requesting merge.
 
 Thanks,
 -Kame

Absolutely.

Mel, would you mind taking a look at this series and commenting on this?

Thanks in advance.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-13 Thread Mel Gorman
On Mon, Aug 13, 2012 at 12:03:38PM +0400, Glauber Costa wrote:
 On 08/10/2012 09:33 PM, Kamezawa Hiroyuki wrote:
  (2012/08/09 22:01), Glauber Costa wrote:
  When a process tries to allocate a page with the __GFP_KMEMCG flag, the
  page allocator will call the corresponding memcg functions to validate
  the allocation. Tasks in the root memcg can always proceed.
 
  To avoid adding markers to the page - and a kmem flag that would
  necessarily follow, as much as doing page_cgroup lookups for no reason,
  whoever is marking its allocations with __GFP_KMEMCG flag is responsible
  for telling the page allocator that this is such an allocation at
  free_pages() time. This is done by the invocation of
  __free_accounted_pages() and free_accounted_pages().
 
  Signed-off-by: Glauber Costa glom...@parallels.com
  CC: Christoph Lameter c...@linux.com
  CC: Pekka Enberg penb...@cs.helsinki.fi
  CC: Michal Hocko mho...@suse.cz
  CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
  CC: Johannes Weiner han...@cmpxchg.org
  CC: Suleiman Souhlal sulei...@google.com
  
  Ah, ok. free_accounted_page() seems good.
  
  Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
  
  I myself is okay with this. But...
  
  Because you add a new hook to alloc_pages(), please get Ack from Mel
  before requesting merge.
  
  Thanks,
  -Kame
 
 Absolutely.
 
 Mel, would you mind taking a look at this series and commenting on this?
 

It'll take me a few days but I'll get around to it.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-10 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Suleiman Souhlal 
> ---
>  include/linux/gfp.h |  3 +++
>  mm/page_alloc.c | 38 ++
>  2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   struct page *page = NULL;
>   int migratetype = allocflags_to_migratetype(gfp_mask);
>   unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>  
>   gfp_mask &= gfp_allowed_mask;
>  
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   return NULL;
>  
>   /*
> +  * Will only have any effect when __GFP_KMEMCG is set.
> +  * This is verified in the (always inline) callee
> +  */
> + if (!memcg_kmem_new_page(gfp_mask, , order))
> + return NULL;
> +
> + /*
>* Check the zones suitable for the gfp_mask contain at least one
>* valid zone. It's possible to have an empty zonelist as a result
>* of GFP_THISNODE and a memoryless node
> @@ -2583,6 +2591,8 @@ out:
>   if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
>   goto retry_cpuset;
>  
> + memcg_kmem_commit_page(page, handle, order);
> +
>   return page;
>  }
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>  
>  EXPORT_SYMBOL(free_pages);
>  
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
> search
> + * for that information only to find out that it is NULL for users who have 
> no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_free_page(page, order);
> + __free_pages(page, order);
> +}
> +EXPORT_SYMBOL(__free_accounted_pages);
> +
> +void free_accounted_pages(unsigned long addr, unsigned int order)
> +{
> + if (addr != 0) {
> + VM_BUG_ON(!virt_addr_valid((void *)addr));
> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
> + __free_pages(virt_to_page((void *)addr), order);

Nit.  Is there any reason not to replace the above two lines with:
__free_accounted_pages(virt_to_page((void *)addr), order);

> + }
> +}
> +EXPORT_SYMBOL(free_accounted_pages);
> +
>  static void *make_alloc_exact(unsigned long addr, unsigned order, size_t 
> size)
>  {
>   if (addr) {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
> 
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
> 
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Suleiman Souhlal 

Ah, ok. free_accounted_page() seems good.

Acked-by: KAMEZAWA Hiroyuki 

I myself is okay with this. But...

Because you add a new hook to alloc_pages(), please get Ack from Mel
before requesting merge.

Thanks,
-Kame




> ---
>   include/linux/gfp.h |  3 +++
>   mm/page_alloc.c | 38 ++
>   2 files changed, 41 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>   extern void free_hot_cold_page(struct page *page, int cold);
>   extern void free_hot_cold_page_list(struct list_head *list, int cold);
>   
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
>   #define __free_page(page) __free_pages((page), 0)
>   #define free_page(addr) free_pages((addr), 0)
>   
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   struct page *page = NULL;
>   int migratetype = allocflags_to_migratetype(gfp_mask);
>   unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>   
>   gfp_mask &= gfp_allowed_mask;
>   
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   return NULL;
>   
>   /*
> +  * Will only have any effect when __GFP_KMEMCG is set.
> +  * This is verified in the (always inline) callee
> +  */
> + if (!memcg_kmem_new_page(gfp_mask, , order))
> + return NULL;
> +
> + /*
>* Check the zones suitable for the gfp_mask contain at least one
>* valid zone. It's possible to have an empty zonelist as a result
>* of GFP_THISNODE and a memoryless node
> @@ -2583,6 +2591,8 @@ out:
>   if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
>   goto retry_cpuset;
>   
> + memcg_kmem_commit_page(page, handle, order);
> +
>   return page;
>   }
>   EXPORT_SYMBOL(__alloc_pages_nodemask);
> @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
>   
>   EXPORT_SYMBOL(free_pages);
>   
> +/*
> + * __free_accounted_pages and free_accounted_pages will free pages allocated
> + * with __GFP_KMEMCG.
> + *
> + * Those pages are accounted to a particular memcg, embedded in the
> + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
> search
> + * for that information only to find out that it is NULL for users who have 
> no
> + * interest in that whatsoever, we provide these functions.
> + *
> + * The caller knows better which flags it relies on.
> + */
> +void __free_accounted_pages(struct page *page, unsigned int order)
> +{
> + memcg_kmem_free_page(page, order);
> + __free_pages(page, order);
> +}
> +EXPORT_SYMBOL(__free_accounted_pages);
> +
> +void free_accounted_pages(unsigned long addr, unsigned int order)
> +{
> + if (addr != 0) {
> + VM_BUG_ON(!virt_addr_valid((void *)addr));
> + memcg_kmem_free_page(virt_to_page((void *)addr), order);
> + __free_pages(virt_to_page((void *)addr), order);
> + }
> +}
> +EXPORT_SYMBOL(free_accounted_pages);
> +
>   static void *make_alloc_exact(unsigned long addr, unsigned order, size_t 
> size)
>   {
>   if (addr) {
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-10 Thread Kamezawa Hiroyuki
(2012/08/09 22:01), Glauber Costa wrote:
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.
 
 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().
 
 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com

Ah, ok. free_accounted_page() seems good.

Acked-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com

I myself is okay with this. But...

Because you add a new hook to alloc_pages(), please get Ack from Mel
before requesting merge.

Thanks,
-Kame




 ---
   include/linux/gfp.h |  3 +++
   mm/page_alloc.c | 38 ++
   2 files changed, 41 insertions(+)
 
 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
   extern void free_hot_cold_page(struct page *page, int cold);
   extern void free_hot_cold_page_list(struct list_head *list, int cold);
   
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
   #define __free_page(page) __free_pages((page), 0)
   #define free_page(addr) free_pages((addr), 0)
   
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
   
   gfp_mask = gfp_allowed_mask;
   
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
   
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))
 + return NULL;
 +
 + /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
 @@ -2583,6 +2591,8 @@ out:
   if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
   goto retry_cpuset;
   
 + memcg_kmem_commit_page(page, handle, order);
 +
   return page;
   }
   EXPORT_SYMBOL(__alloc_pages_nodemask);
 @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
   
   EXPORT_SYMBOL(free_pages);
   
 +/*
 + * __free_accounted_pages and free_accounted_pages will free pages allocated
 + * with __GFP_KMEMCG.
 + *
 + * Those pages are accounted to a particular memcg, embedded in the
 + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
 search
 + * for that information only to find out that it is NULL for users who have 
 no
 + * interest in that whatsoever, we provide these functions.
 + *
 + * The caller knows better which flags it relies on.
 + */
 +void __free_accounted_pages(struct page *page, unsigned int order)
 +{
 + memcg_kmem_free_page(page, order);
 + __free_pages(page, order);
 +}
 +EXPORT_SYMBOL(__free_accounted_pages);
 +
 +void free_accounted_pages(unsigned long addr, unsigned int order)
 +{
 + if (addr != 0) {
 + VM_BUG_ON(!virt_addr_valid((void *)addr));
 + memcg_kmem_free_page(virt_to_page((void *)addr), order);
 + __free_pages(virt_to_page((void *)addr), order);
 + }
 +}
 +EXPORT_SYMBOL(free_accounted_pages);
 +
   static void *make_alloc_exact(unsigned long addr, unsigned order, size_t 
 size)
   {
   if (addr) {
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-10 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/gfp.h |  3 +++
  mm/page_alloc.c | 38 ++
  2 files changed, 41 insertions(+)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
  
   gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
  
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))
 + return NULL;
 +
 + /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
 @@ -2583,6 +2591,8 @@ out:
   if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
   goto retry_cpuset;
  
 + memcg_kmem_commit_page(page, handle, order);
 +
   return page;
  }
  EXPORT_SYMBOL(__alloc_pages_nodemask);
 @@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
  
  EXPORT_SYMBOL(free_pages);
  
 +/*
 + * __free_accounted_pages and free_accounted_pages will free pages allocated
 + * with __GFP_KMEMCG.
 + *
 + * Those pages are accounted to a particular memcg, embedded in the
 + * corresponding page_cgroup. To avoid adding a hit in the allocator to 
 search
 + * for that information only to find out that it is NULL for users who have 
 no
 + * interest in that whatsoever, we provide these functions.
 + *
 + * The caller knows better which flags it relies on.
 + */
 +void __free_accounted_pages(struct page *page, unsigned int order)
 +{
 + memcg_kmem_free_page(page, order);
 + __free_pages(page, order);
 +}
 +EXPORT_SYMBOL(__free_accounted_pages);
 +
 +void free_accounted_pages(unsigned long addr, unsigned int order)
 +{
 + if (addr != 0) {
 + VM_BUG_ON(!virt_addr_valid((void *)addr));
 + memcg_kmem_free_page(virt_to_page((void *)addr), order);
 + __free_pages(virt_to_page((void *)addr), order);

Nit.  Is there any reason not to replace the above two lines with:
__free_accounted_pages(virt_to_page((void *)addr), order);

 + }
 +}
 +EXPORT_SYMBOL(free_accounted_pages);
 +
  static void *make_alloc_exact(unsigned long addr, unsigned order, size_t 
 size)
  {
   if (addr) {
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Glauber Costa
On 08/09/2012 08:33 PM, Greg Thelen wrote:
> On Thu, Aug 09 2012, Glauber Costa wrote:
> 
>> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
>> page allocator will call the corresponding memcg functions to validate
>> the allocation. Tasks in the root memcg can always proceed.
>>
>> To avoid adding markers to the page - and a kmem flag that would
>> necessarily follow, as much as doing page_cgroup lookups for no reason,
>> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
>> for telling the page allocator that this is such an allocation at
>> free_pages() time. This is done by the invocation of
>> __free_accounted_pages() and free_accounted_pages().
>>
>> Signed-off-by: Glauber Costa 
>> CC: Christoph Lameter 
>> CC: Pekka Enberg 
>> CC: Michal Hocko 
>> CC: Kamezawa Hiroyuki 
>> CC: Johannes Weiner 
>> CC: Suleiman Souhlal 
>> ---
>>  include/linux/gfp.h |  3 +++
>>  mm/page_alloc.c | 38 ++
>>  2 files changed, 41 insertions(+)
>>
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index d8eae4d..029570f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
>> order);
>>  extern void free_hot_cold_page(struct page *page, int cold);
>>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>>  
>> +extern void __free_accounted_pages(struct page *page, unsigned int order);
>> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
>> +
>>  #define __free_page(page) __free_pages((page), 0)
>>  #define free_page(addr) free_pages((addr), 0)
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b956cec..da341dc 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  struct page *page = NULL;
>>  int migratetype = allocflags_to_migratetype(gfp_mask);
>>  unsigned int cpuset_mems_cookie;
>> +void *handle = NULL;
>>  
>>  gfp_mask &= gfp_allowed_mask;
>>  
>> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
>> order,
>>  return NULL;
>>  
>>  /*
>> + * Will only have any effect when __GFP_KMEMCG is set.
>> + * This is verified in the (always inline) callee
>> + */
>> +if (!memcg_kmem_new_page(gfp_mask, , order))
>> +return NULL;
>> +
>> +/*
>>   * Check the zones suitable for the gfp_mask contain at least one
>>   * valid zone. It's possible to have an empty zonelist as a result
>>   * of GFP_THISNODE and a memoryless node
> 
> If memcg_kmem_new_page() succeeds then it may have obtained a memcg
> reference with mem_cgroup_get().  I think this reference is leaked when
> returning below:
> 
>   /*
>* Check the zones suitable for the gfp_mask contain at least one
>* valid zone. It's possible to have an empty zonelist as a result
>* of GFP_THISNODE and a memoryless node
>*/
>   if (unlikely(!zonelist->_zonerefs->zone))
>   return NULL;
> 
> I suspect the easiest fix is to swap the call to memcg_kmem_new_page()
> and the (!zonelist->_zonerefs->zone) check.
> 
You are right, indeed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

> When a process tries to allocate a page with the __GFP_KMEMCG flag, the
> page allocator will call the corresponding memcg functions to validate
> the allocation. Tasks in the root memcg can always proceed.
>
> To avoid adding markers to the page - and a kmem flag that would
> necessarily follow, as much as doing page_cgroup lookups for no reason,
> whoever is marking its allocations with __GFP_KMEMCG flag is responsible
> for telling the page allocator that this is such an allocation at
> free_pages() time. This is done by the invocation of
> __free_accounted_pages() and free_accounted_pages().
>
> Signed-off-by: Glauber Costa 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Suleiman Souhlal 
> ---
>  include/linux/gfp.h |  3 +++
>  mm/page_alloc.c | 38 ++
>  2 files changed, 41 insertions(+)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index d8eae4d..029570f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
> order);
>  extern void free_hot_cold_page(struct page *page, int cold);
>  extern void free_hot_cold_page_list(struct list_head *list, int cold);
>  
> +extern void __free_accounted_pages(struct page *page, unsigned int order);
> +extern void free_accounted_pages(unsigned long addr, unsigned int order);
> +
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b956cec..da341dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   struct page *page = NULL;
>   int migratetype = allocflags_to_migratetype(gfp_mask);
>   unsigned int cpuset_mems_cookie;
> + void *handle = NULL;
>  
>   gfp_mask &= gfp_allowed_mask;
>  
> @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>   return NULL;
>  
>   /*
> +  * Will only have any effect when __GFP_KMEMCG is set.
> +  * This is verified in the (always inline) callee
> +  */
> + if (!memcg_kmem_new_page(gfp_mask, , order))
> + return NULL;
> +
> + /*
>* Check the zones suitable for the gfp_mask contain at least one
>* valid zone. It's possible to have an empty zonelist as a result
>* of GFP_THISNODE and a memoryless node

If memcg_kmem_new_page() succeeds then it may have obtained a memcg
reference with mem_cgroup_get().  I think this reference is leaked when
returning below:

/*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
 * of GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

I suspect the easiest fix is to swap the call to memcg_kmem_new_page()
and the (!zonelist->_zonerefs->zone) check.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Glauber Costa
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

Signed-off-by: Glauber Costa 
CC: Christoph Lameter 
CC: Pekka Enberg 
CC: Michal Hocko 
CC: Kamezawa Hiroyuki 
CC: Johannes Weiner 
CC: Suleiman Souhlal 
---
 include/linux/gfp.h |  3 +++
 mm/page_alloc.c | 38 ++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d8eae4d..029570f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
+extern void __free_accounted_pages(struct page *page, unsigned int order);
+extern void free_accounted_pages(unsigned long addr, unsigned int order);
+
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b956cec..da341dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
+   void *handle = NULL;
 
gfp_mask &= gfp_allowed_mask;
 
@@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
return NULL;
 
/*
+* Will only have any effect when __GFP_KMEMCG is set.
+* This is verified in the (always inline) callee
+*/
+   if (!memcg_kmem_new_page(gfp_mask, , order))
+   return NULL;
+
+   /*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
 * of GFP_THISNODE and a memoryless node
@@ -2583,6 +2591,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
 
+   memcg_kmem_commit_page(page, handle, order);
+
return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
+/*
+ * __free_accounted_pages and free_accounted_pages will free pages allocated
+ * with __GFP_KMEMCG.
+ *
+ * Those pages are accounted to a particular memcg, embedded in the
+ * corresponding page_cgroup. To avoid adding a hit in the allocator to search
+ * for that information only to find out that it is NULL for users who have no
+ * interest in that whatsoever, we provide these functions.
+ *
+ * The caller knows better which flags it relies on.
+ */
+void __free_accounted_pages(struct page *page, unsigned int order)
+{
+   memcg_kmem_free_page(page, order);
+   __free_pages(page, order);
+}
+EXPORT_SYMBOL(__free_accounted_pages);
+
+void free_accounted_pages(unsigned long addr, unsigned int order)
+{
+   if (addr != 0) {
+   VM_BUG_ON(!virt_addr_valid((void *)addr));
+   memcg_kmem_free_page(virt_to_page((void *)addr), order);
+   __free_pages(virt_to_page((void *)addr), order);
+   }
+}
+EXPORT_SYMBOL(free_accounted_pages);
+
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
if (addr) {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Glauber Costa
When a process tries to allocate a page with the __GFP_KMEMCG flag, the
page allocator will call the corresponding memcg functions to validate
the allocation. Tasks in the root memcg can always proceed.

To avoid adding markers to the page - and a kmem flag that would
necessarily follow, as much as doing page_cgroup lookups for no reason,
whoever is marking its allocations with __GFP_KMEMCG flag is responsible
for telling the page allocator that this is such an allocation at
free_pages() time. This is done by the invocation of
__free_accounted_pages() and free_accounted_pages().

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Christoph Lameter c...@linux.com
CC: Pekka Enberg penb...@cs.helsinki.fi
CC: Michal Hocko mho...@suse.cz
CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
CC: Johannes Weiner han...@cmpxchg.org
CC: Suleiman Souhlal sulei...@google.com
---
 include/linux/gfp.h |  3 +++
 mm/page_alloc.c | 38 ++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index d8eae4d..029570f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
+extern void __free_accounted_pages(struct page *page, unsigned int order);
+extern void free_accounted_pages(unsigned long addr, unsigned int order);
+
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b956cec..da341dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
int migratetype = allocflags_to_migratetype(gfp_mask);
unsigned int cpuset_mems_cookie;
+   void *handle = NULL;
 
gfp_mask = gfp_allowed_mask;
 
@@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
return NULL;
 
/*
+* Will only have any effect when __GFP_KMEMCG is set.
+* This is verified in the (always inline) callee
+*/
+   if (!memcg_kmem_new_page(gfp_mask, handle, order))
+   return NULL;
+
+   /*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
 * of GFP_THISNODE and a memoryless node
@@ -2583,6 +2591,8 @@ out:
if (unlikely(!put_mems_allowed(cpuset_mems_cookie)  !page))
goto retry_cpuset;
 
+   memcg_kmem_commit_page(page, handle, order);
+
return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
@@ -2635,6 +2645,34 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
+/*
+ * __free_accounted_pages and free_accounted_pages will free pages allocated
+ * with __GFP_KMEMCG.
+ *
+ * Those pages are accounted to a particular memcg, embedded in the
+ * corresponding page_cgroup. To avoid adding a hit in the allocator to search
+ * for that information only to find out that it is NULL for users who have no
+ * interest in that whatsoever, we provide these functions.
+ *
+ * The caller knows better which flags it relies on.
+ */
+void __free_accounted_pages(struct page *page, unsigned int order)
+{
+   memcg_kmem_free_page(page, order);
+   __free_pages(page, order);
+}
+EXPORT_SYMBOL(__free_accounted_pages);
+
+void free_accounted_pages(unsigned long addr, unsigned int order)
+{
+   if (addr != 0) {
+   VM_BUG_ON(!virt_addr_valid((void *)addr));
+   memcg_kmem_free_page(virt_to_page((void *)addr), order);
+   __free_pages(virt_to_page((void *)addr), order);
+   }
+}
+EXPORT_SYMBOL(free_accounted_pages);
+
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
if (addr) {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Greg Thelen
On Thu, Aug 09 2012, Glauber Costa wrote:

 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/gfp.h |  3 +++
  mm/page_alloc.c | 38 ++
  2 files changed, 41 insertions(+)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   struct page *page = NULL;
   int migratetype = allocflags_to_migratetype(gfp_mask);
   unsigned int cpuset_mems_cookie;
 + void *handle = NULL;
  
   gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
   return NULL;
  
   /*
 +  * Will only have any effect when __GFP_KMEMCG is set.
 +  * This is verified in the (always inline) callee
 +  */
 + if (!memcg_kmem_new_page(gfp_mask, handle, order))
 + return NULL;
 +
 + /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node

If memcg_kmem_new_page() succeeds then it may have obtained a memcg
reference with mem_cgroup_get().  I think this reference is leaked when
returning below:

/*
 * Check the zones suitable for the gfp_mask contain at least one
 * valid zone. It's possible to have an empty zonelist as a result
 * of GFP_THISNODE and a memoryless node
 */
if (unlikely(!zonelist-_zonerefs-zone))
return NULL;

I suspect the easiest fix is to swap the call to memcg_kmem_new_page()
and the (!zonelist-_zonerefs-zone) check.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 07/11] mm: Allocate kernel pages to the right memcg

2012-08-09 Thread Glauber Costa
On 08/09/2012 08:33 PM, Greg Thelen wrote:
 On Thu, Aug 09 2012, Glauber Costa wrote:
 
 When a process tries to allocate a page with the __GFP_KMEMCG flag, the
 page allocator will call the corresponding memcg functions to validate
 the allocation. Tasks in the root memcg can always proceed.

 To avoid adding markers to the page - and a kmem flag that would
 necessarily follow, as much as doing page_cgroup lookups for no reason,
 whoever is marking its allocations with __GFP_KMEMCG flag is responsible
 for telling the page allocator that this is such an allocation at
 free_pages() time. This is done by the invocation of
 __free_accounted_pages() and free_accounted_pages().

 Signed-off-by: Glauber Costa glom...@parallels.com
 CC: Christoph Lameter c...@linux.com
 CC: Pekka Enberg penb...@cs.helsinki.fi
 CC: Michal Hocko mho...@suse.cz
 CC: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
 CC: Johannes Weiner han...@cmpxchg.org
 CC: Suleiman Souhlal sulei...@google.com
 ---
  include/linux/gfp.h |  3 +++
  mm/page_alloc.c | 38 ++
  2 files changed, 41 insertions(+)

 diff --git a/include/linux/gfp.h b/include/linux/gfp.h
 index d8eae4d..029570f 100644
 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -370,6 +370,9 @@ extern void free_pages(unsigned long addr, unsigned int 
 order);
  extern void free_hot_cold_page(struct page *page, int cold);
  extern void free_hot_cold_page_list(struct list_head *list, int cold);
  
 +extern void __free_accounted_pages(struct page *page, unsigned int order);
 +extern void free_accounted_pages(unsigned long addr, unsigned int order);
 +
  #define __free_page(page) __free_pages((page), 0)
  #define free_page(addr) free_pages((addr), 0)
  
 diff --git a/mm/page_alloc.c b/mm/page_alloc.c
 index b956cec..da341dc 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -2532,6 +2532,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  struct page *page = NULL;
  int migratetype = allocflags_to_migratetype(gfp_mask);
  unsigned int cpuset_mems_cookie;
 +void *handle = NULL;
  
  gfp_mask = gfp_allowed_mask;
  
 @@ -2543,6 +2544,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
 order,
  return NULL;
  
  /*
 + * Will only have any effect when __GFP_KMEMCG is set.
 + * This is verified in the (always inline) callee
 + */
 +if (!memcg_kmem_new_page(gfp_mask, handle, order))
 +return NULL;
 +
 +/*
   * Check the zones suitable for the gfp_mask contain at least one
   * valid zone. It's possible to have an empty zonelist as a result
   * of GFP_THISNODE and a memoryless node
 
 If memcg_kmem_new_page() succeeds then it may have obtained a memcg
 reference with mem_cgroup_get().  I think this reference is leaked when
 returning below:
 
   /*
* Check the zones suitable for the gfp_mask contain at least one
* valid zone. It's possible to have an empty zonelist as a result
* of GFP_THISNODE and a memoryless node
*/
   if (unlikely(!zonelist-_zonerefs-zone))
   return NULL;
 
 I suspect the easiest fix is to swap the call to memcg_kmem_new_page()
 and the (!zonelist-_zonerefs-zone) check.
 
You are right, indeed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/