Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-07-29 Thread Alex Thorlton
We've re-evaluated the need for a patch to support some sort of finer
grained control over thp, and, based on some tests performed by our 
benchmarking team, we're seeing that we'd definitely still like to
implement some method to support this.  Here's an e-mail from John 
Baron (jba...@sgi.com), on our benchmarking team,  containing some data
which shows a decrease in performance for some SPEC OMP benchmarks when
thp is enabled:

> Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64
> sockets.  These show the performance ratio for jobs run with THP
> disabled versus THP enabled (so > 1.0 means THP disabled is faster).  
> One possible reason for lower performance with THP enabled is that the 
> larger page granularity can result in more remote data accesses.
> 
> 
> SPEC OMP2012:
> 
> 350.md  1.0
> 351.bwaves  1.3
> 352.nab 1.0
> 357.bt331   0.9
> 358.botsalgn1.0
> 359.botsspar1.1
> 360.ilbdc   1.8
> 362.fma3d   1.0
> 363.swim1.4
> 367.imagick 0.9
> 370.mgrid3311.1
> 371.applu3310.9
> 372.smithwa 1.0
> 376.kdtree  1.0
> 
> SPEC OMPL2001:
> 
> 311.wupwise_l   1.1
> 313.swim_l  1.5
> 315.mgrid_l 1.0
> 317.applu_l 1.1
> 321.equake_l5.8
> 325.apsi_l  1.5
> 327.gafort_l1.0
> 329.fma3d_l 1.0
> 331.art_l   0.8
> 
> One could argue that real-world applications could be modified to avoid
> these kinds of effects, but (a) it is not always possible to modify code
> (e.g. in benchmark situations) and (b) even if it is possible to do so,
> it is not necessarily easy to do so (e.g. for customers with large
> legacy Fortran codes).
> 
> We have also observed on Intel Sandy Bridge processors that, as
> counter-intuitive as it may seem, local memory bandwidth is actually
> slightly lower with THP enabled (1-2%), even with unit stride data
> accesses.  This may not seem like much of a performance hit but
> it is important for HPC workloads.  No code modification will help here.

In light of the previous issues discussed in this thread, and some
suggestions from David Rientjes:

> why not make it per-process so users don't have to configure
> cpusets to control it?

Robin and I have come up with a proposal for a way to replicate behavior
similar to what this patch introduced, only on a per-process level
instead of at the cpuset level.

Our idea would be to add a flag somewhere in the task_struct to keep
track of whether or not thp is enabled for each task.  The flag would be
controlled by an additional option included in prctl(), allowing
programmers to set/unset this flag via the prctl() syscall.  We would 
also introduce some code into the clone() syscall to ensure that this 
flag is copied down from parent to child tasks when necessary.  The flag
would be checked in the same place the the per-cpuset flag was checked
in my original patch, thereby allowing the same behavior to be
replicated on a per-process level.

In this way, we will also be able to get static binaries to behave
appropriately by setting this flag in a userland program, and then
having that program exec the static binary for which we need to disable
thp.

This solution allows us to incorporate the behavior that we're looking
for into the kernel, without abusing cpusets for the purpose of
containerization.

Please let me know if anyone has any objections to this approach, or if
you have any suggestions as to how we could improve upon this idea.

Thanks!

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-07-29 Thread Alex Thorlton
We've re-evaluated the need for a patch to support some sort of finer
grained control over thp, and, based on some tests performed by our 
benchmarking team, we're seeing that we'd definitely still like to
implement some method to support this.  Here's an e-mail from John 
Baron (jba...@sgi.com), on our benchmarking team,  containing some data
which shows a decrease in performance for some SPEC OMP benchmarks when
thp is enabled:

 Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64
 sockets.  These show the performance ratio for jobs run with THP
 disabled versus THP enabled (so  1.0 means THP disabled is faster).  
 One possible reason for lower performance with THP enabled is that the 
 larger page granularity can result in more remote data accesses.
 
 
 SPEC OMP2012:
 
 350.md  1.0
 351.bwaves  1.3
 352.nab 1.0
 357.bt331   0.9
 358.botsalgn1.0
 359.botsspar1.1
 360.ilbdc   1.8
 362.fma3d   1.0
 363.swim1.4
 367.imagick 0.9
 370.mgrid3311.1
 371.applu3310.9
 372.smithwa 1.0
 376.kdtree  1.0
 
 SPEC OMPL2001:
 
 311.wupwise_l   1.1
 313.swim_l  1.5
 315.mgrid_l 1.0
 317.applu_l 1.1
 321.equake_l5.8
 325.apsi_l  1.5
 327.gafort_l1.0
 329.fma3d_l 1.0
 331.art_l   0.8
 
 One could argue that real-world applications could be modified to avoid
 these kinds of effects, but (a) it is not always possible to modify code
 (e.g. in benchmark situations) and (b) even if it is possible to do so,
 it is not necessarily easy to do so (e.g. for customers with large
 legacy Fortran codes).
 
 We have also observed on Intel Sandy Bridge processors that, as
 counter-intuitive as it may seem, local memory bandwidth is actually
 slightly lower with THP enabled (1-2%), even with unit stride data
 accesses.  This may not seem like much of a performance hit but
 it is important for HPC workloads.  No code modification will help here.

In light of the previous issues discussed in this thread, and some
suggestions from David Rientjes:

 why not make it per-process so users don't have to configure
 cpusets to control it?

Robin and I have come up with a proposal for a way to replicate behavior
similar to what this patch introduced, only on a per-process level
instead of at the cpuset level.

Our idea would be to add a flag somewhere in the task_struct to keep
track of whether or not thp is enabled for each task.  The flag would be
controlled by an additional option included in prctl(), allowing
programmers to set/unset this flag via the prctl() syscall.  We would 
also introduce some code into the clone() syscall to ensure that this 
flag is copied down from parent to child tasks when necessary.  The flag
would be checked in the same place the the per-cpuset flag was checked
in my original patch, thereby allowing the same behavior to be
replicated on a per-process level.

In this way, we will also be able to get static binaries to behave
appropriately by setting this flag in a userland program, and then
having that program exec the static binary for which we need to disable
thp.

This solution allows us to incorporate the behavior that we're looking
for into the kernel, without abusing cpusets for the purpose of
containerization.

Please let me know if anyone has any objections to this approach, or if
you have any suggestions as to how we could improve upon this idea.

Thanks!

- Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-20 Thread David Rientjes
On Thu, 20 Jun 2013, Mike Galbraith wrote:

> > I'm suspecting that you're referring to enlarged rss because of 
> > khugepaged's max_ptes_none and because you're abusing the purpose of 
> > cpusets for containerization.
> 
> Why is containerization an abuse?  What's wrong with renting out chunks
> of a big box farm on the fly like a task motel?  If a realtime customer
> checks in, he's not gonna be thrilled about sharing a room.
> 

We "abused" cpusets for containerization for years, I'm not implying any 
negative connotation to it, see the 
Documentation/x86/x86_64/fake-numa-for-cpusets document that I wrote.  It 
doesn't suggest that we should be controlling thp through cpusets; if he's 
complaining about static binaries where he can't use a malloc hook then 
why not make it per-process so users don't have to configure cpusets to 
control it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-20 Thread David Rientjes
On Thu, 20 Jun 2013, Mike Galbraith wrote:

  I'm suspecting that you're referring to enlarged rss because of 
  khugepaged's max_ptes_none and because you're abusing the purpose of 
  cpusets for containerization.
 
 Why is containerization an abuse?  What's wrong with renting out chunks
 of a big box farm on the fly like a task motel?  If a realtime customer
 checks in, he's not gonna be thrilled about sharing a room.
 

We abused cpusets for containerization for years, I'm not implying any 
negative connotation to it, see the 
Documentation/x86/x86_64/fake-numa-for-cpusets document that I wrote.  It 
doesn't suggest that we should be controlling thp through cpusets; if he's 
complaining about static binaries where he can't use a malloc hook then 
why not make it per-process so users don't have to configure cpusets to 
control it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Li Zefan
Cc: Tejun, and cgroup ML

>> Here are the entries in the cpuset:
>> cgroup.event_control  mem_exclusivememory_pressure_enabled  
>> notify_on_release tasks
>> cgroup.procs  mem_hardwall memory_spread_page   release_agent
>> cpu_exclusive memory_migrate   memory_spread_slab   
>> sched_load_balance
>> cpus  memory_pressure  mems 
>> sched_relax_domain_level
>>
>> There are scheduler, slab allocator, page_cache layout, etc controls.
> 
> I think this is mostly for historical reasons since cpusets were 
> introduced before cgroups.
> 
>> Why _NOT_ add a thp control to that nicely contained central location?
>> It is a concise set of controls for the job.
>>
> 
> All of the above seem to be for cpusets primary purpose, i.e. NUMA 
> optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
> saying thp has anything to do with memcg either, but a "memory controller" 
> seems more appropriate for controlling thp behavior.)
> 
>> Maybe I am misunderstanding.  Are you saying you want to put memcg
>> information into the cpuset or something like that?
>>
> 
> I'm saying there's absolutely no reason to have thp controlled by a 
> cpuset, or ANY cgroup for that matter, since you chose not to respond to 
> the question I asked: why do you want to control thp behavior for certain 
> static binaries and not others?  Where is the performance regression or 
> the downside?  Is it because of max_ptes_none for certain jobs blowing up 
> the rss?  We need information, and even if were justifiable then it 
> wouldn't have anything to do with ANY cgroup but rather a per-process 
> control.  It has nothing to do with cpusets whatsoever.
> 
> (And I'm very curious why you didn't even cc the cpusets maintainer on 
> this patch in the first place who would probably say the same thing.)
> .

Don't know whom you were refering to here. It's Paul Jackson who invented
cpusets, and then Paul Menage took over the maintainership but he wasn't
doing much maintainer's work. Now it's me and Tejun maintaining cpusets.
(long ago Ingo once requested cpuset patches should be cced to him and
Peter.)

Back to this patch, I'm definitely on your side. This feature doesn't
interact with existing cpuset features, and it doens't need anything
that cpuset provides. In a word, it has nothing to do with cpusets hence
it shouldn't belong to cpusets.

We're clearing all the messes in cgroups, and this patch acts in the
converse direction.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Mike Galbraith
On Wed, 2013-06-19 at 19:43 -0700, David Rientjes wrote: 
>   
> I'm suspecting that you're referring to enlarged rss because of 
> khugepaged's max_ptes_none and because you're abusing the purpose of 
> cpusets for containerization.

Why is containerization an abuse?  What's wrong with renting out chunks
of a big box farm on the fly like a task motel?  If a realtime customer
checks in, he's not gonna be thrilled about sharing a room.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread David Rientjes
On Wed, 19 Jun 2013, Robin Holt wrote:

> cpusets was not for NUMA.  It has no preference for "nodes" or anything like
> that.  It was for splitting a machine into layered smaller groups.  Usually,
> we see one cpuset with contains the batch scheduler.  The batch scheduler then
> creates cpusets for jobs it starts.  Has nothing to do with nodes.  That is
> more an administrator issue.  They set the minimum grouping of resources
> for scheduled jobs.
> 

I disagree with all of the above, it's not what Paul Jackson developed 
cpusets for, it's not what he wrote in Documentation/cgroups/cpusets.txt, 
and it's not why libnuma immediately supported it.  Cpusets is for NUMA, 
like it or not.

> > I'm saying there's absolutely no reason to have thp controlled by a 
> > cpuset, or ANY cgroup for that matter, since you chose not to respond to 
> > the question I asked: why do you want to control thp behavior for certain 
> > static binaries and not others?  Where is the performance regression or 
> > the downside?  Is it because of max_ptes_none for certain jobs blowing up 
> > the rss?  We need information, and even if were justifiable then it 
> > wouldn't have anything to do with ANY cgroup but rather a per-process 
> > control.  It has nothing to do with cpusets whatsoever.
> 
> It was a request from our benchmarking group that has found some jobs
> benefit from thp, while other are harmed.  Let me ask them for more
> details.
> 

Yes, please, because if some jobs are harmed by thp then we need to fix 
that regression and not paper around with it with some cpuset-based 
solution.  People should be able to run with CONFIG_TRANSPARENT_HUGEPAGE 
enabled and not be required to enable CONFIG_CPUSETS for optimal behavior.  
I'm suspecting that you're referring to enlarged rss because of 
khugepaged's max_ptes_none and because you're abusing the purpose of 
cpusets for containerization.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Robin Holt
On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote:
> On Wed, 19 Jun 2013, Robin Holt wrote:
> 
> > The convenience being that many batch schedulers have added cpuset
> > support.  They create the cpuset's and configure them as appropriate
> > for the job as determined by a mixture of input from the submitting
> > user but still under the control of the administrator.  That seems like
> > a fairly significant convenience given that it took years to get the
> > batch schedulers to adopt cpusets in the first place.  At this point,
> > expanding their use of cpusets is under the control of the system
> > administrator and would not require any additional development on
> > the batch scheduler developers part.
> > 
> 
> You can't say the same for memcg?

I am not aware of batch scheduler support for memory controllers.
The request came from our benchmarking group.

> > Here are the entries in the cpuset:
> > cgroup.event_control  mem_exclusivememory_pressure_enabled  
> > notify_on_release tasks
> > cgroup.procs  mem_hardwall memory_spread_page   
> > release_agent
> > cpu_exclusive memory_migrate   memory_spread_slab   
> > sched_load_balance
> > cpus  memory_pressure  mems 
> > sched_relax_domain_level
> > 
> > There are scheduler, slab allocator, page_cache layout, etc controls.
> 
> I think this is mostly for historical reasons since cpusets were 
> introduced before cgroups.
> 
> > Why _NOT_ add a thp control to that nicely contained central location?
> > It is a concise set of controls for the job.
> > 
> 
> All of the above seem to be for cpusets primary purpose, i.e. NUMA 
> optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
> saying thp has anything to do with memcg either, but a "memory controller" 
> seems more appropriate for controlling thp behavior.)

cpusets was not for NUMA.  It has no preference for "nodes" or anything like
that.  It was for splitting a machine into layered smaller groups.  Usually,
we see one cpuset with contains the batch scheduler.  The batch scheduler then
creates cpusets for jobs it starts.  Has nothing to do with nodes.  That is
more an administrator issue.  They set the minimum grouping of resources
for scheduled jobs.

> > Maybe I am misunderstanding.  Are you saying you want to put memcg
> > information into the cpuset or something like that?
> > 
> 
> I'm saying there's absolutely no reason to have thp controlled by a 
> cpuset, or ANY cgroup for that matter, since you chose not to respond to 
> the question I asked: why do you want to control thp behavior for certain 
> static binaries and not others?  Where is the performance regression or 
> the downside?  Is it because of max_ptes_none for certain jobs blowing up 
> the rss?  We need information, and even if were justifiable then it 
> wouldn't have anything to do with ANY cgroup but rather a per-process 
> control.  It has nothing to do with cpusets whatsoever.

It was a request from our benchmarking group that has found some jobs
benefit from thp, while other are harmed.  Let me ask them for more
details.

> (And I'm very curious why you didn't even cc the cpusets maintainer on 
> this patch in the first place who would probably say the same thing.)

I didn't know there was a cpuset maintainer.  Paul Jackson (SGI retired)
had originally worked to get cpusets introduced and then converted to
use cgroups.  I had never known there was a maintainer after him.  Sorry
for that.

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread David Rientjes
On Wed, 19 Jun 2013, Robin Holt wrote:

> The convenience being that many batch schedulers have added cpuset
> support.  They create the cpuset's and configure them as appropriate
> for the job as determined by a mixture of input from the submitting
> user but still under the control of the administrator.  That seems like
> a fairly significant convenience given that it took years to get the
> batch schedulers to adopt cpusets in the first place.  At this point,
> expanding their use of cpusets is under the control of the system
> administrator and would not require any additional development on
> the batch scheduler developers part.
> 

You can't say the same for memcg?

> Here are the entries in the cpuset:
> cgroup.event_control  mem_exclusivememory_pressure_enabled  
> notify_on_release tasks
> cgroup.procs  mem_hardwall memory_spread_page   release_agent
> cpu_exclusive memory_migrate   memory_spread_slab   
> sched_load_balance
> cpus  memory_pressure  mems 
> sched_relax_domain_level
> 
> There are scheduler, slab allocator, page_cache layout, etc controls.

I think this is mostly for historical reasons since cpusets were 
introduced before cgroups.

> Why _NOT_ add a thp control to that nicely contained central location?
> It is a concise set of controls for the job.
> 

All of the above seem to be for cpusets primary purpose, i.e. NUMA 
optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
saying thp has anything to do with memcg either, but a "memory controller" 
seems more appropriate for controlling thp behavior.)

> Maybe I am misunderstanding.  Are you saying you want to put memcg
> information into the cpuset or something like that?
> 

I'm saying there's absolutely no reason to have thp controlled by a 
cpuset, or ANY cgroup for that matter, since you chose not to respond to 
the question I asked: why do you want to control thp behavior for certain 
static binaries and not others?  Where is the performance regression or 
the downside?  Is it because of max_ptes_none for certain jobs blowing up 
the rss?  We need information, and even if were justifiable then it 
wouldn't have anything to do with ANY cgroup but rather a per-process 
control.  It has nothing to do with cpusets whatsoever.

(And I'm very curious why you didn't even cc the cpusets maintainer on 
this patch in the first place who would probably say the same thing.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Robin Holt
On Tue, Jun 18, 2013 at 05:01:23PM -0700, David Rientjes wrote:
> On Tue, 18 Jun 2013, Alex Thorlton wrote:
> 
> > Thanks for your input, however, I believe the method of using a malloc
> > hook falls apart when it comes to static binaries, since we wont' have
> > any shared libraries to hook into.  Although using a malloc hook is a
> > perfectly suitable solution for most cases, we're looking to implement a
> > solution that can be used in all situations.
> > 
> 
> I guess the question would be why you don't want your malloc memory backed 
> by thp pages for certain static binaries and not others?  Is it because of 
> an increased rss due to khugepaged collapsing memory because of its 
> default max_ptes_none value?
> 
> > Aside from that particular shortcoming of the malloc hook solution,
> > there are some other situations having a cpuset-based option is a
> > much simpler and more efficient solution than the alternatives.
> 
> Sure, but why should this be a cpuset based solution?  What is special 
> about cpusets that make certain statically allocated binaries not want 
> memory backed by thp while others do?  This still seems based solely on 
> convenience instead of any hard requirement.

The convenience being that many batch schedulers have added cpuset
support.  They create the cpuset's and configure them as appropriate
for the job as determined by a mixture of input from the submitting
user but still under the control of the administrator.  That seems like
a fairly significant convenience given that it took years to get the
batch schedulers to adopt cpusets in the first place.  At this point,
expanding their use of cpusets is under the control of the system
administrator and would not require any additional development on
the batch scheduler developers part.

> > One
> > such situation that comes to mind would be an environment where a batch
> > scheduler is in use to ration system resources.  If an administrator
> > determines that a users jobs run more efficiently with thp always on,
> > the administrator can simply set the users jobs to always run with that
> > setting, instead of having to coordinate with that user to get them to
> > run their jobs in a different way.  I feel that, for cases such as this,
> > the this additional flag is in line with the other capabilities that
> > cgroups and cpusets provide.
> > 
> 
> That sounds like a memcg, i.e. container, type of an issue, not a cpuset 
> issue which is more geared toward NUMA optimizations.  User jobs should 
> always run more efficiently with thp always on, the worst-case scenario 
> should be if they run with the same performance as thp set to never.  In 
> other words, there shouldn't be any regression that requires certain 
> cpusets to disable thp because of a performance regression.  If there are 
> any, we'd like to investigate that separately from this patch.

Here are the entries in the cpuset:
cgroup.event_control  mem_exclusivememory_pressure_enabled  
notify_on_release tasks
cgroup.procs  mem_hardwall memory_spread_page   release_agent
cpu_exclusive memory_migrate   memory_spread_slab   
sched_load_balance
cpus  memory_pressure  mems 
sched_relax_domain_level

There are scheduler, slab allocator, page_cache layout, etc controls.
Why _NOT_ add a thp control to that nicely contained central location?
It is a concise set of controls for the job.

Maybe I am misunderstanding.  Are you saying you want to put memcg
information into the cpuset or something like that?

Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Robin Holt
On Tue, Jun 18, 2013 at 05:01:23PM -0700, David Rientjes wrote:
 On Tue, 18 Jun 2013, Alex Thorlton wrote:
 
  Thanks for your input, however, I believe the method of using a malloc
  hook falls apart when it comes to static binaries, since we wont' have
  any shared libraries to hook into.  Although using a malloc hook is a
  perfectly suitable solution for most cases, we're looking to implement a
  solution that can be used in all situations.
  
 
 I guess the question would be why you don't want your malloc memory backed 
 by thp pages for certain static binaries and not others?  Is it because of 
 an increased rss due to khugepaged collapsing memory because of its 
 default max_ptes_none value?
 
  Aside from that particular shortcoming of the malloc hook solution,
  there are some other situations having a cpuset-based option is a
  much simpler and more efficient solution than the alternatives.
 
 Sure, but why should this be a cpuset based solution?  What is special 
 about cpusets that make certain statically allocated binaries not want 
 memory backed by thp while others do?  This still seems based solely on 
 convenience instead of any hard requirement.

The convenience being that many batch schedulers have added cpuset
support.  They create the cpuset's and configure them as appropriate
for the job as determined by a mixture of input from the submitting
user but still under the control of the administrator.  That seems like
a fairly significant convenience given that it took years to get the
batch schedulers to adopt cpusets in the first place.  At this point,
expanding their use of cpusets is under the control of the system
administrator and would not require any additional development on
the batch scheduler developers part.

  One
  such situation that comes to mind would be an environment where a batch
  scheduler is in use to ration system resources.  If an administrator
  determines that a users jobs run more efficiently with thp always on,
  the administrator can simply set the users jobs to always run with that
  setting, instead of having to coordinate with that user to get them to
  run their jobs in a different way.  I feel that, for cases such as this,
  the this additional flag is in line with the other capabilities that
  cgroups and cpusets provide.
  
 
 That sounds like a memcg, i.e. container, type of an issue, not a cpuset 
 issue which is more geared toward NUMA optimizations.  User jobs should 
 always run more efficiently with thp always on, the worst-case scenario 
 should be if they run with the same performance as thp set to never.  In 
 other words, there shouldn't be any regression that requires certain 
 cpusets to disable thp because of a performance regression.  If there are 
 any, we'd like to investigate that separately from this patch.

Here are the entries in the cpuset:
cgroup.event_control  mem_exclusivememory_pressure_enabled  
notify_on_release tasks
cgroup.procs  mem_hardwall memory_spread_page   release_agent
cpu_exclusive memory_migrate   memory_spread_slab   
sched_load_balance
cpus  memory_pressure  mems 
sched_relax_domain_level

There are scheduler, slab allocator, page_cache layout, etc controls.
Why _NOT_ add a thp control to that nicely contained central location?
It is a concise set of controls for the job.

Maybe I am misunderstanding.  Are you saying you want to put memcg
information into the cpuset or something like that?

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread David Rientjes
On Wed, 19 Jun 2013, Robin Holt wrote:

 The convenience being that many batch schedulers have added cpuset
 support.  They create the cpuset's and configure them as appropriate
 for the job as determined by a mixture of input from the submitting
 user but still under the control of the administrator.  That seems like
 a fairly significant convenience given that it took years to get the
 batch schedulers to adopt cpusets in the first place.  At this point,
 expanding their use of cpusets is under the control of the system
 administrator and would not require any additional development on
 the batch scheduler developers part.
 

You can't say the same for memcg?

 Here are the entries in the cpuset:
 cgroup.event_control  mem_exclusivememory_pressure_enabled  
 notify_on_release tasks
 cgroup.procs  mem_hardwall memory_spread_page   release_agent
 cpu_exclusive memory_migrate   memory_spread_slab   
 sched_load_balance
 cpus  memory_pressure  mems 
 sched_relax_domain_level
 
 There are scheduler, slab allocator, page_cache layout, etc controls.

I think this is mostly for historical reasons since cpusets were 
introduced before cgroups.

 Why _NOT_ add a thp control to that nicely contained central location?
 It is a concise set of controls for the job.
 

All of the above seem to be for cpusets primary purpose, i.e. NUMA 
optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
saying thp has anything to do with memcg either, but a memory controller 
seems more appropriate for controlling thp behavior.)

 Maybe I am misunderstanding.  Are you saying you want to put memcg
 information into the cpuset or something like that?
 

I'm saying there's absolutely no reason to have thp controlled by a 
cpuset, or ANY cgroup for that matter, since you chose not to respond to 
the question I asked: why do you want to control thp behavior for certain 
static binaries and not others?  Where is the performance regression or 
the downside?  Is it because of max_ptes_none for certain jobs blowing up 
the rss?  We need information, and even if were justifiable then it 
wouldn't have anything to do with ANY cgroup but rather a per-process 
control.  It has nothing to do with cpusets whatsoever.

(And I'm very curious why you didn't even cc the cpusets maintainer on 
this patch in the first place who would probably say the same thing.)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Robin Holt
On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote:
 On Wed, 19 Jun 2013, Robin Holt wrote:
 
  The convenience being that many batch schedulers have added cpuset
  support.  They create the cpuset's and configure them as appropriate
  for the job as determined by a mixture of input from the submitting
  user but still under the control of the administrator.  That seems like
  a fairly significant convenience given that it took years to get the
  batch schedulers to adopt cpusets in the first place.  At this point,
  expanding their use of cpusets is under the control of the system
  administrator and would not require any additional development on
  the batch scheduler developers part.
  
 
 You can't say the same for memcg?

I am not aware of batch scheduler support for memory controllers.
The request came from our benchmarking group.

  Here are the entries in the cpuset:
  cgroup.event_control  mem_exclusivememory_pressure_enabled  
  notify_on_release tasks
  cgroup.procs  mem_hardwall memory_spread_page   
  release_agent
  cpu_exclusive memory_migrate   memory_spread_slab   
  sched_load_balance
  cpus  memory_pressure  mems 
  sched_relax_domain_level
  
  There are scheduler, slab allocator, page_cache layout, etc controls.
 
 I think this is mostly for historical reasons since cpusets were 
 introduced before cgroups.
 
  Why _NOT_ add a thp control to that nicely contained central location?
  It is a concise set of controls for the job.
  
 
 All of the above seem to be for cpusets primary purpose, i.e. NUMA 
 optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
 saying thp has anything to do with memcg either, but a memory controller 
 seems more appropriate for controlling thp behavior.)

cpusets was not for NUMA.  It has no preference for nodes or anything like
that.  It was for splitting a machine into layered smaller groups.  Usually,
we see one cpuset with contains the batch scheduler.  The batch scheduler then
creates cpusets for jobs it starts.  Has nothing to do with nodes.  That is
more an administrator issue.  They set the minimum grouping of resources
for scheduled jobs.

  Maybe I am misunderstanding.  Are you saying you want to put memcg
  information into the cpuset or something like that?
  
 
 I'm saying there's absolutely no reason to have thp controlled by a 
 cpuset, or ANY cgroup for that matter, since you chose not to respond to 
 the question I asked: why do you want to control thp behavior for certain 
 static binaries and not others?  Where is the performance regression or 
 the downside?  Is it because of max_ptes_none for certain jobs blowing up 
 the rss?  We need information, and even if were justifiable then it 
 wouldn't have anything to do with ANY cgroup but rather a per-process 
 control.  It has nothing to do with cpusets whatsoever.

It was a request from our benchmarking group that has found some jobs
benefit from thp, while other are harmed.  Let me ask them for more
details.

 (And I'm very curious why you didn't even cc the cpusets maintainer on 
 this patch in the first place who would probably say the same thing.)

I didn't know there was a cpuset maintainer.  Paul Jackson (SGI retired)
had originally worked to get cpusets introduced and then converted to
use cgroups.  I had never known there was a maintainer after him.  Sorry
for that.

Robin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread David Rientjes
On Wed, 19 Jun 2013, Robin Holt wrote:

 cpusets was not for NUMA.  It has no preference for nodes or anything like
 that.  It was for splitting a machine into layered smaller groups.  Usually,
 we see one cpuset with contains the batch scheduler.  The batch scheduler then
 creates cpusets for jobs it starts.  Has nothing to do with nodes.  That is
 more an administrator issue.  They set the minimum grouping of resources
 for scheduled jobs.
 

I disagree with all of the above, it's not what Paul Jackson developed 
cpusets for, it's not what he wrote in Documentation/cgroups/cpusets.txt, 
and it's not why libnuma immediately supported it.  Cpusets is for NUMA, 
like it or not.

  I'm saying there's absolutely no reason to have thp controlled by a 
  cpuset, or ANY cgroup for that matter, since you chose not to respond to 
  the question I asked: why do you want to control thp behavior for certain 
  static binaries and not others?  Where is the performance regression or 
  the downside?  Is it because of max_ptes_none for certain jobs blowing up 
  the rss?  We need information, and even if were justifiable then it 
  wouldn't have anything to do with ANY cgroup but rather a per-process 
  control.  It has nothing to do with cpusets whatsoever.
 
 It was a request from our benchmarking group that has found some jobs
 benefit from thp, while other are harmed.  Let me ask them for more
 details.
 

Yes, please, because if some jobs are harmed by thp then we need to fix 
that regression and not paper around with it with some cpuset-based 
solution.  People should be able to run with CONFIG_TRANSPARENT_HUGEPAGE 
enabled and not be required to enable CONFIG_CPUSETS for optimal behavior.  
I'm suspecting that you're referring to enlarged rss because of 
khugepaged's max_ptes_none and because you're abusing the purpose of 
cpusets for containerization.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Mike Galbraith
On Wed, 2013-06-19 at 19:43 -0700, David Rientjes wrote: 
   
 I'm suspecting that you're referring to enlarged rss because of 
 khugepaged's max_ptes_none and because you're abusing the purpose of 
 cpusets for containerization.

Why is containerization an abuse?  What's wrong with renting out chunks
of a big box farm on the fly like a task motel?  If a realtime customer
checks in, he's not gonna be thrilled about sharing a room.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-19 Thread Li Zefan
Cc: Tejun, and cgroup ML

 Here are the entries in the cpuset:
 cgroup.event_control  mem_exclusivememory_pressure_enabled  
 notify_on_release tasks
 cgroup.procs  mem_hardwall memory_spread_page   release_agent
 cpu_exclusive memory_migrate   memory_spread_slab   
 sched_load_balance
 cpus  memory_pressure  mems 
 sched_relax_domain_level

 There are scheduler, slab allocator, page_cache layout, etc controls.
 
 I think this is mostly for historical reasons since cpusets were 
 introduced before cgroups.
 
 Why _NOT_ add a thp control to that nicely contained central location?
 It is a concise set of controls for the job.

 
 All of the above seem to be for cpusets primary purpose, i.e. NUMA 
 optimizations.  It has nothing to do with transparent hugepages.  (I'm not 
 saying thp has anything to do with memcg either, but a memory controller 
 seems more appropriate for controlling thp behavior.)
 
 Maybe I am misunderstanding.  Are you saying you want to put memcg
 information into the cpuset or something like that?

 
 I'm saying there's absolutely no reason to have thp controlled by a 
 cpuset, or ANY cgroup for that matter, since you chose not to respond to 
 the question I asked: why do you want to control thp behavior for certain 
 static binaries and not others?  Where is the performance regression or 
 the downside?  Is it because of max_ptes_none for certain jobs blowing up 
 the rss?  We need information, and even if were justifiable then it 
 wouldn't have anything to do with ANY cgroup but rather a per-process 
 control.  It has nothing to do with cpusets whatsoever.
 
 (And I'm very curious why you didn't even cc the cpusets maintainer on 
 this patch in the first place who would probably say the same thing.)
 .

Don't know whom you were refering to here. It's Paul Jackson who invented
cpusets, and then Paul Menage took over the maintainership but he wasn't
doing much maintainer's work. Now it's me and Tejun maintaining cpusets.
(long ago Ingo once requested cpuset patches should be cced to him and
Peter.)

Back to this patch, I'm definitely on your side. This feature doesn't
interact with existing cpuset features, and it doens't need anything
that cpuset provides. In a word, it has nothing to do with cpusets hence
it shouldn't belong to cpusets.

We're clearing all the messes in cgroups, and this patch acts in the
converse direction.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-18 Thread David Rientjes
On Tue, 18 Jun 2013, Alex Thorlton wrote:

> Thanks for your input, however, I believe the method of using a malloc
> hook falls apart when it comes to static binaries, since we wont' have
> any shared libraries to hook into.  Although using a malloc hook is a
> perfectly suitable solution for most cases, we're looking to implement a
> solution that can be used in all situations.
> 

I guess the question would be why you don't want your malloc memory backed 
by thp pages for certain static binaries and not others?  Is it because of 
an increased rss due to khugepaged collapsing memory because of its 
default max_ptes_none value?

> Aside from that particular shortcoming of the malloc hook solution,
> there are some other situations having a cpuset-based option is a
> much simpler and more efficient solution than the alternatives.

Sure, but why should this be a cpuset based solution?  What is special 
about cpusets that make certain statically allocated binaries not want 
memory backed by thp while others do?  This still seems based solely on 
convenience instead of any hard requirement.

> One
> such situation that comes to mind would be an environment where a batch
> scheduler is in use to ration system resources.  If an administrator
> determines that a users jobs run more efficiently with thp always on,
> the administrator can simply set the users jobs to always run with that
> setting, instead of having to coordinate with that user to get them to
> run their jobs in a different way.  I feel that, for cases such as this,
> the this additional flag is in line with the other capabilities that
> cgroups and cpusets provide.
> 

That sounds like a memcg, i.e. container, type of an issue, not a cpuset 
issue which is more geared toward NUMA optimizations.  User jobs should 
always run more efficiently with thp always on, the worst-case scenario 
should be if they run with the same performance as thp set to never.  In 
other words, there shouldn't be any regression that requires certain 
cpusets to disable thp because of a performance regression.  If there are 
any, we'd like to investigate that separately from this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-18 Thread Alex Thorlton
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote:
> On Tue, 11 Jun 2013, Alex Thorlton wrote:
> 
> > This patch adds the ability to control THPs on a per cpuset basis.
> > Please see
> > the additions to Documentation/cgroups/cpusets.txt for more
> > information.
> > 
> 
> What's missing from both this changelog and the documentation you
> point to 
> is why this change is needed.
> 
> I can understand how you would want a subset of processes to not use
> thp 
> when it is enabled.  This is typically where MADV_NOHUGEPAGE is used
> with 
> some type of malloc hook.
> 
> I don't think we need to do this on a cpuset level, so unfortunately I 
> think this needs to be reworked.  Would it make sense to add a
> per-process 
> tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk()
> and 
> mmap() calls?  Perhaps, but then you would need to justify why it
> can't be 
> done with a malloc hook in userspace.
> 
> This seems to just be working around a userspace issue or for a matter
> of 
> convenience, right?

David,

Thanks for your input, however, I believe the method of using a malloc
hook falls apart when it comes to static binaries, since we wont' have
any shared libraries to hook into.  Although using a malloc hook is a
perfectly suitable solution for most cases, we're looking to implement a
solution that can be used in all situations.

Aside from that particular shortcoming of the malloc hook solution,
there are some other situations having a cpuset-based option is a
much simpler and more efficient solution than the alternatives.  One
such situation that comes to mind would be an environment where a batch
scheduler is in use to ration system resources.  If an administrator
determines that a users jobs run more efficiently with thp always on,
the administrator can simply set the users jobs to always run with that
setting, instead of having to coordinate with that user to get them to
run their jobs in a different way.  I feel that, for cases such as this,
the this additional flag is in line with the other capabilities that
cgroups and cpusets provide.

While there are likely numerous other situations where having a flag to
control thp on the cpuset level makes things a bit easier to manage, the
one real show-stopper here is that we really have no other options when
it comes to static binaries.

- Alex Thorlton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-18 Thread Alex Thorlton
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote:
 On Tue, 11 Jun 2013, Alex Thorlton wrote:
 
  This patch adds the ability to control THPs on a per cpuset basis.
  Please see
  the additions to Documentation/cgroups/cpusets.txt for more
  information.
  
 
 What's missing from both this changelog and the documentation you
 point to 
 is why this change is needed.
 
 I can understand how you would want a subset of processes to not use
 thp 
 when it is enabled.  This is typically where MADV_NOHUGEPAGE is used
 with 
 some type of malloc hook.
 
 I don't think we need to do this on a cpuset level, so unfortunately I 
 think this needs to be reworked.  Would it make sense to add a
 per-process 
 tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk()
 and 
 mmap() calls?  Perhaps, but then you would need to justify why it
 can't be 
 done with a malloc hook in userspace.
 
 This seems to just be working around a userspace issue or for a matter
 of 
 convenience, right?

David,

Thanks for your input, however, I believe the method of using a malloc
hook falls apart when it comes to static binaries, since we wont' have
any shared libraries to hook into.  Although using a malloc hook is a
perfectly suitable solution for most cases, we're looking to implement a
solution that can be used in all situations.

Aside from that particular shortcoming of the malloc hook solution,
there are some other situations having a cpuset-based option is a
much simpler and more efficient solution than the alternatives.  One
such situation that comes to mind would be an environment where a batch
scheduler is in use to ration system resources.  If an administrator
determines that a users jobs run more efficiently with thp always on,
the administrator can simply set the users jobs to always run with that
setting, instead of having to coordinate with that user to get them to
run their jobs in a different way.  I feel that, for cases such as this,
the this additional flag is in line with the other capabilities that
cgroups and cpusets provide.

While there are likely numerous other situations where having a flag to
control thp on the cpuset level makes things a bit easier to manage, the
one real show-stopper here is that we really have no other options when
it comes to static binaries.

- Alex Thorlton
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-18 Thread David Rientjes
On Tue, 18 Jun 2013, Alex Thorlton wrote:

 Thanks for your input, however, I believe the method of using a malloc
 hook falls apart when it comes to static binaries, since we wont' have
 any shared libraries to hook into.  Although using a malloc hook is a
 perfectly suitable solution for most cases, we're looking to implement a
 solution that can be used in all situations.
 

I guess the question would be why you don't want your malloc memory backed 
by thp pages for certain static binaries and not others?  Is it because of 
an increased rss due to khugepaged collapsing memory because of its 
default max_ptes_none value?

 Aside from that particular shortcoming of the malloc hook solution,
 there are some other situations having a cpuset-based option is a
 much simpler and more efficient solution than the alternatives.

Sure, but why should this be a cpuset based solution?  What is special 
about cpusets that make certain statically allocated binaries not want 
memory backed by thp while others do?  This still seems based solely on 
convenience instead of any hard requirement.

 One
 such situation that comes to mind would be an environment where a batch
 scheduler is in use to ration system resources.  If an administrator
 determines that a users jobs run more efficiently with thp always on,
 the administrator can simply set the users jobs to always run with that
 setting, instead of having to coordinate with that user to get them to
 run their jobs in a different way.  I feel that, for cases such as this,
 the this additional flag is in line with the other capabilities that
 cgroups and cpusets provide.
 

That sounds like a memcg, i.e. container, type of an issue, not a cpuset 
issue which is more geared toward NUMA optimizations.  User jobs should 
always run more efficiently with thp always on, the worst-case scenario 
should be if they run with the same performance as thp set to never.  In 
other words, there shouldn't be any regression that requires certain 
cpusets to disable thp because of a performance regression.  If there are 
any, we'd like to investigate that separately from this patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-11 Thread David Rientjes
On Tue, 11 Jun 2013, Alex Thorlton wrote:

> This patch adds the ability to control THPs on a per cpuset basis.  Please see
> the additions to Documentation/cgroups/cpusets.txt for more information.
> 

What's missing from both this changelog and the documentation you point to 
is why this change is needed.

I can understand how you would want a subset of processes to not use thp 
when it is enabled.  This is typically where MADV_NOHUGEPAGE is used with 
some type of malloc hook.

I don't think we need to do this on a cpuset level, so unfortunately I 
think this needs to be reworked.  Would it make sense to add a per-process 
tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and 
mmap() calls?  Perhaps, but then you would need to justify why it can't be 
done with a malloc hook in userspace.

This seems to just be working around a userspace issue or for a matter of 
convenience, right?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] Make transparent hugepages cpuset aware

2013-06-11 Thread Alex Thorlton
This patch adds the ability to control THPs on a per cpuset basis.  Please see
the additions to Documentation/cgroups/cpusets.txt for more information.

Signed-off-by: Alex Thorlton 
Reviewed-by: Robin Holt 
Cc: Li Zefan 
Cc: Rob Landley 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Kirill A. Shutemov 
Cc: Johannes Weiner 
Cc: Xiao Guangrong 
Cc: David Rientjes 
Cc: linux-...@vger.kernel.org
Cc: linux...@kvack.org
---
Changes since last patch version:
- Modified transparent_hugepage_enable to always check the vma for the
  VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack.
- Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags

 Documentation/cgroups/cpusets.txt |  50 ++-
 include/linux/cpuset.h|   5 ++
 include/linux/huge_mm.h   |  27 +-
 kernel/cpuset.c   | 181 ++
 mm/huge_memory.c  |   3 +
 5 files changed, 263 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index 12e01d4..b7b2c83 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -22,12 +22,14 @@ CONTENTS:
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
   1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.9 What is thp_enabled ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
   2.3 Setting flags
   2.4 Attaching processes
+  2.5 Setting thp_enabled flags
 3. Questions
 4. Contact
 
@@ -581,7 +583,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is thp_enabled ?
+---
+
+The thp_enabled file contained within each cpuset controls how transparent
+hugepages are handled within that cpuset.
+
+The root cpuset's thp_enabled flags mirror the flags set in
+/sys/kernel/mm/transparent_hugepage/enabled.  The flags in the root cpuset can
+only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The
+thp_enabled file for the root cpuset is read only.  These flags cause the
+root cpuset to behave as one might expect:
+
+- When set to always, THPs are used whenever practical
+- When set to madvise, THPs are used only on chunks of memory that have the
+  MADV_HUGEPAGE flag set
+- When set to never, THPs are never allowed for tasks in this cpuset
+
+The behavior of thp_enabled for children of the root cpuset is where things
+become a bit more interesting.  The child cpusets accept the same flags as the
+root, but also have a default flag, which, when set, causes a cpuset to use the
+behavior of its parent.  When a child cpuset is created, its default flag is
+always initially set.
+
+Since the flags on child cpusets are allowed to differ from the flags on their
+parents, we are able to enable THPs for tasks in specific cpusets, and disable
+them in others.
+
+1.10 How do I use cpusets ?
 --
 
 In order to minimize the impact of cpusets on critical kernel
@@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance
 cpuset.mem_exclusive   cpuset.sched_relax_domain_level
 cpuset.mem_hardwallnotify_on_release
 cpuset.memory_migrate  tasks
+thp_enabled
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
@@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one 
after another:
...
 # /bin/echo PIDn > tasks
 
+2.5 Setting thp_enabled flags
+-
+
+The syntax for setting these flags is similar to setting thp flags in
+/sys/kernel/mm/transparent_hugepage/enabled.  In order to change the flags you
+simply echo the name of the flag you want to set to the thp_enabled file of the
+desired cpuset:
+
+# /bin/echo always > thp_enabled   -> always use THPs when practical
+# /bin/echo madvise > thp_enabled  -> only use THPs in madvise sections
+# /bin/echo never > thp_enabled-> never use THPs
+# /bin/echo default > thp_enabled  -> use parent cpuset's THP flags
+
+Note that the flags on the root cpuset cannot be changed in /dev/cpuset.  These
+flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can 
only
+be modified there.
 
 3. Questions
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index cc1b01c..624aafd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are 
defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
+extern void cpuset_update_top_thp_flags(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
 extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern void 

[PATCH v2] Make transparent hugepages cpuset aware

2013-06-11 Thread Alex Thorlton
This patch adds the ability to control THPs on a per cpuset basis.  Please see
the additions to Documentation/cgroups/cpusets.txt for more information.

Signed-off-by: Alex Thorlton athorl...@sgi.com
Reviewed-by: Robin Holt h...@sgi.com
Cc: Li Zefan lize...@huawei.com
Cc: Rob Landley r...@landley.net
Cc: Andrew Morton a...@linux-foundation.org
Cc: Mel Gorman mgor...@suse.de
Cc: Rik van Riel r...@redhat.com
Cc: Kirill A. Shutemov kirill.shute...@linux.intel.com
Cc: Johannes Weiner han...@cmpxchg.org
Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Cc: David Rientjes rient...@google.com
Cc: linux-...@vger.kernel.org
Cc: linux...@kvack.org
---
Changes since last patch version:
- Modified transparent_hugepage_enable to always check the vma for the
  VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack.
- Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags

 Documentation/cgroups/cpusets.txt |  50 ++-
 include/linux/cpuset.h|   5 ++
 include/linux/huge_mm.h   |  27 +-
 kernel/cpuset.c   | 181 ++
 mm/huge_memory.c  |   3 +
 5 files changed, 263 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index 12e01d4..b7b2c83 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -22,12 +22,14 @@ CONTENTS:
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
   1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.9 What is thp_enabled ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
   2.3 Setting flags
   2.4 Attaching processes
+  2.5 Setting thp_enabled flags
 3. Questions
 4. Contact
 
@@ -581,7 +583,34 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
+1.9 What is thp_enabled ?
+---
+
+The thp_enabled file contained within each cpuset controls how transparent
+hugepages are handled within that cpuset.
+
+The root cpuset's thp_enabled flags mirror the flags set in
+/sys/kernel/mm/transparent_hugepage/enabled.  The flags in the root cpuset can
+only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The
+thp_enabled file for the root cpuset is read only.  These flags cause the
+root cpuset to behave as one might expect:
+
+- When set to always, THPs are used whenever practical
+- When set to madvise, THPs are used only on chunks of memory that have the
+  MADV_HUGEPAGE flag set
+- When set to never, THPs are never allowed for tasks in this cpuset
+
+The behavior of thp_enabled for children of the root cpuset is where things
+become a bit more interesting.  The child cpusets accept the same flags as the
+root, but also have a default flag, which, when set, causes a cpuset to use the
+behavior of its parent.  When a child cpuset is created, its default flag is
+always initially set.
+
+Since the flags on child cpusets are allowed to differ from the flags on their
+parents, we are able to enable THPs for tasks in specific cpusets, and disable
+them in others.
+
+1.10 How do I use cpusets ?
 --
 
 In order to minimize the impact of cpusets on critical kernel
@@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance
 cpuset.mem_exclusive   cpuset.sched_relax_domain_level
 cpuset.mem_hardwallnotify_on_release
 cpuset.memory_migrate  tasks
+thp_enabled
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
@@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one 
after another:
...
 # /bin/echo PIDn  tasks
 
+2.5 Setting thp_enabled flags
+-
+
+The syntax for setting these flags is similar to setting thp flags in
+/sys/kernel/mm/transparent_hugepage/enabled.  In order to change the flags you
+simply echo the name of the flag you want to set to the thp_enabled file of the
+desired cpuset:
+
+# /bin/echo always  thp_enabled   - always use THPs when practical
+# /bin/echo madvise  thp_enabled  - only use THPs in madvise sections
+# /bin/echo never  thp_enabled- never use THPs
+# /bin/echo default  thp_enabled  - use parent cpuset's THP flags
+
+Note that the flags on the root cpuset cannot be changed in /dev/cpuset.  These
+flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can 
only
+be modified there.
 
 3. Questions
 
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index cc1b01c..624aafd 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are 
defined in system? */
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);

Re: [PATCH v2] Make transparent hugepages cpuset aware

2013-06-11 Thread David Rientjes
On Tue, 11 Jun 2013, Alex Thorlton wrote:

 This patch adds the ability to control THPs on a per cpuset basis.  Please see
 the additions to Documentation/cgroups/cpusets.txt for more information.
 

What's missing from both this changelog and the documentation you point to 
is why this change is needed.

I can understand how you would want a subset of processes to not use thp 
when it is enabled.  This is typically where MADV_NOHUGEPAGE is used with 
some type of malloc hook.

I don't think we need to do this on a cpuset level, so unfortunately I 
think this needs to be reworked.  Would it make sense to add a per-process 
tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and 
mmap() calls?  Perhaps, but then you would need to justify why it can't be 
done with a malloc hook in userspace.

This seems to just be working around a userspace issue or for a matter of 
convenience, right?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/