Re: [PATCH v2] Make transparent hugepages cpuset aware
We've re-evaluated the need for a patch to support some sort of finer grained control over thp, and, based on some tests performed by our benchmarking team, we're seeing that we'd definitely still like to implement some method to support this. Here's an e-mail from John Baron (jba...@sgi.com), on our benchmarking team, containing some data which shows a decrease in performance for some SPEC OMP benchmarks when thp is enabled: > Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64 > sockets. These show the performance ratio for jobs run with THP > disabled versus THP enabled (so > 1.0 means THP disabled is faster). > One possible reason for lower performance with THP enabled is that the > larger page granularity can result in more remote data accesses. > > > SPEC OMP2012: > > 350.md 1.0 > 351.bwaves 1.3 > 352.nab 1.0 > 357.bt331 0.9 > 358.botsalgn1.0 > 359.botsspar1.1 > 360.ilbdc 1.8 > 362.fma3d 1.0 > 363.swim1.4 > 367.imagick 0.9 > 370.mgrid3311.1 > 371.applu3310.9 > 372.smithwa 1.0 > 376.kdtree 1.0 > > SPEC OMPL2001: > > 311.wupwise_l 1.1 > 313.swim_l 1.5 > 315.mgrid_l 1.0 > 317.applu_l 1.1 > 321.equake_l5.8 > 325.apsi_l 1.5 > 327.gafort_l1.0 > 329.fma3d_l 1.0 > 331.art_l 0.8 > > One could argue that real-world applications could be modified to avoid > these kinds of effects, but (a) it is not always possible to modify code > (e.g. in benchmark situations) and (b) even if it is possible to do so, > it is not necessarily easy to do so (e.g. for customers with large > legacy Fortran codes). > > We have also observed on Intel Sandy Bridge processors that, as > counter-intuitive as it may seem, local memory bandwidth is actually > slightly lower with THP enabled (1-2%), even with unit stride data > accesses. This may not seem like much of a performance hit but > it is important for HPC workloads. No code modification will help here. In light of the previous issues discussed in this thread, and some suggestions from David Rientjes: > why not make it per-process so users don't have to configure > cpusets to control it? Robin and I have come up with a proposal for a way to replicate behavior similar to what this patch introduced, only on a per-process level instead of at the cpuset level. Our idea would be to add a flag somewhere in the task_struct to keep track of whether or not thp is enabled for each task. The flag would be controlled by an additional option included in prctl(), allowing programmers to set/unset this flag via the prctl() syscall. We would also introduce some code into the clone() syscall to ensure that this flag is copied down from parent to child tasks when necessary. The flag would be checked in the same place the the per-cpuset flag was checked in my original patch, thereby allowing the same behavior to be replicated on a per-process level. In this way, we will also be able to get static binaries to behave appropriately by setting this flag in a userland program, and then having that program exec the static binary for which we need to disable thp. This solution allows us to incorporate the behavior that we're looking for into the kernel, without abusing cpusets for the purpose of containerization. Please let me know if anyone has any objections to this approach, or if you have any suggestions as to how we could improve upon this idea. Thanks! - Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
We've re-evaluated the need for a patch to support some sort of finer grained control over thp, and, based on some tests performed by our benchmarking team, we're seeing that we'd definitely still like to implement some method to support this. Here's an e-mail from John Baron (jba...@sgi.com), on our benchmarking team, containing some data which shows a decrease in performance for some SPEC OMP benchmarks when thp is enabled: Here are results for SPEC OMP benchmarks on UV2 using 512 threads / 64 sockets. These show the performance ratio for jobs run with THP disabled versus THP enabled (so 1.0 means THP disabled is faster). One possible reason for lower performance with THP enabled is that the larger page granularity can result in more remote data accesses. SPEC OMP2012: 350.md 1.0 351.bwaves 1.3 352.nab 1.0 357.bt331 0.9 358.botsalgn1.0 359.botsspar1.1 360.ilbdc 1.8 362.fma3d 1.0 363.swim1.4 367.imagick 0.9 370.mgrid3311.1 371.applu3310.9 372.smithwa 1.0 376.kdtree 1.0 SPEC OMPL2001: 311.wupwise_l 1.1 313.swim_l 1.5 315.mgrid_l 1.0 317.applu_l 1.1 321.equake_l5.8 325.apsi_l 1.5 327.gafort_l1.0 329.fma3d_l 1.0 331.art_l 0.8 One could argue that real-world applications could be modified to avoid these kinds of effects, but (a) it is not always possible to modify code (e.g. in benchmark situations) and (b) even if it is possible to do so, it is not necessarily easy to do so (e.g. for customers with large legacy Fortran codes). We have also observed on Intel Sandy Bridge processors that, as counter-intuitive as it may seem, local memory bandwidth is actually slightly lower with THP enabled (1-2%), even with unit stride data accesses. This may not seem like much of a performance hit but it is important for HPC workloads. No code modification will help here. In light of the previous issues discussed in this thread, and some suggestions from David Rientjes: why not make it per-process so users don't have to configure cpusets to control it? Robin and I have come up with a proposal for a way to replicate behavior similar to what this patch introduced, only on a per-process level instead of at the cpuset level. Our idea would be to add a flag somewhere in the task_struct to keep track of whether or not thp is enabled for each task. The flag would be controlled by an additional option included in prctl(), allowing programmers to set/unset this flag via the prctl() syscall. We would also introduce some code into the clone() syscall to ensure that this flag is copied down from parent to child tasks when necessary. The flag would be checked in the same place the the per-cpuset flag was checked in my original patch, thereby allowing the same behavior to be replicated on a per-process level. In this way, we will also be able to get static binaries to behave appropriately by setting this flag in a userland program, and then having that program exec the static binary for which we need to disable thp. This solution allows us to incorporate the behavior that we're looking for into the kernel, without abusing cpusets for the purpose of containerization. Please let me know if anyone has any objections to this approach, or if you have any suggestions as to how we could improve upon this idea. Thanks! - Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Thu, 20 Jun 2013, Mike Galbraith wrote: > > I'm suspecting that you're referring to enlarged rss because of > > khugepaged's max_ptes_none and because you're abusing the purpose of > > cpusets for containerization. > > Why is containerization an abuse? What's wrong with renting out chunks > of a big box farm on the fly like a task motel? If a realtime customer > checks in, he's not gonna be thrilled about sharing a room. > We "abused" cpusets for containerization for years, I'm not implying any negative connotation to it, see the Documentation/x86/x86_64/fake-numa-for-cpusets document that I wrote. It doesn't suggest that we should be controlling thp through cpusets; if he's complaining about static binaries where he can't use a malloc hook then why not make it per-process so users don't have to configure cpusets to control it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Thu, 20 Jun 2013, Mike Galbraith wrote: I'm suspecting that you're referring to enlarged rss because of khugepaged's max_ptes_none and because you're abusing the purpose of cpusets for containerization. Why is containerization an abuse? What's wrong with renting out chunks of a big box farm on the fly like a task motel? If a realtime customer checks in, he's not gonna be thrilled about sharing a room. We abused cpusets for containerization for years, I'm not implying any negative connotation to it, see the Documentation/x86/x86_64/fake-numa-for-cpusets document that I wrote. It doesn't suggest that we should be controlling thp through cpusets; if he's complaining about static binaries where he can't use a malloc hook then why not make it per-process so users don't have to configure cpusets to control it? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
Cc: Tejun, and cgroup ML >> Here are the entries in the cpuset: >> cgroup.event_control mem_exclusivememory_pressure_enabled >> notify_on_release tasks >> cgroup.procs mem_hardwall memory_spread_page release_agent >> cpu_exclusive memory_migrate memory_spread_slab >> sched_load_balance >> cpus memory_pressure mems >> sched_relax_domain_level >> >> There are scheduler, slab allocator, page_cache layout, etc controls. > > I think this is mostly for historical reasons since cpusets were > introduced before cgroups. > >> Why _NOT_ add a thp control to that nicely contained central location? >> It is a concise set of controls for the job. >> > > All of the above seem to be for cpusets primary purpose, i.e. NUMA > optimizations. It has nothing to do with transparent hugepages. (I'm not > saying thp has anything to do with memcg either, but a "memory controller" > seems more appropriate for controlling thp behavior.) > >> Maybe I am misunderstanding. Are you saying you want to put memcg >> information into the cpuset or something like that? >> > > I'm saying there's absolutely no reason to have thp controlled by a > cpuset, or ANY cgroup for that matter, since you chose not to respond to > the question I asked: why do you want to control thp behavior for certain > static binaries and not others? Where is the performance regression or > the downside? Is it because of max_ptes_none for certain jobs blowing up > the rss? We need information, and even if were justifiable then it > wouldn't have anything to do with ANY cgroup but rather a per-process > control. It has nothing to do with cpusets whatsoever. > > (And I'm very curious why you didn't even cc the cpusets maintainer on > this patch in the first place who would probably say the same thing.) > . Don't know whom you were refering to here. It's Paul Jackson who invented cpusets, and then Paul Menage took over the maintainership but he wasn't doing much maintainer's work. Now it's me and Tejun maintaining cpusets. (long ago Ingo once requested cpuset patches should be cced to him and Peter.) Back to this patch, I'm definitely on your side. This feature doesn't interact with existing cpuset features, and it doens't need anything that cpuset provides. In a word, it has nothing to do with cpusets hence it shouldn't belong to cpusets. We're clearing all the messes in cgroups, and this patch acts in the converse direction. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 2013-06-19 at 19:43 -0700, David Rientjes wrote: > > I'm suspecting that you're referring to enlarged rss because of > khugepaged's max_ptes_none and because you're abusing the purpose of > cpusets for containerization. Why is containerization an abuse? What's wrong with renting out chunks of a big box farm on the fly like a task motel? If a realtime customer checks in, he's not gonna be thrilled about sharing a room. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 19 Jun 2013, Robin Holt wrote: > cpusets was not for NUMA. It has no preference for "nodes" or anything like > that. It was for splitting a machine into layered smaller groups. Usually, > we see one cpuset with contains the batch scheduler. The batch scheduler then > creates cpusets for jobs it starts. Has nothing to do with nodes. That is > more an administrator issue. They set the minimum grouping of resources > for scheduled jobs. > I disagree with all of the above, it's not what Paul Jackson developed cpusets for, it's not what he wrote in Documentation/cgroups/cpusets.txt, and it's not why libnuma immediately supported it. Cpusets is for NUMA, like it or not. > > I'm saying there's absolutely no reason to have thp controlled by a > > cpuset, or ANY cgroup for that matter, since you chose not to respond to > > the question I asked: why do you want to control thp behavior for certain > > static binaries and not others? Where is the performance regression or > > the downside? Is it because of max_ptes_none for certain jobs blowing up > > the rss? We need information, and even if were justifiable then it > > wouldn't have anything to do with ANY cgroup but rather a per-process > > control. It has nothing to do with cpusets whatsoever. > > It was a request from our benchmarking group that has found some jobs > benefit from thp, while other are harmed. Let me ask them for more > details. > Yes, please, because if some jobs are harmed by thp then we need to fix that regression and not paper around with it with some cpuset-based solution. People should be able to run with CONFIG_TRANSPARENT_HUGEPAGE enabled and not be required to enable CONFIG_CPUSETS for optimal behavior. I'm suspecting that you're referring to enlarged rss because of khugepaged's max_ptes_none and because you're abusing the purpose of cpusets for containerization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote: > On Wed, 19 Jun 2013, Robin Holt wrote: > > > The convenience being that many batch schedulers have added cpuset > > support. They create the cpuset's and configure them as appropriate > > for the job as determined by a mixture of input from the submitting > > user but still under the control of the administrator. That seems like > > a fairly significant convenience given that it took years to get the > > batch schedulers to adopt cpusets in the first place. At this point, > > expanding their use of cpusets is under the control of the system > > administrator and would not require any additional development on > > the batch scheduler developers part. > > > > You can't say the same for memcg? I am not aware of batch scheduler support for memory controllers. The request came from our benchmarking group. > > Here are the entries in the cpuset: > > cgroup.event_control mem_exclusivememory_pressure_enabled > > notify_on_release tasks > > cgroup.procs mem_hardwall memory_spread_page > > release_agent > > cpu_exclusive memory_migrate memory_spread_slab > > sched_load_balance > > cpus memory_pressure mems > > sched_relax_domain_level > > > > There are scheduler, slab allocator, page_cache layout, etc controls. > > I think this is mostly for historical reasons since cpusets were > introduced before cgroups. > > > Why _NOT_ add a thp control to that nicely contained central location? > > It is a concise set of controls for the job. > > > > All of the above seem to be for cpusets primary purpose, i.e. NUMA > optimizations. It has nothing to do with transparent hugepages. (I'm not > saying thp has anything to do with memcg either, but a "memory controller" > seems more appropriate for controlling thp behavior.) cpusets was not for NUMA. It has no preference for "nodes" or anything like that. It was for splitting a machine into layered smaller groups. Usually, we see one cpuset with contains the batch scheduler. The batch scheduler then creates cpusets for jobs it starts. Has nothing to do with nodes. That is more an administrator issue. They set the minimum grouping of resources for scheduled jobs. > > Maybe I am misunderstanding. Are you saying you want to put memcg > > information into the cpuset or something like that? > > > > I'm saying there's absolutely no reason to have thp controlled by a > cpuset, or ANY cgroup for that matter, since you chose not to respond to > the question I asked: why do you want to control thp behavior for certain > static binaries and not others? Where is the performance regression or > the downside? Is it because of max_ptes_none for certain jobs blowing up > the rss? We need information, and even if were justifiable then it > wouldn't have anything to do with ANY cgroup but rather a per-process > control. It has nothing to do with cpusets whatsoever. It was a request from our benchmarking group that has found some jobs benefit from thp, while other are harmed. Let me ask them for more details. > (And I'm very curious why you didn't even cc the cpusets maintainer on > this patch in the first place who would probably say the same thing.) I didn't know there was a cpuset maintainer. Paul Jackson (SGI retired) had originally worked to get cpusets introduced and then converted to use cgroups. I had never known there was a maintainer after him. Sorry for that. Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 19 Jun 2013, Robin Holt wrote: > The convenience being that many batch schedulers have added cpuset > support. They create the cpuset's and configure them as appropriate > for the job as determined by a mixture of input from the submitting > user but still under the control of the administrator. That seems like > a fairly significant convenience given that it took years to get the > batch schedulers to adopt cpusets in the first place. At this point, > expanding their use of cpusets is under the control of the system > administrator and would not require any additional development on > the batch scheduler developers part. > You can't say the same for memcg? > Here are the entries in the cpuset: > cgroup.event_control mem_exclusivememory_pressure_enabled > notify_on_release tasks > cgroup.procs mem_hardwall memory_spread_page release_agent > cpu_exclusive memory_migrate memory_spread_slab > sched_load_balance > cpus memory_pressure mems > sched_relax_domain_level > > There are scheduler, slab allocator, page_cache layout, etc controls. I think this is mostly for historical reasons since cpusets were introduced before cgroups. > Why _NOT_ add a thp control to that nicely contained central location? > It is a concise set of controls for the job. > All of the above seem to be for cpusets primary purpose, i.e. NUMA optimizations. It has nothing to do with transparent hugepages. (I'm not saying thp has anything to do with memcg either, but a "memory controller" seems more appropriate for controlling thp behavior.) > Maybe I am misunderstanding. Are you saying you want to put memcg > information into the cpuset or something like that? > I'm saying there's absolutely no reason to have thp controlled by a cpuset, or ANY cgroup for that matter, since you chose not to respond to the question I asked: why do you want to control thp behavior for certain static binaries and not others? Where is the performance regression or the downside? Is it because of max_ptes_none for certain jobs blowing up the rss? We need information, and even if were justifiable then it wouldn't have anything to do with ANY cgroup but rather a per-process control. It has nothing to do with cpusets whatsoever. (And I'm very curious why you didn't even cc the cpusets maintainer on this patch in the first place who would probably say the same thing.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, Jun 18, 2013 at 05:01:23PM -0700, David Rientjes wrote: > On Tue, 18 Jun 2013, Alex Thorlton wrote: > > > Thanks for your input, however, I believe the method of using a malloc > > hook falls apart when it comes to static binaries, since we wont' have > > any shared libraries to hook into. Although using a malloc hook is a > > perfectly suitable solution for most cases, we're looking to implement a > > solution that can be used in all situations. > > > > I guess the question would be why you don't want your malloc memory backed > by thp pages for certain static binaries and not others? Is it because of > an increased rss due to khugepaged collapsing memory because of its > default max_ptes_none value? > > > Aside from that particular shortcoming of the malloc hook solution, > > there are some other situations having a cpuset-based option is a > > much simpler and more efficient solution than the alternatives. > > Sure, but why should this be a cpuset based solution? What is special > about cpusets that make certain statically allocated binaries not want > memory backed by thp while others do? This still seems based solely on > convenience instead of any hard requirement. The convenience being that many batch schedulers have added cpuset support. They create the cpuset's and configure them as appropriate for the job as determined by a mixture of input from the submitting user but still under the control of the administrator. That seems like a fairly significant convenience given that it took years to get the batch schedulers to adopt cpusets in the first place. At this point, expanding their use of cpusets is under the control of the system administrator and would not require any additional development on the batch scheduler developers part. > > One > > such situation that comes to mind would be an environment where a batch > > scheduler is in use to ration system resources. If an administrator > > determines that a users jobs run more efficiently with thp always on, > > the administrator can simply set the users jobs to always run with that > > setting, instead of having to coordinate with that user to get them to > > run their jobs in a different way. I feel that, for cases such as this, > > the this additional flag is in line with the other capabilities that > > cgroups and cpusets provide. > > > > That sounds like a memcg, i.e. container, type of an issue, not a cpuset > issue which is more geared toward NUMA optimizations. User jobs should > always run more efficiently with thp always on, the worst-case scenario > should be if they run with the same performance as thp set to never. In > other words, there shouldn't be any regression that requires certain > cpusets to disable thp because of a performance regression. If there are > any, we'd like to investigate that separately from this patch. Here are the entries in the cpuset: cgroup.event_control mem_exclusivememory_pressure_enabled notify_on_release tasks cgroup.procs mem_hardwall memory_spread_page release_agent cpu_exclusive memory_migrate memory_spread_slab sched_load_balance cpus memory_pressure mems sched_relax_domain_level There are scheduler, slab allocator, page_cache layout, etc controls. Why _NOT_ add a thp control to that nicely contained central location? It is a concise set of controls for the job. Maybe I am misunderstanding. Are you saying you want to put memcg information into the cpuset or something like that? Robin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, Jun 18, 2013 at 05:01:23PM -0700, David Rientjes wrote: On Tue, 18 Jun 2013, Alex Thorlton wrote: Thanks for your input, however, I believe the method of using a malloc hook falls apart when it comes to static binaries, since we wont' have any shared libraries to hook into. Although using a malloc hook is a perfectly suitable solution for most cases, we're looking to implement a solution that can be used in all situations. I guess the question would be why you don't want your malloc memory backed by thp pages for certain static binaries and not others? Is it because of an increased rss due to khugepaged collapsing memory because of its default max_ptes_none value? Aside from that particular shortcoming of the malloc hook solution, there are some other situations having a cpuset-based option is a much simpler and more efficient solution than the alternatives. Sure, but why should this be a cpuset based solution? What is special about cpusets that make certain statically allocated binaries not want memory backed by thp while others do? This still seems based solely on convenience instead of any hard requirement. The convenience being that many batch schedulers have added cpuset support. They create the cpuset's and configure them as appropriate for the job as determined by a mixture of input from the submitting user but still under the control of the administrator. That seems like a fairly significant convenience given that it took years to get the batch schedulers to adopt cpusets in the first place. At this point, expanding their use of cpusets is under the control of the system administrator and would not require any additional development on the batch scheduler developers part. One such situation that comes to mind would be an environment where a batch scheduler is in use to ration system resources. If an administrator determines that a users jobs run more efficiently with thp always on, the administrator can simply set the users jobs to always run with that setting, instead of having to coordinate with that user to get them to run their jobs in a different way. I feel that, for cases such as this, the this additional flag is in line with the other capabilities that cgroups and cpusets provide. That sounds like a memcg, i.e. container, type of an issue, not a cpuset issue which is more geared toward NUMA optimizations. User jobs should always run more efficiently with thp always on, the worst-case scenario should be if they run with the same performance as thp set to never. In other words, there shouldn't be any regression that requires certain cpusets to disable thp because of a performance regression. If there are any, we'd like to investigate that separately from this patch. Here are the entries in the cpuset: cgroup.event_control mem_exclusivememory_pressure_enabled notify_on_release tasks cgroup.procs mem_hardwall memory_spread_page release_agent cpu_exclusive memory_migrate memory_spread_slab sched_load_balance cpus memory_pressure mems sched_relax_domain_level There are scheduler, slab allocator, page_cache layout, etc controls. Why _NOT_ add a thp control to that nicely contained central location? It is a concise set of controls for the job. Maybe I am misunderstanding. Are you saying you want to put memcg information into the cpuset or something like that? Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 19 Jun 2013, Robin Holt wrote: The convenience being that many batch schedulers have added cpuset support. They create the cpuset's and configure them as appropriate for the job as determined by a mixture of input from the submitting user but still under the control of the administrator. That seems like a fairly significant convenience given that it took years to get the batch schedulers to adopt cpusets in the first place. At this point, expanding their use of cpusets is under the control of the system administrator and would not require any additional development on the batch scheduler developers part. You can't say the same for memcg? Here are the entries in the cpuset: cgroup.event_control mem_exclusivememory_pressure_enabled notify_on_release tasks cgroup.procs mem_hardwall memory_spread_page release_agent cpu_exclusive memory_migrate memory_spread_slab sched_load_balance cpus memory_pressure mems sched_relax_domain_level There are scheduler, slab allocator, page_cache layout, etc controls. I think this is mostly for historical reasons since cpusets were introduced before cgroups. Why _NOT_ add a thp control to that nicely contained central location? It is a concise set of controls for the job. All of the above seem to be for cpusets primary purpose, i.e. NUMA optimizations. It has nothing to do with transparent hugepages. (I'm not saying thp has anything to do with memcg either, but a memory controller seems more appropriate for controlling thp behavior.) Maybe I am misunderstanding. Are you saying you want to put memcg information into the cpuset or something like that? I'm saying there's absolutely no reason to have thp controlled by a cpuset, or ANY cgroup for that matter, since you chose not to respond to the question I asked: why do you want to control thp behavior for certain static binaries and not others? Where is the performance regression or the downside? Is it because of max_ptes_none for certain jobs blowing up the rss? We need information, and even if were justifiable then it wouldn't have anything to do with ANY cgroup but rather a per-process control. It has nothing to do with cpusets whatsoever. (And I'm very curious why you didn't even cc the cpusets maintainer on this patch in the first place who would probably say the same thing.) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, Jun 19, 2013 at 02:24:07PM -0700, David Rientjes wrote: On Wed, 19 Jun 2013, Robin Holt wrote: The convenience being that many batch schedulers have added cpuset support. They create the cpuset's and configure them as appropriate for the job as determined by a mixture of input from the submitting user but still under the control of the administrator. That seems like a fairly significant convenience given that it took years to get the batch schedulers to adopt cpusets in the first place. At this point, expanding their use of cpusets is under the control of the system administrator and would not require any additional development on the batch scheduler developers part. You can't say the same for memcg? I am not aware of batch scheduler support for memory controllers. The request came from our benchmarking group. Here are the entries in the cpuset: cgroup.event_control mem_exclusivememory_pressure_enabled notify_on_release tasks cgroup.procs mem_hardwall memory_spread_page release_agent cpu_exclusive memory_migrate memory_spread_slab sched_load_balance cpus memory_pressure mems sched_relax_domain_level There are scheduler, slab allocator, page_cache layout, etc controls. I think this is mostly for historical reasons since cpusets were introduced before cgroups. Why _NOT_ add a thp control to that nicely contained central location? It is a concise set of controls for the job. All of the above seem to be for cpusets primary purpose, i.e. NUMA optimizations. It has nothing to do with transparent hugepages. (I'm not saying thp has anything to do with memcg either, but a memory controller seems more appropriate for controlling thp behavior.) cpusets was not for NUMA. It has no preference for nodes or anything like that. It was for splitting a machine into layered smaller groups. Usually, we see one cpuset with contains the batch scheduler. The batch scheduler then creates cpusets for jobs it starts. Has nothing to do with nodes. That is more an administrator issue. They set the minimum grouping of resources for scheduled jobs. Maybe I am misunderstanding. Are you saying you want to put memcg information into the cpuset or something like that? I'm saying there's absolutely no reason to have thp controlled by a cpuset, or ANY cgroup for that matter, since you chose not to respond to the question I asked: why do you want to control thp behavior for certain static binaries and not others? Where is the performance regression or the downside? Is it because of max_ptes_none for certain jobs blowing up the rss? We need information, and even if were justifiable then it wouldn't have anything to do with ANY cgroup but rather a per-process control. It has nothing to do with cpusets whatsoever. It was a request from our benchmarking group that has found some jobs benefit from thp, while other are harmed. Let me ask them for more details. (And I'm very curious why you didn't even cc the cpusets maintainer on this patch in the first place who would probably say the same thing.) I didn't know there was a cpuset maintainer. Paul Jackson (SGI retired) had originally worked to get cpusets introduced and then converted to use cgroups. I had never known there was a maintainer after him. Sorry for that. Robin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 19 Jun 2013, Robin Holt wrote: cpusets was not for NUMA. It has no preference for nodes or anything like that. It was for splitting a machine into layered smaller groups. Usually, we see one cpuset with contains the batch scheduler. The batch scheduler then creates cpusets for jobs it starts. Has nothing to do with nodes. That is more an administrator issue. They set the minimum grouping of resources for scheduled jobs. I disagree with all of the above, it's not what Paul Jackson developed cpusets for, it's not what he wrote in Documentation/cgroups/cpusets.txt, and it's not why libnuma immediately supported it. Cpusets is for NUMA, like it or not. I'm saying there's absolutely no reason to have thp controlled by a cpuset, or ANY cgroup for that matter, since you chose not to respond to the question I asked: why do you want to control thp behavior for certain static binaries and not others? Where is the performance regression or the downside? Is it because of max_ptes_none for certain jobs blowing up the rss? We need information, and even if were justifiable then it wouldn't have anything to do with ANY cgroup but rather a per-process control. It has nothing to do with cpusets whatsoever. It was a request from our benchmarking group that has found some jobs benefit from thp, while other are harmed. Let me ask them for more details. Yes, please, because if some jobs are harmed by thp then we need to fix that regression and not paper around with it with some cpuset-based solution. People should be able to run with CONFIG_TRANSPARENT_HUGEPAGE enabled and not be required to enable CONFIG_CPUSETS for optimal behavior. I'm suspecting that you're referring to enlarged rss because of khugepaged's max_ptes_none and because you're abusing the purpose of cpusets for containerization. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Wed, 2013-06-19 at 19:43 -0700, David Rientjes wrote: I'm suspecting that you're referring to enlarged rss because of khugepaged's max_ptes_none and because you're abusing the purpose of cpusets for containerization. Why is containerization an abuse? What's wrong with renting out chunks of a big box farm on the fly like a task motel? If a realtime customer checks in, he's not gonna be thrilled about sharing a room. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
Cc: Tejun, and cgroup ML Here are the entries in the cpuset: cgroup.event_control mem_exclusivememory_pressure_enabled notify_on_release tasks cgroup.procs mem_hardwall memory_spread_page release_agent cpu_exclusive memory_migrate memory_spread_slab sched_load_balance cpus memory_pressure mems sched_relax_domain_level There are scheduler, slab allocator, page_cache layout, etc controls. I think this is mostly for historical reasons since cpusets were introduced before cgroups. Why _NOT_ add a thp control to that nicely contained central location? It is a concise set of controls for the job. All of the above seem to be for cpusets primary purpose, i.e. NUMA optimizations. It has nothing to do with transparent hugepages. (I'm not saying thp has anything to do with memcg either, but a memory controller seems more appropriate for controlling thp behavior.) Maybe I am misunderstanding. Are you saying you want to put memcg information into the cpuset or something like that? I'm saying there's absolutely no reason to have thp controlled by a cpuset, or ANY cgroup for that matter, since you chose not to respond to the question I asked: why do you want to control thp behavior for certain static binaries and not others? Where is the performance regression or the downside? Is it because of max_ptes_none for certain jobs blowing up the rss? We need information, and even if were justifiable then it wouldn't have anything to do with ANY cgroup but rather a per-process control. It has nothing to do with cpusets whatsoever. (And I'm very curious why you didn't even cc the cpusets maintainer on this patch in the first place who would probably say the same thing.) . Don't know whom you were refering to here. It's Paul Jackson who invented cpusets, and then Paul Menage took over the maintainership but he wasn't doing much maintainer's work. Now it's me and Tejun maintaining cpusets. (long ago Ingo once requested cpuset patches should be cced to him and Peter.) Back to this patch, I'm definitely on your side. This feature doesn't interact with existing cpuset features, and it doens't need anything that cpuset provides. In a word, it has nothing to do with cpusets hence it shouldn't belong to cpusets. We're clearing all the messes in cgroups, and this patch acts in the converse direction. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, 18 Jun 2013, Alex Thorlton wrote: > Thanks for your input, however, I believe the method of using a malloc > hook falls apart when it comes to static binaries, since we wont' have > any shared libraries to hook into. Although using a malloc hook is a > perfectly suitable solution for most cases, we're looking to implement a > solution that can be used in all situations. > I guess the question would be why you don't want your malloc memory backed by thp pages for certain static binaries and not others? Is it because of an increased rss due to khugepaged collapsing memory because of its default max_ptes_none value? > Aside from that particular shortcoming of the malloc hook solution, > there are some other situations having a cpuset-based option is a > much simpler and more efficient solution than the alternatives. Sure, but why should this be a cpuset based solution? What is special about cpusets that make certain statically allocated binaries not want memory backed by thp while others do? This still seems based solely on convenience instead of any hard requirement. > One > such situation that comes to mind would be an environment where a batch > scheduler is in use to ration system resources. If an administrator > determines that a users jobs run more efficiently with thp always on, > the administrator can simply set the users jobs to always run with that > setting, instead of having to coordinate with that user to get them to > run their jobs in a different way. I feel that, for cases such as this, > the this additional flag is in line with the other capabilities that > cgroups and cpusets provide. > That sounds like a memcg, i.e. container, type of an issue, not a cpuset issue which is more geared toward NUMA optimizations. User jobs should always run more efficiently with thp always on, the worst-case scenario should be if they run with the same performance as thp set to never. In other words, there shouldn't be any regression that requires certain cpusets to disable thp because of a performance regression. If there are any, we'd like to investigate that separately from this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote: > On Tue, 11 Jun 2013, Alex Thorlton wrote: > > > This patch adds the ability to control THPs on a per cpuset basis. > > Please see > > the additions to Documentation/cgroups/cpusets.txt for more > > information. > > > > What's missing from both this changelog and the documentation you > point to > is why this change is needed. > > I can understand how you would want a subset of processes to not use > thp > when it is enabled. This is typically where MADV_NOHUGEPAGE is used > with > some type of malloc hook. > > I don't think we need to do this on a cpuset level, so unfortunately I > think this needs to be reworked. Would it make sense to add a > per-process > tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() > and > mmap() calls? Perhaps, but then you would need to justify why it > can't be > done with a malloc hook in userspace. > > This seems to just be working around a userspace issue or for a matter > of > convenience, right? David, Thanks for your input, however, I believe the method of using a malloc hook falls apart when it comes to static binaries, since we wont' have any shared libraries to hook into. Although using a malloc hook is a perfectly suitable solution for most cases, we're looking to implement a solution that can be used in all situations. Aside from that particular shortcoming of the malloc hook solution, there are some other situations having a cpuset-based option is a much simpler and more efficient solution than the alternatives. One such situation that comes to mind would be an environment where a batch scheduler is in use to ration system resources. If an administrator determines that a users jobs run more efficiently with thp always on, the administrator can simply set the users jobs to always run with that setting, instead of having to coordinate with that user to get them to run their jobs in a different way. I feel that, for cases such as this, the this additional flag is in line with the other capabilities that cgroups and cpusets provide. While there are likely numerous other situations where having a flag to control thp on the cpuset level makes things a bit easier to manage, the one real show-stopper here is that we really have no other options when it comes to static binaries. - Alex Thorlton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, Jun 11, 2013 at 03:20:09PM -0700, David Rientjes wrote: On Tue, 11 Jun 2013, Alex Thorlton wrote: This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. What's missing from both this changelog and the documentation you point to is why this change is needed. I can understand how you would want a subset of processes to not use thp when it is enabled. This is typically where MADV_NOHUGEPAGE is used with some type of malloc hook. I don't think we need to do this on a cpuset level, so unfortunately I think this needs to be reworked. Would it make sense to add a per-process tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and mmap() calls? Perhaps, but then you would need to justify why it can't be done with a malloc hook in userspace. This seems to just be working around a userspace issue or for a matter of convenience, right? David, Thanks for your input, however, I believe the method of using a malloc hook falls apart when it comes to static binaries, since we wont' have any shared libraries to hook into. Although using a malloc hook is a perfectly suitable solution for most cases, we're looking to implement a solution that can be used in all situations. Aside from that particular shortcoming of the malloc hook solution, there are some other situations having a cpuset-based option is a much simpler and more efficient solution than the alternatives. One such situation that comes to mind would be an environment where a batch scheduler is in use to ration system resources. If an administrator determines that a users jobs run more efficiently with thp always on, the administrator can simply set the users jobs to always run with that setting, instead of having to coordinate with that user to get them to run their jobs in a different way. I feel that, for cases such as this, the this additional flag is in line with the other capabilities that cgroups and cpusets provide. While there are likely numerous other situations where having a flag to control thp on the cpuset level makes things a bit easier to manage, the one real show-stopper here is that we really have no other options when it comes to static binaries. - Alex Thorlton -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, 18 Jun 2013, Alex Thorlton wrote: Thanks for your input, however, I believe the method of using a malloc hook falls apart when it comes to static binaries, since we wont' have any shared libraries to hook into. Although using a malloc hook is a perfectly suitable solution for most cases, we're looking to implement a solution that can be used in all situations. I guess the question would be why you don't want your malloc memory backed by thp pages for certain static binaries and not others? Is it because of an increased rss due to khugepaged collapsing memory because of its default max_ptes_none value? Aside from that particular shortcoming of the malloc hook solution, there are some other situations having a cpuset-based option is a much simpler and more efficient solution than the alternatives. Sure, but why should this be a cpuset based solution? What is special about cpusets that make certain statically allocated binaries not want memory backed by thp while others do? This still seems based solely on convenience instead of any hard requirement. One such situation that comes to mind would be an environment where a batch scheduler is in use to ration system resources. If an administrator determines that a users jobs run more efficiently with thp always on, the administrator can simply set the users jobs to always run with that setting, instead of having to coordinate with that user to get them to run their jobs in a different way. I feel that, for cases such as this, the this additional flag is in line with the other capabilities that cgroups and cpusets provide. That sounds like a memcg, i.e. container, type of an issue, not a cpuset issue which is more geared toward NUMA optimizations. User jobs should always run more efficiently with thp always on, the worst-case scenario should be if they run with the same performance as thp set to never. In other words, there shouldn't be any regression that requires certain cpusets to disable thp because of a performance regression. If there are any, we'd like to investigate that separately from this patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, 11 Jun 2013, Alex Thorlton wrote: > This patch adds the ability to control THPs on a per cpuset basis. Please see > the additions to Documentation/cgroups/cpusets.txt for more information. > What's missing from both this changelog and the documentation you point to is why this change is needed. I can understand how you would want a subset of processes to not use thp when it is enabled. This is typically where MADV_NOHUGEPAGE is used with some type of malloc hook. I don't think we need to do this on a cpuset level, so unfortunately I think this needs to be reworked. Would it make sense to add a per-process tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and mmap() calls? Perhaps, but then you would need to justify why it can't be done with a malloc hook in userspace. This seems to just be working around a userspace issue or for a matter of convenience, right? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] Make transparent hugepages cpuset aware
This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. Signed-off-by: Alex Thorlton Reviewed-by: Robin Holt Cc: Li Zefan Cc: Rob Landley Cc: Andrew Morton Cc: Mel Gorman Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Johannes Weiner Cc: Xiao Guangrong Cc: David Rientjes Cc: linux-...@vger.kernel.org Cc: linux...@kvack.org --- Changes since last patch version: - Modified transparent_hugepage_enable to always check the vma for the VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack. - Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags Documentation/cgroups/cpusets.txt | 50 ++- include/linux/cpuset.h| 5 ++ include/linux/huge_mm.h | 27 +- kernel/cpuset.c | 181 ++ mm/huge_memory.c | 3 + 5 files changed, 263 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 12e01d4..b7b2c83 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -22,12 +22,14 @@ CONTENTS: 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is thp_enabled ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus 2.3 Setting flags 2.4 Attaching processes + 2.5 Setting thp_enabled flags 3. Questions 4. Contact @@ -581,7 +583,34 @@ If your situation is: then increasing 'sched_relax_domain_level' would benefit you. -1.9 How do I use cpusets ? +1.9 What is thp_enabled ? +--- + +The thp_enabled file contained within each cpuset controls how transparent +hugepages are handled within that cpuset. + +The root cpuset's thp_enabled flags mirror the flags set in +/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset can +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The +thp_enabled file for the root cpuset is read only. These flags cause the +root cpuset to behave as one might expect: + +- When set to always, THPs are used whenever practical +- When set to madvise, THPs are used only on chunks of memory that have the + MADV_HUGEPAGE flag set +- When set to never, THPs are never allowed for tasks in this cpuset + +The behavior of thp_enabled for children of the root cpuset is where things +become a bit more interesting. The child cpusets accept the same flags as the +root, but also have a default flag, which, when set, causes a cpuset to use the +behavior of its parent. When a child cpuset is created, its default flag is +always initially set. + +Since the flags on child cpusets are allowed to differ from the flags on their +parents, we are able to enable THPs for tasks in specific cpusets, and disable +them in others. + +1.10 How do I use cpusets ? -- In order to minimize the impact of cpusets on critical kernel @@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance cpuset.mem_exclusive cpuset.sched_relax_domain_level cpuset.mem_hardwallnotify_on_release cpuset.memory_migrate tasks +thp_enabled Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using @@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one after another: ... # /bin/echo PIDn > tasks +2.5 Setting thp_enabled flags +- + +The syntax for setting these flags is similar to setting thp flags in +/sys/kernel/mm/transparent_hugepage/enabled. In order to change the flags you +simply echo the name of the flag you want to set to the thp_enabled file of the +desired cpuset: + +# /bin/echo always > thp_enabled -> always use THPs when practical +# /bin/echo madvise > thp_enabled -> only use THPs in madvise sections +# /bin/echo never > thp_enabled-> never use THPs +# /bin/echo default > thp_enabled -> use parent cpuset's THP flags + +Note that the flags on the root cpuset cannot be changed in /dev/cpuset. These +flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can only +be modified there. 3. Questions diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index cc1b01c..624aafd 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are defined in system? */ extern int cpuset_init(void); extern void cpuset_init_smp(void); +extern void cpuset_update_top_thp_flags(void); extern void cpuset_update_active_cpus(bool cpu_online); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern void
[PATCH v2] Make transparent hugepages cpuset aware
This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. Signed-off-by: Alex Thorlton athorl...@sgi.com Reviewed-by: Robin Holt h...@sgi.com Cc: Li Zefan lize...@huawei.com Cc: Rob Landley r...@landley.net Cc: Andrew Morton a...@linux-foundation.org Cc: Mel Gorman mgor...@suse.de Cc: Rik van Riel r...@redhat.com Cc: Kirill A. Shutemov kirill.shute...@linux.intel.com Cc: Johannes Weiner han...@cmpxchg.org Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Cc: David Rientjes rient...@google.com Cc: linux-...@vger.kernel.org Cc: linux...@kvack.org --- Changes since last patch version: - Modified transparent_hugepage_enable to always check the vma for the VM_NOHUGEPAGE flag and to always check is_vma_temporary_stack. - Moved cpuset_update_child_thp_flags above cpuset_update_top_thp_flags Documentation/cgroups/cpusets.txt | 50 ++- include/linux/cpuset.h| 5 ++ include/linux/huge_mm.h | 27 +- kernel/cpuset.c | 181 ++ mm/huge_memory.c | 3 + 5 files changed, 263 insertions(+), 3 deletions(-) diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 12e01d4..b7b2c83 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -22,12 +22,14 @@ CONTENTS: 1.6 What is memory spread ? 1.7 What is sched_load_balance ? 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? + 1.9 What is thp_enabled ? + 1.10 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus 2.3 Setting flags 2.4 Attaching processes + 2.5 Setting thp_enabled flags 3. Questions 4. Contact @@ -581,7 +583,34 @@ If your situation is: then increasing 'sched_relax_domain_level' would benefit you. -1.9 How do I use cpusets ? +1.9 What is thp_enabled ? +--- + +The thp_enabled file contained within each cpuset controls how transparent +hugepages are handled within that cpuset. + +The root cpuset's thp_enabled flags mirror the flags set in +/sys/kernel/mm/transparent_hugepage/enabled. The flags in the root cpuset can +only be modified by changing /sys/kernel/mm/transparent_hugepage/enabled. The +thp_enabled file for the root cpuset is read only. These flags cause the +root cpuset to behave as one might expect: + +- When set to always, THPs are used whenever practical +- When set to madvise, THPs are used only on chunks of memory that have the + MADV_HUGEPAGE flag set +- When set to never, THPs are never allowed for tasks in this cpuset + +The behavior of thp_enabled for children of the root cpuset is where things +become a bit more interesting. The child cpusets accept the same flags as the +root, but also have a default flag, which, when set, causes a cpuset to use the +behavior of its parent. When a child cpuset is created, its default flag is +always initially set. + +Since the flags on child cpusets are allowed to differ from the flags on their +parents, we are able to enable THPs for tasks in specific cpusets, and disable +them in others. + +1.10 How do I use cpusets ? -- In order to minimize the impact of cpusets on critical kernel @@ -733,6 +762,7 @@ cpuset.cpuscpuset.sched_load_balance cpuset.mem_exclusive cpuset.sched_relax_domain_level cpuset.mem_hardwallnotify_on_release cpuset.memory_migrate tasks +thp_enabled Reading them will give you information about the state of this cpuset: the CPUs and Memory Nodes it can use, the processes that are using @@ -814,6 +844,22 @@ If you have several tasks to attach, you have to do it one after another: ... # /bin/echo PIDn tasks +2.5 Setting thp_enabled flags +- + +The syntax for setting these flags is similar to setting thp flags in +/sys/kernel/mm/transparent_hugepage/enabled. In order to change the flags you +simply echo the name of the flag you want to set to the thp_enabled file of the +desired cpuset: + +# /bin/echo always thp_enabled - always use THPs when practical +# /bin/echo madvise thp_enabled - only use THPs in madvise sections +# /bin/echo never thp_enabled- never use THPs +# /bin/echo default thp_enabled - use parent cpuset's THP flags + +Note that the flags on the root cpuset cannot be changed in /dev/cpuset. These +flags are mirrored from /sys/kernel/mm/transparent_hugepage/enabled and can only +be modified there. 3. Questions diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index cc1b01c..624aafd 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -19,9 +19,12 @@ extern int number_of_cpusets;/* How many cpusets are defined in system? */ extern int cpuset_init(void); extern void cpuset_init_smp(void);
Re: [PATCH v2] Make transparent hugepages cpuset aware
On Tue, 11 Jun 2013, Alex Thorlton wrote: This patch adds the ability to control THPs on a per cpuset basis. Please see the additions to Documentation/cgroups/cpusets.txt for more information. What's missing from both this changelog and the documentation you point to is why this change is needed. I can understand how you would want a subset of processes to not use thp when it is enabled. This is typically where MADV_NOHUGEPAGE is used with some type of malloc hook. I don't think we need to do this on a cpuset level, so unfortunately I think this needs to be reworked. Would it make sense to add a per-process tunable to always get MADV_NOHUGEPAGE behavior for all of its sbrk() and mmap() calls? Perhaps, but then you would need to justify why it can't be done with a malloc hook in userspace. This seems to just be working around a userspace issue or for a matter of convenience, right? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/