Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/25 2:45, Mike Kravetz wrote: [...] >> THP does not guarantee to use the Huge Page, but may use the normal page. > > Note. You do not want to use THP because "THP does not guarantee". [...] >> One of the answers I have reached is to use HugeTLBfs by overcommitting >> without creating a pool(this is the surplus hugepage). > > Using hugetlbfs overcommit also does not provide a guarantee. Without > doing much research, I would say the failure rate for obtaining a huge > page via THP and hugetlbfs overcommit is about the same. The most > difficult issue in both cases will be obtaining a "huge page" number of > pages from the buddy allocator. Yes. If do not support multiple size hugetlb pages such as x86, because number of pages between THP and hugetlb is same, the failure rate of obtaining a compound page is same, as you said. > I really do not think hugetlbfs overcommit will provide any benefit over > THP for your use case. I think that what you say is absolutely right. > Also, new user space code is required to "fall back" > to normal pages in the case of hugetlbfs page allocation failure. This > is not needed in the THP case. I understand the superiority of THP, but there are scenes where khugepaged occupies cpu due to page fragmentation. Instead of overcommit, setup a persistent pool once, I think that hugetlb can be superior, such as memory allocation performance exceeding THP. I will try to find a good way to use hugetlb page. I sincerely thank you for your help. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/25 2:45, Mike Kravetz wrote: [...] >> THP does not guarantee to use the Huge Page, but may use the normal page. > > Note. You do not want to use THP because "THP does not guarantee". [...] >> One of the answers I have reached is to use HugeTLBfs by overcommitting >> without creating a pool(this is the surplus hugepage). > > Using hugetlbfs overcommit also does not provide a guarantee. Without > doing much research, I would say the failure rate for obtaining a huge > page via THP and hugetlbfs overcommit is about the same. The most > difficult issue in both cases will be obtaining a "huge page" number of > pages from the buddy allocator. Yes. If do not support multiple size hugetlb pages such as x86, because number of pages between THP and hugetlb is same, the failure rate of obtaining a compound page is same, as you said. > I really do not think hugetlbfs overcommit will provide any benefit over > THP for your use case. I think that what you say is absolutely right. > Also, new user space code is required to "fall back" > to normal pages in the case of hugetlbfs page allocation failure. This > is not needed in the THP case. I understand the superiority of THP, but there are scenes where khugepaged occupies cpu due to page fragmentation. Instead of overcommit, setup a persistent pool once, I think that hugetlb can be superior, such as memory allocation performance exceeding THP. I will try to find a good way to use hugetlb page. I sincerely thank you for your help. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/24 22:24, Michal Hocko wrote [...]> I do not see anything like that. adjust_pool_surplus is simply and > accounting thing. At least the last time I've checked. Maybe your > patchset handles that? As you said, my patch did not consider handling when manipulating the pool. And even if that handling is done well, it will not be a valid reason to charge surplus hugepage to memcg. [...] >> Absolutely you are saying the right thing, but, for example, can mlock(2)ed >> pages be swapped out by reclaim?(What is the difference between mlock(2)ed >> pages and hugetlb page?) > > No mlocked pages cannot be reclaimed and that is why we restrict them to > a relatively small amount. I understood the concept of memcg. [...] > Fatal? Not sure. It simply tries to add an alien memory to the memcg > concept so I would pressume an unexpected behavior (e.g. not being able > to reclaim memcg or, over reclaim, trashing etc.). As you said, it must be an alien. Thanks to the interaction up to here, I understood that my solution is inappropriate. I will look for another way. Thank you for your kind explanation. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/24 22:24, Michal Hocko wrote [...]> I do not see anything like that. adjust_pool_surplus is simply and > accounting thing. At least the last time I've checked. Maybe your > patchset handles that? As you said, my patch did not consider handling when manipulating the pool. And even if that handling is done well, it will not be a valid reason to charge surplus hugepage to memcg. [...] >> Absolutely you are saying the right thing, but, for example, can mlock(2)ed >> pages be swapped out by reclaim?(What is the difference between mlock(2)ed >> pages and hugetlb page?) > > No mlocked pages cannot be reclaimed and that is why we restrict them to > a relatively small amount. I understood the concept of memcg. [...] > Fatal? Not sure. It simply tries to add an alien memory to the memcg > concept so I would pressume an unexpected behavior (e.g. not being able > to reclaim memcg or, over reclaim, trashing etc.). As you said, it must be an alien. Thanks to the interaction up to here, I understood that my solution is inappropriate. I will look for another way. Thank you for your kind explanation. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote: > > I do not know if it is really a strong use case, but I will explain my > motive in detail. English is not my native language, so please pardon > my poor English. > > I am one of the developers for software that managing the resource used > from user job at HPC-Cluster with Linux. The resource is memory mainly. > The HPC-Cluster may be shared by multiple people and used. Therefore, the > memory used by each user must be strictly controlled, otherwise the > user's job will runaway, not only will it hamper the other users, it will > crash the entire system in OOM. > > Some users of HPC are very nervous about performance. Jobs are executed > while synchronizing with MPI communication using multiple compute nodes. > Since CPU wait time will occur when synchronizing, they want to minimize > the variation in execution time at each node to reduce waiting times as > much as possible. We call this variation a noise. > > THP does not guarantee to use the Huge Page, but may use the normal page. Note. You do not want to use THP because "THP does not guarantee". > This mechanism is one cause of variation(noise). > > The users who know this mechanism will be hesitant to use THP. However, > the users also know the benefits of the Huge Page's TLB hit rate > performance, and the Huge Page seems to be attractive. It seems natural > that these users are interested in HugeTLBfs, I do not know at all > whether it is the right approach or not. > > At the very least, our HPC system is pursuing high versatility and we > have to consider whether we can provide it if users want to use HugeTLBfs. > > In order to use HugeTLBfs we need to create a persistent pool, but in > our use case sharing nodes, it would be impossible to create, delete or > resize the pool. > > One of the answers I have reached is to use HugeTLBfs by overcommitting > without creating a pool(this is the surplus hugepage). Using hugetlbfs overcommit also does not provide a guarantee. Without doing much research, I would say the failure rate for obtaining a huge page via THP and hugetlbfs overcommit is about the same. The most difficult issue in both cases will be obtaining a "huge page" number of pages from the buddy allocator. I really do not think hugetlbfs overcommit will provide any benefit over THP for your use case. Also, new user space code is required to "fall back" to normal pages in the case of hugetlbfs page allocation failure. This is not needed in the THP case. -- Mike Kravetz
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote: > > I do not know if it is really a strong use case, but I will explain my > motive in detail. English is not my native language, so please pardon > my poor English. > > I am one of the developers for software that managing the resource used > from user job at HPC-Cluster with Linux. The resource is memory mainly. > The HPC-Cluster may be shared by multiple people and used. Therefore, the > memory used by each user must be strictly controlled, otherwise the > user's job will runaway, not only will it hamper the other users, it will > crash the entire system in OOM. > > Some users of HPC are very nervous about performance. Jobs are executed > while synchronizing with MPI communication using multiple compute nodes. > Since CPU wait time will occur when synchronizing, they want to minimize > the variation in execution time at each node to reduce waiting times as > much as possible. We call this variation a noise. > > THP does not guarantee to use the Huge Page, but may use the normal page. Note. You do not want to use THP because "THP does not guarantee". > This mechanism is one cause of variation(noise). > > The users who know this mechanism will be hesitant to use THP. However, > the users also know the benefits of the Huge Page's TLB hit rate > performance, and the Huge Page seems to be attractive. It seems natural > that these users are interested in HugeTLBfs, I do not know at all > whether it is the right approach or not. > > At the very least, our HPC system is pursuing high versatility and we > have to consider whether we can provide it if users want to use HugeTLBfs. > > In order to use HugeTLBfs we need to create a persistent pool, but in > our use case sharing nodes, it would be impossible to create, delete or > resize the pool. > > One of the answers I have reached is to use HugeTLBfs by overcommitting > without creating a pool(this is the surplus hugepage). Using hugetlbfs overcommit also does not provide a guarantee. Without doing much research, I would say the failure rate for obtaining a huge page via THP and hugetlbfs overcommit is about the same. The most difficult issue in both cases will be obtaining a "huge page" number of pages from the buddy allocator. I really do not think hugetlbfs overcommit will provide any benefit over THP for your use case. Also, new user space code is required to "fall back" to normal pages in the case of hugetlbfs page allocation failure. This is not needed in the THP case. -- Mike Kravetz
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote: > On 2018/05/24 17:20, Michal Hocko wrote: > > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: > >> On 2018/05/23 3:54, Michal Hocko wrote: > > [...] > >>> I am also quite confused why you keep distinguishing surplus hugetlb > >>> pages from regular preallocated ones. Being a surplus page is an > >>> implementation detail that we use for an internal accounting rather than > >>> something to exhibit to the userspace even more than we do currently. > >> > >> I apologize for having confused. > >> > >> The hugetlb pages obtained from the pool do not waste the buddy pool. > > > > Because they have already allocated from the buddy allocator so the end > > result is very same. > > > >> On > >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this > >> difference in property, I thought it could be distinguished. > > > > But this is simply not correct. Surplus pages are fluid. If you increase > > the hugetlb size they will become regular persistent hugetlb pages. > > I really can not understand what's wrong with this. That page is obviously > released before being added to the persistent pool, and at that time it is > uncharged from memcg to which the task belongs(This assumes my patch-set). > After that, the same page obtained from the pool is not surplus hugepage > so it will not be charged to memcg again. I do not see anything like that. adjust_pool_surplus is simply and accounting thing. At least the last time I've checked. Maybe your patchset handles that? > >> Although my memcg knowledge is extremely limited, memcg is accounting for > >> various kinds of pages obtained from the buddy pool by the task belonging > >> to it. I would like to argue that surplus hugepage has specificity in > >> terms of obtaining from the buddy pool, and that it is specially permitted > >> charge requirements for memcg. > > > > Not really. Memcg accounts primarily for reclaimable memory. We do > > account for some non-reclaimable slabs but the life time should be at > > least bound to a process life time. Otherwise the memcg oom killer > > behavior is not guaranteed to unclutter the situation. Hugetlb pages are > > simply persistent. Well, to be completely honest tmpfs pages have a > > similar problem but lacking the swap space for them is kinda > > configuration bug. > > Absolutely you are saying the right thing, but, for example, can mlock(2)ed > pages be swapped out by reclaim?(What is the difference between mlock(2)ed > pages and hugetlb page?) No mlocked pages cannot be reclaimed and that is why we restrict them to a relatively small amount. > >> It seems very strange that charge hugetlb page to memcg, but essentially > >> it only charges the usage of the compound page obtained from the buddy > >> pool, > >> and even if that page is used as hugetlb page after that, memcg is not > >> interested in that. > > > > Ohh, it is very much interested. The primary goal of memcg is to enforce > > the limit. How are you going to do that in an absence of the reclaimable > > memory? And quite a lot of it because hugetlb pages usually consume a > > lot of memory. > > Simply kill any of the tasks belonging to that memcg. Maybe, no one wants > reclaim at the time of account of with surplus hugepages. But that will not release the hugetlb memory, does it? > [...] > >> I could not understand the intention of this question, sorry. When resize > >> the pool, I think that the number of surplus hugepages in use does not > >> change. Could you explain what you were concerned about? > > > > It does change when you change the hugetlb pool size, migrate pages > > between per-numa pools (have a look at adjust_pool_surplus). > > As I looked at, what kind of fatal problem is caused by charging surplus > hugepages to memcg by just manipulating counter of statistical information? Fatal? Not sure. It simply tries to add an alien memory to the memcg concept so I would pressume an unexpected behavior (e.g. not being able to reclaim memcg or, over reclaim, trashing etc.). -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote: > On 2018/05/24 17:20, Michal Hocko wrote: > > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: > >> On 2018/05/23 3:54, Michal Hocko wrote: > > [...] > >>> I am also quite confused why you keep distinguishing surplus hugetlb > >>> pages from regular preallocated ones. Being a surplus page is an > >>> implementation detail that we use for an internal accounting rather than > >>> something to exhibit to the userspace even more than we do currently. > >> > >> I apologize for having confused. > >> > >> The hugetlb pages obtained from the pool do not waste the buddy pool. > > > > Because they have already allocated from the buddy allocator so the end > > result is very same. > > > >> On > >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this > >> difference in property, I thought it could be distinguished. > > > > But this is simply not correct. Surplus pages are fluid. If you increase > > the hugetlb size they will become regular persistent hugetlb pages. > > I really can not understand what's wrong with this. That page is obviously > released before being added to the persistent pool, and at that time it is > uncharged from memcg to which the task belongs(This assumes my patch-set). > After that, the same page obtained from the pool is not surplus hugepage > so it will not be charged to memcg again. I do not see anything like that. adjust_pool_surplus is simply and accounting thing. At least the last time I've checked. Maybe your patchset handles that? > >> Although my memcg knowledge is extremely limited, memcg is accounting for > >> various kinds of pages obtained from the buddy pool by the task belonging > >> to it. I would like to argue that surplus hugepage has specificity in > >> terms of obtaining from the buddy pool, and that it is specially permitted > >> charge requirements for memcg. > > > > Not really. Memcg accounts primarily for reclaimable memory. We do > > account for some non-reclaimable slabs but the life time should be at > > least bound to a process life time. Otherwise the memcg oom killer > > behavior is not guaranteed to unclutter the situation. Hugetlb pages are > > simply persistent. Well, to be completely honest tmpfs pages have a > > similar problem but lacking the swap space for them is kinda > > configuration bug. > > Absolutely you are saying the right thing, but, for example, can mlock(2)ed > pages be swapped out by reclaim?(What is the difference between mlock(2)ed > pages and hugetlb page?) No mlocked pages cannot be reclaimed and that is why we restrict them to a relatively small amount. > >> It seems very strange that charge hugetlb page to memcg, but essentially > >> it only charges the usage of the compound page obtained from the buddy > >> pool, > >> and even if that page is used as hugetlb page after that, memcg is not > >> interested in that. > > > > Ohh, it is very much interested. The primary goal of memcg is to enforce > > the limit. How are you going to do that in an absence of the reclaimable > > memory? And quite a lot of it because hugetlb pages usually consume a > > lot of memory. > > Simply kill any of the tasks belonging to that memcg. Maybe, no one wants > reclaim at the time of account of with surplus hugepages. But that will not release the hugetlb memory, does it? > [...] > >> I could not understand the intention of this question, sorry. When resize > >> the pool, I think that the number of surplus hugepages in use does not > >> change. Could you explain what you were concerned about? > > > > It does change when you change the hugetlb pool size, migrate pages > > between per-numa pools (have a look at adjust_pool_surplus). > > As I looked at, what kind of fatal problem is caused by charging surplus > hugepages to memcg by just manipulating counter of statistical information? Fatal? Not sure. It simply tries to add an alien memory to the memcg concept so I would pressume an unexpected behavior (e.g. not being able to reclaim memcg or, over reclaim, trashing etc.). -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/24 17:20, Michal Hocko wrote: > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: >> On 2018/05/23 3:54, Michal Hocko wrote: > [...] >>> I am also quite confused why you keep distinguishing surplus hugetlb >>> pages from regular preallocated ones. Being a surplus page is an >>> implementation detail that we use for an internal accounting rather than >>> something to exhibit to the userspace even more than we do currently. >> >> I apologize for having confused. >> >> The hugetlb pages obtained from the pool do not waste the buddy pool. > > Because they have already allocated from the buddy allocator so the end > result is very same. > >> On >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this >> difference in property, I thought it could be distinguished. > > But this is simply not correct. Surplus pages are fluid. If you increase > the hugetlb size they will become regular persistent hugetlb pages. I really can not understand what's wrong with this. That page is obviously released before being added to the persistent pool, and at that time it is uncharged from memcg to which the task belongs(This assumes my patch-set). After that, the same page obtained from the pool is not surplus hugepage so it will not be charged to memcg again. >> Although my memcg knowledge is extremely limited, memcg is accounting for >> various kinds of pages obtained from the buddy pool by the task belonging >> to it. I would like to argue that surplus hugepage has specificity in >> terms of obtaining from the buddy pool, and that it is specially permitted >> charge requirements for memcg. > > Not really. Memcg accounts primarily for reclaimable memory. We do > account for some non-reclaimable slabs but the life time should be at > least bound to a process life time. Otherwise the memcg oom killer > behavior is not guaranteed to unclutter the situation. Hugetlb pages are > simply persistent. Well, to be completely honest tmpfs pages have a > similar problem but lacking the swap space for them is kinda > configuration bug. Absolutely you are saying the right thing, but, for example, can mlock(2)ed pages be swapped out by reclaim?(What is the difference between mlock(2)ed pages and hugetlb page?) >> It seems very strange that charge hugetlb page to memcg, but essentially >> it only charges the usage of the compound page obtained from the buddy pool, >> and even if that page is used as hugetlb page after that, memcg is not >> interested in that. > > Ohh, it is very much interested. The primary goal of memcg is to enforce > the limit. How are you going to do that in an absence of the reclaimable > memory? And quite a lot of it because hugetlb pages usually consume a > lot of memory. Simply kill any of the tasks belonging to that memcg. Maybe, no one wants reclaim at the time of account of with surplus hugepages. [...] >> I could not understand the intention of this question, sorry. When resize >> the pool, I think that the number of surplus hugepages in use does not >> change. Could you explain what you were concerned about? > > It does change when you change the hugetlb pool size, migrate pages > between per-numa pools (have a look at adjust_pool_surplus). As I looked at, what kind of fatal problem is caused by charging surplus hugepages to memcg by just manipulating counter of statistical information? -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/24 17:20, Michal Hocko wrote: > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: >> On 2018/05/23 3:54, Michal Hocko wrote: > [...] >>> I am also quite confused why you keep distinguishing surplus hugetlb >>> pages from regular preallocated ones. Being a surplus page is an >>> implementation detail that we use for an internal accounting rather than >>> something to exhibit to the userspace even more than we do currently. >> >> I apologize for having confused. >> >> The hugetlb pages obtained from the pool do not waste the buddy pool. > > Because they have already allocated from the buddy allocator so the end > result is very same. > >> On >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this >> difference in property, I thought it could be distinguished. > > But this is simply not correct. Surplus pages are fluid. If you increase > the hugetlb size they will become regular persistent hugetlb pages. I really can not understand what's wrong with this. That page is obviously released before being added to the persistent pool, and at that time it is uncharged from memcg to which the task belongs(This assumes my patch-set). After that, the same page obtained from the pool is not surplus hugepage so it will not be charged to memcg again. >> Although my memcg knowledge is extremely limited, memcg is accounting for >> various kinds of pages obtained from the buddy pool by the task belonging >> to it. I would like to argue that surplus hugepage has specificity in >> terms of obtaining from the buddy pool, and that it is specially permitted >> charge requirements for memcg. > > Not really. Memcg accounts primarily for reclaimable memory. We do > account for some non-reclaimable slabs but the life time should be at > least bound to a process life time. Otherwise the memcg oom killer > behavior is not guaranteed to unclutter the situation. Hugetlb pages are > simply persistent. Well, to be completely honest tmpfs pages have a > similar problem but lacking the swap space for them is kinda > configuration bug. Absolutely you are saying the right thing, but, for example, can mlock(2)ed pages be swapped out by reclaim?(What is the difference between mlock(2)ed pages and hugetlb page?) >> It seems very strange that charge hugetlb page to memcg, but essentially >> it only charges the usage of the compound page obtained from the buddy pool, >> and even if that page is used as hugetlb page after that, memcg is not >> interested in that. > > Ohh, it is very much interested. The primary goal of memcg is to enforce > the limit. How are you going to do that in an absence of the reclaimable > memory? And quite a lot of it because hugetlb pages usually consume a > lot of memory. Simply kill any of the tasks belonging to that memcg. Maybe, no one wants reclaim at the time of account of with surplus hugepages. [...] >> I could not understand the intention of this question, sorry. When resize >> the pool, I think that the number of surplus hugepages in use does not >> change. Could you explain what you were concerned about? > > It does change when you change the hugetlb pool size, migrate pages > between per-numa pools (have a look at adjust_pool_surplus). As I looked at, what kind of fatal problem is caused by charging surplus hugepages to memcg by just manipulating counter of statistical information? -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote: [...] > I do not know if it is really a strong use case, but I will explain my > motive in detail. English is not my native language, so please pardon > my poor English. > > I am one of the developers for software that managing the resource used > from user job at HPC-Cluster with Linux. The resource is memory mainly. > The HPC-Cluster may be shared by multiple people and used. Therefore, the > memory used by each user must be strictly controlled, otherwise the > user's job will runaway, not only will it hamper the other users, it will > crash the entire system in OOM. > > Some users of HPC are very nervous about performance. Jobs are executed > while synchronizing with MPI communication using multiple compute nodes. > Since CPU wait time will occur when synchronizing, they want to minimize > the variation in execution time at each node to reduce waiting times as > much as possible. We call this variation a noise. > > THP does not guarantee to use the Huge Page, but may use the normal page. > This mechanism is one cause of variation(noise). > > The users who know this mechanism will be hesitant to use THP. However, > the users also know the benefits of the Huge Page's TLB hit rate > performance, and the Huge Page seems to be attractive. It seems natural > that these users are interested in HugeTLBfs, I do not know at all > whether it is the right approach or not. Sure, asking for guarantee makes hugetlb pages attractive. But nothing is really for free, especially any resource _guarantee_, and you have to pay an additional configuration price usually. > At the very least, our HPC system is pursuing high versatility and we > have to consider whether we can provide it if users want to use HugeTLBfs. > > In order to use HugeTLBfs we need to create a persistent pool, but in > our use case sharing nodes, it would be impossible to create, delete or > resize the pool. Why? I can see this would be quite a PITA but not really impossible. > One of the answers I have reached is to use HugeTLBfs by overcommitting > without creating a pool(this is the surplus hugepage). > > Surplus hugepages is hugetlb page, but I think at least that consuming > buddy pool is a decisive difference from hugetlb page of persistent pool. > If nr_overcommit_hugepages is assumed to be infinite, allocating pages for > surplus hugepages from buddy pool is all unlimited even if being limited > by memcg. Not really, you can specify how much you can overcommit hugetlb pages. > In extreme cases, overcommitment will allow users to exhaust > the entire memory of the system. Of course, this can be prevented by the > hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup > respectively, as I asked in the first mail(set limit to 10GB), the > control will not work. -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote: [...] > I do not know if it is really a strong use case, but I will explain my > motive in detail. English is not my native language, so please pardon > my poor English. > > I am one of the developers for software that managing the resource used > from user job at HPC-Cluster with Linux. The resource is memory mainly. > The HPC-Cluster may be shared by multiple people and used. Therefore, the > memory used by each user must be strictly controlled, otherwise the > user's job will runaway, not only will it hamper the other users, it will > crash the entire system in OOM. > > Some users of HPC are very nervous about performance. Jobs are executed > while synchronizing with MPI communication using multiple compute nodes. > Since CPU wait time will occur when synchronizing, they want to minimize > the variation in execution time at each node to reduce waiting times as > much as possible. We call this variation a noise. > > THP does not guarantee to use the Huge Page, but may use the normal page. > This mechanism is one cause of variation(noise). > > The users who know this mechanism will be hesitant to use THP. However, > the users also know the benefits of the Huge Page's TLB hit rate > performance, and the Huge Page seems to be attractive. It seems natural > that these users are interested in HugeTLBfs, I do not know at all > whether it is the right approach or not. Sure, asking for guarantee makes hugetlb pages attractive. But nothing is really for free, especially any resource _guarantee_, and you have to pay an additional configuration price usually. > At the very least, our HPC system is pursuing high versatility and we > have to consider whether we can provide it if users want to use HugeTLBfs. > > In order to use HugeTLBfs we need to create a persistent pool, but in > our use case sharing nodes, it would be impossible to create, delete or > resize the pool. Why? I can see this would be quite a PITA but not really impossible. > One of the answers I have reached is to use HugeTLBfs by overcommitting > without creating a pool(this is the surplus hugepage). > > Surplus hugepages is hugetlb page, but I think at least that consuming > buddy pool is a decisive difference from hugetlb page of persistent pool. > If nr_overcommit_hugepages is assumed to be infinite, allocating pages for > surplus hugepages from buddy pool is all unlimited even if being limited > by memcg. Not really, you can specify how much you can overcommit hugetlb pages. > In extreme cases, overcommitment will allow users to exhaust > the entire memory of the system. Of course, this can be prevented by the > hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup > respectively, as I asked in the first mail(set limit to 10GB), the > control will not work. -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: > On 2018/05/23 3:54, Michal Hocko wrote: [...] > > I am also quite confused why you keep distinguishing surplus hugetlb > > pages from regular preallocated ones. Being a surplus page is an > > implementation detail that we use for an internal accounting rather than > > something to exhibit to the userspace even more than we do currently. > > I apologize for having confused. > > The hugetlb pages obtained from the pool do not waste the buddy pool. Because they have already allocated from the buddy allocator so the end result is very same. > On > the other hand, surplus hugetlb pages waste the buddy pool. Due to this > difference in property, I thought it could be distinguished. But this is simply not correct. Surplus pages are fluid. If you increase the hugetlb size they will become regular persistent hugetlb pages. > Although my memcg knowledge is extremely limited, memcg is accounting for > various kinds of pages obtained from the buddy pool by the task belonging > to it. I would like to argue that surplus hugepage has specificity in > terms of obtaining from the buddy pool, and that it is specially permitted > charge requirements for memcg. Not really. Memcg accounts primarily for reclaimable memory. We do account for some non-reclaimable slabs but the life time should be at least bound to a process life time. Otherwise the memcg oom killer behavior is not guaranteed to unclutter the situation. Hugetlb pages are simply persistent. Well, to be completely honest tmpfs pages have a similar problem but lacking the swap space for them is kinda configuration bug. > It seems very strange that charge hugetlb page to memcg, but essentially > it only charges the usage of the compound page obtained from the buddy pool, > and even if that page is used as hugetlb page after that, memcg is not > interested in that. Ohh, it is very much interested. The primary goal of memcg is to enforce the limit. How are you going to do that in an absence of the reclaimable memory? And quite a lot of it because hugetlb pages usually consume a lot of memory. > I will completely apologize if my way of thinking is wrong. It would be > greatly appreciated if you could mention why we can not charge surplus > hugepages to memcg. > > > Just look at what [sw]hould when you need to adjust accounting - e.g. > > due to the pool resize. Are you going to uncharge those surplus pages > > ffrom memcg to reflect their persistence? > > > > I could not understand the intention of this question, sorry. When resize > the pool, I think that the number of surplus hugepages in use does not > change. Could you explain what you were concerned about? It does change when ou change the hugetlb pool size, migrate pages between per-numa pools (have a look at adjust_pool_surplus). -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote: > On 2018/05/23 3:54, Michal Hocko wrote: [...] > > I am also quite confused why you keep distinguishing surplus hugetlb > > pages from regular preallocated ones. Being a surplus page is an > > implementation detail that we use for an internal accounting rather than > > something to exhibit to the userspace even more than we do currently. > > I apologize for having confused. > > The hugetlb pages obtained from the pool do not waste the buddy pool. Because they have already allocated from the buddy allocator so the end result is very same. > On > the other hand, surplus hugetlb pages waste the buddy pool. Due to this > difference in property, I thought it could be distinguished. But this is simply not correct. Surplus pages are fluid. If you increase the hugetlb size they will become regular persistent hugetlb pages. > Although my memcg knowledge is extremely limited, memcg is accounting for > various kinds of pages obtained from the buddy pool by the task belonging > to it. I would like to argue that surplus hugepage has specificity in > terms of obtaining from the buddy pool, and that it is specially permitted > charge requirements for memcg. Not really. Memcg accounts primarily for reclaimable memory. We do account for some non-reclaimable slabs but the life time should be at least bound to a process life time. Otherwise the memcg oom killer behavior is not guaranteed to unclutter the situation. Hugetlb pages are simply persistent. Well, to be completely honest tmpfs pages have a similar problem but lacking the swap space for them is kinda configuration bug. > It seems very strange that charge hugetlb page to memcg, but essentially > it only charges the usage of the compound page obtained from the buddy pool, > and even if that page is used as hugetlb page after that, memcg is not > interested in that. Ohh, it is very much interested. The primary goal of memcg is to enforce the limit. How are you going to do that in an absence of the reclaimable memory? And quite a lot of it because hugetlb pages usually consume a lot of memory. > I will completely apologize if my way of thinking is wrong. It would be > greatly appreciated if you could mention why we can not charge surplus > hugepages to memcg. > > > Just look at what [sw]hould when you need to adjust accounting - e.g. > > due to the pool resize. Are you going to uncharge those surplus pages > > ffrom memcg to reflect their persistence? > > > > I could not understand the intention of this question, sorry. When resize > the pool, I think that the number of surplus hugepages in use does not > change. Could you explain what you were concerned about? It does change when ou change the hugetlb pool size, migrate pages between per-numa pools (have a look at adjust_pool_surplus). -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/23 3:54, Michal Hocko wrote: > On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote: >> On 2018/05/22 3:07, Mike Kravetz wrote: >>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: Thanks to Mike Kravetz for comment on the previous version patch. The purpose of this patch-set is to make it possible to control whether or not to charge surplus hugetlb pages obtained by overcommitting to memory cgroup. In the future, I am trying to accomplish limiting the memory usage of applications that use both normal pages and hugetlb pages by the memory cgroup(not use the hugetlb cgroup). Applications that use shared libraries like libhugetlbfs.so use both normal pages and hugetlb pages, but we do not know how much to use each. Please suppose you want to manage the memory usage of such applications by cgroup How do you set the memory cgroup and hugetlb cgroup limit when you want to limit memory usage to 10GB? If you set a limit of 10GB for each, the user can use a total of 20GB of memory and can not limit it well. Since it is difficult to estimate the ratio used by user of normal pages and hugetlb pages, setting limits of 2GB to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a case, I thought that by using my patch-set, we could manage resources just by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb cgroup). In this patch-set, introduce the charge_surplus_huge_pages(boolean) to struct hstate. If it is true, it charges to the memory cgroup to which the task that obtained surplus hugepages belongs. If it is false, do nothing as before, and the default value is false. The charge_surplus_huge_pages can be controlled procfs or sysfs interfaces. Since THP is very effective in environments with kernel page size of 4KB, such as x86, there is no reason to positively use HugeTLBfs, so I think that there is no situation to enable charge_surplus_huge_pages. However, in some distributions such as arm64, the page size of the kernel is 64KB, and the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs may support multiple huge page sizes, and in such a special environment there is a desire to use HugeTLBfs. >>> >>> One of the basic questions/concerns I have is accounting for surplus huge >>> pages in the default memory resource controller. The existing huegtlb >>> resource controller already takes hugetlbfs huge pages into account, >>> including surplus pages. This series would allow surplus pages to be >>> accounted for in the default memory controller, or the hugetlb controller >>> or both. >>> >>> I understand that current mechanisms do not meet the needs of the above >>> use case. The question is whether this is an appropriate way to approach >>> the issue. > > I do share your view Mike! > >>> My cgroup experience and knowledge is extremely limited, but >>> it does not appear that any other resource can be controlled by multiple >>> controllers. Therefore, I am concerned that this may be going against >>> basic cgroup design philosophy. >> >> Thank you for your feedback. >> That makes sense, surplus hugepages are charged to both memcg and hugetlb >> cgroup, which may be contrary to cgroup design philosophy. >> >> Based on the above advice, I have considered the following improvements, >> what do you think about? >> >> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch >> "whether to charge memcg in addition to hugetlb cgroup", but it will be >> abolished. Instead, change to "switch only to memcg instead of hugetlb >> cgroup" option. This is called 'surplus_charge_to_memcg'. > > This all looks so hackish and ad-hoc that I would be tempted to give it > an outright nack, but let's here more about why do we need this fiddling > at all. I've asked in other email so I guess I will get an answer there > but let me just emphasize again that I absolutely detest a possibility > to put hugetlb pages into the memcg mix. They just do not belong there. > Try to look at previous discussions why it has been decided to have a > separate hugetlb pages at all. > > I am also quite confused why you keep distinguishing surplus hugetlb > pages from regular preallocated ones. Being a surplus page is an > implementation detail that we use for an internal accounting rather than > something to exhibit to the userspace even more than we do currently. I apologize for having confused. The hugetlb pages obtained from the pool do not waste the buddy pool. On the other hand, surplus hugetlb pages waste the buddy pool. Due to this difference in property, I thought it could be distinguished. Although my memcg knowledge is extremely limited, memcg is accounting for various kinds of pages obtained from the buddy pool by the task belonging to it. I would like to argue that surplus hugepage has
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/23 3:54, Michal Hocko wrote: > On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote: >> On 2018/05/22 3:07, Mike Kravetz wrote: >>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: Thanks to Mike Kravetz for comment on the previous version patch. The purpose of this patch-set is to make it possible to control whether or not to charge surplus hugetlb pages obtained by overcommitting to memory cgroup. In the future, I am trying to accomplish limiting the memory usage of applications that use both normal pages and hugetlb pages by the memory cgroup(not use the hugetlb cgroup). Applications that use shared libraries like libhugetlbfs.so use both normal pages and hugetlb pages, but we do not know how much to use each. Please suppose you want to manage the memory usage of such applications by cgroup How do you set the memory cgroup and hugetlb cgroup limit when you want to limit memory usage to 10GB? If you set a limit of 10GB for each, the user can use a total of 20GB of memory and can not limit it well. Since it is difficult to estimate the ratio used by user of normal pages and hugetlb pages, setting limits of 2GB to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a case, I thought that by using my patch-set, we could manage resources just by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb cgroup). In this patch-set, introduce the charge_surplus_huge_pages(boolean) to struct hstate. If it is true, it charges to the memory cgroup to which the task that obtained surplus hugepages belongs. If it is false, do nothing as before, and the default value is false. The charge_surplus_huge_pages can be controlled procfs or sysfs interfaces. Since THP is very effective in environments with kernel page size of 4KB, such as x86, there is no reason to positively use HugeTLBfs, so I think that there is no situation to enable charge_surplus_huge_pages. However, in some distributions such as arm64, the page size of the kernel is 64KB, and the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs may support multiple huge page sizes, and in such a special environment there is a desire to use HugeTLBfs. >>> >>> One of the basic questions/concerns I have is accounting for surplus huge >>> pages in the default memory resource controller. The existing huegtlb >>> resource controller already takes hugetlbfs huge pages into account, >>> including surplus pages. This series would allow surplus pages to be >>> accounted for in the default memory controller, or the hugetlb controller >>> or both. >>> >>> I understand that current mechanisms do not meet the needs of the above >>> use case. The question is whether this is an appropriate way to approach >>> the issue. > > I do share your view Mike! > >>> My cgroup experience and knowledge is extremely limited, but >>> it does not appear that any other resource can be controlled by multiple >>> controllers. Therefore, I am concerned that this may be going against >>> basic cgroup design philosophy. >> >> Thank you for your feedback. >> That makes sense, surplus hugepages are charged to both memcg and hugetlb >> cgroup, which may be contrary to cgroup design philosophy. >> >> Based on the above advice, I have considered the following improvements, >> what do you think about? >> >> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch >> "whether to charge memcg in addition to hugetlb cgroup", but it will be >> abolished. Instead, change to "switch only to memcg instead of hugetlb >> cgroup" option. This is called 'surplus_charge_to_memcg'. > > This all looks so hackish and ad-hoc that I would be tempted to give it > an outright nack, but let's here more about why do we need this fiddling > at all. I've asked in other email so I guess I will get an answer there > but let me just emphasize again that I absolutely detest a possibility > to put hugetlb pages into the memcg mix. They just do not belong there. > Try to look at previous discussions why it has been decided to have a > separate hugetlb pages at all. > > I am also quite confused why you keep distinguishing surplus hugetlb > pages from regular preallocated ones. Being a surplus page is an > implementation detail that we use for an internal accounting rather than > something to exhibit to the userspace even more than we do currently. I apologize for having confused. The hugetlb pages obtained from the pool do not waste the buddy pool. On the other hand, surplus hugetlb pages waste the buddy pool. Due to this difference in property, I thought it could be distinguished. Although my memcg knowledge is extremely limited, memcg is accounting for various kinds of pages obtained from the buddy pool by the task belonging to it. I would like to argue that surplus hugepage has
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/22 22:51, Michal Hocko wrote: > On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote: >> The purpose of this patch-set is to make it possible to control whether or >> not to charge surplus hugetlb pages obtained by overcommitting to memory >> cgroup. In the future, I am trying to accomplish limiting the memory usage >> of applications that use both normal pages and hugetlb pages by the memory >> cgroup(not use the hugetlb cgroup). > > There was a deliberate decision to keep hugetlb and "normal" memory > cgroup controllers separate. Mostly because hugetlb memory is an > artificial memory subsystem on its own and it doesn't fit into the rest > of memcg accounted memory very well. I believe we want to keep that > status quo. > >> Applications that use shared libraries like libhugetlbfs.so use both normal >> pages and hugetlb pages, but we do not know how much to use each. Please >> suppose you want to manage the memory usage of such applications by cgroup >> How do you set the memory cgroup and hugetlb cgroup limit when you want to >> limit memory usage to 10GB? > > Well such a usecase requires an explicit configuration already. Either > by using special wrappers or modifying the code. So I would argue that > you have quite a good knowlege of the setup. If you need a greater > flexibility then just do not use hugetlb at all and rely on THP. > [...] > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. > > I do not really think this is a good idea. We really do not want to make > the current hugetlb code more complex than it is already. The current > hugetlb cgroup controller is simple and works at least somehow. I would > not add more on top unless there is a _really_ strong usecase behind. > Please make sure to describe such a usecase in details before we even > start considering the code. Thank you for your time. I do not know if it is really a strong use case, but I will explain my motive in detail. English is not my native language, so please pardon my poor English. I am one of the developers for software that managing the resource used from user job at HPC-Cluster with Linux. The resource is memory mainly. The HPC-Cluster may be shared by multiple people and used. Therefore, the memory used by each user must be strictly controlled, otherwise the user's job will runaway, not only will it hamper the other users, it will crash the entire system in OOM. Some users of HPC are very nervous about performance. Jobs are executed while synchronizing with MPI communication using multiple compute nodes. Since CPU wait time will occur when synchronizing, they want to minimize the variation in execution time at each node to reduce waiting times as much as possible. We call this variation a noise. THP does not guarantee to use the Huge Page, but may use the normal page. This mechanism is one cause of variation(noise). The users who know this mechanism will be hesitant to use THP. However, the users also know the benefits of the Huge Page's TLB hit rate performance, and the Huge Page seems to be attractive. It seems natural that these users are interested in HugeTLBfs, I do not know at all whether it is the right approach or not. At the very least, our HPC system is pursuing high versatility and we have to consider whether we can provide it if users want to use HugeTLBfs. In order to use HugeTLBfs we need to create a persistent pool, but in our use case sharing nodes, it would be impossible to create, delete or resize the pool. One of the answers I have reached is to use HugeTLBfs by overcommitting without creating a pool(this is the surplus hugepage). Surplus hugepages is hugetlb page, but I think at least that consuming buddy pool is a decisive difference from hugetlb page of persistent pool. If nr_overcommit_hugepages is assumed to be infinite, allocating pages for surplus hugepages from buddy pool is all unlimited even if being limited by memcg. In extreme cases, overcommitment will allow users to exhaust the entire memory of the system. Of course, this can be prevented by the hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup respectively, as I asked in the first mail(set limit to 10GB), the control will not work. I thought I could charge surplus hugepages to memcg, but maybe I did not have enough knowledge about memcg. I would like to reply to another mail for details. >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/22 22:51, Michal Hocko wrote: > On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote: >> The purpose of this patch-set is to make it possible to control whether or >> not to charge surplus hugetlb pages obtained by overcommitting to memory >> cgroup. In the future, I am trying to accomplish limiting the memory usage >> of applications that use both normal pages and hugetlb pages by the memory >> cgroup(not use the hugetlb cgroup). > > There was a deliberate decision to keep hugetlb and "normal" memory > cgroup controllers separate. Mostly because hugetlb memory is an > artificial memory subsystem on its own and it doesn't fit into the rest > of memcg accounted memory very well. I believe we want to keep that > status quo. > >> Applications that use shared libraries like libhugetlbfs.so use both normal >> pages and hugetlb pages, but we do not know how much to use each. Please >> suppose you want to manage the memory usage of such applications by cgroup >> How do you set the memory cgroup and hugetlb cgroup limit when you want to >> limit memory usage to 10GB? > > Well such a usecase requires an explicit configuration already. Either > by using special wrappers or modifying the code. So I would argue that > you have quite a good knowlege of the setup. If you need a greater > flexibility then just do not use hugetlb at all and rely on THP. > [...] > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. > > I do not really think this is a good idea. We really do not want to make > the current hugetlb code more complex than it is already. The current > hugetlb cgroup controller is simple and works at least somehow. I would > not add more on top unless there is a _really_ strong usecase behind. > Please make sure to describe such a usecase in details before we even > start considering the code. Thank you for your time. I do not know if it is really a strong use case, but I will explain my motive in detail. English is not my native language, so please pardon my poor English. I am one of the developers for software that managing the resource used from user job at HPC-Cluster with Linux. The resource is memory mainly. The HPC-Cluster may be shared by multiple people and used. Therefore, the memory used by each user must be strictly controlled, otherwise the user's job will runaway, not only will it hamper the other users, it will crash the entire system in OOM. Some users of HPC are very nervous about performance. Jobs are executed while synchronizing with MPI communication using multiple compute nodes. Since CPU wait time will occur when synchronizing, they want to minimize the variation in execution time at each node to reduce waiting times as much as possible. We call this variation a noise. THP does not guarantee to use the Huge Page, but may use the normal page. This mechanism is one cause of variation(noise). The users who know this mechanism will be hesitant to use THP. However, the users also know the benefits of the Huge Page's TLB hit rate performance, and the Huge Page seems to be attractive. It seems natural that these users are interested in HugeTLBfs, I do not know at all whether it is the right approach or not. At the very least, our HPC system is pursuing high versatility and we have to consider whether we can provide it if users want to use HugeTLBfs. In order to use HugeTLBfs we need to create a persistent pool, but in our use case sharing nodes, it would be impossible to create, delete or resize the pool. One of the answers I have reached is to use HugeTLBfs by overcommitting without creating a pool(this is the surplus hugepage). Surplus hugepages is hugetlb page, but I think at least that consuming buddy pool is a decisive difference from hugetlb page of persistent pool. If nr_overcommit_hugepages is assumed to be infinite, allocating pages for surplus hugepages from buddy pool is all unlimited even if being limited by memcg. In extreme cases, overcommitment will allow users to exhaust the entire memory of the system. Of course, this can be prevented by the hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup respectively, as I asked in the first mail(set limit to 10GB), the control will not work. I thought I could charge surplus hugepages to memcg, but maybe I did not have enough knowledge about memcg. I would like to reply to another mail for details. >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/22/2018 06:04 AM, TSUKADA Koutaro wrote: > > I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to > have specially considered of surplus hugepages. Later, I will send a mail > to hugetlb cgroup's committer to ask about surplus hugepages charge > specifications. > I went back and looked at surplus huge page allocation. Previously, I made a statement that the hugetlb controller accounts for surplus huge pages. Turns out that may not be 100% correct. Thanks to Michal, all surplus huge page allocation is performed via the alloc_surplus_huge_page() routine. This will ultimately call into the buddy allocator without any cgroup charges. Calls to alloc_surplus_huge_page are made from: - alloc_huge_page() when allocating a huge page to a mapping/file. In this case, appropriate calls to the hugetlb controller are in place. So, any limits are enforced here. - gather_surplus_pages() when allocating and setting aside 'reserved' huge pages. No accounting is performed here. Do note that in this case the allocated huge pages are not assigned to the mapping/file. Even though 'reserved', they are deposited into the global pool and also counted as 'free'. When these reserved pages are ultimately used to populate a file/mapping, the code path goes through alloc_huge_page() where appropriate calls to the hugetlb controller are in place. So, the bottom line is that surplus huge pages are not accounted for when they are allocated as 'reserves'. It is not until these reserves are actually used that accounting limits are checked. This 'seems' to align with general allocation of huge pages within the pool. No accounting is done until they are actually allocated to a mapping/file. -- Mike Kravetz
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/22/2018 06:04 AM, TSUKADA Koutaro wrote: > > I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to > have specially considered of surplus hugepages. Later, I will send a mail > to hugetlb cgroup's committer to ask about surplus hugepages charge > specifications. > I went back and looked at surplus huge page allocation. Previously, I made a statement that the hugetlb controller accounts for surplus huge pages. Turns out that may not be 100% correct. Thanks to Michal, all surplus huge page allocation is performed via the alloc_surplus_huge_page() routine. This will ultimately call into the buddy allocator without any cgroup charges. Calls to alloc_surplus_huge_page are made from: - alloc_huge_page() when allocating a huge page to a mapping/file. In this case, appropriate calls to the hugetlb controller are in place. So, any limits are enforced here. - gather_surplus_pages() when allocating and setting aside 'reserved' huge pages. No accounting is performed here. Do note that in this case the allocated huge pages are not assigned to the mapping/file. Even though 'reserved', they are deposited into the global pool and also counted as 'free'. When these reserved pages are ultimately used to populate a file/mapping, the code path goes through alloc_huge_page() where appropriate calls to the hugetlb controller are in place. So, the bottom line is that surplus huge pages are not accounted for when they are allocated as 'reserves'. It is not until these reserves are actually used that accounting limits are checked. This 'seems' to align with general allocation of huge pages within the pool. No accounting is done until they are actually allocated to a mapping/file. -- Mike Kravetz
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote: > On 2018/05/22 3:07, Mike Kravetz wrote: > > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: > >> Thanks to Mike Kravetz for comment on the previous version patch. > >> > >> The purpose of this patch-set is to make it possible to control whether or > >> not to charge surplus hugetlb pages obtained by overcommitting to memory > >> cgroup. In the future, I am trying to accomplish limiting the memory usage > >> of applications that use both normal pages and hugetlb pages by the memory > >> cgroup(not use the hugetlb cgroup). > >> > >> Applications that use shared libraries like libhugetlbfs.so use both normal > >> pages and hugetlb pages, but we do not know how much to use each. Please > >> suppose you want to manage the memory usage of such applications by cgroup > >> How do you set the memory cgroup and hugetlb cgroup limit when you want to > >> limit memory usage to 10GB? > >> > >> If you set a limit of 10GB for each, the user can use a total of 20GB of > >> memory and can not limit it well. Since it is difficult to estimate the > >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB > >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a > >> case, I thought that by using my patch-set, we could manage resources just > >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb > >> cgroup). > >> > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > >> struct hstate. If it is true, it charges to the memory cgroup to which the > >> task that obtained surplus hugepages belongs. If it is false, do nothing as > >> before, and the default value is false. The charge_surplus_huge_pages can > >> be controlled procfs or sysfs interfaces. > >> > >> Since THP is very effective in environments with kernel page size of 4KB, > >> such as x86, there is no reason to positively use HugeTLBfs, so I think > >> that there is no situation to enable charge_surplus_huge_pages. However, in > >> some distributions such as arm64, the page size of the kernel is 64KB, and > >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > >> may support multiple huge page sizes, and in such a special environment > >> there is a desire to use HugeTLBfs. > > > > One of the basic questions/concerns I have is accounting for surplus huge > > pages in the default memory resource controller. The existing huegtlb > > resource controller already takes hugetlbfs huge pages into account, > > including surplus pages. This series would allow surplus pages to be > > accounted for in the default memory controller, or the hugetlb controller > > or both. > > > > I understand that current mechanisms do not meet the needs of the above > > use case. The question is whether this is an appropriate way to approach > > the issue. I do share your view Mike! > > My cgroup experience and knowledge is extremely limited, but > > it does not appear that any other resource can be controlled by multiple > > controllers. Therefore, I am concerned that this may be going against > > basic cgroup design philosophy. > > Thank you for your feedback. > That makes sense, surplus hugepages are charged to both memcg and hugetlb > cgroup, which may be contrary to cgroup design philosophy. > > Based on the above advice, I have considered the following improvements, > what do you think about? > > The 'charge_surplus_hugepages' of v2 patch-set was an option to switch > "whether to charge memcg in addition to hugetlb cgroup", but it will be > abolished. Instead, change to "switch only to memcg instead of hugetlb > cgroup" option. This is called 'surplus_charge_to_memcg'. This all looks so hackish and ad-hoc that I would be tempted to give it an outright nack, but let's here more about why do we need this fiddling at all. I've asked in other email so I guess I will get an answer there but let me just emphasize again that I absolutely detest a possibility to put hugetlb pages into the memcg mix. They just do not belong there. Try to look at previous discussions why it has been decided to have a separate hugetlb pages at all. I am also quite confused why you keep distinguishing surplus hugetlb pages from regular preallocated ones. Being a surplus page is an implementation detail that we use for an internal accounting rather than something to exhibit to the userspace even more than we do currently. Just look at what [sw]hould when you need to adjust accounting - e.g. due to the pool resize. Are you going to uncharge those surplus pages ffrom memcg to reflect their persistence? -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote: > On 2018/05/22 3:07, Mike Kravetz wrote: > > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: > >> Thanks to Mike Kravetz for comment on the previous version patch. > >> > >> The purpose of this patch-set is to make it possible to control whether or > >> not to charge surplus hugetlb pages obtained by overcommitting to memory > >> cgroup. In the future, I am trying to accomplish limiting the memory usage > >> of applications that use both normal pages and hugetlb pages by the memory > >> cgroup(not use the hugetlb cgroup). > >> > >> Applications that use shared libraries like libhugetlbfs.so use both normal > >> pages and hugetlb pages, but we do not know how much to use each. Please > >> suppose you want to manage the memory usage of such applications by cgroup > >> How do you set the memory cgroup and hugetlb cgroup limit when you want to > >> limit memory usage to 10GB? > >> > >> If you set a limit of 10GB for each, the user can use a total of 20GB of > >> memory and can not limit it well. Since it is difficult to estimate the > >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB > >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a > >> case, I thought that by using my patch-set, we could manage resources just > >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb > >> cgroup). > >> > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > >> struct hstate. If it is true, it charges to the memory cgroup to which the > >> task that obtained surplus hugepages belongs. If it is false, do nothing as > >> before, and the default value is false. The charge_surplus_huge_pages can > >> be controlled procfs or sysfs interfaces. > >> > >> Since THP is very effective in environments with kernel page size of 4KB, > >> such as x86, there is no reason to positively use HugeTLBfs, so I think > >> that there is no situation to enable charge_surplus_huge_pages. However, in > >> some distributions such as arm64, the page size of the kernel is 64KB, and > >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > >> may support multiple huge page sizes, and in such a special environment > >> there is a desire to use HugeTLBfs. > > > > One of the basic questions/concerns I have is accounting for surplus huge > > pages in the default memory resource controller. The existing huegtlb > > resource controller already takes hugetlbfs huge pages into account, > > including surplus pages. This series would allow surplus pages to be > > accounted for in the default memory controller, or the hugetlb controller > > or both. > > > > I understand that current mechanisms do not meet the needs of the above > > use case. The question is whether this is an appropriate way to approach > > the issue. I do share your view Mike! > > My cgroup experience and knowledge is extremely limited, but > > it does not appear that any other resource can be controlled by multiple > > controllers. Therefore, I am concerned that this may be going against > > basic cgroup design philosophy. > > Thank you for your feedback. > That makes sense, surplus hugepages are charged to both memcg and hugetlb > cgroup, which may be contrary to cgroup design philosophy. > > Based on the above advice, I have considered the following improvements, > what do you think about? > > The 'charge_surplus_hugepages' of v2 patch-set was an option to switch > "whether to charge memcg in addition to hugetlb cgroup", but it will be > abolished. Instead, change to "switch only to memcg instead of hugetlb > cgroup" option. This is called 'surplus_charge_to_memcg'. This all looks so hackish and ad-hoc that I would be tempted to give it an outright nack, but let's here more about why do we need this fiddling at all. I've asked in other email so I guess I will get an answer there but let me just emphasize again that I absolutely detest a possibility to put hugetlb pages into the memcg mix. They just do not belong there. Try to look at previous discussions why it has been decided to have a separate hugetlb pages at all. I am also quite confused why you keep distinguishing surplus hugetlb pages from regular preallocated ones. Being a surplus page is an implementation detail that we use for an internal accounting rather than something to exhibit to the userspace even more than we do currently. Just look at what [sw]hould when you need to adjust accounting - e.g. due to the pool resize. Are you going to uncharge those surplus pages ffrom memcg to reflect their persistence? -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote: > Thanks to Mike Kravetz for comment on the previous version patch. I am sorry that I didn't join the discussion for the previous version but time just didn't allow that. So sorry if I am repeating something already sorted out. > The purpose of this patch-set is to make it possible to control whether or > not to charge surplus hugetlb pages obtained by overcommitting to memory > cgroup. In the future, I am trying to accomplish limiting the memory usage > of applications that use both normal pages and hugetlb pages by the memory > cgroup(not use the hugetlb cgroup). There was a deliberate decision to keep hugetlb and "normal" memory cgroup controllers separate. Mostly because hugetlb memory is an artificial memory subsystem on its own and it doesn't fit into the rest of memcg accounted memory very well. I believe we want to keep that status quo. > Applications that use shared libraries like libhugetlbfs.so use both normal > pages and hugetlb pages, but we do not know how much to use each. Please > suppose you want to manage the memory usage of such applications by cgroup > How do you set the memory cgroup and hugetlb cgroup limit when you want to > limit memory usage to 10GB? Well such a usecase requires an explicit configuration already. Either by using special wrappers or modifying the code. So I would argue that you have quite a good knowlege of the setup. If you need a greater flexibility then just do not use hugetlb at all and rely on THP. [...] > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. I do not really think this is a good idea. We really do not want to make the current hugetlb code more complex than it is already. The current hugetlb cgroup controller is simple and works at least somehow. I would not add more on top unless there is a _really_ strong usecase behind. Please make sure to describe such a usecase in details before we even start considering the code. > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. Well, then I would argue that you shouldn't use 64kB pages for your setup or allow THP for smaller sizes. Really hugetlb pages are by no means a substitute here. -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote: > Thanks to Mike Kravetz for comment on the previous version patch. I am sorry that I didn't join the discussion for the previous version but time just didn't allow that. So sorry if I am repeating something already sorted out. > The purpose of this patch-set is to make it possible to control whether or > not to charge surplus hugetlb pages obtained by overcommitting to memory > cgroup. In the future, I am trying to accomplish limiting the memory usage > of applications that use both normal pages and hugetlb pages by the memory > cgroup(not use the hugetlb cgroup). There was a deliberate decision to keep hugetlb and "normal" memory cgroup controllers separate. Mostly because hugetlb memory is an artificial memory subsystem on its own and it doesn't fit into the rest of memcg accounted memory very well. I believe we want to keep that status quo. > Applications that use shared libraries like libhugetlbfs.so use both normal > pages and hugetlb pages, but we do not know how much to use each. Please > suppose you want to manage the memory usage of such applications by cgroup > How do you set the memory cgroup and hugetlb cgroup limit when you want to > limit memory usage to 10GB? Well such a usecase requires an explicit configuration already. Either by using special wrappers or modifying the code. So I would argue that you have quite a good knowlege of the setup. If you need a greater flexibility then just do not use hugetlb at all and rely on THP. [...] > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. I do not really think this is a good idea. We really do not want to make the current hugetlb code more complex than it is already. The current hugetlb cgroup controller is simple and works at least somehow. I would not add more on top unless there is a _really_ strong usecase behind. Please make sure to describe such a usecase in details before we even start considering the code. > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. Well, then I would argue that you shouldn't use 64kB pages for your setup or allow THP for smaller sizes. Really hugetlb pages are by no means a substitute here. -- Michal Hocko SUSE Labs
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/22 3:07, Mike Kravetz wrote: > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: >> Thanks to Mike Kravetz for comment on the previous version patch. >> >> The purpose of this patch-set is to make it possible to control whether or >> not to charge surplus hugetlb pages obtained by overcommitting to memory >> cgroup. In the future, I am trying to accomplish limiting the memory usage >> of applications that use both normal pages and hugetlb pages by the memory >> cgroup(not use the hugetlb cgroup). >> >> Applications that use shared libraries like libhugetlbfs.so use both normal >> pages and hugetlb pages, but we do not know how much to use each. Please >> suppose you want to manage the memory usage of such applications by cgroup >> How do you set the memory cgroup and hugetlb cgroup limit when you want to >> limit memory usage to 10GB? >> >> If you set a limit of 10GB for each, the user can use a total of 20GB of >> memory and can not limit it well. Since it is difficult to estimate the >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a >> case, I thought that by using my patch-set, we could manage resources just >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb >> cgroup). >> >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. >> >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs >> may support multiple huge page sizes, and in such a special environment >> there is a desire to use HugeTLBfs. > > One of the basic questions/concerns I have is accounting for surplus huge > pages in the default memory resource controller. The existing huegtlb > resource controller already takes hugetlbfs huge pages into account, > including surplus pages. This series would allow surplus pages to be > accounted for in the default memory controller, or the hugetlb controller > or both. > > I understand that current mechanisms do not meet the needs of the above > use case. The question is whether this is an appropriate way to approach > the issue. My cgroup experience and knowledge is extremely limited, but > it does not appear that any other resource can be controlled by multiple > controllers. Therefore, I am concerned that this may be going against > basic cgroup design philosophy. Thank you for your feedback. That makes sense, surplus hugepages are charged to both memcg and hugetlb cgroup, which may be contrary to cgroup design philosophy. Based on the above advice, I have considered the following improvements, what do you think about? The 'charge_surplus_hugepages' of v2 patch-set was an option to switch "whether to charge memcg in addition to hugetlb cgroup", but it will be abolished. Instead, change to "switch only to memcg instead of hugetlb cgroup" option. This is called 'surplus_charge_to_memcg'. The surplus_charge_to_memcg option is created in per hugetlb cgroup. If it is false(default), charge destination cgroup of various page types is the same as the current kernel version. If it become true, hugetlb cgroup stops accounting for surplus hugepages, and memcg starts accounting instead. A table showing which cgroups are charged: page types | current v2(off) v2(on) v3(off) v3(on) --- normal + THP| m m m mm hugetlb(persistent) | h h h hh hugetlb(surplus)| h h m+h hm --- v2: charge_surplus_hugepages option v3: next version, surplus_charge_to_memcg option m: memory cgroup h: hugetlb cgroup > > It would be good to get comments from people more cgroup knowledgeable, > and especially from those involved in the decision to do separate hugetlb > control. > I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to have specially considered of surplus hugepages. Later, I will send a mail to hugetlb cgroup's committer to ask about surplus hugepages charge specifications. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 2018/05/22 3:07, Mike Kravetz wrote: > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: >> Thanks to Mike Kravetz for comment on the previous version patch. >> >> The purpose of this patch-set is to make it possible to control whether or >> not to charge surplus hugetlb pages obtained by overcommitting to memory >> cgroup. In the future, I am trying to accomplish limiting the memory usage >> of applications that use both normal pages and hugetlb pages by the memory >> cgroup(not use the hugetlb cgroup). >> >> Applications that use shared libraries like libhugetlbfs.so use both normal >> pages and hugetlb pages, but we do not know how much to use each. Please >> suppose you want to manage the memory usage of such applications by cgroup >> How do you set the memory cgroup and hugetlb cgroup limit when you want to >> limit memory usage to 10GB? >> >> If you set a limit of 10GB for each, the user can use a total of 20GB of >> memory and can not limit it well. Since it is difficult to estimate the >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a >> case, I thought that by using my patch-set, we could manage resources just >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb >> cgroup). >> >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. >> >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs >> may support multiple huge page sizes, and in such a special environment >> there is a desire to use HugeTLBfs. > > One of the basic questions/concerns I have is accounting for surplus huge > pages in the default memory resource controller. The existing huegtlb > resource controller already takes hugetlbfs huge pages into account, > including surplus pages. This series would allow surplus pages to be > accounted for in the default memory controller, or the hugetlb controller > or both. > > I understand that current mechanisms do not meet the needs of the above > use case. The question is whether this is an appropriate way to approach > the issue. My cgroup experience and knowledge is extremely limited, but > it does not appear that any other resource can be controlled by multiple > controllers. Therefore, I am concerned that this may be going against > basic cgroup design philosophy. Thank you for your feedback. That makes sense, surplus hugepages are charged to both memcg and hugetlb cgroup, which may be contrary to cgroup design philosophy. Based on the above advice, I have considered the following improvements, what do you think about? The 'charge_surplus_hugepages' of v2 patch-set was an option to switch "whether to charge memcg in addition to hugetlb cgroup", but it will be abolished. Instead, change to "switch only to memcg instead of hugetlb cgroup" option. This is called 'surplus_charge_to_memcg'. The surplus_charge_to_memcg option is created in per hugetlb cgroup. If it is false(default), charge destination cgroup of various page types is the same as the current kernel version. If it become true, hugetlb cgroup stops accounting for surplus hugepages, and memcg starts accounting instead. A table showing which cgroups are charged: page types | current v2(off) v2(on) v3(off) v3(on) --- normal + THP| m m m mm hugetlb(persistent) | h h h hh hugetlb(surplus)| h h m+h hm --- v2: charge_surplus_hugepages option v3: next version, surplus_charge_to_memcg option m: memory cgroup h: hugetlb cgroup > > It would be good to get comments from people more cgroup knowledgeable, > and especially from those involved in the decision to do separate hugetlb > control. > I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to have specially considered of surplus hugepages. Later, I will send a mail to hugetlb cgroup's committer to ask about surplus hugepages charge specifications. -- Thanks, Tsukada
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
Hi Punit, On 2018/05/21 23:52, Punit Agrawal wrote: > Hi Tsukada, > > I was staring at memcg code to better understand your changes and had > the below thought. > > TSUKADA Koutarowrites: > > [...] > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. > > Instead of tying the surplus huge page charging control per-hstate, > could the control be made per-memcg? > > This can be done by introducing a per-memory controller file in sysfs > (memory.charge_surplus_hugepages?) that indicates whether surplus > hugepages are to be charged to the controller and forms part of the > total limit. IIUC, the limit already accounts for page and swap cache > pages. > > This would allow the control to be enabled per-cgroup and also keep the > userspace control interface in one place. > > As said earlier, I'm not familiar with memcg so the above might not be a > feasible but think it'll lead to a more coherent user > interface. Hopefully, more knowledgeable folks on the thread can chime > in. > Thank you for good advise. As you mentioned, it is better to be able to control by per-memcg. After organizing my thoughts, I will develop the next version patch-set that can solve issues and challenge again. Thanks, Tsukada > Thanks, > Punit > >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs >> may support multiple huge page sizes, and in such a special environment >> there is a desire to use HugeTLBfs. >> >> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are >> acceptable or not, so I just done a simple test. >> >> Thanks, >> Tsukada >> >> TSUKADA Koutaro (7): >> hugetlb: introduce charge_surplus_huge_pages to struct hstate >> hugetlb: supports migrate charging for surplus hugepages >> memcg: use compound_order rather than hpage_nr_pages >> mm, sysctl: make charging surplus hugepages controllable >> hugetlb: add charge_surplus_hugepages attribute >> Documentation, hugetlb: describe about charge_surplus_hugepages >> memcg: supports movement of surplus hugepages statistics >> >> Documentation/vm/hugetlbpage.txt |6 + >> include/linux/hugetlb.h |4 + >> kernel/sysctl.c |7 + >> mm/hugetlb.c | 148 >> +++ >> mm/memcontrol.c | 109 +++- >> 5 files changed, 269 insertions(+), 5 deletions(-)
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
Hi Punit, On 2018/05/21 23:52, Punit Agrawal wrote: > Hi Tsukada, > > I was staring at memcg code to better understand your changes and had > the below thought. > > TSUKADA Koutaro writes: > > [...] > >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to >> struct hstate. If it is true, it charges to the memory cgroup to which the >> task that obtained surplus hugepages belongs. If it is false, do nothing as >> before, and the default value is false. The charge_surplus_huge_pages can >> be controlled procfs or sysfs interfaces. > > Instead of tying the surplus huge page charging control per-hstate, > could the control be made per-memcg? > > This can be done by introducing a per-memory controller file in sysfs > (memory.charge_surplus_hugepages?) that indicates whether surplus > hugepages are to be charged to the controller and forms part of the > total limit. IIUC, the limit already accounts for page and swap cache > pages. > > This would allow the control to be enabled per-cgroup and also keep the > userspace control interface in one place. > > As said earlier, I'm not familiar with memcg so the above might not be a > feasible but think it'll lead to a more coherent user > interface. Hopefully, more knowledgeable folks on the thread can chime > in. > Thank you for good advise. As you mentioned, it is better to be able to control by per-memcg. After organizing my thoughts, I will develop the next version patch-set that can solve issues and challenge again. Thanks, Tsukada > Thanks, > Punit > >> Since THP is very effective in environments with kernel page size of 4KB, >> such as x86, there is no reason to positively use HugeTLBfs, so I think >> that there is no situation to enable charge_surplus_huge_pages. However, in >> some distributions such as arm64, the page size of the kernel is 64KB, and >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs >> may support multiple huge page sizes, and in such a special environment >> there is a desire to use HugeTLBfs. >> >> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are >> acceptable or not, so I just done a simple test. >> >> Thanks, >> Tsukada >> >> TSUKADA Koutaro (7): >> hugetlb: introduce charge_surplus_huge_pages to struct hstate >> hugetlb: supports migrate charging for surplus hugepages >> memcg: use compound_order rather than hpage_nr_pages >> mm, sysctl: make charging surplus hugepages controllable >> hugetlb: add charge_surplus_hugepages attribute >> Documentation, hugetlb: describe about charge_surplus_hugepages >> memcg: supports movement of surplus hugepages statistics >> >> Documentation/vm/hugetlbpage.txt |6 + >> include/linux/hugetlb.h |4 + >> kernel/sysctl.c |7 + >> mm/hugetlb.c | 148 >> +++ >> mm/memcontrol.c | 109 +++- >> 5 files changed, 269 insertions(+), 5 deletions(-)
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: > Thanks to Mike Kravetz for comment on the previous version patch. > > The purpose of this patch-set is to make it possible to control whether or > not to charge surplus hugetlb pages obtained by overcommitting to memory > cgroup. In the future, I am trying to accomplish limiting the memory usage > of applications that use both normal pages and hugetlb pages by the memory > cgroup(not use the hugetlb cgroup). > > Applications that use shared libraries like libhugetlbfs.so use both normal > pages and hugetlb pages, but we do not know how much to use each. Please > suppose you want to manage the memory usage of such applications by cgroup > How do you set the memory cgroup and hugetlb cgroup limit when you want to > limit memory usage to 10GB? > > If you set a limit of 10GB for each, the user can use a total of 20GB of > memory and can not limit it well. Since it is difficult to estimate the > ratio used by user of normal pages and hugetlb pages, setting limits of 2GB > to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a > case, I thought that by using my patch-set, we could manage resources just > by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb > cgroup). > > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. > > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. One of the basic questions/concerns I have is accounting for surplus huge pages in the default memory resource controller. The existing huegtlb resource controller already takes hugetlbfs huge pages into account, including surplus pages. This series would allow surplus pages to be accounted for in the default memory controller, or the hugetlb controller or both. I understand that current mechanisms do not meet the needs of the above use case. The question is whether this is an appropriate way to approach the issue. My cgroup experience and knowledge is extremely limited, but it does not appear that any other resource can be controlled by multiple controllers. Therefore, I am concerned that this may be going against basic cgroup design philosophy. It would be good to get comments from people more cgroup knowledgeable, and especially from those involved in the decision to do separate hugetlb control. -- Mike Kravetz > > The patch set is for 4.17.0-rc3+. I don't know whether patch-set are > acceptable or not, so I just done a simple test. > > Thanks, > Tsukada > > TSUKADA Koutaro (7): > hugetlb: introduce charge_surplus_huge_pages to struct hstate > hugetlb: supports migrate charging for surplus hugepages > memcg: use compound_order rather than hpage_nr_pages > mm, sysctl: make charging surplus hugepages controllable > hugetlb: add charge_surplus_hugepages attribute > Documentation, hugetlb: describe about charge_surplus_hugepages > memcg: supports movement of surplus hugepages statistics > > Documentation/vm/hugetlbpage.txt |6 + > include/linux/hugetlb.h |4 + > kernel/sysctl.c |7 + > mm/hugetlb.c | 148 > +++ > mm/memcontrol.c | 109 +++- > 5 files changed, 269 insertions(+), 5 deletions(-) >
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote: > Thanks to Mike Kravetz for comment on the previous version patch. > > The purpose of this patch-set is to make it possible to control whether or > not to charge surplus hugetlb pages obtained by overcommitting to memory > cgroup. In the future, I am trying to accomplish limiting the memory usage > of applications that use both normal pages and hugetlb pages by the memory > cgroup(not use the hugetlb cgroup). > > Applications that use shared libraries like libhugetlbfs.so use both normal > pages and hugetlb pages, but we do not know how much to use each. Please > suppose you want to manage the memory usage of such applications by cgroup > How do you set the memory cgroup and hugetlb cgroup limit when you want to > limit memory usage to 10GB? > > If you set a limit of 10GB for each, the user can use a total of 20GB of > memory and can not limit it well. Since it is difficult to estimate the > ratio used by user of normal pages and hugetlb pages, setting limits of 2GB > to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a > case, I thought that by using my patch-set, we could manage resources just > by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb > cgroup). > > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. > > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. One of the basic questions/concerns I have is accounting for surplus huge pages in the default memory resource controller. The existing huegtlb resource controller already takes hugetlbfs huge pages into account, including surplus pages. This series would allow surplus pages to be accounted for in the default memory controller, or the hugetlb controller or both. I understand that current mechanisms do not meet the needs of the above use case. The question is whether this is an appropriate way to approach the issue. My cgroup experience and knowledge is extremely limited, but it does not appear that any other resource can be controlled by multiple controllers. Therefore, I am concerned that this may be going against basic cgroup design philosophy. It would be good to get comments from people more cgroup knowledgeable, and especially from those involved in the decision to do separate hugetlb control. -- Mike Kravetz > > The patch set is for 4.17.0-rc3+. I don't know whether patch-set are > acceptable or not, so I just done a simple test. > > Thanks, > Tsukada > > TSUKADA Koutaro (7): > hugetlb: introduce charge_surplus_huge_pages to struct hstate > hugetlb: supports migrate charging for surplus hugepages > memcg: use compound_order rather than hpage_nr_pages > mm, sysctl: make charging surplus hugepages controllable > hugetlb: add charge_surplus_hugepages attribute > Documentation, hugetlb: describe about charge_surplus_hugepages > memcg: supports movement of surplus hugepages statistics > > Documentation/vm/hugetlbpage.txt |6 + > include/linux/hugetlb.h |4 + > kernel/sysctl.c |7 + > mm/hugetlb.c | 148 > +++ > mm/memcontrol.c | 109 +++- > 5 files changed, 269 insertions(+), 5 deletions(-) >
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
Hi Tsukada, I was staring at memcg code to better understand your changes and had the below thought. TSUKADA Koutarowrites: [...] > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. Instead of tying the surplus huge page charging control per-hstate, could the control be made per-memcg? This can be done by introducing a per-memory controller file in sysfs (memory.charge_surplus_hugepages?) that indicates whether surplus hugepages are to be charged to the controller and forms part of the total limit. IIUC, the limit already accounts for page and swap cache pages. This would allow the control to be enabled per-cgroup and also keep the userspace control interface in one place. As said earlier, I'm not familiar with memcg so the above might not be a feasible but think it'll lead to a more coherent user interface. Hopefully, more knowledgeable folks on the thread can chime in. Thanks, Punit > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. > > The patch set is for 4.17.0-rc3+. I don't know whether patch-set are > acceptable or not, so I just done a simple test. > > Thanks, > Tsukada > > TSUKADA Koutaro (7): > hugetlb: introduce charge_surplus_huge_pages to struct hstate > hugetlb: supports migrate charging for surplus hugepages > memcg: use compound_order rather than hpage_nr_pages > mm, sysctl: make charging surplus hugepages controllable > hugetlb: add charge_surplus_hugepages attribute > Documentation, hugetlb: describe about charge_surplus_hugepages > memcg: supports movement of surplus hugepages statistics > > Documentation/vm/hugetlbpage.txt |6 + > include/linux/hugetlb.h |4 + > kernel/sysctl.c |7 + > mm/hugetlb.c | 148 > +++ > mm/memcontrol.c | 109 +++- > 5 files changed, 269 insertions(+), 5 deletions(-)
Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg
Hi Tsukada, I was staring at memcg code to better understand your changes and had the below thought. TSUKADA Koutaro writes: [...] > In this patch-set, introduce the charge_surplus_huge_pages(boolean) to > struct hstate. If it is true, it charges to the memory cgroup to which the > task that obtained surplus hugepages belongs. If it is false, do nothing as > before, and the default value is false. The charge_surplus_huge_pages can > be controlled procfs or sysfs interfaces. Instead of tying the surplus huge page charging control per-hstate, could the control be made per-memcg? This can be done by introducing a per-memory controller file in sysfs (memory.charge_surplus_hugepages?) that indicates whether surplus hugepages are to be charged to the controller and forms part of the total limit. IIUC, the limit already accounts for page and swap cache pages. This would allow the control to be enabled per-cgroup and also keep the userspace control interface in one place. As said earlier, I'm not familiar with memcg so the above might not be a feasible but think it'll lead to a more coherent user interface. Hopefully, more knowledgeable folks on the thread can chime in. Thanks, Punit > Since THP is very effective in environments with kernel page size of 4KB, > such as x86, there is no reason to positively use HugeTLBfs, so I think > that there is no situation to enable charge_surplus_huge_pages. However, in > some distributions such as arm64, the page size of the kernel is 64KB, and > the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs > may support multiple huge page sizes, and in such a special environment > there is a desire to use HugeTLBfs. > > The patch set is for 4.17.0-rc3+. I don't know whether patch-set are > acceptable or not, so I just done a simple test. > > Thanks, > Tsukada > > TSUKADA Koutaro (7): > hugetlb: introduce charge_surplus_huge_pages to struct hstate > hugetlb: supports migrate charging for surplus hugepages > memcg: use compound_order rather than hpage_nr_pages > mm, sysctl: make charging surplus hugepages controllable > hugetlb: add charge_surplus_hugepages attribute > Documentation, hugetlb: describe about charge_surplus_hugepages > memcg: supports movement of surplus hugepages statistics > > Documentation/vm/hugetlbpage.txt |6 + > include/linux/hugetlb.h |4 + > kernel/sysctl.c |7 + > mm/hugetlb.c | 148 > +++ > mm/memcontrol.c | 109 +++- > 5 files changed, 269 insertions(+), 5 deletions(-)