Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Mike Kravetz
On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote:
> 
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.

Note.  You do not want to use THP because "THP does not guarantee".

> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.
> 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.
> 
> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).

Using hugetlbfs overcommit also does not provide a guarantee.  Without
doing much research, I would say the failure rate for obtaining a huge
page via THP and hugetlbfs overcommit is about the same.  The most
difficult issue in both cases will be obtaining a "huge page" number of
pages from the buddy allocator.

I really do not think hugetlbfs overcommit will provide any benefit over
THP for your use case.  Also, new user space code is required to "fall back"
to normal pages in the case of hugetlbfs page allocation failure.  This
is not needed in the THP case.
-- 
Mike Kravetz


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Mike Kravetz
On 05/23/2018 09:26 PM, TSUKADA Koutaro wrote:
> 
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.

Note.  You do not want to use THP because "THP does not guarantee".

> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.
> 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.
> 
> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).

Using hugetlbfs overcommit also does not provide a guarantee.  Without
doing much research, I would say the failure rate for obtaining a huge
page via THP and hugetlbfs overcommit is about the same.  The most
difficult issue in both cases will be obtaining a "huge page" number of
pages from the buddy allocator.

I really do not think hugetlbfs overcommit will provide any benefit over
THP for your use case.  Also, new user space code is required to "fall back"
to normal pages in the case of hugetlbfs page allocation failure.  This
is not needed in the THP case.
-- 
Mike Kravetz


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote:
> On 2018/05/24 17:20, Michal Hocko wrote:
> > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> >> On 2018/05/23 3:54, Michal Hocko wrote:
> > [...]
> >>> I am also quite confused why you keep distinguishing surplus hugetlb
> >>> pages from regular preallocated ones. Being a surplus page is an
> >>> implementation detail that we use for an internal accounting rather than
> >>> something to exhibit to the userspace even more than we do currently.
> >>
> >> I apologize for having confused.
> >>
> >> The hugetlb pages obtained from the pool do not waste the buddy pool.
> > 
> > Because they have already allocated from the buddy allocator so the end
> > result is very same.
> > 
> >> On
> >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> >> difference in property, I thought it could be distinguished.
> > 
> > But this is simply not correct. Surplus pages are fluid. If you increase
> > the hugetlb size they will become regular persistent hugetlb pages.
> 
> I really can not understand what's wrong with this. That page is obviously
> released before being added to the persistent pool, and at that time it is
> uncharged from memcg to which the task belongs(This assumes my patch-set).
> After that, the same page obtained from the pool is not surplus hugepage
> so it will not be charged to memcg again.

I do not see anything like that. adjust_pool_surplus is simply and
accounting thing. At least the last time I've checked. Maybe your
patchset handles that?
 
> >> Although my memcg knowledge is extremely limited, memcg is accounting for
> >> various kinds of pages obtained from the buddy pool by the task belonging
> >> to it. I would like to argue that surplus hugepage has specificity in
> >> terms of obtaining from the buddy pool, and that it is specially permitted
> >> charge requirements for memcg.
> > 
> > Not really. Memcg accounts primarily for reclaimable memory. We do
> > account for some non-reclaimable slabs but the life time should be at
> > least bound to a process life time. Otherwise the memcg oom killer
> > behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> > simply persistent. Well, to be completely honest tmpfs pages have a
> > similar problem but lacking the swap space for them is kinda
> > configuration bug.
> 
> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
> pages and hugetlb page?)

No mlocked pages cannot be reclaimed and that is why we restrict them to
a relatively small amount.
 
> >> It seems very strange that charge hugetlb page to memcg, but essentially
> >> it only charges the usage of the compound page obtained from the buddy 
> >> pool,
> >> and even if that page is used as hugetlb page after that, memcg is not
> >> interested in that.
> > 
> > Ohh, it is very much interested. The primary goal of memcg is to enforce
> > the limit. How are you going to do that in an absence of the reclaimable
> > memory? And quite a lot of it because hugetlb pages usually consume a
> > lot of memory.
> 
> Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
> reclaim at the time of account of with surplus hugepages.

But that will not release the hugetlb memory, does it?
 
> [...]
> >> I could not understand the intention of this question, sorry. When resize
> >> the pool, I think that the number of surplus hugepages in use does not
> >> change. Could you explain what you were concerned about?
> > 
> > It does change when you change the hugetlb pool size, migrate pages
> > between per-numa pools (have a look at adjust_pool_surplus).
> 
> As I looked at, what kind of fatal problem is caused by charging surplus
> hugepages to memcg by just manipulating counter of statistical information?

Fatal? Not sure. It simply tries to add an alien memory to the memcg
concept so I would pressume an unexpected behavior (e.g. not being able
to reclaim memcg or, over reclaim, trashing etc.).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote:
> On 2018/05/24 17:20, Michal Hocko wrote:
> > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> >> On 2018/05/23 3:54, Michal Hocko wrote:
> > [...]
> >>> I am also quite confused why you keep distinguishing surplus hugetlb
> >>> pages from regular preallocated ones. Being a surplus page is an
> >>> implementation detail that we use for an internal accounting rather than
> >>> something to exhibit to the userspace even more than we do currently.
> >>
> >> I apologize for having confused.
> >>
> >> The hugetlb pages obtained from the pool do not waste the buddy pool.
> > 
> > Because they have already allocated from the buddy allocator so the end
> > result is very same.
> > 
> >> On
> >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> >> difference in property, I thought it could be distinguished.
> > 
> > But this is simply not correct. Surplus pages are fluid. If you increase
> > the hugetlb size they will become regular persistent hugetlb pages.
> 
> I really can not understand what's wrong with this. That page is obviously
> released before being added to the persistent pool, and at that time it is
> uncharged from memcg to which the task belongs(This assumes my patch-set).
> After that, the same page obtained from the pool is not surplus hugepage
> so it will not be charged to memcg again.

I do not see anything like that. adjust_pool_surplus is simply and
accounting thing. At least the last time I've checked. Maybe your
patchset handles that?
 
> >> Although my memcg knowledge is extremely limited, memcg is accounting for
> >> various kinds of pages obtained from the buddy pool by the task belonging
> >> to it. I would like to argue that surplus hugepage has specificity in
> >> terms of obtaining from the buddy pool, and that it is specially permitted
> >> charge requirements for memcg.
> > 
> > Not really. Memcg accounts primarily for reclaimable memory. We do
> > account for some non-reclaimable slabs but the life time should be at
> > least bound to a process life time. Otherwise the memcg oom killer
> > behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> > simply persistent. Well, to be completely honest tmpfs pages have a
> > similar problem but lacking the swap space for them is kinda
> > configuration bug.
> 
> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
> pages and hugetlb page?)

No mlocked pages cannot be reclaimed and that is why we restrict them to
a relatively small amount.
 
> >> It seems very strange that charge hugetlb page to memcg, but essentially
> >> it only charges the usage of the compound page obtained from the buddy 
> >> pool,
> >> and even if that page is used as hugetlb page after that, memcg is not
> >> interested in that.
> > 
> > Ohh, it is very much interested. The primary goal of memcg is to enforce
> > the limit. How are you going to do that in an absence of the reclaimable
> > memory? And quite a lot of it because hugetlb pages usually consume a
> > lot of memory.
> 
> Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
> reclaim at the time of account of with surplus hugepages.

But that will not release the hugetlb memory, does it?
 
> [...]
> >> I could not understand the intention of this question, sorry. When resize
> >> the pool, I think that the number of surplus hugepages in use does not
> >> change. Could you explain what you were concerned about?
> > 
> > It does change when you change the hugetlb pool size, migrate pages
> > between per-numa pools (have a look at adjust_pool_surplus).
> 
> As I looked at, what kind of fatal problem is caused by charging surplus
> hugepages to memcg by just manipulating counter of statistical information?

Fatal? Not sure. It simply tries to add an alien memory to the memcg
concept so I would pressume an unexpected behavior (e.g. not being able
to reclaim memcg or, over reclaim, trashing etc.).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
> 
> Because they have already allocated from the buddy allocator so the end
> result is very same.
> 
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
> 
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
> 
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
> 
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
> 
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
> 
> Because they have already allocated from the buddy allocator so the end
> result is very same.
> 
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
> 
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
> 
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
> 
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
> 
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote:
[...]
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.
> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.

Sure, asking for guarantee makes hugetlb pages attractive. But nothing
is really for free, especially any resource _guarantee_, and you have to
pay an additional configuration price usually.
 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.

Why? I can see this would be quite a PITA but not really impossible.

> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).
> 
> Surplus hugepages is hugetlb page, but I think at least that consuming
> buddy pool is a decisive difference from hugetlb page of persistent pool.
> If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
> surplus hugepages from buddy pool is all unlimited even if being limited
> by memcg.

Not really, you can specify how much you can overcommit hugetlb pages.

> In extreme cases, overcommitment will allow users to exhaust
> the entire memory of the system. Of course, this can be prevented by the
> hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
> respectively, as I asked in the first mail(set limit to 10GB), the
> control will not work.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote:
[...]
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.
> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.

Sure, asking for guarantee makes hugetlb pages attractive. But nothing
is really for free, especially any resource _guarantee_, and you have to
pay an additional configuration price usually.
 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.

Why? I can see this would be quite a PITA but not really impossible.

> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).
> 
> Surplus hugepages is hugetlb page, but I think at least that consuming
> buddy pool is a decisive difference from hugetlb page of persistent pool.
> If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
> surplus hugepages from buddy pool is all unlimited even if being limited
> by memcg.

Not really, you can specify how much you can overcommit hugetlb pages.

> In extreme cases, overcommitment will allow users to exhaust
> the entire memory of the system. Of course, this can be prevented by the
> hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
> respectively, as I asked in the first mail(set limit to 10GB), the
> control will not work.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> On 2018/05/23 3:54, Michal Hocko wrote:
[...]
> > I am also quite confused why you keep distinguishing surplus hugetlb
> > pages from regular preallocated ones. Being a surplus page is an
> > implementation detail that we use for an internal accounting rather than
> > something to exhibit to the userspace even more than we do currently.
> 
> I apologize for having confused.
> 
> The hugetlb pages obtained from the pool do not waste the buddy pool.

Because they have already allocated from the buddy allocator so the end
result is very same.

> On
> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> difference in property, I thought it could be distinguished.

But this is simply not correct. Surplus pages are fluid. If you increase
the hugetlb size they will become regular persistent hugetlb pages.
 
> Although my memcg knowledge is extremely limited, memcg is accounting for
> various kinds of pages obtained from the buddy pool by the task belonging
> to it. I would like to argue that surplus hugepage has specificity in
> terms of obtaining from the buddy pool, and that it is specially permitted
> charge requirements for memcg.

Not really. Memcg accounts primarily for reclaimable memory. We do
account for some non-reclaimable slabs but the life time should be at
least bound to a process life time. Otherwise the memcg oom killer
behavior is not guaranteed to unclutter the situation. Hugetlb pages are
simply persistent. Well, to be completely honest tmpfs pages have a
similar problem but lacking the swap space for them is kinda
configuration bug.

> It seems very strange that charge hugetlb page to memcg, but essentially
> it only charges the usage of the compound page obtained from the buddy pool,
> and even if that page is used as hugetlb page after that, memcg is not
> interested in that.

Ohh, it is very much interested. The primary goal of memcg is to enforce
the limit. How are you going to do that in an absence of the reclaimable
memory? And quite a lot of it because hugetlb pages usually consume a
lot of memory.

> I will completely apologize if my way of thinking is wrong. It would be
> greatly appreciated if you could mention why we can not charge surplus
> hugepages to memcg.
> 
> > Just look at what [sw]hould when you need to adjust accounting - e.g.
> > due to the pool resize. Are you going to uncharge those surplus pages
> > ffrom memcg to reflect their persistence?
> > 
> 
> I could not understand the intention of this question, sorry. When resize
> the pool, I think that the number of surplus hugepages in use does not
> change. Could you explain what you were concerned about?

It does change when ou change the hugetlb pool size, migrate pages
between per-numa pools (have a look at adjust_pool_surplus).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko
On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> On 2018/05/23 3:54, Michal Hocko wrote:
[...]
> > I am also quite confused why you keep distinguishing surplus hugetlb
> > pages from regular preallocated ones. Being a surplus page is an
> > implementation detail that we use for an internal accounting rather than
> > something to exhibit to the userspace even more than we do currently.
> 
> I apologize for having confused.
> 
> The hugetlb pages obtained from the pool do not waste the buddy pool.

Because they have already allocated from the buddy allocator so the end
result is very same.

> On
> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> difference in property, I thought it could be distinguished.

But this is simply not correct. Surplus pages are fluid. If you increase
the hugetlb size they will become regular persistent hugetlb pages.
 
> Although my memcg knowledge is extremely limited, memcg is accounting for
> various kinds of pages obtained from the buddy pool by the task belonging
> to it. I would like to argue that surplus hugepage has specificity in
> terms of obtaining from the buddy pool, and that it is specially permitted
> charge requirements for memcg.

Not really. Memcg accounts primarily for reclaimable memory. We do
account for some non-reclaimable slabs but the life time should be at
least bound to a process life time. Otherwise the memcg oom killer
behavior is not guaranteed to unclutter the situation. Hugetlb pages are
simply persistent. Well, to be completely honest tmpfs pages have a
similar problem but lacking the swap space for them is kinda
configuration bug.

> It seems very strange that charge hugetlb page to memcg, but essentially
> it only charges the usage of the compound page obtained from the buddy pool,
> and even if that page is used as hugetlb page after that, memcg is not
> interested in that.

Ohh, it is very much interested. The primary goal of memcg is to enforce
the limit. How are you going to do that in an absence of the reclaimable
memory? And quite a lot of it because hugetlb pages usually consume a
lot of memory.

> I will completely apologize if my way of thinking is wrong. It would be
> greatly appreciated if you could mention why we can not charge surplus
> hugepages to memcg.
> 
> > Just look at what [sw]hould when you need to adjust accounting - e.g.
> > due to the pool resize. Are you going to uncharge those surplus pages
> > ffrom memcg to reflect their persistence?
> > 
> 
> I could not understand the intention of this question, sorry. When resize
> the pool, I think that the number of surplus hugepages in use does not
> change. Could you explain what you were concerned about?

It does change when ou change the hugetlb pool size, migrate pages
between per-numa pools (have a look at adjust_pool_surplus).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/23 3:54, Michal Hocko wrote:
> On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
>> On 2018/05/22 3:07, Mike Kravetz wrote:
>>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
 Thanks to Mike Kravetz for comment on the previous version patch.

 The purpose of this patch-set is to make it possible to control whether or
 not to charge surplus hugetlb pages obtained by overcommitting to memory
 cgroup. In the future, I am trying to accomplish limiting the memory usage
 of applications that use both normal pages and hugetlb pages by the memory
 cgroup(not use the hugetlb cgroup).

 Applications that use shared libraries like libhugetlbfs.so use both normal
 pages and hugetlb pages, but we do not know how much to use each. Please
 suppose you want to manage the memory usage of such applications by cgroup
 How do you set the memory cgroup and hugetlb cgroup limit when you want to
 limit memory usage to 10GB?

 If you set a limit of 10GB for each, the user can use a total of 20GB of
 memory and can not limit it well. Since it is difficult to estimate the
 ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
 to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
 case, I thought that by using my patch-set, we could manage resources just
 by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
 cgroup).

 In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
 struct hstate. If it is true, it charges to the memory cgroup to which the
 task that obtained surplus hugepages belongs. If it is false, do nothing as
 before, and the default value is false. The charge_surplus_huge_pages can
 be controlled procfs or sysfs interfaces.

 Since THP is very effective in environments with kernel page size of 4KB,
 such as x86, there is no reason to positively use HugeTLBfs, so I think
 that there is no situation to enable charge_surplus_huge_pages. However, in
 some distributions such as arm64, the page size of the kernel is 64KB, and
 the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
 may support multiple huge page sizes, and in such a special environment
 there is a desire to use HugeTLBfs.
>>>
>>> One of the basic questions/concerns I have is accounting for surplus huge
>>> pages in the default memory resource controller.  The existing huegtlb
>>> resource controller already takes hugetlbfs huge pages into account,
>>> including surplus pages.  This series would allow surplus pages to be
>>> accounted for in the default  memory controller, or the hugetlb controller
>>> or both.
>>>
>>> I understand that current mechanisms do not meet the needs of the above
>>> use case.  The question is whether this is an appropriate way to approach
>>> the issue.
> 
> I do share your view Mike!
> 
>>> My cgroup experience and knowledge is extremely limited, but
>>> it does not appear that any other resource can be controlled by multiple
>>> controllers.  Therefore, I am concerned that this may be going against
>>> basic cgroup design philosophy.
>>
>> Thank you for your feedback.
>> That makes sense, surplus hugepages are charged to both memcg and hugetlb
>> cgroup, which may be contrary to cgroup design philosophy.
>>
>> Based on the above advice, I have considered the following improvements,
>> what do you think about?
>>
>> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
>> "whether to charge memcg in addition to hugetlb cgroup", but it will be
>> abolished. Instead, change to "switch only to memcg instead of hugetlb
>> cgroup" option. This is called 'surplus_charge_to_memcg'.
> 
> This all looks so hackish and ad-hoc that I would be tempted to give it
> an outright nack, but let's here more about why do we need this fiddling
> at all. I've asked in other email so I guess I will get an answer there
> but let me just emphasize again that I absolutely detest a possibility
> to put hugetlb pages into the memcg mix. They just do not belong there.
> Try to look at previous discussions why it has been decided to have a
> separate hugetlb pages at all.
> 
> I am also quite confused why you keep distinguishing surplus hugetlb
> pages from regular preallocated ones. Being a surplus page is an
> implementation detail that we use for an internal accounting rather than
> something to exhibit to the userspace even more than we do currently.

I apologize for having confused.

The hugetlb pages obtained from the pool do not waste the buddy pool. On
the other hand, surplus hugetlb pages waste the buddy pool. Due to this
difference in property, I thought it could be distinguished.

Although my memcg knowledge is extremely limited, memcg is accounting for
various kinds of pages obtained from the buddy pool by the task belonging
to it. I would like to argue that surplus hugepage has 

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/23 3:54, Michal Hocko wrote:
> On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
>> On 2018/05/22 3:07, Mike Kravetz wrote:
>>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
 Thanks to Mike Kravetz for comment on the previous version patch.

 The purpose of this patch-set is to make it possible to control whether or
 not to charge surplus hugetlb pages obtained by overcommitting to memory
 cgroup. In the future, I am trying to accomplish limiting the memory usage
 of applications that use both normal pages and hugetlb pages by the memory
 cgroup(not use the hugetlb cgroup).

 Applications that use shared libraries like libhugetlbfs.so use both normal
 pages and hugetlb pages, but we do not know how much to use each. Please
 suppose you want to manage the memory usage of such applications by cgroup
 How do you set the memory cgroup and hugetlb cgroup limit when you want to
 limit memory usage to 10GB?

 If you set a limit of 10GB for each, the user can use a total of 20GB of
 memory and can not limit it well. Since it is difficult to estimate the
 ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
 to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
 case, I thought that by using my patch-set, we could manage resources just
 by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
 cgroup).

 In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
 struct hstate. If it is true, it charges to the memory cgroup to which the
 task that obtained surplus hugepages belongs. If it is false, do nothing as
 before, and the default value is false. The charge_surplus_huge_pages can
 be controlled procfs or sysfs interfaces.

 Since THP is very effective in environments with kernel page size of 4KB,
 such as x86, there is no reason to positively use HugeTLBfs, so I think
 that there is no situation to enable charge_surplus_huge_pages. However, in
 some distributions such as arm64, the page size of the kernel is 64KB, and
 the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
 may support multiple huge page sizes, and in such a special environment
 there is a desire to use HugeTLBfs.
>>>
>>> One of the basic questions/concerns I have is accounting for surplus huge
>>> pages in the default memory resource controller.  The existing huegtlb
>>> resource controller already takes hugetlbfs huge pages into account,
>>> including surplus pages.  This series would allow surplus pages to be
>>> accounted for in the default  memory controller, or the hugetlb controller
>>> or both.
>>>
>>> I understand that current mechanisms do not meet the needs of the above
>>> use case.  The question is whether this is an appropriate way to approach
>>> the issue.
> 
> I do share your view Mike!
> 
>>> My cgroup experience and knowledge is extremely limited, but
>>> it does not appear that any other resource can be controlled by multiple
>>> controllers.  Therefore, I am concerned that this may be going against
>>> basic cgroup design philosophy.
>>
>> Thank you for your feedback.
>> That makes sense, surplus hugepages are charged to both memcg and hugetlb
>> cgroup, which may be contrary to cgroup design philosophy.
>>
>> Based on the above advice, I have considered the following improvements,
>> what do you think about?
>>
>> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
>> "whether to charge memcg in addition to hugetlb cgroup", but it will be
>> abolished. Instead, change to "switch only to memcg instead of hugetlb
>> cgroup" option. This is called 'surplus_charge_to_memcg'.
> 
> This all looks so hackish and ad-hoc that I would be tempted to give it
> an outright nack, but let's here more about why do we need this fiddling
> at all. I've asked in other email so I guess I will get an answer there
> but let me just emphasize again that I absolutely detest a possibility
> to put hugetlb pages into the memcg mix. They just do not belong there.
> Try to look at previous discussions why it has been decided to have a
> separate hugetlb pages at all.
> 
> I am also quite confused why you keep distinguishing surplus hugetlb
> pages from regular preallocated ones. Being a surplus page is an
> implementation detail that we use for an internal accounting rather than
> something to exhibit to the userspace even more than we do currently.

I apologize for having confused.

The hugetlb pages obtained from the pool do not waste the buddy pool. On
the other hand, surplus hugetlb pages waste the buddy pool. Due to this
difference in property, I thought it could be distinguished.

Although my memcg knowledge is extremely limited, memcg is accounting for
various kinds of pages obtained from the buddy pool by the task belonging
to it. I would like to argue that surplus hugepage has 

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/22 22:51, Michal Hocko wrote:
> On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
> 
> There was a deliberate decision to keep hugetlb and "normal" memory
> cgroup controllers separate. Mostly because hugetlb memory is an
> artificial memory subsystem on its own and it doesn't fit into the rest
> of memcg accounted memory very well. I believe we want to keep that
> status quo.
> 
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
> 
> Well such a usecase requires an explicit configuration already. Either
> by using special wrappers or modifying the code. So I would argue that
> you have quite a good knowlege of the setup. If you need a greater
> flexibility then just do not use hugetlb at all and rely on THP.
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> I do not really think this is a good idea. We really do not want to make
> the current hugetlb code more complex than it is already. The current
> hugetlb cgroup controller is simple and works at least somehow. I would
> not add more on top unless there is a _really_ strong usecase behind.
> Please make sure to describe such a usecase in details before we even
> start considering the code.

Thank you for your time.

I do not know if it is really a strong use case, but I will explain my
motive in detail. English is not my native language, so please pardon
my poor English.

I am one of the developers for software that managing the resource used
from user job at HPC-Cluster with Linux. The resource is memory mainly.
The HPC-Cluster may be shared by multiple people and used. Therefore, the
memory used by each user must be strictly controlled, otherwise the
user's job will runaway, not only will it hamper the other users, it will
crash the entire system in OOM.

Some users of HPC are very nervous about performance. Jobs are executed
while synchronizing with MPI communication using multiple compute nodes.
Since CPU wait time will occur when synchronizing, they want to minimize
the variation in execution time at each node to reduce waiting times as
much as possible. We call this variation a noise.

THP does not guarantee to use the Huge Page, but may use the normal page.
This mechanism is one cause of variation(noise).

The users who know this mechanism will be hesitant to use THP. However,
the users also know the benefits of the Huge Page's TLB hit rate
performance, and the Huge Page seems to be attractive. It seems natural
that these users are interested in HugeTLBfs, I do not know at all
whether it is the right approach or not.

At the very least, our HPC system is pursuing high versatility and we
have to consider whether we can provide it if users want to use HugeTLBfs.

In order to use HugeTLBfs we need to create a persistent pool, but in
our use case sharing nodes, it would be impossible to create, delete or
resize the pool.

One of the answers I have reached is to use HugeTLBfs by overcommitting
without creating a pool(this is the surplus hugepage).

Surplus hugepages is hugetlb page, but I think at least that consuming
buddy pool is a decisive difference from hugetlb page of persistent pool.
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
surplus hugepages from buddy pool is all unlimited even if being limited
by memcg. In extreme cases, overcommitment will allow users to exhaust
the entire memory of the system. Of course, this can be prevented by the
hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
respectively, as I asked in the first mail(set limit to 10GB), the
control will not work.

I thought I could charge surplus hugepages to memcg, but maybe I did not
have enough knowledge about memcg. I would like to reply to another mail
for details.

>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/22 22:51, Michal Hocko wrote:
> On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
> 
> There was a deliberate decision to keep hugetlb and "normal" memory
> cgroup controllers separate. Mostly because hugetlb memory is an
> artificial memory subsystem on its own and it doesn't fit into the rest
> of memcg accounted memory very well. I believe we want to keep that
> status quo.
> 
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
> 
> Well such a usecase requires an explicit configuration already. Either
> by using special wrappers or modifying the code. So I would argue that
> you have quite a good knowlege of the setup. If you need a greater
> flexibility then just do not use hugetlb at all and rely on THP.
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> I do not really think this is a good idea. We really do not want to make
> the current hugetlb code more complex than it is already. The current
> hugetlb cgroup controller is simple and works at least somehow. I would
> not add more on top unless there is a _really_ strong usecase behind.
> Please make sure to describe such a usecase in details before we even
> start considering the code.

Thank you for your time.

I do not know if it is really a strong use case, but I will explain my
motive in detail. English is not my native language, so please pardon
my poor English.

I am one of the developers for software that managing the resource used
from user job at HPC-Cluster with Linux. The resource is memory mainly.
The HPC-Cluster may be shared by multiple people and used. Therefore, the
memory used by each user must be strictly controlled, otherwise the
user's job will runaway, not only will it hamper the other users, it will
crash the entire system in OOM.

Some users of HPC are very nervous about performance. Jobs are executed
while synchronizing with MPI communication using multiple compute nodes.
Since CPU wait time will occur when synchronizing, they want to minimize
the variation in execution time at each node to reduce waiting times as
much as possible. We call this variation a noise.

THP does not guarantee to use the Huge Page, but may use the normal page.
This mechanism is one cause of variation(noise).

The users who know this mechanism will be hesitant to use THP. However,
the users also know the benefits of the Huge Page's TLB hit rate
performance, and the Huge Page seems to be attractive. It seems natural
that these users are interested in HugeTLBfs, I do not know at all
whether it is the right approach or not.

At the very least, our HPC system is pursuing high versatility and we
have to consider whether we can provide it if users want to use HugeTLBfs.

In order to use HugeTLBfs we need to create a persistent pool, but in
our use case sharing nodes, it would be impossible to create, delete or
resize the pool.

One of the answers I have reached is to use HugeTLBfs by overcommitting
without creating a pool(this is the surplus hugepage).

Surplus hugepages is hugetlb page, but I think at least that consuming
buddy pool is a decisive difference from hugetlb page of persistent pool.
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
surplus hugepages from buddy pool is all unlimited even if being limited
by memcg. In extreme cases, overcommitment will allow users to exhaust
the entire memory of the system. Of course, this can be prevented by the
hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
respectively, as I asked in the first mail(set limit to 10GB), the
control will not work.

I thought I could charge surplus hugepages to memcg, but maybe I did not
have enough knowledge about memcg. I would like to reply to another mail
for details.

>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Mike Kravetz
On 05/22/2018 06:04 AM, TSUKADA Koutaro wrote:
> 
> I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
> have specially considered of surplus hugepages. Later, I will send a mail
> to hugetlb cgroup's committer to ask about surplus hugepages charge
> specifications.
> 

I went back and looked at surplus huge page allocation.  Previously, I made
a statement that the hugetlb controller accounts for surplus huge pages.
Turns out that may not be 100% correct.

Thanks to Michal, all surplus huge page allocation is performed via the
alloc_surplus_huge_page() routine.  This will ultimately call into the
buddy allocator without any cgroup charges.  Calls to alloc_surplus_huge_page
are made from:
- alloc_huge_page() when allocating a huge page to a mapping/file.  In this
  case, appropriate calls to the hugetlb controller are in place.  So, any
  limits are enforced here.
- gather_surplus_pages() when allocating and setting aside 'reserved' huge
  pages. No accounting is performed here.  Do note that in this case the
  allocated huge pages are not assigned to the mapping/file.  Even though
  'reserved', they are deposited into the global pool and also counted as
  'free'.  When these reserved pages are ultimately used to populate a
  file/mapping, the code path goes through alloc_huge_page() where appropriate
  calls to the hugetlb controller are in place.

So, the bottom line is that surplus huge pages are not accounted for when
they are allocated as 'reserves'.  It is not until these reserves are actually
used that accounting limits are checked.  This 'seems' to align with general
allocation of huge pages within the pool.  No accounting is done until they
are actually allocated to a mapping/file.

-- 
Mike Kravetz


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Mike Kravetz
On 05/22/2018 06:04 AM, TSUKADA Koutaro wrote:
> 
> I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
> have specially considered of surplus hugepages. Later, I will send a mail
> to hugetlb cgroup's committer to ask about surplus hugepages charge
> specifications.
> 

I went back and looked at surplus huge page allocation.  Previously, I made
a statement that the hugetlb controller accounts for surplus huge pages.
Turns out that may not be 100% correct.

Thanks to Michal, all surplus huge page allocation is performed via the
alloc_surplus_huge_page() routine.  This will ultimately call into the
buddy allocator without any cgroup charges.  Calls to alloc_surplus_huge_page
are made from:
- alloc_huge_page() when allocating a huge page to a mapping/file.  In this
  case, appropriate calls to the hugetlb controller are in place.  So, any
  limits are enforced here.
- gather_surplus_pages() when allocating and setting aside 'reserved' huge
  pages. No accounting is performed here.  Do note that in this case the
  allocated huge pages are not assigned to the mapping/file.  Even though
  'reserved', they are deposited into the global pool and also counted as
  'free'.  When these reserved pages are ultimately used to populate a
  file/mapping, the code path goes through alloc_huge_page() where appropriate
  calls to the hugetlb controller are in place.

So, the bottom line is that surplus huge pages are not accounted for when
they are allocated as 'reserves'.  It is not until these reserves are actually
used that accounting limits are checked.  This 'seems' to align with general
allocation of huge pages within the pool.  No accounting is done until they
are actually allocated to a mapping/file.

-- 
Mike Kravetz


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Michal Hocko
On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
> On 2018/05/22 3:07, Mike Kravetz wrote:
> > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> >> Thanks to Mike Kravetz for comment on the previous version patch.
> >>
> >> The purpose of this patch-set is to make it possible to control whether or
> >> not to charge surplus hugetlb pages obtained by overcommitting to memory
> >> cgroup. In the future, I am trying to accomplish limiting the memory usage
> >> of applications that use both normal pages and hugetlb pages by the memory
> >> cgroup(not use the hugetlb cgroup).
> >>
> >> Applications that use shared libraries like libhugetlbfs.so use both normal
> >> pages and hugetlb pages, but we do not know how much to use each. Please
> >> suppose you want to manage the memory usage of such applications by cgroup
> >> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> >> limit memory usage to 10GB?
> >>
> >> If you set a limit of 10GB for each, the user can use a total of 20GB of
> >> memory and can not limit it well. Since it is difficult to estimate the
> >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> >> case, I thought that by using my patch-set, we could manage resources just
> >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> >> cgroup).
> >>
> >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> >> struct hstate. If it is true, it charges to the memory cgroup to which the
> >> task that obtained surplus hugepages belongs. If it is false, do nothing as
> >> before, and the default value is false. The charge_surplus_huge_pages can
> >> be controlled procfs or sysfs interfaces.
> >>
> >> Since THP is very effective in environments with kernel page size of 4KB,
> >> such as x86, there is no reason to positively use HugeTLBfs, so I think
> >> that there is no situation to enable charge_surplus_huge_pages. However, in
> >> some distributions such as arm64, the page size of the kernel is 64KB, and
> >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> >> may support multiple huge page sizes, and in such a special environment
> >> there is a desire to use HugeTLBfs.
> > 
> > One of the basic questions/concerns I have is accounting for surplus huge
> > pages in the default memory resource controller.  The existing huegtlb
> > resource controller already takes hugetlbfs huge pages into account,
> > including surplus pages.  This series would allow surplus pages to be
> > accounted for in the default  memory controller, or the hugetlb controller
> > or both.
> > 
> > I understand that current mechanisms do not meet the needs of the above
> > use case.  The question is whether this is an appropriate way to approach
> > the issue.

I do share your view Mike!

> > My cgroup experience and knowledge is extremely limited, but
> > it does not appear that any other resource can be controlled by multiple
> > controllers.  Therefore, I am concerned that this may be going against
> > basic cgroup design philosophy.
> 
> Thank you for your feedback.
> That makes sense, surplus hugepages are charged to both memcg and hugetlb
> cgroup, which may be contrary to cgroup design philosophy.
> 
> Based on the above advice, I have considered the following improvements,
> what do you think about?
> 
> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
> "whether to charge memcg in addition to hugetlb cgroup", but it will be
> abolished. Instead, change to "switch only to memcg instead of hugetlb
> cgroup" option. This is called 'surplus_charge_to_memcg'.

This all looks so hackish and ad-hoc that I would be tempted to give it
an outright nack, but let's here more about why do we need this fiddling
at all. I've asked in other email so I guess I will get an answer there
but let me just emphasize again that I absolutely detest a possibility
to put hugetlb pages into the memcg mix. They just do not belong there.
Try to look at previous discussions why it has been decided to have a
separate hugetlb pages at all.

I am also quite confused why you keep distinguishing surplus hugetlb
pages from regular preallocated ones. Being a surplus page is an
implementation detail that we use for an internal accounting rather than
something to exhibit to the userspace even more than we do currently.
Just look at what [sw]hould when you need to adjust accounting - e.g.
due to the pool resize. Are you going to uncharge those surplus pages
ffrom memcg to reflect their persistence?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Michal Hocko
On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
> On 2018/05/22 3:07, Mike Kravetz wrote:
> > On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> >> Thanks to Mike Kravetz for comment on the previous version patch.
> >>
> >> The purpose of this patch-set is to make it possible to control whether or
> >> not to charge surplus hugetlb pages obtained by overcommitting to memory
> >> cgroup. In the future, I am trying to accomplish limiting the memory usage
> >> of applications that use both normal pages and hugetlb pages by the memory
> >> cgroup(not use the hugetlb cgroup).
> >>
> >> Applications that use shared libraries like libhugetlbfs.so use both normal
> >> pages and hugetlb pages, but we do not know how much to use each. Please
> >> suppose you want to manage the memory usage of such applications by cgroup
> >> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> >> limit memory usage to 10GB?
> >>
> >> If you set a limit of 10GB for each, the user can use a total of 20GB of
> >> memory and can not limit it well. Since it is difficult to estimate the
> >> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> >> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> >> case, I thought that by using my patch-set, we could manage resources just
> >> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> >> cgroup).
> >>
> >> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> >> struct hstate. If it is true, it charges to the memory cgroup to which the
> >> task that obtained surplus hugepages belongs. If it is false, do nothing as
> >> before, and the default value is false. The charge_surplus_huge_pages can
> >> be controlled procfs or sysfs interfaces.
> >>
> >> Since THP is very effective in environments with kernel page size of 4KB,
> >> such as x86, there is no reason to positively use HugeTLBfs, so I think
> >> that there is no situation to enable charge_surplus_huge_pages. However, in
> >> some distributions such as arm64, the page size of the kernel is 64KB, and
> >> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> >> may support multiple huge page sizes, and in such a special environment
> >> there is a desire to use HugeTLBfs.
> > 
> > One of the basic questions/concerns I have is accounting for surplus huge
> > pages in the default memory resource controller.  The existing huegtlb
> > resource controller already takes hugetlbfs huge pages into account,
> > including surplus pages.  This series would allow surplus pages to be
> > accounted for in the default  memory controller, or the hugetlb controller
> > or both.
> > 
> > I understand that current mechanisms do not meet the needs of the above
> > use case.  The question is whether this is an appropriate way to approach
> > the issue.

I do share your view Mike!

> > My cgroup experience and knowledge is extremely limited, but
> > it does not appear that any other resource can be controlled by multiple
> > controllers.  Therefore, I am concerned that this may be going against
> > basic cgroup design philosophy.
> 
> Thank you for your feedback.
> That makes sense, surplus hugepages are charged to both memcg and hugetlb
> cgroup, which may be contrary to cgroup design philosophy.
> 
> Based on the above advice, I have considered the following improvements,
> what do you think about?
> 
> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
> "whether to charge memcg in addition to hugetlb cgroup", but it will be
> abolished. Instead, change to "switch only to memcg instead of hugetlb
> cgroup" option. This is called 'surplus_charge_to_memcg'.

This all looks so hackish and ad-hoc that I would be tempted to give it
an outright nack, but let's here more about why do we need this fiddling
at all. I've asked in other email so I guess I will get an answer there
but let me just emphasize again that I absolutely detest a possibility
to put hugetlb pages into the memcg mix. They just do not belong there.
Try to look at previous discussions why it has been decided to have a
separate hugetlb pages at all.

I am also quite confused why you keep distinguishing surplus hugetlb
pages from regular preallocated ones. Being a surplus page is an
implementation detail that we use for an internal accounting rather than
something to exhibit to the userspace even more than we do currently.
Just look at what [sw]hould when you need to adjust accounting - e.g.
due to the pool resize. Are you going to uncharge those surplus pages
ffrom memcg to reflect their persistence?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Michal Hocko
On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.

I am sorry that I didn't join the discussion for the previous version
but time just didn't allow that. So sorry if I am repeating something
already sorted out.

> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).

There was a deliberate decision to keep hugetlb and "normal" memory
cgroup controllers separate. Mostly because hugetlb memory is an
artificial memory subsystem on its own and it doesn't fit into the rest
of memcg accounted memory very well. I believe we want to keep that
status quo.

> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?

Well such a usecase requires an explicit configuration already. Either
by using special wrappers or modifying the code. So I would argue that
you have quite a good knowlege of the setup. If you need a greater
flexibility then just do not use hugetlb at all and rely on THP.
[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

I do not really think this is a good idea. We really do not want to make
the current hugetlb code more complex than it is already. The current
hugetlb cgroup controller is simple and works at least somehow. I would
not add more on top unless there is a _really_ strong usecase behind.
Please make sure to describe such a usecase in details before we even
start considering the code.

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

Well, then I would argue that you shouldn't use 64kB pages for your
setup or allow THP for smaller sizes. Really hugetlb pages are by no
means a substitute here.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread Michal Hocko
On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.

I am sorry that I didn't join the discussion for the previous version
but time just didn't allow that. So sorry if I am repeating something
already sorted out.

> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).

There was a deliberate decision to keep hugetlb and "normal" memory
cgroup controllers separate. Mostly because hugetlb memory is an
artificial memory subsystem on its own and it doesn't fit into the rest
of memcg accounted memory very well. I believe we want to keep that
status quo.

> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?

Well such a usecase requires an explicit configuration already. Either
by using special wrappers or modifying the code. So I would argue that
you have quite a good knowlege of the setup. If you need a greater
flexibility then just do not use hugetlb at all and rely on THP.
[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

I do not really think this is a good idea. We really do not want to make
the current hugetlb code more complex than it is already. The current
hugetlb cgroup controller is simple and works at least somehow. I would
not add more on top unless there is a _really_ strong usecase behind.
Please make sure to describe such a usecase in details before we even
start considering the code.

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

Well, then I would argue that you shouldn't use 64kB pages for your
setup or allow THP for smaller sizes. Really hugetlb pages are by no
means a substitute here.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
On 2018/05/22 3:07, Mike Kravetz wrote:
> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>> Thanks to Mike Kravetz for comment on the previous version patch.
>>
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>>
>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>> memory and can not limit it well. Since it is difficult to estimate the
>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>> case, I thought that by using my patch-set, we could manage resources just
>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>> cgroup).
>>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
> 
> One of the basic questions/concerns I have is accounting for surplus huge
> pages in the default memory resource controller.  The existing huegtlb
> resource controller already takes hugetlbfs huge pages into account,
> including surplus pages.  This series would allow surplus pages to be
> accounted for in the default  memory controller, or the hugetlb controller
> or both.
> 
> I understand that current mechanisms do not meet the needs of the above
> use case.  The question is whether this is an appropriate way to approach
> the issue.  My cgroup experience and knowledge is extremely limited, but
> it does not appear that any other resource can be controlled by multiple
> controllers.  Therefore, I am concerned that this may be going against
> basic cgroup design philosophy.

Thank you for your feedback.
That makes sense, surplus hugepages are charged to both memcg and hugetlb
cgroup, which may be contrary to cgroup design philosophy.

Based on the above advice, I have considered the following improvements,
what do you think about?

The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
"whether to charge memcg in addition to hugetlb cgroup", but it will be
abolished. Instead, change to "switch only to memcg instead of hugetlb
cgroup" option. This is called 'surplus_charge_to_memcg'.

The surplus_charge_to_memcg option is created in per hugetlb cgroup.
If it is false(default), charge destination cgroup of various page types
is the same as the current kernel version. If it become true, hugetlb
cgroup stops accounting for surplus hugepages, and memcg starts accounting
instead.

A table showing which cgroups are charged:

page types  | current  v2(off)  v2(on)   v3(off)   v3(on)
---
normal + THP|   m   m   m mm
hugetlb(persistent) |   h   h   h hh
hugetlb(surplus)|   h   h m+h hm
---

 v2: charge_surplus_hugepages option
 v3: next version, surplus_charge_to_memcg option
  m: memory cgroup
  h: hugetlb cgroup

> 
> It would be good to get comments from people more cgroup knowledgeable,
> and especially from those involved in the decision to do separate hugetlb
> control.
> 

I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
have specially considered of surplus hugepages. Later, I will send a mail
to hugetlb cgroup's committer to ask about surplus hugepages charge
specifications.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
On 2018/05/22 3:07, Mike Kravetz wrote:
> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>> Thanks to Mike Kravetz for comment on the previous version patch.
>>
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>>
>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>> memory and can not limit it well. Since it is difficult to estimate the
>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>> case, I thought that by using my patch-set, we could manage resources just
>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>> cgroup).
>>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
> 
> One of the basic questions/concerns I have is accounting for surplus huge
> pages in the default memory resource controller.  The existing huegtlb
> resource controller already takes hugetlbfs huge pages into account,
> including surplus pages.  This series would allow surplus pages to be
> accounted for in the default  memory controller, or the hugetlb controller
> or both.
> 
> I understand that current mechanisms do not meet the needs of the above
> use case.  The question is whether this is an appropriate way to approach
> the issue.  My cgroup experience and knowledge is extremely limited, but
> it does not appear that any other resource can be controlled by multiple
> controllers.  Therefore, I am concerned that this may be going against
> basic cgroup design philosophy.

Thank you for your feedback.
That makes sense, surplus hugepages are charged to both memcg and hugetlb
cgroup, which may be contrary to cgroup design philosophy.

Based on the above advice, I have considered the following improvements,
what do you think about?

The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
"whether to charge memcg in addition to hugetlb cgroup", but it will be
abolished. Instead, change to "switch only to memcg instead of hugetlb
cgroup" option. This is called 'surplus_charge_to_memcg'.

The surplus_charge_to_memcg option is created in per hugetlb cgroup.
If it is false(default), charge destination cgroup of various page types
is the same as the current kernel version. If it become true, hugetlb
cgroup stops accounting for surplus hugepages, and memcg starts accounting
instead.

A table showing which cgroups are charged:

page types  | current  v2(off)  v2(on)   v3(off)   v3(on)
---
normal + THP|   m   m   m mm
hugetlb(persistent) |   h   h   h hh
hugetlb(surplus)|   h   h m+h hm
---

 v2: charge_surplus_hugepages option
 v3: next version, surplus_charge_to_memcg option
  m: memory cgroup
  h: hugetlb cgroup

> 
> It would be good to get comments from people more cgroup knowledgeable,
> and especially from those involved in the decision to do separate hugetlb
> control.
> 

I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
have specially considered of surplus hugepages. Later, I will send a mail
to hugetlb cgroup's committer to ask about surplus hugepages charge
specifications.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
Hi Punit,

On 2018/05/21 23:52, Punit Agrawal wrote:
> Hi Tsukada,
> 
> I was staring at memcg code to better understand your changes and had
> the below thought.
> 
> TSUKADA Koutaro  writes:
> 
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> Instead of tying the surplus huge page charging control per-hstate,
> could the control be made per-memcg?
> 
> This can be done by introducing a per-memory controller file in sysfs
> (memory.charge_surplus_hugepages?) that indicates whether surplus
> hugepages are to be charged to the controller and forms part of the
> total limit. IIUC, the limit already accounts for page and swap cache
> pages.
> 
> This would allow the control to be enabled per-cgroup and also keep the
> userspace control interface in one place.
> 
> As said earlier, I'm not familiar with memcg so the above might not be a
> feasible but think it'll lead to a more coherent user
> interface. Hopefully, more knowledgeable folks on the thread can chime
> in.
> 

Thank you for good advise.
As you mentioned, it is better to be able to control by per-memcg. After
organizing my thoughts, I will develop the next version patch-set that can
solve issues and challenge again.

Thanks,
Tsukada

> Thanks,
> Punit
> 
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>>
>> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
>> acceptable or not, so I just done a simple test.
>>
>> Thanks,
>> Tsukada
>>
>> TSUKADA Koutaro (7):
>>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>>   hugetlb: supports migrate charging for surplus hugepages
>>   memcg: use compound_order rather than hpage_nr_pages
>>   mm, sysctl: make charging surplus hugepages controllable
>>   hugetlb: add charge_surplus_hugepages attribute
>>   Documentation, hugetlb: describe about charge_surplus_hugepages
>>   memcg: supports movement of surplus hugepages statistics
>>
>>  Documentation/vm/hugetlbpage.txt |6 +
>>  include/linux/hugetlb.h  |4 +
>>  kernel/sysctl.c  |7 +
>>  mm/hugetlb.c |  148 
>> +++
>>  mm/memcontrol.c  |  109 +++-
>>  5 files changed, 269 insertions(+), 5 deletions(-)




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
Hi Punit,

On 2018/05/21 23:52, Punit Agrawal wrote:
> Hi Tsukada,
> 
> I was staring at memcg code to better understand your changes and had
> the below thought.
> 
> TSUKADA Koutaro  writes:
> 
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> Instead of tying the surplus huge page charging control per-hstate,
> could the control be made per-memcg?
> 
> This can be done by introducing a per-memory controller file in sysfs
> (memory.charge_surplus_hugepages?) that indicates whether surplus
> hugepages are to be charged to the controller and forms part of the
> total limit. IIUC, the limit already accounts for page and swap cache
> pages.
> 
> This would allow the control to be enabled per-cgroup and also keep the
> userspace control interface in one place.
> 
> As said earlier, I'm not familiar with memcg so the above might not be a
> feasible but think it'll lead to a more coherent user
> interface. Hopefully, more knowledgeable folks on the thread can chime
> in.
> 

Thank you for good advise.
As you mentioned, it is better to be able to control by per-memcg. After
organizing my thoughts, I will develop the next version patch-set that can
solve issues and challenge again.

Thanks,
Tsukada

> Thanks,
> Punit
> 
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>>
>> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
>> acceptable or not, so I just done a simple test.
>>
>> Thanks,
>> Tsukada
>>
>> TSUKADA Koutaro (7):
>>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>>   hugetlb: supports migrate charging for surplus hugepages
>>   memcg: use compound_order rather than hpage_nr_pages
>>   mm, sysctl: make charging surplus hugepages controllable
>>   hugetlb: add charge_surplus_hugepages attribute
>>   Documentation, hugetlb: describe about charge_surplus_hugepages
>>   memcg: supports movement of surplus hugepages statistics
>>
>>  Documentation/vm/hugetlbpage.txt |6 +
>>  include/linux/hugetlb.h  |4 +
>>  kernel/sysctl.c  |7 +
>>  mm/hugetlb.c |  148 
>> +++
>>  mm/memcontrol.c  |  109 +++-
>>  5 files changed, 269 insertions(+), 5 deletions(-)




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-21 Thread Mike Kravetz
On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.
> 
> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).
> 
> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?
> 
> If you set a limit of 10GB for each, the user can use a total of 20GB of
> memory and can not limit it well. Since it is difficult to estimate the
> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> case, I thought that by using my patch-set, we could manage resources just
> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> cgroup).
> 
> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.
> 
> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

One of the basic questions/concerns I have is accounting for surplus huge
pages in the default memory resource controller.  The existing huegtlb
resource controller already takes hugetlbfs huge pages into account,
including surplus pages.  This series would allow surplus pages to be
accounted for in the default  memory controller, or the hugetlb controller
or both.

I understand that current mechanisms do not meet the needs of the above
use case.  The question is whether this is an appropriate way to approach
the issue.  My cgroup experience and knowledge is extremely limited, but
it does not appear that any other resource can be controlled by multiple
controllers.  Therefore, I am concerned that this may be going against
basic cgroup design philosophy.

It would be good to get comments from people more cgroup knowledgeable,
and especially from those involved in the decision to do separate hugetlb
control.

-- 
Mike Kravetz

> 
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
> 
> Thanks,
> Tsukada
> 
> TSUKADA Koutaro (7):
>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>   hugetlb: supports migrate charging for surplus hugepages
>   memcg: use compound_order rather than hpage_nr_pages
>   mm, sysctl: make charging surplus hugepages controllable
>   hugetlb: add charge_surplus_hugepages attribute
>   Documentation, hugetlb: describe about charge_surplus_hugepages
>   memcg: supports movement of surplus hugepages statistics
> 
>  Documentation/vm/hugetlbpage.txt |6 +
>  include/linux/hugetlb.h  |4 +
>  kernel/sysctl.c  |7 +
>  mm/hugetlb.c |  148 
> +++
>  mm/memcontrol.c  |  109 +++-
>  5 files changed, 269 insertions(+), 5 deletions(-)
> 


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-21 Thread Mike Kravetz
On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
> Thanks to Mike Kravetz for comment on the previous version patch.
> 
> The purpose of this patch-set is to make it possible to control whether or
> not to charge surplus hugetlb pages obtained by overcommitting to memory
> cgroup. In the future, I am trying to accomplish limiting the memory usage
> of applications that use both normal pages and hugetlb pages by the memory
> cgroup(not use the hugetlb cgroup).
> 
> Applications that use shared libraries like libhugetlbfs.so use both normal
> pages and hugetlb pages, but we do not know how much to use each. Please
> suppose you want to manage the memory usage of such applications by cgroup
> How do you set the memory cgroup and hugetlb cgroup limit when you want to
> limit memory usage to 10GB?
> 
> If you set a limit of 10GB for each, the user can use a total of 20GB of
> memory and can not limit it well. Since it is difficult to estimate the
> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
> case, I thought that by using my patch-set, we could manage resources just
> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
> cgroup).
> 
> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.
> 
> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.

One of the basic questions/concerns I have is accounting for surplus huge
pages in the default memory resource controller.  The existing huegtlb
resource controller already takes hugetlbfs huge pages into account,
including surplus pages.  This series would allow surplus pages to be
accounted for in the default  memory controller, or the hugetlb controller
or both.

I understand that current mechanisms do not meet the needs of the above
use case.  The question is whether this is an appropriate way to approach
the issue.  My cgroup experience and knowledge is extremely limited, but
it does not appear that any other resource can be controlled by multiple
controllers.  Therefore, I am concerned that this may be going against
basic cgroup design philosophy.

It would be good to get comments from people more cgroup knowledgeable,
and especially from those involved in the decision to do separate hugetlb
control.

-- 
Mike Kravetz

> 
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
> 
> Thanks,
> Tsukada
> 
> TSUKADA Koutaro (7):
>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>   hugetlb: supports migrate charging for surplus hugepages
>   memcg: use compound_order rather than hpage_nr_pages
>   mm, sysctl: make charging surplus hugepages controllable
>   hugetlb: add charge_surplus_hugepages attribute
>   Documentation, hugetlb: describe about charge_surplus_hugepages
>   memcg: supports movement of surplus hugepages statistics
> 
>  Documentation/vm/hugetlbpage.txt |6 +
>  include/linux/hugetlb.h  |4 +
>  kernel/sysctl.c  |7 +
>  mm/hugetlb.c |  148 
> +++
>  mm/memcontrol.c  |  109 +++-
>  5 files changed, 269 insertions(+), 5 deletions(-)
> 


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-21 Thread Punit Agrawal
Hi Tsukada,

I was staring at memcg code to better understand your changes and had
the below thought.

TSUKADA Koutaro  writes:

[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

Instead of tying the surplus huge page charging control per-hstate,
could the control be made per-memcg?

This can be done by introducing a per-memory controller file in sysfs
(memory.charge_surplus_hugepages?) that indicates whether surplus
hugepages are to be charged to the controller and forms part of the
total limit. IIUC, the limit already accounts for page and swap cache
pages.

This would allow the control to be enabled per-cgroup and also keep the
userspace control interface in one place.

As said earlier, I'm not familiar with memcg so the above might not be a
feasible but think it'll lead to a more coherent user
interface. Hopefully, more knowledgeable folks on the thread can chime
in.

Thanks,
Punit

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.
>
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
>
> Thanks,
> Tsukada
>
> TSUKADA Koutaro (7):
>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>   hugetlb: supports migrate charging for surplus hugepages
>   memcg: use compound_order rather than hpage_nr_pages
>   mm, sysctl: make charging surplus hugepages controllable
>   hugetlb: add charge_surplus_hugepages attribute
>   Documentation, hugetlb: describe about charge_surplus_hugepages
>   memcg: supports movement of surplus hugepages statistics
>
>  Documentation/vm/hugetlbpage.txt |6 +
>  include/linux/hugetlb.h  |4 +
>  kernel/sysctl.c  |7 +
>  mm/hugetlb.c |  148 
> +++
>  mm/memcontrol.c  |  109 +++-
>  5 files changed, 269 insertions(+), 5 deletions(-)


Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-21 Thread Punit Agrawal
Hi Tsukada,

I was staring at memcg code to better understand your changes and had
the below thought.

TSUKADA Koutaro  writes:

[...]

> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
> struct hstate. If it is true, it charges to the memory cgroup to which the
> task that obtained surplus hugepages belongs. If it is false, do nothing as
> before, and the default value is false. The charge_surplus_huge_pages can
> be controlled procfs or sysfs interfaces.

Instead of tying the surplus huge page charging control per-hstate,
could the control be made per-memcg?

This can be done by introducing a per-memory controller file in sysfs
(memory.charge_surplus_hugepages?) that indicates whether surplus
hugepages are to be charged to the controller and forms part of the
total limit. IIUC, the limit already accounts for page and swap cache
pages.

This would allow the control to be enabled per-cgroup and also keep the
userspace control interface in one place.

As said earlier, I'm not familiar with memcg so the above might not be a
feasible but think it'll lead to a more coherent user
interface. Hopefully, more knowledgeable folks on the thread can chime
in.

Thanks,
Punit

> Since THP is very effective in environments with kernel page size of 4KB,
> such as x86, there is no reason to positively use HugeTLBfs, so I think
> that there is no situation to enable charge_surplus_huge_pages. However, in
> some distributions such as arm64, the page size of the kernel is 64KB, and
> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
> may support multiple huge page sizes, and in such a special environment
> there is a desire to use HugeTLBfs.
>
> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
> acceptable or not, so I just done a simple test.
>
> Thanks,
> Tsukada
>
> TSUKADA Koutaro (7):
>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>   hugetlb: supports migrate charging for surplus hugepages
>   memcg: use compound_order rather than hpage_nr_pages
>   mm, sysctl: make charging surplus hugepages controllable
>   hugetlb: add charge_surplus_hugepages attribute
>   Documentation, hugetlb: describe about charge_surplus_hugepages
>   memcg: supports movement of surplus hugepages statistics
>
>  Documentation/vm/hugetlbpage.txt |6 +
>  include/linux/hugetlb.h  |4 +
>  kernel/sysctl.c  |7 +
>  mm/hugetlb.c |  148 
> +++
>  mm/memcontrol.c  |  109 +++-
>  5 files changed, 269 insertions(+), 5 deletions(-)