Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
> 
> Because they have already allocated from the buddy allocator so the end
> result is very same.
> 
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
> 
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
> 
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
> 
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
> 
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro
On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
> 
> Because they have already allocated from the buddy allocator so the end
> result is very same.
> 
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
> 
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
> 
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
> 
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
> 
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/23 3:54, Michal Hocko wrote:
> On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
>> On 2018/05/22 3:07, Mike Kravetz wrote:
>>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>>>> Thanks to Mike Kravetz for comment on the previous version patch.
>>>>
>>>> The purpose of this patch-set is to make it possible to control whether or
>>>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>>>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>>>> of applications that use both normal pages and hugetlb pages by the memory
>>>> cgroup(not use the hugetlb cgroup).
>>>>
>>>> Applications that use shared libraries like libhugetlbfs.so use both normal
>>>> pages and hugetlb pages, but we do not know how much to use each. Please
>>>> suppose you want to manage the memory usage of such applications by cgroup
>>>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>>>> limit memory usage to 10GB?
>>>>
>>>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>>>> memory and can not limit it well. Since it is difficult to estimate the
>>>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>>>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>>>> case, I thought that by using my patch-set, we could manage resources just
>>>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>>>> cgroup).
>>>>
>>>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>>>> struct hstate. If it is true, it charges to the memory cgroup to which the
>>>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>>>> before, and the default value is false. The charge_surplus_huge_pages can
>>>> be controlled procfs or sysfs interfaces.
>>>>
>>>> Since THP is very effective in environments with kernel page size of 4KB,
>>>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>>>> that there is no situation to enable charge_surplus_huge_pages. However, in
>>>> some distributions such as arm64, the page size of the kernel is 64KB, and
>>>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>>>> may support multiple huge page sizes, and in such a special environment
>>>> there is a desire to use HugeTLBfs.
>>>
>>> One of the basic questions/concerns I have is accounting for surplus huge
>>> pages in the default memory resource controller.  The existing huegtlb
>>> resource controller already takes hugetlbfs huge pages into account,
>>> including surplus pages.  This series would allow surplus pages to be
>>> accounted for in the default  memory controller, or the hugetlb controller
>>> or both.
>>>
>>> I understand that current mechanisms do not meet the needs of the above
>>> use case.  The question is whether this is an appropriate way to approach
>>> the issue.
> 
> I do share your view Mike!
> 
>>> My cgroup experience and knowledge is extremely limited, but
>>> it does not appear that any other resource can be controlled by multiple
>>> controllers.  Therefore, I am concerned that this may be going against
>>> basic cgroup design philosophy.
>>
>> Thank you for your feedback.
>> That makes sense, surplus hugepages are charged to both memcg and hugetlb
>> cgroup, which may be contrary to cgroup design philosophy.
>>
>> Based on the above advice, I have considered the following improvements,
>> what do you think about?
>>
>> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
>> "whether to charge memcg in addition to hugetlb cgroup", but it will be
>> abolished. Instead, change to "switch only to memcg instead of hugetlb
>> cgroup" option. This is called 'surplus_charge_to_memcg'.
> 
> This all looks so hackish and ad-hoc that I would be tempted to give it
> an outright nack, but let's here more about why do we need this fiddling
> at all. I've asked in other email so I guess I will get an answer there
> but let me just emphasize again that I absolutely detest a possibility
> to put hugetlb pages into the memcg mix. They just do not belong there.
> Try to look at previous discussions why it has been decided to have a
> separate hugetlb pages at all.
> 
> I am also quite confuse

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/23 3:54, Michal Hocko wrote:
> On Tue 22-05-18 22:04:23, TSUKADA Koutaro wrote:
>> On 2018/05/22 3:07, Mike Kravetz wrote:
>>> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>>>> Thanks to Mike Kravetz for comment on the previous version patch.
>>>>
>>>> The purpose of this patch-set is to make it possible to control whether or
>>>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>>>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>>>> of applications that use both normal pages and hugetlb pages by the memory
>>>> cgroup(not use the hugetlb cgroup).
>>>>
>>>> Applications that use shared libraries like libhugetlbfs.so use both normal
>>>> pages and hugetlb pages, but we do not know how much to use each. Please
>>>> suppose you want to manage the memory usage of such applications by cgroup
>>>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>>>> limit memory usage to 10GB?
>>>>
>>>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>>>> memory and can not limit it well. Since it is difficult to estimate the
>>>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>>>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>>>> case, I thought that by using my patch-set, we could manage resources just
>>>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>>>> cgroup).
>>>>
>>>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>>>> struct hstate. If it is true, it charges to the memory cgroup to which the
>>>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>>>> before, and the default value is false. The charge_surplus_huge_pages can
>>>> be controlled procfs or sysfs interfaces.
>>>>
>>>> Since THP is very effective in environments with kernel page size of 4KB,
>>>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>>>> that there is no situation to enable charge_surplus_huge_pages. However, in
>>>> some distributions such as arm64, the page size of the kernel is 64KB, and
>>>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>>>> may support multiple huge page sizes, and in such a special environment
>>>> there is a desire to use HugeTLBfs.
>>>
>>> One of the basic questions/concerns I have is accounting for surplus huge
>>> pages in the default memory resource controller.  The existing huegtlb
>>> resource controller already takes hugetlbfs huge pages into account,
>>> including surplus pages.  This series would allow surplus pages to be
>>> accounted for in the default  memory controller, or the hugetlb controller
>>> or both.
>>>
>>> I understand that current mechanisms do not meet the needs of the above
>>> use case.  The question is whether this is an appropriate way to approach
>>> the issue.
> 
> I do share your view Mike!
> 
>>> My cgroup experience and knowledge is extremely limited, but
>>> it does not appear that any other resource can be controlled by multiple
>>> controllers.  Therefore, I am concerned that this may be going against
>>> basic cgroup design philosophy.
>>
>> Thank you for your feedback.
>> That makes sense, surplus hugepages are charged to both memcg and hugetlb
>> cgroup, which may be contrary to cgroup design philosophy.
>>
>> Based on the above advice, I have considered the following improvements,
>> what do you think about?
>>
>> The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
>> "whether to charge memcg in addition to hugetlb cgroup", but it will be
>> abolished. Instead, change to "switch only to memcg instead of hugetlb
>> cgroup" option. This is called 'surplus_charge_to_memcg'.
> 
> This all looks so hackish and ad-hoc that I would be tempted to give it
> an outright nack, but let's here more about why do we need this fiddling
> at all. I've asked in other email so I guess I will get an answer there
> but let me just emphasize again that I absolutely detest a possibility
> to put hugetlb pages into the memcg mix. They just do not belong there.
> Try to look at previous discussions why it has been decided to have a
> separate hugetlb pages at all.
> 
> I am also quite confuse

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/22 22:51, Michal Hocko wrote:
> On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
> 
> There was a deliberate decision to keep hugetlb and "normal" memory
> cgroup controllers separate. Mostly because hugetlb memory is an
> artificial memory subsystem on its own and it doesn't fit into the rest
> of memcg accounted memory very well. I believe we want to keep that
> status quo.
> 
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
> 
> Well such a usecase requires an explicit configuration already. Either
> by using special wrappers or modifying the code. So I would argue that
> you have quite a good knowlege of the setup. If you need a greater
> flexibility then just do not use hugetlb at all and rely on THP.
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> I do not really think this is a good idea. We really do not want to make
> the current hugetlb code more complex than it is already. The current
> hugetlb cgroup controller is simple and works at least somehow. I would
> not add more on top unless there is a _really_ strong usecase behind.
> Please make sure to describe such a usecase in details before we even
> start considering the code.

Thank you for your time.

I do not know if it is really a strong use case, but I will explain my
motive in detail. English is not my native language, so please pardon
my poor English.

I am one of the developers for software that managing the resource used
from user job at HPC-Cluster with Linux. The resource is memory mainly.
The HPC-Cluster may be shared by multiple people and used. Therefore, the
memory used by each user must be strictly controlled, otherwise the
user's job will runaway, not only will it hamper the other users, it will
crash the entire system in OOM.

Some users of HPC are very nervous about performance. Jobs are executed
while synchronizing with MPI communication using multiple compute nodes.
Since CPU wait time will occur when synchronizing, they want to minimize
the variation in execution time at each node to reduce waiting times as
much as possible. We call this variation a noise.

THP does not guarantee to use the Huge Page, but may use the normal page.
This mechanism is one cause of variation(noise).

The users who know this mechanism will be hesitant to use THP. However,
the users also know the benefits of the Huge Page's TLB hit rate
performance, and the Huge Page seems to be attractive. It seems natural
that these users are interested in HugeTLBfs, I do not know at all
whether it is the right approach or not.

At the very least, our HPC system is pursuing high versatility and we
have to consider whether we can provide it if users want to use HugeTLBfs.

In order to use HugeTLBfs we need to create a persistent pool, but in
our use case sharing nodes, it would be impossible to create, delete or
resize the pool.

One of the answers I have reached is to use HugeTLBfs by overcommitting
without creating a pool(this is the surplus hugepage).

Surplus hugepages is hugetlb page, but I think at least that consuming
buddy pool is a decisive difference from hugetlb page of persistent pool.
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
surplus hugepages from buddy pool is all unlimited even if being limited
by memcg. In extreme cases, overcommitment will allow users to exhaust
the entire memory of the system. Of course, this can be prevented by the
hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
respectively, as I asked in the first mail(set limit to 10GB), the
control will not work.

I thought I could charge surplus hugepages to memcg, but maybe I did not
have enough knowledge about memcg. I would like to reply to another mail
for details.

>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-23 Thread TSUKADA Koutaro
On 2018/05/22 22:51, Michal Hocko wrote:
> On Fri 18-05-18 13:27:27, TSUKADA Koutaro wrote:
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
> 
> There was a deliberate decision to keep hugetlb and "normal" memory
> cgroup controllers separate. Mostly because hugetlb memory is an
> artificial memory subsystem on its own and it doesn't fit into the rest
> of memcg accounted memory very well. I believe we want to keep that
> status quo.
> 
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
> 
> Well such a usecase requires an explicit configuration already. Either
> by using special wrappers or modifying the code. So I would argue that
> you have quite a good knowlege of the setup. If you need a greater
> flexibility then just do not use hugetlb at all and rely on THP.
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> I do not really think this is a good idea. We really do not want to make
> the current hugetlb code more complex than it is already. The current
> hugetlb cgroup controller is simple and works at least somehow. I would
> not add more on top unless there is a _really_ strong usecase behind.
> Please make sure to describe such a usecase in details before we even
> start considering the code.

Thank you for your time.

I do not know if it is really a strong use case, but I will explain my
motive in detail. English is not my native language, so please pardon
my poor English.

I am one of the developers for software that managing the resource used
from user job at HPC-Cluster with Linux. The resource is memory mainly.
The HPC-Cluster may be shared by multiple people and used. Therefore, the
memory used by each user must be strictly controlled, otherwise the
user's job will runaway, not only will it hamper the other users, it will
crash the entire system in OOM.

Some users of HPC are very nervous about performance. Jobs are executed
while synchronizing with MPI communication using multiple compute nodes.
Since CPU wait time will occur when synchronizing, they want to minimize
the variation in execution time at each node to reduce waiting times as
much as possible. We call this variation a noise.

THP does not guarantee to use the Huge Page, but may use the normal page.
This mechanism is one cause of variation(noise).

The users who know this mechanism will be hesitant to use THP. However,
the users also know the benefits of the Huge Page's TLB hit rate
performance, and the Huge Page seems to be attractive. It seems natural
that these users are interested in HugeTLBfs, I do not know at all
whether it is the right approach or not.

At the very least, our HPC system is pursuing high versatility and we
have to consider whether we can provide it if users want to use HugeTLBfs.

In order to use HugeTLBfs we need to create a persistent pool, but in
our use case sharing nodes, it would be impossible to create, delete or
resize the pool.

One of the answers I have reached is to use HugeTLBfs by overcommitting
without creating a pool(this is the surplus hugepage).

Surplus hugepages is hugetlb page, but I think at least that consuming
buddy pool is a decisive difference from hugetlb page of persistent pool.
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
surplus hugepages from buddy pool is all unlimited even if being limited
by memcg. In extreme cases, overcommitment will allow users to exhaust
the entire memory of the system. Of course, this can be prevented by the
hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
respectively, as I asked in the first mail(set limit to 10GB), the
control will not work.

I thought I could charge surplus hugepages to memcg, but maybe I did not
have enough knowledge about memcg. I would like to reply to another mail
for details.

>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
On 2018/05/22 3:07, Mike Kravetz wrote:
> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>> Thanks to Mike Kravetz for comment on the previous version patch.
>>
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>>
>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>> memory and can not limit it well. Since it is difficult to estimate the
>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>> case, I thought that by using my patch-set, we could manage resources just
>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>> cgroup).
>>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
> 
> One of the basic questions/concerns I have is accounting for surplus huge
> pages in the default memory resource controller.  The existing huegtlb
> resource controller already takes hugetlbfs huge pages into account,
> including surplus pages.  This series would allow surplus pages to be
> accounted for in the default  memory controller, or the hugetlb controller
> or both.
> 
> I understand that current mechanisms do not meet the needs of the above
> use case.  The question is whether this is an appropriate way to approach
> the issue.  My cgroup experience and knowledge is extremely limited, but
> it does not appear that any other resource can be controlled by multiple
> controllers.  Therefore, I am concerned that this may be going against
> basic cgroup design philosophy.

Thank you for your feedback.
That makes sense, surplus hugepages are charged to both memcg and hugetlb
cgroup, which may be contrary to cgroup design philosophy.

Based on the above advice, I have considered the following improvements,
what do you think about?

The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
"whether to charge memcg in addition to hugetlb cgroup", but it will be
abolished. Instead, change to "switch only to memcg instead of hugetlb
cgroup" option. This is called 'surplus_charge_to_memcg'.

The surplus_charge_to_memcg option is created in per hugetlb cgroup.
If it is false(default), charge destination cgroup of various page types
is the same as the current kernel version. If it become true, hugetlb
cgroup stops accounting for surplus hugepages, and memcg starts accounting
instead.

A table showing which cgroups are charged:

page types  | current  v2(off)  v2(on)   v3(off)   v3(on)
---
normal + THP|   m   m   m mm
hugetlb(persistent) |   h   h   h hh
hugetlb(surplus)|   h   h m+h hm
---

 v2: charge_surplus_hugepages option
 v3: next version, surplus_charge_to_memcg option
  m: memory cgroup
  h: hugetlb cgroup

> 
> It would be good to get comments from people more cgroup knowledgeable,
> and especially from those involved in the decision to do separate hugetlb
> control.
> 

I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
have specially considered of surplus hugepages. Later, I will send a mail
to hugetlb cgroup's committer to ask about surplus hugepages charge
specifications.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
On 2018/05/22 3:07, Mike Kravetz wrote:
> On 05/17/2018 09:27 PM, TSUKADA Koutaro wrote:
>> Thanks to Mike Kravetz for comment on the previous version patch.
>>
>> The purpose of this patch-set is to make it possible to control whether or
>> not to charge surplus hugetlb pages obtained by overcommitting to memory
>> cgroup. In the future, I am trying to accomplish limiting the memory usage
>> of applications that use both normal pages and hugetlb pages by the memory
>> cgroup(not use the hugetlb cgroup).
>>
>> Applications that use shared libraries like libhugetlbfs.so use both normal
>> pages and hugetlb pages, but we do not know how much to use each. Please
>> suppose you want to manage the memory usage of such applications by cgroup
>> How do you set the memory cgroup and hugetlb cgroup limit when you want to
>> limit memory usage to 10GB?
>>
>> If you set a limit of 10GB for each, the user can use a total of 20GB of
>> memory and can not limit it well. Since it is difficult to estimate the
>> ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
>> to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
>> case, I thought that by using my patch-set, we could manage resources just
>> by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
>> cgroup).
>>
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
>>
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
> 
> One of the basic questions/concerns I have is accounting for surplus huge
> pages in the default memory resource controller.  The existing huegtlb
> resource controller already takes hugetlbfs huge pages into account,
> including surplus pages.  This series would allow surplus pages to be
> accounted for in the default  memory controller, or the hugetlb controller
> or both.
> 
> I understand that current mechanisms do not meet the needs of the above
> use case.  The question is whether this is an appropriate way to approach
> the issue.  My cgroup experience and knowledge is extremely limited, but
> it does not appear that any other resource can be controlled by multiple
> controllers.  Therefore, I am concerned that this may be going against
> basic cgroup design philosophy.

Thank you for your feedback.
That makes sense, surplus hugepages are charged to both memcg and hugetlb
cgroup, which may be contrary to cgroup design philosophy.

Based on the above advice, I have considered the following improvements,
what do you think about?

The 'charge_surplus_hugepages' of v2 patch-set was an option to switch
"whether to charge memcg in addition to hugetlb cgroup", but it will be
abolished. Instead, change to "switch only to memcg instead of hugetlb
cgroup" option. This is called 'surplus_charge_to_memcg'.

The surplus_charge_to_memcg option is created in per hugetlb cgroup.
If it is false(default), charge destination cgroup of various page types
is the same as the current kernel version. If it become true, hugetlb
cgroup stops accounting for surplus hugepages, and memcg starts accounting
instead.

A table showing which cgroups are charged:

page types  | current  v2(off)  v2(on)   v3(off)   v3(on)
---
normal + THP|   m   m   m mm
hugetlb(persistent) |   h   h   h hh
hugetlb(surplus)|   h   h m+h hm
---

 v2: charge_surplus_hugepages option
 v3: next version, surplus_charge_to_memcg option
  m: memory cgroup
  h: hugetlb cgroup

> 
> It would be good to get comments from people more cgroup knowledgeable,
> and especially from those involved in the decision to do separate hugetlb
> control.
> 

I stared at the commit log of mm/hugetlb_cgroup.c, but it did not seem to
have specially considered of surplus hugepages. Later, I will send a mail
to hugetlb cgroup's committer to ask about surplus hugepages charge
specifications.

-- 
Thanks,
Tsukada



Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
Hi Punit,

On 2018/05/21 23:52, Punit Agrawal wrote:
> Hi Tsukada,
> 
> I was staring at memcg code to better understand your changes and had
> the below thought.
> 
> TSUKADA Koutaro <tsuk...@ascade.co.jp> writes:
> 
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> Instead of tying the surplus huge page charging control per-hstate,
> could the control be made per-memcg?
> 
> This can be done by introducing a per-memory controller file in sysfs
> (memory.charge_surplus_hugepages?) that indicates whether surplus
> hugepages are to be charged to the controller and forms part of the
> total limit. IIUC, the limit already accounts for page and swap cache
> pages.
> 
> This would allow the control to be enabled per-cgroup and also keep the
> userspace control interface in one place.
> 
> As said earlier, I'm not familiar with memcg so the above might not be a
> feasible but think it'll lead to a more coherent user
> interface. Hopefully, more knowledgeable folks on the thread can chime
> in.
> 

Thank you for good advise.
As you mentioned, it is better to be able to control by per-memcg. After
organizing my thoughts, I will develop the next version patch-set that can
solve issues and challenge again.

Thanks,
Tsukada

> Thanks,
> Punit
> 
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>>
>> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
>> acceptable or not, so I just done a simple test.
>>
>> Thanks,
>> Tsukada
>>
>> TSUKADA Koutaro (7):
>>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>>   hugetlb: supports migrate charging for surplus hugepages
>>   memcg: use compound_order rather than hpage_nr_pages
>>   mm, sysctl: make charging surplus hugepages controllable
>>   hugetlb: add charge_surplus_hugepages attribute
>>   Documentation, hugetlb: describe about charge_surplus_hugepages
>>   memcg: supports movement of surplus hugepages statistics
>>
>>  Documentation/vm/hugetlbpage.txt |6 +
>>  include/linux/hugetlb.h  |4 +
>>  kernel/sysctl.c  |7 +
>>  mm/hugetlb.c |  148 
>> +++
>>  mm/memcontrol.c  |  109 +++-
>>  5 files changed, 269 insertions(+), 5 deletions(-)




Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-22 Thread TSUKADA Koutaro
Hi Punit,

On 2018/05/21 23:52, Punit Agrawal wrote:
> Hi Tsukada,
> 
> I was staring at memcg code to better understand your changes and had
> the below thought.
> 
> TSUKADA Koutaro  writes:
> 
> [...]
> 
>> In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
>> struct hstate. If it is true, it charges to the memory cgroup to which the
>> task that obtained surplus hugepages belongs. If it is false, do nothing as
>> before, and the default value is false. The charge_surplus_huge_pages can
>> be controlled procfs or sysfs interfaces.
> 
> Instead of tying the surplus huge page charging control per-hstate,
> could the control be made per-memcg?
> 
> This can be done by introducing a per-memory controller file in sysfs
> (memory.charge_surplus_hugepages?) that indicates whether surplus
> hugepages are to be charged to the controller and forms part of the
> total limit. IIUC, the limit already accounts for page and swap cache
> pages.
> 
> This would allow the control to be enabled per-cgroup and also keep the
> userspace control interface in one place.
> 
> As said earlier, I'm not familiar with memcg so the above might not be a
> feasible but think it'll lead to a more coherent user
> interface. Hopefully, more knowledgeable folks on the thread can chime
> in.
> 

Thank you for good advise.
As you mentioned, it is better to be able to control by per-memcg. After
organizing my thoughts, I will develop the next version patch-set that can
solve issues and challenge again.

Thanks,
Tsukada

> Thanks,
> Punit
> 
>> Since THP is very effective in environments with kernel page size of 4KB,
>> such as x86, there is no reason to positively use HugeTLBfs, so I think
>> that there is no situation to enable charge_surplus_huge_pages. However, in
>> some distributions such as arm64, the page size of the kernel is 64KB, and
>> the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
>> may support multiple huge page sizes, and in such a special environment
>> there is a desire to use HugeTLBfs.
>>
>> The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
>> acceptable or not, so I just done a simple test.
>>
>> Thanks,
>> Tsukada
>>
>> TSUKADA Koutaro (7):
>>   hugetlb: introduce charge_surplus_huge_pages to struct hstate
>>   hugetlb: supports migrate charging for surplus hugepages
>>   memcg: use compound_order rather than hpage_nr_pages
>>   mm, sysctl: make charging surplus hugepages controllable
>>   hugetlb: add charge_surplus_hugepages attribute
>>   Documentation, hugetlb: describe about charge_surplus_hugepages
>>   memcg: supports movement of surplus hugepages statistics
>>
>>  Documentation/vm/hugetlbpage.txt |6 +
>>  include/linux/hugetlb.h  |4 +
>>  kernel/sysctl.c  |7 +
>>  mm/hugetlb.c |  148 
>> +++
>>  mm/memcontrol.c  |  109 +++-
>>  5 files changed, 269 insertions(+), 5 deletions(-)




Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

2018-05-20 Thread TSUKADA Koutaro

On 2018/05/19 2:51, Punit Agrawal wrote:

Punit Agrawal <punit.agra...@arm.com> writes:


Tsukada-san,

I am not familiar with memcg so can't comment about whether the patchset
is the right way to solve the problem outlined in the cover letter but
had a couple of comments about this patch.

TSUKADA Koutaro <tsuk...@ascade.co.jp> writes:


The current memcg implementation assumes that the compound page is THP.
In order to be able to charge surplus hugepage, we use compound_order.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>


Please move this before Patch 1/7. This is to prevent wrong accounting
of pages to memcg for size != PMD_SIZE.


I just noticed that the default state is off so the change isn't enabled
until the sysfs node is exposed in the next patch. Please ignore this
comment.

One below still applies.




---
  memcontrol.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3..a8f1ff8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
   struct mem_cgroup *to)
  {
unsigned long flags;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;


Instead of replacing calls to hpage_nr_pages(), is it possible to modify
it to do the calculation?


Thank you for review my code and please just call me Tsukada.

I think it is possible to modify the inside of itself rather than
replacing the call to hpage_nr_pages().

Inferring from the processing that hpage_nr_pages() desires, I thought
that the definition of hpage_nr_pages() could be moved outside the
CONFIG_TRANSPARENT_HUGEPAGE. It seems that THP and HugeTLBfs can be
handled correctly because compound_order() is judged by seeing whether it
is PageHead or not.

Also, I would like to use compound_order() inside hpage_nr_pages(), but
since huge_mm.h is included before mm.h where compound_order() is defined,
move hpage_nr_pages to mm.h.

Instead of patch 3/7, are the following patches implementing what you
intended?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a1262..1186ab7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -204,12 +204,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
else
return NULL;
 }
-static inline int hpage_nr_pages(struct page *page)
-{
-   if (unlikely(PageTransHuge(page)))
-   return HPAGE_PMD_NR;
-   return 1;
-}

 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags);
@@ -254,8 +248,6 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

-#define hpage_nr_pages(x) 1
-
 static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..082f2ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -673,6 +673,12 @@ static inline unsigned int compound_order(struct page 
*page)
return page[1].compound_order;
 }

+static inline int hpage_nr_pages(struct page *page)
+{
+   VM_BUG_ON_PAGE(PageTail(page), page);
+   return (1 << compound_order(page));
+}
+
 static inline void set_compound_order(struct page *page, unsigned int order)
 {
page[1].compound_order = order;

--
Thanks,
Tsukada



Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

2018-05-20 Thread TSUKADA Koutaro

On 2018/05/19 2:51, Punit Agrawal wrote:

Punit Agrawal  writes:


Tsukada-san,

I am not familiar with memcg so can't comment about whether the patchset
is the right way to solve the problem outlined in the cover letter but
had a couple of comments about this patch.

TSUKADA Koutaro  writes:


The current memcg implementation assumes that the compound page is THP.
In order to be able to charge surplus hugepage, we use compound_order.

Signed-off-by: TSUKADA Koutaro 


Please move this before Patch 1/7. This is to prevent wrong accounting
of pages to memcg for size != PMD_SIZE.


I just noticed that the default state is off so the change isn't enabled
until the sysfs node is exposed in the next patch. Please ignore this
comment.

One below still applies.




---
  memcontrol.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3..a8f1ff8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
   struct mem_cgroup *to)
  {
unsigned long flags;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;


Instead of replacing calls to hpage_nr_pages(), is it possible to modify
it to do the calculation?


Thank you for review my code and please just call me Tsukada.

I think it is possible to modify the inside of itself rather than
replacing the call to hpage_nr_pages().

Inferring from the processing that hpage_nr_pages() desires, I thought
that the definition of hpage_nr_pages() could be moved outside the
CONFIG_TRANSPARENT_HUGEPAGE. It seems that THP and HugeTLBfs can be
handled correctly because compound_order() is judged by seeing whether it
is PageHead or not.

Also, I would like to use compound_order() inside hpage_nr_pages(), but
since huge_mm.h is included before mm.h where compound_order() is defined,
move hpage_nr_pages to mm.h.

Instead of patch 3/7, are the following patches implementing what you
intended?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a1262..1186ab7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -204,12 +204,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
else
return NULL;
 }
-static inline int hpage_nr_pages(struct page *page)
-{
-   if (unlikely(PageTransHuge(page)))
-   return HPAGE_PMD_NR;
-   return 1;
-}

 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags);
@@ -254,8 +248,6 @@ static inline bool thp_migration_supported(void)
 #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })

-#define hpage_nr_pages(x) 1
-
 static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..082f2ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -673,6 +673,12 @@ static inline unsigned int compound_order(struct page 
*page)
return page[1].compound_order;
 }

+static inline int hpage_nr_pages(struct page *page)
+{
+   VM_BUG_ON_PAGE(PageTail(page), page);
+   return (1 << compound_order(page));
+}
+
 static inline void set_compound_order(struct page *page, unsigned int order)
 {
page[1].compound_order = order;

--
Thanks,
Tsukada



[PATCH v2 7/7] memcg: supports movement of surplus hugepages statistics

2018-05-17 Thread TSUKADA Koutaro
When the task that charged surplus hugepages moves memory cgroup, it
updates the statistical information correctly.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 memcontrol.c |   99 +++
 1 file changed, 99 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8f1ff8..63f0922 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4698,12 +4698,110 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t 
*pmd,
return 0;
 }

+#ifdef CONFIG_HUGETLB_PAGE
+static enum mc_target_type get_mctgt_type_hugetlb(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *pte, union mc_target *target)
+{
+   struct page *page = NULL;
+   pte_t entry;
+   enum mc_target_type ret = MC_TARGET_NONE;
+
+   if (!(mc.flags & MOVE_ANON))
+   return ret;
+
+   entry = huge_ptep_get(pte);
+   if (!pte_present(entry))
+   return ret;
+
+   page = pte_page(entry);
+   VM_BUG_ON_PAGE(!page || !PageHead(page), page);
+   if (likely(!PageSurplusCharge(page)))
+   return ret;
+   if (page->mem_cgroup == mc.from) {
+   ret = MC_TARGET_PAGE;
+   if (target) {
+   get_page(page);
+   target->page = page;
+   }
+   }
+
+   return ret;
+}
+
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct vm_area_struct *vma = walk->vma;
+   struct mm_struct *mm = walk->mm;
+   spinlock_t *ptl;
+   union mc_target target;
+
+   ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+   if (get_mctgt_type_hugetlb(vma, addr, pte, ) == MC_TARGET_PAGE) {
+   mc.precharge += (1 << compound_order(target.page));
+   put_page(target.page);
+   }
+   spin_unlock(ptl);
+
+   return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct vm_area_struct *vma = walk->vma;
+   struct mm_struct *mm = walk->mm;
+   spinlock_t *ptl;
+   enum mc_target_type target_type;
+   union mc_target target;
+   struct page *page;
+   unsigned long nr_pages;
+
+   ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+   target_type = get_mctgt_type_hugetlb(vma, addr, pte, );
+   if (target_type == MC_TARGET_PAGE) {
+   page = target.page;
+   nr_pages = (1 << compound_order(page));
+   if (mc.precharge < nr_pages) {
+   put_page(page);
+   goto unlock;
+   }
+   if (!mem_cgroup_move_account(page, true, mc.from, mc.to)) {
+   mc.precharge -= nr_pages;
+   mc.moved_charge += nr_pages;
+   }
+   put_page(page);
+   }
+unlock:
+   spin_unlock(ptl);
+
+   return 0;
+}
+#else
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   return 0;
+}
+#endif
+
 static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 {
unsigned long precharge;

struct mm_walk mem_cgroup_count_precharge_walk = {
.pmd_entry = mem_cgroup_count_precharge_pte_range,
+   .hugetlb_entry = hugetlb_count_precharge_pte_range,
.mm = mm,
};
down_read(>mmap_sem);
@@ -4981,6 +5079,7 @@ static void mem_cgroup_move_charge(void)
 {
struct mm_walk mem_cgroup_move_charge_walk = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
+   .hugetlb_entry = hugetlb_move_charge_pte_range,
.mm = mc.mm,
};

-- 
Tsukada



[PATCH v2 7/7] memcg: supports movement of surplus hugepages statistics

2018-05-17 Thread TSUKADA Koutaro
When the task that charged surplus hugepages moves memory cgroup, it
updates the statistical information correctly.

Signed-off-by: TSUKADA Koutaro 
---
 memcontrol.c |   99 +++
 1 file changed, 99 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8f1ff8..63f0922 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4698,12 +4698,110 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t 
*pmd,
return 0;
 }

+#ifdef CONFIG_HUGETLB_PAGE
+static enum mc_target_type get_mctgt_type_hugetlb(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *pte, union mc_target *target)
+{
+   struct page *page = NULL;
+   pte_t entry;
+   enum mc_target_type ret = MC_TARGET_NONE;
+
+   if (!(mc.flags & MOVE_ANON))
+   return ret;
+
+   entry = huge_ptep_get(pte);
+   if (!pte_present(entry))
+   return ret;
+
+   page = pte_page(entry);
+   VM_BUG_ON_PAGE(!page || !PageHead(page), page);
+   if (likely(!PageSurplusCharge(page)))
+   return ret;
+   if (page->mem_cgroup == mc.from) {
+   ret = MC_TARGET_PAGE;
+   if (target) {
+   get_page(page);
+   target->page = page;
+   }
+   }
+
+   return ret;
+}
+
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct vm_area_struct *vma = walk->vma;
+   struct mm_struct *mm = walk->mm;
+   spinlock_t *ptl;
+   union mc_target target;
+
+   ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+   if (get_mctgt_type_hugetlb(vma, addr, pte, ) == MC_TARGET_PAGE) {
+   mc.precharge += (1 << compound_order(target.page));
+   put_page(target.page);
+   }
+   spin_unlock(ptl);
+
+   return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   struct vm_area_struct *vma = walk->vma;
+   struct mm_struct *mm = walk->mm;
+   spinlock_t *ptl;
+   enum mc_target_type target_type;
+   union mc_target target;
+   struct page *page;
+   unsigned long nr_pages;
+
+   ptl = huge_pte_lock(hstate_vma(vma), mm, pte);
+   target_type = get_mctgt_type_hugetlb(vma, addr, pte, );
+   if (target_type == MC_TARGET_PAGE) {
+   page = target.page;
+   nr_pages = (1 << compound_order(page));
+   if (mc.precharge < nr_pages) {
+   put_page(page);
+   goto unlock;
+   }
+   if (!mem_cgroup_move_account(page, true, mc.from, mc.to)) {
+   mc.precharge -= nr_pages;
+   mc.moved_charge += nr_pages;
+   }
+   put_page(page);
+   }
+unlock:
+   spin_unlock(ptl);
+
+   return 0;
+}
+#else
+static int hugetlb_count_precharge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   return 0;
+}
+
+static int hugetlb_move_charge_pte_range(pte_t *pte, unsigned long hmask,
+   unsigned long addr, unsigned long end,
+   struct mm_walk *walk)
+{
+   return 0;
+}
+#endif
+
 static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 {
unsigned long precharge;

struct mm_walk mem_cgroup_count_precharge_walk = {
.pmd_entry = mem_cgroup_count_precharge_pte_range,
+   .hugetlb_entry = hugetlb_count_precharge_pte_range,
.mm = mm,
};
down_read(>mmap_sem);
@@ -4981,6 +5079,7 @@ static void mem_cgroup_move_charge(void)
 {
struct mm_walk mem_cgroup_move_charge_walk = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
+   .hugetlb_entry = hugetlb_move_charge_pte_range,
.mm = mc.mm,
};

-- 
Tsukada



[PATCH v2 6/7] Documentation, hugetlb: describe about charge_surplus_hugepages,

2018-05-17 Thread TSUKADA Koutaro
Add a description about charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 hugetlbpage.txt |6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index faf077d..af8d112 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -129,6 +129,11 @@ number of "surplus" huge pages from the kernel's normal 
page pool, when the
 persistent huge page pool is exhausted. As these surplus huge pages become
 unused, they are freed back to the kernel's normal page pool.

+/proc/sys/vm/charge_surplus_hugepages indicates to charge "surplus" huge pages
+obteined from the normal page pool to memory cgroup. If true, the amount to be
+overcommitted is limited within memory usage allowed by the memory cgroup to
+which the task belongs. The default value is false.
+
 When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
@@ -169,6 +174,7 @@ Inside each of these directories, the same set of files 
will exist:
free_hugepages
resv_hugepages
surplus_hugepages
+   charge_surplus_hugepages

 which function as described above for the default huge page-sized case.

-- 
Tsukada




[PATCH v2 6/7] Documentation, hugetlb: describe about charge_surplus_hugepages,

2018-05-17 Thread TSUKADA Koutaro
Add a description about charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro 
---
 hugetlbpage.txt |6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index faf077d..af8d112 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -129,6 +129,11 @@ number of "surplus" huge pages from the kernel's normal 
page pool, when the
 persistent huge page pool is exhausted. As these surplus huge pages become
 unused, they are freed back to the kernel's normal page pool.

+/proc/sys/vm/charge_surplus_hugepages indicates to charge "surplus" huge pages
+obteined from the normal page pool to memory cgroup. If true, the amount to be
+overcommitted is limited within memory usage allowed by the memory cgroup to
+which the task belongs. The default value is false.
+
 When increasing the huge page pool size via nr_hugepages, any existing surplus
 pages will first be promoted to persistent huge pages.  Then, additional
 huge pages will be allocated, if necessary and if possible, to fulfill
@@ -169,6 +174,7 @@ Inside each of these directories, the same set of files 
will exist:
free_hugepages
resv_hugepages
surplus_hugepages
+   charge_surplus_hugepages

 which function as described above for the default huge page-sized case.

-- 
Tsukada




[PATCH v2 5/7] hugetlb: add charge_surplus_hugepages attribute

2018-05-17 Thread TSUKADA Koutaro
Add an entry for charge_surplus_hugepages to sysfs.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 hugetlb.c |   25 +
 1 file changed, 25 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a9549c..2f9bdbc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2662,6 +2662,30 @@ static ssize_t surplus_hugepages_show(struct kobject 
*kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);

+static ssize_t charge_surplus_hugepages_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   struct hstate *h = kobj_to_hstate(kobj, NULL);
+   return sprintf(buf, "%d\n", h->charge_surplus_huge_pages);
+}
+
+static ssize_t charge_surplus_hugepages_store(struct kobject *kobj,
+   struct kobj_attribute *attr, const char *buf, size_t len)
+{
+   int err;
+   unsigned long input;
+   struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+   err = kstrtoul(buf, 10, );
+   if (err)
+   return err;
+
+   h->charge_surplus_huge_pages = input ? true : false;
+
+   return len;
+}
+HSTATE_ATTR(charge_surplus_hugepages);
+
 static struct attribute *hstate_attrs[] = {
_hugepages_attr.attr,
_overcommit_hugepages_attr.attr,
@@ -2671,6 +2695,7 @@ static ssize_t surplus_hugepages_show(struct kobject 
*kobj,
 #ifdef CONFIG_NUMA
_hugepages_mempolicy_attr.attr,
 #endif
+   _surplus_hugepages_attr.attr,
NULL,
 };

-- 
Tsukada




[PATCH v2 5/7] hugetlb: add charge_surplus_hugepages attribute

2018-05-17 Thread TSUKADA Koutaro
Add an entry for charge_surplus_hugepages to sysfs.

Signed-off-by: TSUKADA Koutaro 
---
 hugetlb.c |   25 +
 1 file changed, 25 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a9549c..2f9bdbc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2662,6 +2662,30 @@ static ssize_t surplus_hugepages_show(struct kobject 
*kobj,
 }
 HSTATE_ATTR_RO(surplus_hugepages);

+static ssize_t charge_surplus_hugepages_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   struct hstate *h = kobj_to_hstate(kobj, NULL);
+   return sprintf(buf, "%d\n", h->charge_surplus_huge_pages);
+}
+
+static ssize_t charge_surplus_hugepages_store(struct kobject *kobj,
+   struct kobj_attribute *attr, const char *buf, size_t len)
+{
+   int err;
+   unsigned long input;
+   struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+   err = kstrtoul(buf, 10, );
+   if (err)
+   return err;
+
+   h->charge_surplus_huge_pages = input ? true : false;
+
+   return len;
+}
+HSTATE_ATTR(charge_surplus_hugepages);
+
 static struct attribute *hstate_attrs[] = {
_hugepages_attr.attr,
_overcommit_hugepages_attr.attr,
@@ -2671,6 +2695,7 @@ static ssize_t surplus_hugepages_show(struct kobject 
*kobj,
 #ifdef CONFIG_NUMA
_hugepages_mempolicy_attr.attr,
 #endif
+   _surplus_hugepages_attr.attr,
NULL,
 };

-- 
Tsukada




[PATCH v2 4/7] mm, sysctl: make charging surplus hugepages controllable

2018-05-17 Thread TSUKADA Koutaro
Make the default hugetlb surplus hugepage controlable by
/proc/sys/vm/charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 include/linux/hugetlb.h |2 ++
 kernel/sysctl.c |7 +++
 mm/hugetlb.c|   21 +
 3 files changed, 30 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 33fe5be..9314b07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -80,6 +80,8 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate 
*h, long max_hpages,
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, 
loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t 
*, loff_t *);
+int hugetlb_charge_surplus_handler(struct ctl_table *, int, void __user *,
+   size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, 
size_t *, loff_t *);

 #ifdef CONFIG_NUMA
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6a78cf7..d562d64 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1394,6 +1394,13 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = hugetlb_overcommit_handler,
},
+   {
+   .procname   = "charge_surplus_hugepages",
+   .data   = NULL,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = hugetlb_charge_surplus_handler,
+   },
 #endif
{
.procname   = "lowmem_reserve_ratio",
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e7b543..9a9549c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3069,6 +3069,27 @@ int hugetlb_overcommit_handler(struct ctl_table *table, 
int write,
return ret;
 }

+int hugetlb_charge_surplus_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *length, loff_t *ppos)
+{
+   struct hstate *h = _hstate;
+   int tmp, ret;
+
+   if (!hugepages_supported())
+   return -EOPNOTSUPP;
+
+   tmp = h->charge_surplus_huge_pages ? 1 : 0;
+   table->data = 
+   table->maxlen = sizeof(int);
+   ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+   if (ret)
+   goto out;
+
+   if (write)
+   h->charge_surplus_huge_pages = tmp ? true : false;
+out:
+   return ret;
+}
 #endif /* CONFIG_SYSCTL */

 void hugetlb_report_meminfo(struct seq_file *m)

--
Tsukada



[PATCH v2 4/7] mm, sysctl: make charging surplus hugepages controllable

2018-05-17 Thread TSUKADA Koutaro
Make the default hugetlb surplus hugepage controlable by
/proc/sys/vm/charge_surplus_hugepages.

Signed-off-by: TSUKADA Koutaro 
---
 include/linux/hugetlb.h |2 ++
 kernel/sysctl.c |7 +++
 mm/hugetlb.c|   21 +
 3 files changed, 30 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 33fe5be..9314b07 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -80,6 +80,8 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate 
*h, long max_hpages,
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, 
loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t 
*, loff_t *);
+int hugetlb_charge_surplus_handler(struct ctl_table *, int, void __user *,
+   size_t *, loff_t *);
 int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, 
size_t *, loff_t *);

 #ifdef CONFIG_NUMA
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6a78cf7..d562d64 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1394,6 +1394,13 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = hugetlb_overcommit_handler,
},
+   {
+   .procname   = "charge_surplus_hugepages",
+   .data   = NULL,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = hugetlb_charge_surplus_handler,
+   },
 #endif
{
.procname   = "lowmem_reserve_ratio",
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e7b543..9a9549c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3069,6 +3069,27 @@ int hugetlb_overcommit_handler(struct ctl_table *table, 
int write,
return ret;
 }

+int hugetlb_charge_surplus_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *length, loff_t *ppos)
+{
+   struct hstate *h = _hstate;
+   int tmp, ret;
+
+   if (!hugepages_supported())
+   return -EOPNOTSUPP;
+
+   tmp = h->charge_surplus_huge_pages ? 1 : 0;
+   table->data = 
+   table->maxlen = sizeof(int);
+   ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+   if (ret)
+   goto out;
+
+   if (write)
+   h->charge_surplus_huge_pages = tmp ? true : false;
+out:
+   return ret;
+}
 #endif /* CONFIG_SYSCTL */

 void hugetlb_report_meminfo(struct seq_file *m)

--
Tsukada



[PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

2018-05-17 Thread TSUKADA Koutaro
The current memcg implementation assumes that the compound page is THP.
In order to be able to charge surplus hugepage, we use compound_order.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 memcontrol.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3..a8f1ff8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
   struct mem_cgroup *to)
 {
unsigned long flags;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret;
bool anon;

@@ -5417,7 +5417,7 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
  bool compound)
 {
struct mem_cgroup *memcg = NULL;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -5478,7 +5478,7 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  bool lrucare, bool compound)
 {
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5522,7 +5522,7 @@ void mem_cgroup_commit_charge(struct page *page, struct 
mem_cgroup *memcg,
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
bool compound)
 {
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

if (mem_cgroup_disabled())
return;
@@ -5729,7 +5729,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)

/* Force-charge the new page. The old one will be freed soon */
compound = PageTransHuge(newpage);
-   nr_pages = compound ? hpage_nr_pages(newpage) : 1;
+   nr_pages = compound ? (1 << compound_order(newpage)) : 1;

page_counter_charge(>memory, nr_pages);
if (do_memsw_account())

-- 
Tsukada



[PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages

2018-05-17 Thread TSUKADA Koutaro
The current memcg implementation assumes that the compound page is THP.
In order to be able to charge surplus hugepage, we use compound_order.

Signed-off-by: TSUKADA Koutaro 
---
 memcontrol.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3..a8f1ff8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
   struct mem_cgroup *to)
 {
unsigned long flags;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret;
bool anon;

@@ -5417,7 +5417,7 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
  bool compound)
 {
struct mem_cgroup *memcg = NULL;
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -5478,7 +5478,7 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  bool lrucare, bool compound)
 {
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5522,7 +5522,7 @@ void mem_cgroup_commit_charge(struct page *page, struct 
mem_cgroup *memcg,
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
bool compound)
 {
-   unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+   unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;

if (mem_cgroup_disabled())
return;
@@ -5729,7 +5729,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page 
*newpage)

/* Force-charge the new page. The old one will be freed soon */
compound = PageTransHuge(newpage);
-   nr_pages = compound ? hpage_nr_pages(newpage) : 1;
+   nr_pages = compound ? (1 << compound_order(newpage)) : 1;

page_counter_charge(>memory, nr_pages);
if (do_memsw_account())

-- 
Tsukada



[PATCH v2 2/7] hugetlb: support migrate charging for surplus hugepages

2018-05-17 Thread TSUKADA Koutaro
Surplus hugepages allocated for migration also charge to memory cgroup.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 hugetlb.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 679c151f..2e7b543 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1687,6 +1687,8 @@ static struct page *alloc_migrate_huge_page(struct hstate 
*h, gfp_t gfp_mask,
if (!page)
return NULL;

+   surplus_hugepage_set_charge(h, page);
+
/*
 * We do not account these pages as surplus because they are only
 * temporary and will be released properly on the last reference

-- 
Tsukada



[PATCH v2 2/7] hugetlb: support migrate charging for surplus hugepages

2018-05-17 Thread TSUKADA Koutaro
Surplus hugepages allocated for migration also charge to memory cgroup.

Signed-off-by: TSUKADA Koutaro 
---
 hugetlb.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 679c151f..2e7b543 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1687,6 +1687,8 @@ static struct page *alloc_migrate_huge_page(struct hstate 
*h, gfp_t gfp_mask,
if (!page)
return NULL;

+   surplus_hugepage_set_charge(h, page);
+
/*
 * We do not account these pages as surplus because they are only
 * temporary and will be released properly on the last reference

-- 
Tsukada



[PATCH v2 1/7] hugetlb: introduce charge_surplus_huge_pages to struct hstate

2018-05-17 Thread TSUKADA Koutaro
The charge_surplus_huge_pages indicates to charge surplus huge pages
obteined from the normal page pool to memory cgroup. The default value is
false.

This patch implements the core part of charging surplus hugepages. Use the
private and mem_cgroup member of the second entry of compound hugepage for
surplus hugepage charging.

Mark when surplus hugepage is obtained from normal pool, and charge to
memory cgroup at alloc_huge_page. Once the mapping of the page is decided,
commit the charge. surplus hugepages will uncharge or cancel at
free_huge_page.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 include/linux/hugetlb.h |2
 mm/hugetlb.c|  100 
 2 files changed, 102 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2..33fe5be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,7 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

 bool is_hugetlb_entry_migration(pte_t pte);
+bool PageSurplusCharge(struct page *page);

 #else /* !CONFIG_HUGETLB_PAGE */

@@ -338,6 +339,7 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+   bool charge_surplus_huge_pages; /* default to off */
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
struct cftype cgroup_files[5];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2186791..679c151f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"

 int hugetlb_max_hstate __read_mostly;
@@ -1236,6 +1237,90 @@ static inline void ClearPageHugeTemporary(struct page 
*page)
page[2].mapping = NULL;
 }

+#define HUGETLB_SURPLUS_CHARGE 1UL
+
+bool PageSurplusCharge(struct page *page)
+{
+   if (!PageHuge(page))
+   return false;
+   return page[1].private == HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void SetPageSurplusCharge(struct page *page)
+{
+   page[1].private = HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void ClearPageSurplusCharge(struct page *page)
+{
+   page[1].private = 0;
+}
+
+static inline void
+set_surplus_hugepage_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+   page[1].mem_cgroup = memcg;
+}
+
+static inline struct mem_cgroup *get_surplus_hugepage_memcg(struct page *page)
+{
+   return page[1].mem_cgroup;
+}
+
+static void surplus_hugepage_set_charge(struct hstate *h, struct page *page)
+{
+   if (likely(!h->charge_surplus_huge_pages))
+   return;
+   if (unlikely(!page))
+   return;
+   SetPageSurplusCharge(page);
+}
+
+static int surplus_hugepage_try_charge(struct page *page, struct mm_struct *mm)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return 0;
+
+   if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, , true)) {
+   /* mem_cgroup oom invoked */
+   ClearPageSurplusCharge(page);
+   return -ENOMEM;
+   }
+   set_surplus_hugepage_memcg(page, memcg);
+
+   return 0;
+}
+
+static void surplus_hugepage_commit_charge(struct page *page)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return;
+
+   memcg = get_surplus_hugepage_memcg(page);
+   mem_cgroup_commit_charge(page, memcg, false, true);
+   set_surplus_hugepage_memcg(page, NULL);
+}
+
+static void surplus_hugepage_finalize_charge(struct page *page)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return;
+
+   memcg = get_surplus_hugepage_memcg(page);
+   if (memcg)
+   mem_cgroup_cancel_charge(page, memcg, true);
+   else
+   mem_cgroup_uncharge(page);
+   set_surplus_hugepage_memcg(page, NULL);
+   ClearPageSurplusCharge(page);
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1248,6 +1333,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   surplus_hugepage_finalize_charge(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1583,6 +1670,8 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 out_unlock:
spin_unlock(_lock);

+   surplus_hugepage_set_charge(h, page);
+
return page;
 }

@@ -2062,6 +2151,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
spin_unlock(_lock);

+   if (unlikely(surplus_hugepage_try_c

[PATCH v2 1/7] hugetlb: introduce charge_surplus_huge_pages to struct hstate

2018-05-17 Thread TSUKADA Koutaro
The charge_surplus_huge_pages indicates to charge surplus huge pages
obteined from the normal page pool to memory cgroup. The default value is
false.

This patch implements the core part of charging surplus hugepages. Use the
private and mem_cgroup member of the second entry of compound hugepage for
surplus hugepage charging.

Mark when surplus hugepage is obtained from normal pool, and charge to
memory cgroup at alloc_huge_page. Once the mapping of the page is decided,
commit the charge. surplus hugepages will uncharge or cancel at
free_huge_page.

Signed-off-by: TSUKADA Koutaro 
---
 include/linux/hugetlb.h |2
 mm/hugetlb.c|  100 
 2 files changed, 102 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2..33fe5be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,7 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

 bool is_hugetlb_entry_migration(pte_t pte);
+bool PageSurplusCharge(struct page *page);

 #else /* !CONFIG_HUGETLB_PAGE */

@@ -338,6 +339,7 @@ struct hstate {
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
+   bool charge_surplus_huge_pages; /* default to off */
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
struct cftype cgroup_files[5];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2186791..679c151f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"

 int hugetlb_max_hstate __read_mostly;
@@ -1236,6 +1237,90 @@ static inline void ClearPageHugeTemporary(struct page 
*page)
page[2].mapping = NULL;
 }

+#define HUGETLB_SURPLUS_CHARGE 1UL
+
+bool PageSurplusCharge(struct page *page)
+{
+   if (!PageHuge(page))
+   return false;
+   return page[1].private == HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void SetPageSurplusCharge(struct page *page)
+{
+   page[1].private = HUGETLB_SURPLUS_CHARGE;
+}
+
+static inline void ClearPageSurplusCharge(struct page *page)
+{
+   page[1].private = 0;
+}
+
+static inline void
+set_surplus_hugepage_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+   page[1].mem_cgroup = memcg;
+}
+
+static inline struct mem_cgroup *get_surplus_hugepage_memcg(struct page *page)
+{
+   return page[1].mem_cgroup;
+}
+
+static void surplus_hugepage_set_charge(struct hstate *h, struct page *page)
+{
+   if (likely(!h->charge_surplus_huge_pages))
+   return;
+   if (unlikely(!page))
+   return;
+   SetPageSurplusCharge(page);
+}
+
+static int surplus_hugepage_try_charge(struct page *page, struct mm_struct *mm)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return 0;
+
+   if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, , true)) {
+   /* mem_cgroup oom invoked */
+   ClearPageSurplusCharge(page);
+   return -ENOMEM;
+   }
+   set_surplus_hugepage_memcg(page, memcg);
+
+   return 0;
+}
+
+static void surplus_hugepage_commit_charge(struct page *page)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return;
+
+   memcg = get_surplus_hugepage_memcg(page);
+   mem_cgroup_commit_charge(page, memcg, false, true);
+   set_surplus_hugepage_memcg(page, NULL);
+}
+
+static void surplus_hugepage_finalize_charge(struct page *page)
+{
+   struct mem_cgroup *memcg;
+
+   if (likely(!PageSurplusCharge(page)))
+   return;
+
+   memcg = get_surplus_hugepage_memcg(page);
+   if (memcg)
+   mem_cgroup_cancel_charge(page, memcg, true);
+   else
+   mem_cgroup_uncharge(page);
+   set_surplus_hugepage_memcg(page, NULL);
+   ClearPageSurplusCharge(page);
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1248,6 +1333,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   surplus_hugepage_finalize_charge(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1583,6 +1670,8 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
 out_unlock:
spin_unlock(_lock);

+   surplus_hugepage_set_charge(h, page);
+
return page;
 }

@@ -2062,6 +2151,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
spin_unlock(_lock);

+   if (unlikely(surplus_hugepage_try_charge(page, vma->vm_mm))) {
+ 

[PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-17 Thread TSUKADA Koutaro
Thanks to Mike Kravetz for comment on the previous version patch.

The purpose of this patch-set is to make it possible to control whether or
not to charge surplus hugetlb pages obtained by overcommitting to memory
cgroup. In the future, I am trying to accomplish limiting the memory usage
of applications that use both normal pages and hugetlb pages by the memory
cgroup(not use the hugetlb cgroup).

Applications that use shared libraries like libhugetlbfs.so use both normal
pages and hugetlb pages, but we do not know how much to use each. Please
suppose you want to manage the memory usage of such applications by cgroup
How do you set the memory cgroup and hugetlb cgroup limit when you want to
limit memory usage to 10GB?

If you set a limit of 10GB for each, the user can use a total of 20GB of
memory and can not limit it well. Since it is difficult to estimate the
ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
case, I thought that by using my patch-set, we could manage resources just
by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
cgroup).

In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
struct hstate. If it is true, it charges to the memory cgroup to which the
task that obtained surplus hugepages belongs. If it is false, do nothing as
before, and the default value is false. The charge_surplus_huge_pages can
be controlled procfs or sysfs interfaces.

Since THP is very effective in environments with kernel page size of 4KB,
such as x86, there is no reason to positively use HugeTLBfs, so I think
that there is no situation to enable charge_surplus_huge_pages. However, in
some distributions such as arm64, the page size of the kernel is 64KB, and
the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
may support multiple huge page sizes, and in such a special environment
there is a desire to use HugeTLBfs.

The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
acceptable or not, so I just done a simple test.

Thanks,
Tsukada

TSUKADA Koutaro (7):
  hugetlb: introduce charge_surplus_huge_pages to struct hstate
  hugetlb: supports migrate charging for surplus hugepages
  memcg: use compound_order rather than hpage_nr_pages
  mm, sysctl: make charging surplus hugepages controllable
  hugetlb: add charge_surplus_hugepages attribute
  Documentation, hugetlb: describe about charge_surplus_hugepages
  memcg: supports movement of surplus hugepages statistics

 Documentation/vm/hugetlbpage.txt |6 +
 include/linux/hugetlb.h  |4 +
 kernel/sysctl.c  |7 +
 mm/hugetlb.c |  148 +++
 mm/memcontrol.c  |  109 +++-
 5 files changed, 269 insertions(+), 5 deletions(-)

-- 
Tsukada




[PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-17 Thread TSUKADA Koutaro
Thanks to Mike Kravetz for comment on the previous version patch.

The purpose of this patch-set is to make it possible to control whether or
not to charge surplus hugetlb pages obtained by overcommitting to memory
cgroup. In the future, I am trying to accomplish limiting the memory usage
of applications that use both normal pages and hugetlb pages by the memory
cgroup(not use the hugetlb cgroup).

Applications that use shared libraries like libhugetlbfs.so use both normal
pages and hugetlb pages, but we do not know how much to use each. Please
suppose you want to manage the memory usage of such applications by cgroup
How do you set the memory cgroup and hugetlb cgroup limit when you want to
limit memory usage to 10GB?

If you set a limit of 10GB for each, the user can use a total of 20GB of
memory and can not limit it well. Since it is difficult to estimate the
ratio used by user of normal pages and hugetlb pages, setting limits of 2GB
to memory cgroup and 8GB to hugetlb cgroup is not very good idea. In such a
case, I thought that by using my patch-set, we could manage resources just
by setting 10GB as the limit of memory cgoup(there is no limit to hugetlb
cgroup).

In this patch-set, introduce the charge_surplus_huge_pages(boolean) to
struct hstate. If it is true, it charges to the memory cgroup to which the
task that obtained surplus hugepages belongs. If it is false, do nothing as
before, and the default value is false. The charge_surplus_huge_pages can
be controlled procfs or sysfs interfaces.

Since THP is very effective in environments with kernel page size of 4KB,
such as x86, there is no reason to positively use HugeTLBfs, so I think
that there is no situation to enable charge_surplus_huge_pages. However, in
some distributions such as arm64, the page size of the kernel is 64KB, and
the size of THP is too huge as 512MB, making it difficult to use. HugeTLBfs
may support multiple huge page sizes, and in such a special environment
there is a desire to use HugeTLBfs.

The patch set is for 4.17.0-rc3+. I don't know whether patch-set are
acceptable or not, so I just done a simple test.

Thanks,
Tsukada

TSUKADA Koutaro (7):
  hugetlb: introduce charge_surplus_huge_pages to struct hstate
  hugetlb: supports migrate charging for surplus hugepages
  memcg: use compound_order rather than hpage_nr_pages
  mm, sysctl: make charging surplus hugepages controllable
  hugetlb: add charge_surplus_hugepages attribute
  Documentation, hugetlb: describe about charge_surplus_hugepages
  memcg: supports movement of surplus hugepages statistics

 Documentation/vm/hugetlbpage.txt |6 +
 include/linux/hugetlb.h  |4 +
 kernel/sysctl.c  |7 +
 mm/hugetlb.c |  148 +++
 mm/memcontrol.c  |  109 +++-
 5 files changed, 269 insertions(+), 5 deletions(-)

-- 
Tsukada




Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-07 Thread TSUKADA Koutaro
On 2018/05/04 13:26, Mike Kravetz wrote:
> Thank you for the explanation of your use case.  I now understand what
> you were trying to accomplish with your patch.
> 
> Your use case reminds me of the session at LSFMM titled "swap accounting".
> https://lwn.net/Articles/753162/
> 
> I hope someone with more cgroup expertise (Johannes? Aneesh?) can provide
> comments.  My experience in that area is limited.

I am waiting for comments from expertise. The point is whether the surplus
hugetlb page that allocated from buddy pool directly should be charged to
memory cgroup or not.

> One question that comes to mind is "Why would the user/application writer
> use hugetlbfs overcommit instead of THP?".  For hugetlbfs overcommit, they
> need to be prepared for huge page allocations to fail.  So, in some cases
> they may not be able to use any hugetlbfs pages.  This is not much different
> than THP.  However, in the THP case huge page allocation failures and fall
> back to base pages is transparent to the user.  With THP, the normal memory
> cgroup controller should work well.

Certainly THP is much easier to use than hugetlb in 4KB page size kernel.
On the other hand, some distributions(SUSE, RHEL) have a page size of 64KB,
and the THP size in that case is 512MB(not 2MB). I am afraid that 512MB of
huge page is somewhat difficult to use.

In hugetlbfs, page size variation increases by using contiguous bits
supported by aarch64 architecture, and 2MB, 512MB, 16GB, 4TB can be used
in 64KB environment(Actually, only 2MB is usable...). I also believe THP
is the best in the 4KB environment, but I am considering how to use the
huge page in the 64KB environment.
-- 
Tsukada Koutaro




Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-07 Thread TSUKADA Koutaro
On 2018/05/04 13:26, Mike Kravetz wrote:
> Thank you for the explanation of your use case.  I now understand what
> you were trying to accomplish with your patch.
> 
> Your use case reminds me of the session at LSFMM titled "swap accounting".
> https://lwn.net/Articles/753162/
> 
> I hope someone with more cgroup expertise (Johannes? Aneesh?) can provide
> comments.  My experience in that area is limited.

I am waiting for comments from expertise. The point is whether the surplus
hugetlb page that allocated from buddy pool directly should be charged to
memory cgroup or not.

> One question that comes to mind is "Why would the user/application writer
> use hugetlbfs overcommit instead of THP?".  For hugetlbfs overcommit, they
> need to be prepared for huge page allocations to fail.  So, in some cases
> they may not be able to use any hugetlbfs pages.  This is not much different
> than THP.  However, in the THP case huge page allocation failures and fall
> back to base pages is transparent to the user.  With THP, the normal memory
> cgroup controller should work well.

Certainly THP is much easier to use than hugetlb in 4KB page size kernel.
On the other hand, some distributions(SUSE, RHEL) have a page size of 64KB,
and the THP size in that case is 512MB(not 2MB). I am afraid that 512MB of
huge page is somewhat difficult to use.

In hugetlbfs, page size variation increases by using contiguous bits
supported by aarch64 architecture, and 2MB, 512MB, 16GB, 4TB can be used
in 64KB environment(Actually, only 2MB is usable...). I also believe THP
is the best in the 4KB environment, but I am considering how to use the
huge page in the 64KB environment.
-- 
Tsukada Koutaro




Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-03 Thread TSUKADA Koutaro
On 2018/05/03 11:33, Mike Kravetz wrote:
> On 05/01/2018 11:54 PM, TSUKADA Koutaro wrote:
>> On 2018/05/02 13:41, Mike Kravetz wrote:
>>> What is the reason for not charging pages at allocation/reserve time?  I am
>>> not an expert in memcg accounting, but I would think the pages should be
>>> charged at allocation time.  Otherwise, a task could allocate a large number
>>> of (reserved) pages that are not charged to a memcg.  memcg charges in other
>>> code paths seem to happen at huge page allocation time.
>>
>> If we charge to memcg at page allocation time, the new page is not yet
>> registered in rmap, so it will be accounted as 'cache' inside the memcg. 
>> Then,
>> after being registered in rmap, memcg will account as 'RSS' if the task moves
>> cgroup, so I am worried about the possibility of inconsistency in statistics
>> (memory.stat).
>>
>> As you said, in this patch, there may be a problem that a memory leak occurs
>> due to unused pages after being reserved.
>>
>>>> This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm).
>>>
>>> It would be very helpful to rebase this patch on a recent mainline kernel.
>>> The code to allocate surplus huge pages has been significantly changed in
>>> recent kernels.
>>>
>>> I have no doubt that this is a real issue and we are not correctly charging
>>> surplus to a memcg.  But your patch will be hard to evaluate when based on
>>> an older distro kernel.
>> I apologize for the patch of the old kernel. The patch was rewritten
>> for 4.17-rc2(6d08b06).
> 
> Thank you very much for rebasing the patch.
> 
> I did not look closely at your patch until now.  My first thought was that
> you  were changing/expanding the existing accounting.  However, it appears
> that you want to account for hugetlb surplus pages in the memory cgroup.
> Is there any reason why the hugetlb cgroup resource controller does not meet
> your needs?  It a quick look at the code, it does appear to handle surplus
> pages correctly.

Yes, basically it is exactly what you are talking about, but my usage is
somewhat special. I would like users who submit jobs on the HPC cluster to use
the hugetlb page. When submitting a job, the user specifies a memory resource
(for example, sbatch --mem in slurm).

If the user specifies 10GB, we assume that the system administrator has set the
limit of 10GB for memory cgroup and hugetlb cgroup respectively, and does not
create a hugetlb pool and sets it so that can overcommit. Then, users can use
10GB normal pages and more 10GB hugetlb page by overcommitting, which means
user can use 20GB totaly. However, the administrator should restrict the normal
page and hugetlb page to 10GB in total.

Since it is difficult to estimate the ratio used by user of normal pages and
hugetlb pages, setting limits of 2 GB to memory cgroup and 8 GB to hugetlb
cgroup is not very good idea.

In such a case, with my patch, I thought that the administrator can manage the
resources just by setting 10GB for the limit of memory cgoup(No limit is set
for hugetlb cgroup).

-- 




Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-03 Thread TSUKADA Koutaro
On 2018/05/03 11:33, Mike Kravetz wrote:
> On 05/01/2018 11:54 PM, TSUKADA Koutaro wrote:
>> On 2018/05/02 13:41, Mike Kravetz wrote:
>>> What is the reason for not charging pages at allocation/reserve time?  I am
>>> not an expert in memcg accounting, but I would think the pages should be
>>> charged at allocation time.  Otherwise, a task could allocate a large number
>>> of (reserved) pages that are not charged to a memcg.  memcg charges in other
>>> code paths seem to happen at huge page allocation time.
>>
>> If we charge to memcg at page allocation time, the new page is not yet
>> registered in rmap, so it will be accounted as 'cache' inside the memcg. 
>> Then,
>> after being registered in rmap, memcg will account as 'RSS' if the task moves
>> cgroup, so I am worried about the possibility of inconsistency in statistics
>> (memory.stat).
>>
>> As you said, in this patch, there may be a problem that a memory leak occurs
>> due to unused pages after being reserved.
>>
>>>> This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm).
>>>
>>> It would be very helpful to rebase this patch on a recent mainline kernel.
>>> The code to allocate surplus huge pages has been significantly changed in
>>> recent kernels.
>>>
>>> I have no doubt that this is a real issue and we are not correctly charging
>>> surplus to a memcg.  But your patch will be hard to evaluate when based on
>>> an older distro kernel.
>> I apologize for the patch of the old kernel. The patch was rewritten
>> for 4.17-rc2(6d08b06).
> 
> Thank you very much for rebasing the patch.
> 
> I did not look closely at your patch until now.  My first thought was that
> you  were changing/expanding the existing accounting.  However, it appears
> that you want to account for hugetlb surplus pages in the memory cgroup.
> Is there any reason why the hugetlb cgroup resource controller does not meet
> your needs?  It a quick look at the code, it does appear to handle surplus
> pages correctly.

Yes, basically it is exactly what you are talking about, but my usage is
somewhat special. I would like users who submit jobs on the HPC cluster to use
the hugetlb page. When submitting a job, the user specifies a memory resource
(for example, sbatch --mem in slurm).

If the user specifies 10GB, we assume that the system administrator has set the
limit of 10GB for memory cgroup and hugetlb cgroup respectively, and does not
create a hugetlb pool and sets it so that can overcommit. Then, users can use
10GB normal pages and more 10GB hugetlb page by overcommitting, which means
user can use 20GB totaly. However, the administrator should restrict the normal
page and hugetlb page to 10GB in total.

Since it is difficult to estimate the ratio used by user of normal pages and
hugetlb pages, setting limits of 2 GB to memory cgroup and 8 GB to hugetlb
cgroup is not very good idea.

In such a case, with my patch, I thought that the administrator can manage the
resources just by setting 10GB for the limit of memory cgoup(No limit is set
for hugetlb cgroup).

-- 




Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-02 Thread TSUKADA Koutaro
On 2018/05/02 13:41, Mike Kravetz wrote:
> What is the reason for not charging pages at allocation/reserve time?  I am
> not an expert in memcg accounting, but I would think the pages should be
> charged at allocation time.  Otherwise, a task could allocate a large number
> of (reserved) pages that are not charged to a memcg.  memcg charges in other
> code paths seem to happen at huge page allocation time.

If we charge to memcg at page allocation time, the new page is not yet
registered in rmap, so it will be accounted as 'cache' inside the memcg. Then,
after being registered in rmap, memcg will account as 'RSS' if the task moves
cgroup, so I am worried about the possibility of inconsistency in statistics
(memory.stat).

As you said, in this patch, there may be a problem that a memory leak occurs
due to unused pages after being reserved.

>> This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm).
> 
> It would be very helpful to rebase this patch on a recent mainline kernel.
> The code to allocate surplus huge pages has been significantly changed in
> recent kernels.
> 
> I have no doubt that this is a real issue and we are not correctly charging
> surplus to a memcg.  But your patch will be hard to evaluate when based on
> an older distro kernel.
I apologize for the patch of the old kernel. The patch was rewritten
for 4.17-rc2(6d08b06).

Mark the page with overcommit at alloc_surplus_huge_page. Since the path of
alloc_pool_huge_page is creating a pool, I do not think it is necessary to
mark it.

Signed-off-by: TSUKADA koutaro <tsuk...@ascade.co.jp>
---
 include/linux/hugetlb.h |   45 +
 mm/hugetlb.c|   74 +++
 mm/memcontrol.c |   99 ++--
 3 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2..a92c4e2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -532,6 +532,51 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
*mm, unsigned long addr
set_huge_pte_at(mm, addr, ptep, pte);
 }
 #endif
+
+#define HUGETLB_OVERCOMMIT 1UL
+#define HUGETLB_OVERCOMMIT_CHARGED 2UL
+
+static inline void add_hugetlb_compound_private(struct page *page,
+   unsigned long val)
+{
+   page[1].private |= val;
+}
+
+static inline unsigned long get_hugetlb_compound_private(struct page *page)
+{
+   return page_private([1]);
+}
+
+static inline void add_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT);
+}
+
+static inline void del_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT));
+}
+
+static inline int is_hugetlb_overcommit(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) & HUGETLB_OVERCOMMIT);
+}
+
+static inline void add_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT_CHARGED);
+}
+
+static inline void del_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT_CHARGED));
+}
+
+static inline int is_hugetlb_overcommit_charged(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) &
+   HUGETLB_OVERCOMMIT_CHARGED);
+}
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2186791..fd34f15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"

 int hugetlb_max_hstate __read_mostly;
@@ -1236,6 +1237,17 @@ static inline void ClearPageHugeTemporary(struct page 
*page)
page[2].mapping = NULL;
 }

+static void hugetlb_overcommit_finalize(struct page *page)
+{
+   if (is_hugetlb_overcommit_charged(page)) {
+   del_hugetlb_overcommit_charged(page);
+   mem_cgroup_uncharge(page);
+   }
+
+   if (is_hugetlb_overcommit(page))
+   del_hugetlb_overcommit(page);
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1248,6 +1260,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   hugetlb_overcommit_finalize(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1562,6 +1576,8 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
if (!page)
return NULL;
+   else
+   add_hugetlb_overcommit(page);

spin_lock(_lock);
/*
@@ -3509,6 +3525,8 @@ static int hugetlb_co

Re: [PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-02 Thread TSUKADA Koutaro
On 2018/05/02 13:41, Mike Kravetz wrote:
> What is the reason for not charging pages at allocation/reserve time?  I am
> not an expert in memcg accounting, but I would think the pages should be
> charged at allocation time.  Otherwise, a task could allocate a large number
> of (reserved) pages that are not charged to a memcg.  memcg charges in other
> code paths seem to happen at huge page allocation time.

If we charge to memcg at page allocation time, the new page is not yet
registered in rmap, so it will be accounted as 'cache' inside the memcg. Then,
after being registered in rmap, memcg will account as 'RSS' if the task moves
cgroup, so I am worried about the possibility of inconsistency in statistics
(memory.stat).

As you said, in this patch, there may be a problem that a memory leak occurs
due to unused pages after being reserved.

>> This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm).
> 
> It would be very helpful to rebase this patch on a recent mainline kernel.
> The code to allocate surplus huge pages has been significantly changed in
> recent kernels.
> 
> I have no doubt that this is a real issue and we are not correctly charging
> surplus to a memcg.  But your patch will be hard to evaluate when based on
> an older distro kernel.
I apologize for the patch of the old kernel. The patch was rewritten
for 4.17-rc2(6d08b06).

Mark the page with overcommit at alloc_surplus_huge_page. Since the path of
alloc_pool_huge_page is creating a pool, I do not think it is necessary to
mark it.

Signed-off-by: TSUKADA koutaro 
---
 include/linux/hugetlb.h |   45 +
 mm/hugetlb.c|   74 +++
 mm/memcontrol.c |   99 ++--
 3 files changed, 214 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2..a92c4e2 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -532,6 +532,51 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
*mm, unsigned long addr
set_huge_pte_at(mm, addr, ptep, pte);
 }
 #endif
+
+#define HUGETLB_OVERCOMMIT 1UL
+#define HUGETLB_OVERCOMMIT_CHARGED 2UL
+
+static inline void add_hugetlb_compound_private(struct page *page,
+   unsigned long val)
+{
+   page[1].private |= val;
+}
+
+static inline unsigned long get_hugetlb_compound_private(struct page *page)
+{
+   return page_private([1]);
+}
+
+static inline void add_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT);
+}
+
+static inline void del_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT));
+}
+
+static inline int is_hugetlb_overcommit(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) & HUGETLB_OVERCOMMIT);
+}
+
+static inline void add_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT_CHARGED);
+}
+
+static inline void del_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT_CHARGED));
+}
+
+static inline int is_hugetlb_overcommit_charged(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) &
+   HUGETLB_OVERCOMMIT_CHARGED);
+}
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2186791..fd34f15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"

 int hugetlb_max_hstate __read_mostly;
@@ -1236,6 +1237,17 @@ static inline void ClearPageHugeTemporary(struct page 
*page)
page[2].mapping = NULL;
 }

+static void hugetlb_overcommit_finalize(struct page *page)
+{
+   if (is_hugetlb_overcommit_charged(page)) {
+   del_hugetlb_overcommit_charged(page);
+   mem_cgroup_uncharge(page);
+   }
+
+   if (is_hugetlb_overcommit(page))
+   del_hugetlb_overcommit(page);
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1248,6 +1260,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   hugetlb_overcommit_finalize(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1562,6 +1576,8 @@ int dissolve_free_huge_pages(unsigned long start_pfn, 
unsigned long end_pfn)
page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
if (!page)
return NULL;
+   else
+   add_hugetlb_overcommit(page);

spin_lock(_lock);
/*
@@ -3509,6 +3525,8 @@ static int hugetlb_cow(struct mm_struct *mm, str

[PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-01 Thread TSUKADA Koutaro
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
hugetlb's overcommit from buddy pool is all unlimited even if being limited
by memcg. The purpose of this patch is that if we allocate the hugetlb page
from the boddy pool, that means we should charge it to memcg.

A straightforward way for user applications to use hugetlb pages is to
create the pool(nr_hugepages), but root privileges is required. For example,
assuming the field of HPC, it can be said that giving root privs to general
users is difficult. Also, the way to the creating pool is that we need to
estimate exactly how much use hugetlb, and it feels a bit troublesome.

In such a case, using hugetlb's overcommit feature, considered to let the
user use hugetlb page only with overcommit without creating the any pool.
However, as mentioned earlier, the page can be allocated limitelessly in
overcommit in the current implementation. Therefore, by introducing memcg
charging, I wanted to be able to manage the memory resources used by the
user application only with memcg's limitation.

This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm). The patch
does the following things.

When allocating the page from buddy at __hugetlb_alloc_buddy_huge_page,
set the flag of HUGETLB_OVERCOMMIT on that page[1].private. When actually
use the page which HUGETLB_OVERCOMMIT is set(at hugepage_add_new_anon_rmap
or huge_add_to_page_cache), it tries to charge to memcg. If the charge is
successful, add the flag of HUGETLB_OVERCOMMIT_CHARGED on that page[1].

The page charged to memcg will finally be uncharged at free_huge_page.

Modification of memcontrol.c is for updating of statistical information
when the task moves cgroup. The hpage_nr_pages works correctly for thp,
but it is not suitable for such as hugetlb which uses the contiguous bit
of aarch64, so need to modify it.

Signed-off-by: TSUKADA Koutaro <tsuk...@ascade.co.jp>
---
 include/linux/hugetlb.h |   45 ++
 mm/hugetlb.c|   80 +++
 mm/memcontrol.c |   98 ++--
 3 files changed, 219 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d67675e..67991cb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,51 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
*mm, unsigned long addr
set_huge_pte_at(mm, addr, ptep, pte);
 }
 #endif
+
+#define HUGETLB_OVERCOMMIT 1UL
+#define HUGETLB_OVERCOMMIT_CHARGED 2UL
+
+static inline void add_hugetlb_compound_private(struct page *page,
+   unsigned long val)
+{
+   page[1].private |= val;
+}
+
+static inline unsigned long get_hugetlb_compound_private(struct page *page)
+{
+   return page_private([1]);
+}
+
+static inline void add_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT);
+}
+
+static inline void del_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT));
+}
+
+static inline int is_hugetlb_overcommit(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) & HUGETLB_OVERCOMMIT);
+}
+
+static inline void add_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT_CHARGED);
+}
+
+static inline void del_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT_CHARGED));
+}
+
+static inline int is_hugetlb_overcommit_charged(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) &
+   HUGETLB_OVERCOMMIT_CHARGED);
+}
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6191fb9..2cd91d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1227,6 +1228,17 @@ static void clear_page_huge_active(struct page *page)
ClearPagePrivate([1]);
 }

+static void hugetlb_overcommit_finalize(struct page *page)
+{
+   if (is_hugetlb_overcommit_charged(page)) {
+   del_hugetlb_overcommit_charged(page);
+   mem_cgroup_uncharge(page);
+   }
+   if (is_hugetlb_overcommit(page)) {
+   del_hugetlb_overcommit(page);
+   }
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1239,6 +1251,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   hugetlb_overcommit_finalize(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1620,6 +1634,13 @@ static struct page *__alloc_buddy_huge_page(struct 
hstate *h,
spin_unlock(_loc

[PATCH] memcg, hugetlb: pages allocated for hugetlb's overcommit will be charged to memcg

2018-05-01 Thread TSUKADA Koutaro
If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
hugetlb's overcommit from buddy pool is all unlimited even if being limited
by memcg. The purpose of this patch is that if we allocate the hugetlb page
from the boddy pool, that means we should charge it to memcg.

A straightforward way for user applications to use hugetlb pages is to
create the pool(nr_hugepages), but root privileges is required. For example,
assuming the field of HPC, it can be said that giving root privs to general
users is difficult. Also, the way to the creating pool is that we need to
estimate exactly how much use hugetlb, and it feels a bit troublesome.

In such a case, using hugetlb's overcommit feature, considered to let the
user use hugetlb page only with overcommit without creating the any pool.
However, as mentioned earlier, the page can be allocated limitelessly in
overcommit in the current implementation. Therefore, by introducing memcg
charging, I wanted to be able to manage the memory resources used by the
user application only with memcg's limitation.

This patch targets RHELSA(kernel-alt-4.11.0-45.6.1.el7a.src.rpm). The patch
does the following things.

When allocating the page from buddy at __hugetlb_alloc_buddy_huge_page,
set the flag of HUGETLB_OVERCOMMIT on that page[1].private. When actually
use the page which HUGETLB_OVERCOMMIT is set(at hugepage_add_new_anon_rmap
or huge_add_to_page_cache), it tries to charge to memcg. If the charge is
successful, add the flag of HUGETLB_OVERCOMMIT_CHARGED on that page[1].

The page charged to memcg will finally be uncharged at free_huge_page.

Modification of memcontrol.c is for updating of statistical information
when the task moves cgroup. The hpage_nr_pages works correctly for thp,
but it is not suitable for such as hugetlb which uses the contiguous bit
of aarch64, so need to modify it.

Signed-off-by: TSUKADA Koutaro 
---
 include/linux/hugetlb.h |   45 ++
 mm/hugetlb.c|   80 +++
 mm/memcontrol.c |   98 ++--
 3 files changed, 219 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d67675e..67991cb 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,6 +511,51 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
*mm, unsigned long addr
set_huge_pte_at(mm, addr, ptep, pte);
 }
 #endif
+
+#define HUGETLB_OVERCOMMIT 1UL
+#define HUGETLB_OVERCOMMIT_CHARGED 2UL
+
+static inline void add_hugetlb_compound_private(struct page *page,
+   unsigned long val)
+{
+   page[1].private |= val;
+}
+
+static inline unsigned long get_hugetlb_compound_private(struct page *page)
+{
+   return page_private([1]);
+}
+
+static inline void add_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT);
+}
+
+static inline void del_hugetlb_overcommit(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT));
+}
+
+static inline int is_hugetlb_overcommit(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) & HUGETLB_OVERCOMMIT);
+}
+
+static inline void add_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, HUGETLB_OVERCOMMIT_CHARGED);
+}
+
+static inline void del_hugetlb_overcommit_charged(struct page *page)
+{
+   add_hugetlb_compound_private(page, ~(HUGETLB_OVERCOMMIT_CHARGED));
+}
+
+static inline int is_hugetlb_overcommit_charged(struct page *page)
+{
+   return (get_hugetlb_compound_private(page) &
+   HUGETLB_OVERCOMMIT_CHARGED);
+}
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6191fb9..2cd91d9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1227,6 +1228,17 @@ static void clear_page_huge_active(struct page *page)
ClearPagePrivate([1]);
 }

+static void hugetlb_overcommit_finalize(struct page *page)
+{
+   if (is_hugetlb_overcommit_charged(page)) {
+   del_hugetlb_overcommit_charged(page);
+   mem_cgroup_uncharge(page);
+   }
+   if (is_hugetlb_overcommit(page)) {
+   del_hugetlb_overcommit(page);
+   }
+}
+
 void free_huge_page(struct page *page)
 {
/*
@@ -1239,6 +1251,8 @@ void free_huge_page(struct page *page)
(struct hugepage_subpool *)page_private(page);
bool restore_reserve;

+   hugetlb_overcommit_finalize(page);
+
set_page_private(page, 0);
page->mapping = NULL;
VM_BUG_ON_PAGE(page_count(page), page);
@@ -1620,6 +1634,13 @@ static struct page *__alloc_buddy_huge_page(struct 
hstate *h,
spin_unlock(_lock);