Re: [PATCH 0/2] Make core_pattern support namespace

2016-03-19 Thread Kamezawa Hiroyuki
On 2016/03/16 18:23, Zhao Lei wrote:
> We discussed patch titled:
>   [PATCH] Make core_pattern support namespace
> before.
> 
> Above patch can solve half problem of custom core_dump pattern
> in container, but there are also another problem that limit
> custom core_pattern in container, it is the pipe-type core_pattern
> will write core dump into host's filesystem.
> (See discussion of that patch for detail)
> 
> Now we can solve the second problem by [PATCH 1/2], I send
> the origional patch with it.
> 

Let me know your design...

This patch does using fork+execve() rather than calling UMH in pipe-coredump 
pass.
And coredump-pipe process is run as a child of get-dumped process.
Right ?

Doesn't this break existing solution actually used in distro ?

BTW, it's first time for me to see that _do_fork() is called outside from 
fork.c.
Isn't it better to add a new func in fork.c if we really need this ?

Thanks,
-Kame



Re: [PATCH 0/2] Make core_pattern support namespace

2016-03-19 Thread Kamezawa Hiroyuki
On 2016/03/16 18:23, Zhao Lei wrote:
> We discussed patch titled:
>   [PATCH] Make core_pattern support namespace
> before.
> 
> Above patch can solve half problem of custom core_dump pattern
> in container, but there are also another problem that limit
> custom core_pattern in container, it is the pipe-type core_pattern
> will write core dump into host's filesystem.
> (See discussion of that patch for detail)
> 
> Now we can solve the second problem by [PATCH 1/2], I send
> the origional patch with it.
> 

Let me know your design...

This patch does using fork+execve() rather than calling UMH in pipe-coredump 
pass.
And coredump-pipe process is run as a child of get-dumped process.
Right ?

Doesn't this break existing solution actually used in distro ?

BTW, it's first time for me to see that _do_fork() is called outside from 
fork.c.
Isn't it better to add a new func in fork.c if we really need this ?

Thanks,
-Kame



Re: call_usermodehelper in containers

2016-02-19 Thread Kamezawa Hiroyuki

On 2016/02/19 14:37, Ian Kent wrote:

On Fri, 2016-02-19 at 12:08 +0900, Kamezawa Hiroyuki wrote:

On 2016/02/19 5:45, Eric W. Biederman wrote:

Personally I am a fan of the don't be clever and capture a kernel
thread
approach as it is very easy to see you what if any exploitation
opportunities there are.  The justifications for something more
clever
is trickier.  Of course we do something that from this perspective
would
be considered ``clever'' today with kthreadd and user mode helpers.



I read old discussionlet me allow clarification  to create a
helper kernel thread
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper
environment for a container.
Option A) at creating some namespace (pid, uid, etc...)
Option B) at creating a new nsproxy
Option C).at a new systemcall is called or some sysctl,
make_private_usermode_helper() or some,

   It's expected this should be triggered by init process of a
container with some capability.
   And scope of the effect should be defined. pid namespace ? nsporxy ?
or new namespace ?

1) create a helper thread.
task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
switch task's nsproxy to current.(swtich_task_namespaces())
switch task's cgroups to current (cgroup_attach_task_all())
switch task's cred to current.
copy task's capability from current
(and any other ?)
wake_up_process()

And create a link between kthread_wq and container.


Not sure I quite understand this but I thought the difficulty with this
approach previously (even though the approach was very much incomplete)
was knowing that all the "moving parts" would not allow vulnerabilities.


Ok, that was discussed.


And it looks like this would require a kernel thread for each instance.
So for a thousand containers that each mount an NFS mount that means, at
least, 1000 additional kernel threads. Might be able to sell that, if we
were lucky, but from an system administration POV it's horrible.


I agree.


There's also the question of existence (aka. lifetime) to deal with
since the thread above needs to be created at a time other than the
usermode helper callback.

What happens for SIGKILL on a container?


It depends on how the helper kthread is tied to a container related object.
If kthread is linked with some namespace, we can kill it when a namespace
goes away.

So, with your opinion,
 - a helper thread should be spawned on demand
 - the lifetime of it should be clear. It will be good to have as same life 
time as the container.

I wonder there is no solution for "moving part" problem other than calling
do_fork() or copy_process() with container's init process context if we do all 
in the kernel.
Is that possible ?

Thanks,
-Kame




Re: call_usermodehelper in containers

2016-02-19 Thread Kamezawa Hiroyuki

On 2016/02/19 14:37, Ian Kent wrote:

On Fri, 2016-02-19 at 12:08 +0900, Kamezawa Hiroyuki wrote:

On 2016/02/19 5:45, Eric W. Biederman wrote:

Personally I am a fan of the don't be clever and capture a kernel
thread
approach as it is very easy to see you what if any exploitation
opportunities there are.  The justifications for something more
clever
is trickier.  Of course we do something that from this perspective
would
be considered ``clever'' today with kthreadd and user mode helpers.



I read old discussionlet me allow clarification  to create a
helper kernel thread
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper
environment for a container.
Option A) at creating some namespace (pid, uid, etc...)
Option B) at creating a new nsproxy
Option C).at a new systemcall is called or some sysctl,
make_private_usermode_helper() or some,

   It's expected this should be triggered by init process of a
container with some capability.
   And scope of the effect should be defined. pid namespace ? nsporxy ?
or new namespace ?

1) create a helper thread.
task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
switch task's nsproxy to current.(swtich_task_namespaces())
switch task's cgroups to current (cgroup_attach_task_all())
switch task's cred to current.
copy task's capability from current
(and any other ?)
wake_up_process()

And create a link between kthread_wq and container.


Not sure I quite understand this but I thought the difficulty with this
approach previously (even though the approach was very much incomplete)
was knowing that all the "moving parts" would not allow vulnerabilities.


Ok, that was discussed.


And it looks like this would require a kernel thread for each instance.
So for a thousand containers that each mount an NFS mount that means, at
least, 1000 additional kernel threads. Might be able to sell that, if we
were lucky, but from an system administration POV it's horrible.


I agree.


There's also the question of existence (aka. lifetime) to deal with
since the thread above needs to be created at a time other than the
usermode helper callback.

What happens for SIGKILL on a container?


It depends on how the helper kthread is tied to a container related object.
If kthread is linked with some namespace, we can kill it when a namespace
goes away.

So, with your opinion,
 - a helper thread should be spawned on demand
 - the lifetime of it should be clear. It will be good to have as same life 
time as the container.

I wonder there is no solution for "moving part" problem other than calling
do_fork() or copy_process() with container's init process context if we do all 
in the kernel.
Is that possible ?

Thanks,
-Kame




Re: call_usermodehelper in containers

2016-02-18 Thread Kamezawa Hiroyuki
On 2016/02/19 5:45, Eric W. Biederman wrote: 
> Personally I am a fan of the don't be clever and capture a kernel thread
> approach as it is very easy to see you what if any exploitation
> opportunities there are.  The justifications for something more clever
> is trickier.  Of course we do something that from this perspective would
> be considered ``clever'' today with kthreadd and user mode helpers.
> 

I read old discussionlet me allow clarification  to create a helper kernel 
thread 
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper environment for a 
container.
   Option A) at creating some namespace (pid, uid, etc...)
   Option B) at creating a new nsproxy
   Option C).at a new systemcall is called or some sysctl, 
make_private_usermode_helper() or some,
  
  It's expected this should be triggered by init process of a container with 
some capability.
  And scope of the effect should be defined. pid namespace ? nsporxy ? or new 
namespace ?

1) create a helper thread.
   task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
   switch task's nsproxy to current.(swtich_task_namespaces())
   switch task's cgroups to current (cgroup_attach_task_all())
   switch task's cred to current.
   copy task's capability from current
   (and any other ?)
   wake_up_process()
   
   And create a link between kthread_wq and container.

2) modify call_usermodehelper() to use kthread_worker


It seems the problem is which object container private user mode helper should 
be tied to.

Regards,
-Kame


Re: call_usermodehelper in containers

2016-02-18 Thread Kamezawa Hiroyuki
On 2016/02/19 5:45, Eric W. Biederman wrote: 
> Personally I am a fan of the don't be clever and capture a kernel thread
> approach as it is very easy to see you what if any exploitation
> opportunities there are.  The justifications for something more clever
> is trickier.  Of course we do something that from this perspective would
> be considered ``clever'' today with kthreadd and user mode helpers.
> 

I read old discussionlet me allow clarification  to create a helper kernel 
thread 
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper environment for a 
container.
   Option A) at creating some namespace (pid, uid, etc...)
   Option B) at creating a new nsproxy
   Option C).at a new systemcall is called or some sysctl, 
make_private_usermode_helper() or some,
  
  It's expected this should be triggered by init process of a container with 
some capability.
  And scope of the effect should be defined. pid namespace ? nsporxy ? or new 
namespace ?

1) create a helper thread.
   task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
   switch task's nsproxy to current.(swtich_task_namespaces())
   switch task's cgroups to current (cgroup_attach_task_all())
   switch task's cred to current.
   copy task's capability from current
   (and any other ?)
   wake_up_process()
   
   And create a link between kthread_wq and container.

2) modify call_usermodehelper() to use kthread_worker


It seems the problem is which object container private user mode helper should 
be tied to.

Regards,
-Kame


Re: call_usermodehelper in containers

2016-02-17 Thread Kamezawa Hiroyuki
On 2016/02/18 11:57, Eric W. Biederman wrote:
> 
> Ccing The containers list because a related discussion is happening there
> and somehow this thread has never made it there.
> 
> Ian Kent  writes:
> 
>> On Mon, 2013-11-18 at 18:28 +0100, Oleg Nesterov wrote:
>>> On 11/15, Eric W. Biederman wrote:

 I don't understand that one.  Having a preforked thread with the
 proper
 environment that can act like kthreadd in terms of spawning user
 mode
 helpers works and is simple.
>>
>> Forgive me replying to such an old thread but ...
>>
>> After realizing workqueues can't be used to pre-create threads to run
>> usermode helpers I've returned to look at this.
> 
> If someone can wind up with a good implementation I will be happy.
> 
>>> Can't we ask ->child_reaper to create the non-daemonized kernel thread
>>> with the "right" ->nsproxy, ->fs, etc?
>>
>> Eric, do you think this approach would be sufficient too?
>>
>> Probably wouldn't be quite right for user namespaces but should provide
>> what's needed for other cases?
>>
>> It certainly has the advantage of not having to maintain a plague of
>> processes waiting around to execute helpers.
> 
> That certainly sounds attractive.  Especially for the case of everyone
> who wants to set a core pattern in a container.
> 
> I am fuzzy on all of the details right now, but what I do remember is
> that in the kernel the user mode helper concepts when they attempted to
> scrub a processes environment were quite error prone until we managed to
> get kthreadd(pid 2) on the scene which always had a clean environment.
> 
> If we are going to tie this kind of thing to the pid namespace I
> recommend simplying denying it if you are in a user namespace without
> an approrpriate pid namespace.  AKA simply not allowing thigns to be setup
> if current->pid_ns->user_ns != current->user_ns.
> 
Can't be handled by simple capability like CAP_SYS_USERMODEHELPER ?

User_ns check seems not to allow core-dump-cather in host will not work if 
user_ns is used.

Thanks,
-Kame



Re: call_usermodehelper in containers

2016-02-17 Thread Kamezawa Hiroyuki
On 2016/02/18 11:57, Eric W. Biederman wrote:
> 
> Ccing The containers list because a related discussion is happening there
> and somehow this thread has never made it there.
> 
> Ian Kent  writes:
> 
>> On Mon, 2013-11-18 at 18:28 +0100, Oleg Nesterov wrote:
>>> On 11/15, Eric W. Biederman wrote:

 I don't understand that one.  Having a preforked thread with the
 proper
 environment that can act like kthreadd in terms of spawning user
 mode
 helpers works and is simple.
>>
>> Forgive me replying to such an old thread but ...
>>
>> After realizing workqueues can't be used to pre-create threads to run
>> usermode helpers I've returned to look at this.
> 
> If someone can wind up with a good implementation I will be happy.
> 
>>> Can't we ask ->child_reaper to create the non-daemonized kernel thread
>>> with the "right" ->nsproxy, ->fs, etc?
>>
>> Eric, do you think this approach would be sufficient too?
>>
>> Probably wouldn't be quite right for user namespaces but should provide
>> what's needed for other cases?
>>
>> It certainly has the advantage of not having to maintain a plague of
>> processes waiting around to execute helpers.
> 
> That certainly sounds attractive.  Especially for the case of everyone
> who wants to set a core pattern in a container.
> 
> I am fuzzy on all of the details right now, but what I do remember is
> that in the kernel the user mode helper concepts when they attempted to
> scrub a processes environment were quite error prone until we managed to
> get kthreadd(pid 2) on the scene which always had a clean environment.
> 
> If we are going to tie this kind of thing to the pid namespace I
> recommend simplying denying it if you are in a user namespace without
> an approrpriate pid namespace.  AKA simply not allowing thigns to be setup
> if current->pid_ns->user_ns != current->user_ns.
> 
Can't be handled by simple capability like CAP_SYS_USERMODEHELPER ?

User_ns check seems not to allow core-dump-cather in host will not work if 
user_ns is used.

Thanks,
-Kame



Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-21 Thread Kamezawa Hiroyuki
On 2015/12/22 6:52, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote:
>>> Dongsheng Yang  writes:
>>>
 On 12/20/2015 10:37 AM, Al Viro wrote:
> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
>>> Hi guys,
>>>We are working on making core dump behaviour isolated in
>>> container. But the problem is, the /proc/sys/kernel/core_pattern
>>> is a kernel wide setting, not belongs to a container.
>>>
>>>So we want to add core_pattern into mnt namespace. What
>>> do you think about it?
>>
>> Hi Eric,
>>  I found your patch about "net: Implement the per network namespace
>> sysctl infrastructure", I want to do the similar thing
>> in mnt namespace. Is that suggested way?
>
> Why mnt namespace and not something else?

 Hi Al,

 Well, because core_pattern indicates the path to store core file.
 In different mnt namespace, we would like to change the path with
 different value.

 In addition, Let's considering other namespaces:
 UTS ns: contains informations of kernel and arch, not proper for 
 core_pattern.
 IPC ns: communication informations, not proper for core_pattern
 PID ns: core_pattern is not related with pid
 net ns: obviousely no.
 user ns: not proper too.

 Then I believe it's better to do this in mnt namespace. of course,
 core_pattern is just one example. After this infrastructure finished,
 we can implement more sysctls as per-mnt if necessary, I think.

 Al, what do you think about this idea?
>>>
>>> The hard part is not the sysctl.  The hard part is starting the usermode
>>> helper, in an environment that it can deal with.  The mount namespace
>>> really provides you with no help there.
>>
>> Do you mean the core dump helper? But I think I don't want to touch it
>> in my development. I think I can use non-pipe way to get what I want,
>> Let me try to explain what I want here.
>>
>> (1). introduce a --core-path option in docker run command to specify the
>> path in host to store core file in one container.
>> E.g: docker run --core-path=/core/test --name=test IMAGE
>>
>> (2). When the container starting, docker attach a volume to it, similar
>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of
>> /var/lib/docker/coredump in container is a link to /core/test in host.
>>
>> (3). Set the /proc/sys/kernel/core_pattern in container as
>> "/var/lib/docker/coredump". But that should not affect the core_pattern
>> in host or other containers.
>>
>> Then I think I can collect the core files from each container and save
>> them in the paths where I want.
> 
> For your case that sounds like it would work.  Unfortunately for this to
> be generally applicable and to let the OS in the contianer control it's
> fate the core dump pattern needs to be supported.
> 
> Otherwise something clever in userspace that can be written now should
> be sufficient to fill the gap.  There is enough information for the user
> mode helper to implement the policy you would like today.
> 
Let me clarify my understanding.

1) running user-mode-helper in a container.
   It's not supported by the kernel. user-mode-helper always works on a host.

2) running user mode helper in a host.
  It's supported in the newest distro(FC23). (abrt supports container.)
  Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots

If a guest user doesn't want to pass a core to the host owner, core_pattern
should be configurable but it can't.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-21 Thread Kamezawa Hiroyuki
On 2015/12/22 6:52, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote:
>>> Dongsheng Yang  writes:
>>>
 On 12/20/2015 10:37 AM, Al Viro wrote:
> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
>>> Hi guys,
>>>We are working on making core dump behaviour isolated in
>>> container. But the problem is, the /proc/sys/kernel/core_pattern
>>> is a kernel wide setting, not belongs to a container.
>>>
>>>So we want to add core_pattern into mnt namespace. What
>>> do you think about it?
>>
>> Hi Eric,
>>  I found your patch about "net: Implement the per network namespace
>> sysctl infrastructure", I want to do the similar thing
>> in mnt namespace. Is that suggested way?
>
> Why mnt namespace and not something else?

 Hi Al,

 Well, because core_pattern indicates the path to store core file.
 In different mnt namespace, we would like to change the path with
 different value.

 In addition, Let's considering other namespaces:
 UTS ns: contains informations of kernel and arch, not proper for 
 core_pattern.
 IPC ns: communication informations, not proper for core_pattern
 PID ns: core_pattern is not related with pid
 net ns: obviousely no.
 user ns: not proper too.

 Then I believe it's better to do this in mnt namespace. of course,
 core_pattern is just one example. After this infrastructure finished,
 we can implement more sysctls as per-mnt if necessary, I think.

 Al, what do you think about this idea?
>>>
>>> The hard part is not the sysctl.  The hard part is starting the usermode
>>> helper, in an environment that it can deal with.  The mount namespace
>>> really provides you with no help there.
>>
>> Do you mean the core dump helper? But I think I don't want to touch it
>> in my development. I think I can use non-pipe way to get what I want,
>> Let me try to explain what I want here.
>>
>> (1). introduce a --core-path option in docker run command to specify the
>> path in host to store core file in one container.
>> E.g: docker run --core-path=/core/test --name=test IMAGE
>>
>> (2). When the container starting, docker attach a volume to it, similar
>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of
>> /var/lib/docker/coredump in container is a link to /core/test in host.
>>
>> (3). Set the /proc/sys/kernel/core_pattern in container as
>> "/var/lib/docker/coredump". But that should not affect the core_pattern
>> in host or other containers.
>>
>> Then I think I can collect the core files from each container and save
>> them in the paths where I want.
> 
> For your case that sounds like it would work.  Unfortunately for this to
> be generally applicable and to let the OS in the contianer control it's
> fate the core dump pattern needs to be supported.
> 
> Otherwise something clever in userspace that can be written now should
> be sufficient to fill the gap.  There is enough information for the user
> mode helper to implement the policy you would like today.
> 
Let me clarify my understanding.

1) running user-mode-helper in a container.
   It's not supported by the kernel. user-mode-helper always works on a host.

2) running user mode helper in a host.
  It's supported in the newest distro(FC23). (abrt supports container.)
  Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots

If a guest user doesn't want to pass a core to the host owner, core_pattern
should be configurable but it can't.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-20 Thread Kamezawa Hiroyuki
On 2015/12/20 18:47, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 10:37 AM, Al Viro wrote:
>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
 On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
> Hi guys,
>   We are working on making core dump behaviour isolated in
> container. But the problem is, the /proc/sys/kernel/core_pattern
> is a kernel wide setting, not belongs to a container.
>
>   So we want to add core_pattern into mnt namespace. What
> do you think about it?

 Hi Eric,
I found your patch about "net: Implement the per network namespace
 sysctl infrastructure", I want to do the similar thing
 in mnt namespace. Is that suggested way?
>>>
>>> Why mnt namespace and not something else?
>>
>> Hi Al,
>>
>> Well, because core_pattern indicates the path to store core file.
>> In different mnt namespace, we would like to change the path with
>> different value.
>>
>> In addition, Let's considering other namespaces:
>> UTS ns: contains informations of kernel and arch, not proper for 
>> core_pattern.
>> IPC ns: communication informations, not proper for core_pattern
>> PID ns: core_pattern is not related with pid
>> net ns: obviousely no.
>> user ns: not proper too.
>>
>> Then I believe it's better to do this in mnt namespace. of course,
>> core_pattern is just one example. After this infrastructure finished,
>> we can implement more sysctls as per-mnt if necessary, I think.
>>
>> Al, what do you think about this idea?
> 
> The hard part is not the sysctl.  The hard part is starting the usermode
> helper, in an environment that it can deal with.  The mount namespace
> really provides you with no help there.
> 
Let me ask. I think user-mode-helper shouldn be found in container's namespace.
Or Do you mean running user-mode-helper out of a container ?

I think if a user need to send cores to outside of a container, a file 
descriptor
or a network end point should be able to be passed to sysctl. 

Thanks,
-Kame



> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-20 Thread Kamezawa Hiroyuki
On 2015/12/20 18:47, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 10:37 AM, Al Viro wrote:
>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
 On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
> Hi guys,
>   We are working on making core dump behaviour isolated in
> container. But the problem is, the /proc/sys/kernel/core_pattern
> is a kernel wide setting, not belongs to a container.
>
>   So we want to add core_pattern into mnt namespace. What
> do you think about it?

 Hi Eric,
I found your patch about "net: Implement the per network namespace
 sysctl infrastructure", I want to do the similar thing
 in mnt namespace. Is that suggested way?
>>>
>>> Why mnt namespace and not something else?
>>
>> Hi Al,
>>
>> Well, because core_pattern indicates the path to store core file.
>> In different mnt namespace, we would like to change the path with
>> different value.
>>
>> In addition, Let's considering other namespaces:
>> UTS ns: contains informations of kernel and arch, not proper for 
>> core_pattern.
>> IPC ns: communication informations, not proper for core_pattern
>> PID ns: core_pattern is not related with pid
>> net ns: obviousely no.
>> user ns: not proper too.
>>
>> Then I believe it's better to do this in mnt namespace. of course,
>> core_pattern is just one example. After this infrastructure finished,
>> we can implement more sysctls as per-mnt if necessary, I think.
>>
>> Al, what do you think about this idea?
> 
> The hard part is not the sysctl.  The hard part is starting the usermode
> helper, in an environment that it can deal with.  The mount namespace
> really provides you with no help there.
> 
Let me ask. I think user-mode-helper shouldn be found in container's namespace.
Or Do you mean running user-mode-helper out of a container ?

I think if a user need to send cores to outside of a container, a file 
descriptor
or a network end point should be able to be passed to sysctl. 

Thanks,
-Kame



> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 7/7] Documentation: cgroup: add memory.swap.{current,max} description

2015-12-17 Thread Kamezawa Hiroyuki
On 2015/12/17 21:30, Vladimir Davydov wrote:
> The rationale of separate swap counter is given by Johannes Weiner.
> 
> Signed-off-by: Vladimir Davydov 
> ---
> Changes in v2:
>   - Add rationale of separate swap counter provided by Johannes.
> 
>   Documentation/cgroup.txt | 33 +
>   1 file changed, 33 insertions(+)
> 
> diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
> index 31d1f7bf12a1..f441564023e1 100644
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back.
>   the cgroup.  This may not exactly match the number of
>   processes killed but should generally be close.
>   
> +  memory.swap.current
> +
> + A read-only single value file which exists on non-root
> + cgroups.
> +
> + The total amount of swap currently being used by the cgroup
> + and its descendants.
> +
> +  memory.swap.max
> +
> + A read-write single value file which exists on non-root
> + cgroups.  The default is "max".
> +
> + Swap usage hard limit.  If a cgroup's swap usage reaches this
> + limit, anonymous meomry of the cgroup will not be swapped out.
> +
>   
>   5-2-2. General Usage
>   
> @@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or 
> the rest of the
>   system than killing the group.  Otherwise, memory.max is there to
>   limit this type of spillover and ultimately contain buggy or even
>   malicious applications.
> +
> +The combined memory+swap accounting and limiting is replaced by real
> +control over swap space.
> +
> +The main argument for a combined memory+swap facility in the original
> +cgroup design was that global or parental pressure would always be
> +able to swap all anonymous memory of a child group, regardless of the
> +child's own (possibly untrusted) configuration.  However, untrusted
> +groups can sabotage swapping by other means - such as referencing its
> +anonymous memory in a tight loop - and an admin can not assume full
> +swappability when overcommitting untrusted jobs.
> +
> +For trusted jobs, on the other hand, a combined counter is not an
> +intuitive userspace interface, and it flies in the face of the idea
> +that cgroup controllers should account and limit specific physical
> +resources.  Swap space is a resource like all others in the system,
> +and that's why unified hierarchy allows distributing it separately.
> 
Could you give here a hint how to calculate amount of swapcache,
counted both in memory.current and swap.current ?

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-17 Thread Kamezawa Hiroyuki

On 2015/12/18 3:43, Luck, Tony wrote:

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.



If the mirrored memory is small and the other is large,
I think we can both enable "non-mirrored -> normal" and "normal -> 
non-mirrored".


Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.


In the original removable usage we wanted to guarantee that Linux did not 
allocate any
kernel objects in removable memory - because that would prevent later removal 
of that
memory.

Mirror case is the same - we don't want to allocate kernel structures in 
non-mirrored memory
because an uncorrectable error in one of them would crash the system.

But I think some users might like some flexibility here.  If the system doesn't 
have enough
memory for the kernel (non-movable or mirrored), then it seems odd to end up 
crashing
the system at the point of memory exhaustion (a likely result ... the kernel 
can try to reclaim
some pages from SLAB, but that might only return a few pages, if the shortage 
continues
the system will perform poorly and eventually fail).

The whole point of removable memory or mirrored memory is to provide better 
availability.

I'd vote for a mode where running out of memory for kernel results in a

warn_on_once("Ran out of mirrored/non-removable memory for kernel - now 
allocating from all zones\n")

because I think most people would like the system to stay up rather than worry 
about some future problem that may never happen.


Hmm...like this ?
  sysctl.vm.fallback_mirror_memory = 0  // never fallback  # default.
  sysctl.vm.fallback_mirror_memory = 1  // the user memory may be allocated 
from mirrored zone.
  sysctl.vm.fallback_mirror_memory = 2  // usually kernel allocates memory 
from mirrored zone before OOM.
  sysctl.vm.fallback_mirror_memory = 3  // 1+2

However I believe my customer's choice is always 0, above implementation can be 
done in a clean way.
(adding a flag to zones (mirrored or not) and controlling fallback zonelist 
walk.)

BTW, we need this Taku's patch to make a progress. I think other devs should be 
done in another
development cycle. What does he need to get your Acks ?

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-17 Thread Kamezawa Hiroyuki

On 2015/12/18 3:43, Luck, Tony wrote:

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.



If the mirrored memory is small and the other is large,
I think we can both enable "non-mirrored -> normal" and "normal -> 
non-mirrored".


Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.


In the original removable usage we wanted to guarantee that Linux did not 
allocate any
kernel objects in removable memory - because that would prevent later removal 
of that
memory.

Mirror case is the same - we don't want to allocate kernel structures in 
non-mirrored memory
because an uncorrectable error in one of them would crash the system.

But I think some users might like some flexibility here.  If the system doesn't 
have enough
memory for the kernel (non-movable or mirrored), then it seems odd to end up 
crashing
the system at the point of memory exhaustion (a likely result ... the kernel 
can try to reclaim
some pages from SLAB, but that might only return a few pages, if the shortage 
continues
the system will perform poorly and eventually fail).

The whole point of removable memory or mirrored memory is to provide better 
availability.

I'd vote for a mode where running out of memory for kernel results in a

warn_on_once("Ran out of mirrored/non-removable memory for kernel - now 
allocating from all zones\n")

because I think most people would like the system to stay up rather than worry 
about some future problem that may never happen.


Hmm...like this ?
  sysctl.vm.fallback_mirror_memory = 0  // never fallback  # default.
  sysctl.vm.fallback_mirror_memory = 1  // the user memory may be allocated 
from mirrored zone.
  sysctl.vm.fallback_mirror_memory = 2  // usually kernel allocates memory 
from mirrored zone before OOM.
  sysctl.vm.fallback_mirror_memory = 3  // 1+2

However I believe my customer's choice is always 0, above implementation can be 
done in a clean way.
(adding a flag to zones (mirrored or not) and controlling fallback zonelist 
walk.)

BTW, we need this Taku's patch to make a progress. I think other devs should be 
done in another
development cycle. What does he need to get your Acks ?

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 7/7] Documentation: cgroup: add memory.swap.{current,max} description

2015-12-17 Thread Kamezawa Hiroyuki
On 2015/12/17 21:30, Vladimir Davydov wrote:
> The rationale of separate swap counter is given by Johannes Weiner.
> 
> Signed-off-by: Vladimir Davydov 
> ---
> Changes in v2:
>   - Add rationale of separate swap counter provided by Johannes.
> 
>   Documentation/cgroup.txt | 33 +
>   1 file changed, 33 insertions(+)
> 
> diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
> index 31d1f7bf12a1..f441564023e1 100644
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back.
>   the cgroup.  This may not exactly match the number of
>   processes killed but should generally be close.
>   
> +  memory.swap.current
> +
> + A read-only single value file which exists on non-root
> + cgroups.
> +
> + The total amount of swap currently being used by the cgroup
> + and its descendants.
> +
> +  memory.swap.max
> +
> + A read-write single value file which exists on non-root
> + cgroups.  The default is "max".
> +
> + Swap usage hard limit.  If a cgroup's swap usage reaches this
> + limit, anonymous meomry of the cgroup will not be swapped out.
> +
>   
>   5-2-2. General Usage
>   
> @@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or 
> the rest of the
>   system than killing the group.  Otherwise, memory.max is there to
>   limit this type of spillover and ultimately contain buggy or even
>   malicious applications.
> +
> +The combined memory+swap accounting and limiting is replaced by real
> +control over swap space.
> +
> +The main argument for a combined memory+swap facility in the original
> +cgroup design was that global or parental pressure would always be
> +able to swap all anonymous memory of a child group, regardless of the
> +child's own (possibly untrusted) configuration.  However, untrusted
> +groups can sabotage swapping by other means - such as referencing its
> +anonymous memory in a tight loop - and an admin can not assume full
> +swappability when overcommitting untrusted jobs.
> +
> +For trusted jobs, on the other hand, a combined counter is not an
> +intuitive userspace interface, and it flies in the face of the idea
> +that cgroup controllers should account and limit specific physical
> +resources.  Swap space is a resource like all others in the system,
> +and that's why unified hierarchy allows distributing it separately.
> 
Could you give here a hint how to calculate amount of swapcache,
counted both in memory.current and swap.current ?

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 13:48, Xishi Qiu wrote:
> On 2015/12/17 10:53, Kamezawa Hiroyuki wrote:
> 
>> On 2015/12/17 11:47, Xishi Qiu wrote:
>>> On 2015/12/17 9:38, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>Sorry for late.
>>>>
>>>>> -Original Message-
>>>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>>>> Sent: Friday, December 11, 2015 6:44 PM
>>>>> To: Izumi, Taku/泉 拓
>>>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>>>
>>>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>>>
>>>>>> Dear Xishi,
>>>>>>
>>>>>>> Hi Taku,
>>>>>>>
>>>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>>>> system
>>>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>>>
>>>>>> What does "when zone_movable and mirrored_kernelcore are both 
>>>>>> enabled?" mean ?
>>>>>>
>>>>>> My patchset just provides a new way to create ZONE_MOVABLE.
>>>>>>
>>>>>
>>>>> Hi Taku,
>>>>>
>>>
>>> Hi Taku,
>>>
>>> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
>>> So when we use "mirror", in fact, the movable zone is a new zone. I think 
>>> it is
>>> more appropriate with this name "mirrored zone", and also we can rewrite the
>>> fallback function in buddy system in this case.
>>
>> kernelcore ="mirrored zone" ?
> 
> No, it's zone_names[MAX_NR_ZONES]
> How about "Movable", -> "Non-mirrored"?
> 
That will break many user apps. I think we don't have enough reason. 

>>
>> BTW, let me confirm.
>>
>>ZONE_NORMAL = mirrored
>>ZONE_MOVABLE = not mirrored.
>>
> 
> Yes,
> 
>> so, the new zone is "not-mirrored" zone.
>>
>> Now, fallback function is
>>
>> movable -> normal -> DMA.
>>
>> As Tony requested, we may need a knob to stop a fallback in 
>> "movable->normal", later.
>>
> 
> If the mirrored memory is small and the other is large,
> I think we can both enable "non-mirrored -> normal" and "normal -> 
> non-mirrored".

Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/17 12:32, Johannes Weiner wrote:

On Thu, Dec 17, 2015 at 11:46:27AM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


That will be reflected in vmpressure or other distress mechanisms. I'm
not convinced "ran out of swap space" needs special casing in any way.


Most users checks swap space shortage as "system alarm" in enterprise systems.
At least, our customers checks swap-full.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.


I don't see why swap space allowance would need to be as dynamically
adjustable as the memory allowance. There is usually no need to be as
tight with swap space as with memory, and the performance penalty of
swapping, even with flash drives, is high enough that swap space acts
as an overflow vessel rather than be part of the regularly backing of
the anonymous/shmem working set. It really is NOT obvious that swap
space would need to be adjusted on the fly, and that it's important
that reducing the limit will be reflected in consumption right away.



With my OS support experience, some customers consider swap-space as a resource.



We shouldn't be adding hundreds of lines of likely terrible heuristics
code* on speculation that somebody MIGHT find this useful in real life.
We should wait until we are presented with a real usecase that applies
to a whole class of users, and then see what the true requirements are.


ok, we should wait.  I'm just guessing (japanese) HPC people will want the
feature for their job control. I hear many programs relies on swap.


* If a group has 200M swapped out and the swap limit is reduced by 10M
below the current consumption, which pages would you swap in? There is
no LRU list for swap space.


If a rotation can happen when a swap-in-by-real-pagefault, random swap-in
at reducing swap.limit will work enough.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 11:47, Xishi Qiu wrote:
> On 2015/12/17 9:38, Izumi, Taku wrote:
> 
>> Dear Xishi,
>>
>>   Sorry for late.
>>
>>> -Original Message-
>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>> Sent: Friday, December 11, 2015 6:44 PM
>>> To: Izumi, Taku/泉 拓
>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>
>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>> Hi Taku,
>>>>>
>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>> system
>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>
>>>>What does "when zone_movable and mirrored_kernelcore are both enabled?" 
>>>> mean ?
>>>>
>>>>My patchset just provides a new way to create ZONE_MOVABLE.
>>>>
>>>
>>> Hi Taku,
>>>
> 
> Hi Taku,
> 
> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
> So when we use "mirror", in fact, the movable zone is a new zone. I think it 
> is
> more appropriate with this name "mirrored zone", and also we can rewrite the
> fallback function in buddy system in this case.

kernelcore ="mirrored zone" ?

BTW, let me confirm.

  ZONE_NORMAL = mirrored
  ZONE_MOVABLE = not mirrored.

so, the new zone is "not-mirrored" zone.

Now, fallback function is

   movable -> normal -> DMA.

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

Hmm, my requests are
  - set the same capabilities as mlock() to set swap.limit=0


Setting swap.max is already privileged operation.


Sure.


  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


  - OOM-Killer's available memory calculation may be corrupted, please check.


Vladimir updated mem_cgroup_get_limit().


I'll check it.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/17 12:32, Johannes Weiner wrote:

On Thu, Dec 17, 2015 at 11:46:27AM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


That will be reflected in vmpressure or other distress mechanisms. I'm
not convinced "ran out of swap space" needs special casing in any way.


Most users checks swap space shortage as "system alarm" in enterprise systems.
At least, our customers checks swap-full.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.


I don't see why swap space allowance would need to be as dynamically
adjustable as the memory allowance. There is usually no need to be as
tight with swap space as with memory, and the performance penalty of
swapping, even with flash drives, is high enough that swap space acts
as an overflow vessel rather than be part of the regularly backing of
the anonymous/shmem working set. It really is NOT obvious that swap
space would need to be adjusted on the fly, and that it's important
that reducing the limit will be reflected in consumption right away.



With my OS support experience, some customers consider swap-space as a resource.



We shouldn't be adding hundreds of lines of likely terrible heuristics
code* on speculation that somebody MIGHT find this useful in real life.
We should wait until we are presented with a real usecase that applies
to a whole class of users, and then see what the true requirements are.


ok, we should wait.  I'm just guessing (japanese) HPC people will want the
feature for their job control. I hear many programs relies on swap.


* If a group has 200M swapped out and the swap limit is reduced by 10M
below the current consumption, which pages would you swap in? There is
no LRU list for swap space.


If a rotation can happen when a swap-in-by-real-pagefault, random swap-in
at reducing swap.limit will work enough.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 13:48, Xishi Qiu wrote:
> On 2015/12/17 10:53, Kamezawa Hiroyuki wrote:
> 
>> On 2015/12/17 11:47, Xishi Qiu wrote:
>>> On 2015/12/17 9:38, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>Sorry for late.
>>>>
>>>>> -Original Message-
>>>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>>>> Sent: Friday, December 11, 2015 6:44 PM
>>>>> To: Izumi, Taku/泉 拓
>>>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>>>
>>>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>>>
>>>>>> Dear Xishi,
>>>>>>
>>>>>>> Hi Taku,
>>>>>>>
>>>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>>>> system
>>>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>>>
>>>>>> What does "when zone_movable and mirrored_kernelcore are both 
>>>>>> enabled?" mean ?
>>>>>>
>>>>>> My patchset just provides a new way to create ZONE_MOVABLE.
>>>>>>
>>>>>
>>>>> Hi Taku,
>>>>>
>>>
>>> Hi Taku,
>>>
>>> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
>>> So when we use "mirror", in fact, the movable zone is a new zone. I think 
>>> it is
>>> more appropriate with this name "mirrored zone", and also we can rewrite the
>>> fallback function in buddy system in this case.
>>
>> kernelcore ="mirrored zone" ?
> 
> No, it's zone_names[MAX_NR_ZONES]
> How about "Movable", -> "Non-mirrored"?
> 
That will break many user apps. I think we don't have enough reason. 

>>
>> BTW, let me confirm.
>>
>>ZONE_NORMAL = mirrored
>>ZONE_MOVABLE = not mirrored.
>>
> 
> Yes,
> 
>> so, the new zone is "not-mirrored" zone.
>>
>> Now, fallback function is
>>
>> movable -> normal -> DMA.
>>
>> As Tony requested, we may need a knob to stop a fallback in 
>> "movable->normal", later.
>>
> 
> If the mirrored memory is small and the other is large,
> I think we can both enable "non-mirrored -> normal" and "normal -> 
> non-mirrored".

Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

Hmm, my requests are
  - set the same capabilities as mlock() to set swap.limit=0


Setting swap.max is already privileged operation.


Sure.


  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


  - OOM-Killer's available memory calculation may be corrupted, please check.


Vladimir updated mem_cgroup_get_limit().


I'll check it.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 11:47, Xishi Qiu wrote:
> On 2015/12/17 9:38, Izumi, Taku wrote:
> 
>> Dear Xishi,
>>
>>   Sorry for late.
>>
>>> -Original Message-
>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>> Sent: Friday, December 11, 2015 6:44 PM
>>> To: Izumi, Taku/泉 拓
>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>
>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>> Hi Taku,
>>>>>
>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>> system
>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>
>>>>What does "when zone_movable and mirrored_kernelcore are both enabled?" 
>>>> mean ?
>>>>
>>>>My patchset just provides a new way to create ZONE_MOVABLE.
>>>>
>>>
>>> Hi Taku,
>>>
> 
> Hi Taku,
> 
> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
> So when we use "mirror", in fact, the movable zone is a new zone. I think it 
> is
> more appropriate with this name "mirrored zone", and also we can rewrite the
> fallback function in buddy system in this case.

kernelcore ="mirrored zone" ?

BTW, let me confirm.

  ZONE_NORMAL = mirrored
  ZONE_MOVABLE = not mirrored.

so, the new zone is "not-mirrored" zone.

Now, fallback function is

   movable -> normal -> DMA.

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/16 2:21, Michal Hocko wrote:
 

I completely agree that malicious/untrusted users absolutely have to
be capped by the hard limit. Then the separate swap limit would work
for sure. But I am less convinced about usefulness of the rigid (to
the global memory pressure) swap limit without the hard limit. All the
memory that could have been swapped out will make a memory pressure to
the rest of the system without being punished for it too much. Memcg
is allowed to grow over the high limit (in the current implementation)
without any way to shrink back in other words.

My understanding was that the primary use case for the swap limit is to
handle potential (not only malicious but also unexpectedly misbehaving
application) anon memory consumption runaways more gracefully without
the massive disruption on the global level. I simply didn't see swap
space partitioning as important enough because an alternative to swap
usage is to consume primary memory which is a more precious resource
IMO. Swap storage is really cheap and runtime expandable resource which
is not the case for the primary memory in general. Maybe there are other
use cases I am not aware of, though. Do you want to guarantee the swap
availability?



At the first implementation, NEC guy explained their use case in HPC area.
At that time, there was no swap support.

Considering 2 workloads partitioned into group A, B. total swap was 100GB.
  A: memory.limit = 40G
  B: memory.limit = 40G

Job scheduler runs applications in A and B in turn. Apps in A stops while Apps 
in B running.

If App-A requires 120GB of anonymous memory, it uses 80GB of swap. So, App-B 
can use only
20GB of swap. This can cause trouble if App-B needs 100GB of anonymous memory.
They need some knob to control amount of swap per cgroup.

The point is, at least for their customer, the swap is "resource", which should 
be under
control. With their use case, memory usage and swap usage has the same meaning. 
So,
mem+swap limit doesn't cause trouble.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 23:50, Johannes Weiner wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.


Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?


That never worked with a combined memory+swap limit, either. How could
it? The parent might swap you out under pressure, but simply touching
a few of your anon pages causes them to get swapped back in, thrashing
with whatever the parent was trying to do. Your ability to swap it out
is simply no protection against a group touching its pages.

Allowing the parent to exceed swap with separate counters makes even
less sense, because every page swapped out frees up a page of memory
that the child can reuse. For every swap page that exceeds the limit,
the child gets a free memory page! The child doesn't even have to
cause swapin, it can just steal whatever the parent tried to free up,
and meanwhile its combined memory & swap footprint explodes.


Sure.


The answer is and always should have been: don't overcommit untrusted
cgroups. Think of swap as a resource you distribute, not as breathing
room for the parents to rely on. Because it can't and could never.


ok, don't overcommmit.


And the new separate swap counter makes this explicit.


Hmm, my requests are
 - set the same capabilities as mlock() to set swap.limit=0
 - swap-full notification via vmpressure or something mechanism.
 - OOM-Killer's available memory calculation may be corrupted, please check.
 - force swap-in at reducing swap.limit

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 20:02, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


What do you mean by "malicious degree"? What is such a user trying to
achieve? Killing the system? Well, there are much more effective ways to
do so. Or does it want to exploit a system specific feature to get
benefit for itself? If so, it will hardly win by mlocking all anonymous
memory, because this will result in higher pressure exerted upon its
page cache and dcache, which normal workloads just can't get along
without.



I wanted to say almost all application developers want to set swap.limit=0 if 
allowed.
So, it's a usual people who can kill the system if swap imbalance is allowed.
 



Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are
available swap" with easy mistake.


What do you mean by "easy mistake"? Misconfiguration? If so, it's a lame
excuse IMO. Admin should take system configuration seriously. If the
host is not overcommitted, it's trivial. Otherwise, there's always a
chance that things will go south, so it's not going to be easy. It's up
to admin to analyze risks and set limits accordingly. Exporting knobs
with clear meaning is the best we can do here. swap.max is one such knob
It defines maximal usage of swap resource. Allowing to breach it just
does not add up.


Can't you add "swap excess" switch to sysctl to allow global memory
reclaim can ignore swap limitation ?


I'd be opposed to it, because this would obscure the user API. OTOH, a
kind of swap soft limit (swap.high?) might be considered. I'm not sure
if it's really necessary though, because all arguments for it do not
look convincing to me for now. So, personally, I would refrain from
implementing it until it is really called for by users of cgroup v2.



Considering my customers, running OOM-Killer while there are free swap space is
system's error rather than their misconfiguration.

BTW, mlock() requires CAP_IPC_LOCK.
please set default unlimited and check capability at setting swap limit, at 
least.

Thanks,
-Kame



Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe l

Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 17:30, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:12:40PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


If we are talking about "attacking system" from inside a container,
there are much easier and disruptive ways, e.g. running a fork-bomb or
creating pipes - such memory can't be reclaimed and global OOM killer
won't help.


You're right. We just wanted to avoid affecting global memory reclaim by
each cgroup settings.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 17:30, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:12:40PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


If we are talking about "attacking system" from inside a container,
there are much easier and disruptive ways, e.g. running a fork-bomb or
creating pipes - such memory can't be reclaimed and global OOM killer
won't help.


You're right. We just wanted to avoid affecting global memory reclaim by
each cgroup settings.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 23:50, Johannes Weiner wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.


Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?


That never worked with a combined memory+swap limit, either. How could
it? The parent might swap you out under pressure, but simply touching
a few of your anon pages causes them to get swapped back in, thrashing
with whatever the parent was trying to do. Your ability to swap it out
is simply no protection against a group touching its pages.

Allowing the parent to exceed swap with separate counters makes even
less sense, because every page swapped out frees up a page of memory
that the child can reuse. For every swap page that exceeds the limit,
the child gets a free memory page! The child doesn't even have to
cause swapin, it can just steal whatever the parent tried to free up,
and meanwhile its combined memory & swap footprint explodes.


Sure.


The answer is and always should have been: don't overcommit untrusted
cgroups. Think of swap as a resource you distribute, not as breathing
room for the parents to rely on. Because it can't and could never.


ok, don't overcommmit.


And the new separate swap counter makes this explicit.


Hmm, my requests are
 - set the same capabilities as mlock() to set swap.limit=0
 - swap-full notification via vmpressure or something mechanism.
 - OOM-Killer's available memory calculation may be corrupted, please check.
 - force swap-in at reducing swap.limit

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/16 2:21, Michal Hocko wrote:
 

I completely agree that malicious/untrusted users absolutely have to
be capped by the hard limit. Then the separate swap limit would work
for sure. But I am less convinced about usefulness of the rigid (to
the global memory pressure) swap limit without the hard limit. All the
memory that could have been swapped out will make a memory pressure to
the rest of the system without being punished for it too much. Memcg
is allowed to grow over the high limit (in the current implementation)
without any way to shrink back in other words.

My understanding was that the primary use case for the swap limit is to
handle potential (not only malicious but also unexpectedly misbehaving
application) anon memory consumption runaways more gracefully without
the massive disruption on the global level. I simply didn't see swap
space partitioning as important enough because an alternative to swap
usage is to consume primary memory which is a more precious resource
IMO. Swap storage is really cheap and runtime expandable resource which
is not the case for the primary memory in general. Maybe there are other
use cases I am not aware of, though. Do you want to guarantee the swap
availability?



At the first implementation, NEC guy explained their use case in HPC area.
At that time, there was no swap support.

Considering 2 workloads partitioned into group A, B. total swap was 100GB.
  A: memory.limit = 40G
  B: memory.limit = 40G

Job scheduler runs applications in A and B in turn. Apps in A stops while Apps 
in B running.

If App-A requires 120GB of anonymous memory, it uses 80GB of swap. So, App-B 
can use only
20GB of swap. This can cause trouble if App-B needs 100GB of anonymous memory.
They need some knob to control amount of swap per cgroup.

The point is, at least for their customer, the swap is "resource", which should 
be under
control. With their use case, memory usage and swap usage has the same meaning. 
So,
mem+swap limit doesn't cause trouble.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 20:02, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


What do you mean by "malicious degree"? What is such a user trying to
achieve? Killing the system? Well, there are much more effective ways to
do so. Or does it want to exploit a system specific feature to get
benefit for itself? If so, it will hardly win by mlocking all anonymous
memory, because this will result in higher pressure exerted upon its
page cache and dcache, which normal workloads just can't get along
without.



I wanted to say almost all application developers want to set swap.limit=0 if 
allowed.
So, it's a usual people who can kill the system if swap imbalance is allowed.
 



Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are
available swap" with easy mistake.


What do you mean by "easy mistake"? Misconfiguration? If so, it's a lame
excuse IMO. Admin should take system configuration seriously. If the
host is not overcommitted, it's trivial. Otherwise, there's always a
chance that things will go south, so it's not going to be easy. It's up
to admin to analyze risks and set limits accordingly. Exporting knobs
with clear meaning is the best we can do here. swap.max is one such knob
It defines maximal usage of swap resource. Allowing to breach it just
does not add up.


Can't you add "swap excess" switch to sysctl to allow global memory
reclaim can ignore swap limitation ?


I'd be opposed to it, because this would obscure the user API. OTOH, a
kind of swap soft limit (swap.high?) might be considered. I'm not sure
if it's really necessary though, because all arguments for it do not
look convincing to me for now. So, personally, I would refrain from
implementing it until it is really called for by users of cgroup v2.



Considering my customers, running OOM-Killer while there are free swap space is
system's error rather than their misconfiguration.

BTW, mlock() requires CAP_IPC_LOCK.
please set default unlimited and check capability at setting swap limit, at 
least.

Thanks,
-Kame



Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe l

Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


but I think we can come up with a way to stop the run away consumption
even when the swap is accounted separately. All of them are quite nasty
but let me try.

We could allow charges to fail even for the high limit if the excess is
way above the amount of reclaimable memory in the given memcg/hierarchy.
A runaway load would be stopped before it can cause a considerable
damage outside of its hierarchy this way even when the swap limit
is configured small.
Now that goes against the high limit semantic which should only throttle
the consumer and shouldn't cause any functional failures but maybe this
is acceptable for the overall system stability. An alternative would
be to throttle in the high limit reclaim context proportionally to
the excess. This is normally done by the reclaim itself but with no
reclaimable memory this wouldn't work that way.


This seems hard to use for users who want to control resource precisely
even if stability is good.


Another option would be to ignore the swap limit during the global
reclaim. This wouldn't stop the runaway loads but they would at least
see their fair share of the reclaim. The swap excess could be then used
as a "handicap" for a more aggressive throttling during high limit reclaim
or to trigger hard limit sooner.


This seems to work. But users need to understand swap-limit can be exceeded.


Or we could teach the global OOM killer to select abusive anon memory
users with restricted swap. That would require to iterate through all
memcgs and checks whether their anon consumption is in a large excess to
their swap limit and fallback to the memcg OOM victim selection if that
is the case. This adds more complexity to the OOM killer path so I am
not sure this is generally acceptable, though.



I think this is not acceptable.


My question now is. Is the knob usable/useful even without additional
heuristics? Do we want to protect swap space so rigidly that a swap
limited memcg can cause bigger problems than without the swap limit
globally?



swap requires some limit. If not, an application can eat up all swap
and it will not be never freed until the application access it or
swapoff runs.

Thanks,
-Kame


The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Signed-off-by: Vladimir Davydov 
---
  include/linux/memcontrol.h |   1 +
  include/linux/swap.h   |   5 ++
  mm/memcontrol.c| 123 +
  mm/shmem.c |   4 ++
  mm/swap_state.c|   5 ++
  5 files changed, 129 insertions(+), 9 deletions(-)


[...]




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?

Regards,
-Kame








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?

Regards,
-Kame








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


but I think we can come up with a way to stop the run away consumption
even when the swap is accounted separately. All of them are quite nasty
but let me try.

We could allow charges to fail even for the high limit if the excess is
way above the amount of reclaimable memory in the given memcg/hierarchy.
A runaway load would be stopped before it can cause a considerable
damage outside of its hierarchy this way even when the swap limit
is configured small.
Now that goes against the high limit semantic which should only throttle
the consumer and shouldn't cause any functional failures but maybe this
is acceptable for the overall system stability. An alternative would
be to throttle in the high limit reclaim context proportionally to
the excess. This is normally done by the reclaim itself but with no
reclaimable memory this wouldn't work that way.


This seems hard to use for users who want to control resource precisely
even if stability is good.


Another option would be to ignore the swap limit during the global
reclaim. This wouldn't stop the runaway loads but they would at least
see their fair share of the reclaim. The swap excess could be then used
as a "handicap" for a more aggressive throttling during high limit reclaim
or to trigger hard limit sooner.


This seems to work. But users need to understand swap-limit can be exceeded.


Or we could teach the global OOM killer to select abusive anon memory
users with restricted swap. That would require to iterate through all
memcgs and checks whether their anon consumption is in a large excess to
their swap limit and fallback to the memcg OOM victim selection if that
is the case. This adds more complexity to the OOM killer path so I am
not sure this is generally acceptable, though.



I think this is not acceptable.


My question now is. Is the knob usable/useful even without additional
heuristics? Do we want to protect swap space so rigidly that a swap
limited memcg can cause bigger problems than without the swap limit
globally?



swap requires some limit. If not, an application can eat up all swap
and it will not be never freed until the application access it or
swapoff runs.

Thanks,
-Kame


The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Signed-off-by: Vladimir Davydov 
---
  include/linux/memcontrol.h |   1 +
  include/linux/swap.h   |   5 ++
  mm/memcontrol.c| 123 +
  mm/shmem.c |   4 ++
  mm/swap_state.c|   5 ++
  5 files changed, 129 insertions(+), 9 deletions(-)


[...]




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-10 Thread Kamezawa Hiroyuki
On 2015/12/10 20:39, Vladimir Davydov wrote:
> In the legacy hierarchy we charge memsw, which is dubious, because:
> 
>   - memsw.limit must be >= memory.limit, so it is impossible to limit
> swap usage less than memory usage. Taking into account the fact that
> the primary limiting mechanism in the unified hierarchy is
> memory.high while memory.limit is either left unset or set to a very
> large value, moving memsw.limit knob to the unified hierarchy would
> effectively make it impossible to limit swap usage according to the
> user preference.
> 
>   - memsw.usage != memory.usage + swap.usage, because a page occupying
> both swap entry and a swap cache page is charged only once to memsw
> counter. As a result, it is possible to effectively eat up to
> memory.limit of memory pages *and* memsw.limit of swap entries, which
> looks unexpected.
> 
> That said, we should provide a different swap limiting mechanism for
> cgroup2.
> 
> This patch adds mem_cgroup->swap counter, which charges the actual
> number of swap entries used by a cgroup. It is only charged in the
> unified hierarchy, while the legacy hierarchy memsw logic is left
> intact.
> 
> The swap usage can be monitored using new memory.swap.current file and
> limited using memory.swap.max.
> 
> Signed-off-by: Vladimir Davydov 

setting swap.max=0 will work like mlock ?

Thanks,
-Kame


> ---
>   include/linux/memcontrol.h |   1 +
>   include/linux/swap.h   |   5 ++
>   mm/memcontrol.c| 123 
> +
>   mm/shmem.c |   4 ++
>   mm/swap_state.c|   5 ++
>   5 files changed, 129 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c6a5ed2f2744..993c9a26b637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -169,6 +169,7 @@ struct mem_cgroup {
>   
>   /* Accounted resources */
>   struct page_counter memory;
> + struct page_counter swap;
>   struct page_counter memsw;
>   struct page_counter kmem;
>   
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..f4b3ccdcba91 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -368,11 +368,16 @@ static inline int mem_cgroup_swappiness(struct 
> mem_cgroup *mem)
>   #endif
>   #ifdef CONFIG_MEMCG_SWAP
>   extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry);
>   extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
>   #else
>   static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>   {
>   }
> +static inline int mem_cgroup_charge_swap(struct page *page, swp_entry_t 
> entry)
> +{
> + return 0;
> +}
>   static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f5c6abf5421..9d10e2819ec4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1212,7 +1212,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup 
> *memcg, struct task_struct *p)
>   pr_cont(":");
>   
>   for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> - if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>   continue;
>   pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
>   K(mem_cgroup_read_stat(iter, i)));
> @@ -1248,12 +1248,15 @@ static unsigned long mem_cgroup_get_limit(struct 
> mem_cgroup *memcg)
>   {
>   unsigned long limit;
>   
> - limit = memcg->memory.limit;
> + limit = READ_ONCE(memcg->memory.limit);
>   if (mem_cgroup_swappiness(memcg)) {
>   unsigned long memsw_limit;
> + unsigned long swap_limit;
>   
> - memsw_limit = memcg->memsw.limit;
> - limit = min(limit + total_swap_pages, memsw_limit);
> + memsw_limit = READ_ONCE(memcg->memsw.limit);
> + swap_limit = min(READ_ONCE(memcg->swap.limit),
> +  (unsigned long)total_swap_pages);
> + limit = min(limit + swap_limit, memsw_limit);
>   }
>   return limit;
>   }
> @@ -4226,6 +4229,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>   page_counter_init(>memory, NULL);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg->soft_limit = PAGE_COUNTER_MAX;
> + page_counter_init(>swap, NULL);
>   page_counter_init(>memsw, NULL);
>   page_counter_init(>kmem, NULL);
>   }
> @@ -4276,6 +4280,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>   page_counter_init(>memory, >memory);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg->soft_limit = PAGE_COUNTER_MAX;
> +

Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-10 Thread Kamezawa Hiroyuki
On 2015/12/10 20:39, Vladimir Davydov wrote:
> In the legacy hierarchy we charge memsw, which is dubious, because:
> 
>   - memsw.limit must be >= memory.limit, so it is impossible to limit
> swap usage less than memory usage. Taking into account the fact that
> the primary limiting mechanism in the unified hierarchy is
> memory.high while memory.limit is either left unset or set to a very
> large value, moving memsw.limit knob to the unified hierarchy would
> effectively make it impossible to limit swap usage according to the
> user preference.
> 
>   - memsw.usage != memory.usage + swap.usage, because a page occupying
> both swap entry and a swap cache page is charged only once to memsw
> counter. As a result, it is possible to effectively eat up to
> memory.limit of memory pages *and* memsw.limit of swap entries, which
> looks unexpected.
> 
> That said, we should provide a different swap limiting mechanism for
> cgroup2.
> 
> This patch adds mem_cgroup->swap counter, which charges the actual
> number of swap entries used by a cgroup. It is only charged in the
> unified hierarchy, while the legacy hierarchy memsw logic is left
> intact.
> 
> The swap usage can be monitored using new memory.swap.current file and
> limited using memory.swap.max.
> 
> Signed-off-by: Vladimir Davydov 

setting swap.max=0 will work like mlock ?

Thanks,
-Kame


> ---
>   include/linux/memcontrol.h |   1 +
>   include/linux/swap.h   |   5 ++
>   mm/memcontrol.c| 123 
> +
>   mm/shmem.c |   4 ++
>   mm/swap_state.c|   5 ++
>   5 files changed, 129 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c6a5ed2f2744..993c9a26b637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -169,6 +169,7 @@ struct mem_cgroup {
>   
>   /* Accounted resources */
>   struct page_counter memory;
> + struct page_counter swap;
>   struct page_counter memsw;
>   struct page_counter kmem;
>   
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..f4b3ccdcba91 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -368,11 +368,16 @@ static inline int mem_cgroup_swappiness(struct 
> mem_cgroup *mem)
>   #endif
>   #ifdef CONFIG_MEMCG_SWAP
>   extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry);
>   extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
>   #else
>   static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>   {
>   }
> +static inline int mem_cgroup_charge_swap(struct page *page, swp_entry_t 
> entry)
> +{
> + return 0;
> +}
>   static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f5c6abf5421..9d10e2819ec4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1212,7 +1212,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup 
> *memcg, struct task_struct *p)
>   pr_cont(":");
>   
>   for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> - if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>   continue;
>   pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
>   K(mem_cgroup_read_stat(iter, i)));
> @@ -1248,12 +1248,15 @@ static unsigned long mem_cgroup_get_limit(struct 
> mem_cgroup *memcg)
>   {
>   unsigned long limit;
>   
> - limit = memcg->memory.limit;
> + limit = READ_ONCE(memcg->memory.limit);
>   if (mem_cgroup_swappiness(memcg)) {
>   unsigned long memsw_limit;
> + unsigned long swap_limit;
>   
> - memsw_limit = memcg->memsw.limit;
> - limit = min(limit + total_swap_pages, memsw_limit);
> + memsw_limit = READ_ONCE(memcg->memsw.limit);
> + swap_limit = min(READ_ONCE(memcg->swap.limit),
> +  (unsigned long)total_swap_pages);
> + limit = min(limit + swap_limit, memsw_limit);
>   }
>   return limit;
>   }
> @@ -4226,6 +4229,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>   page_counter_init(>memory, NULL);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg->soft_limit = PAGE_COUNTER_MAX;
> + page_counter_init(>swap, NULL);
>   page_counter_init(>memsw, NULL);
>   page_counter_init(>kmem, NULL);
>   }
> @@ -4276,6 +4280,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>   page_counter_init(>memory, >memory);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg->soft_limit = 

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-11-03 Thread Kamezawa Hiroyuki

On 2015/10/31 4:42, Luck, Tony wrote:

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?


It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.



Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-11-03 Thread Kamezawa Hiroyuki

On 2015/10/31 4:42, Luck, Tony wrote:

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?


It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.



Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-30 Thread Kamezawa Hiroyuki

On 2015/10/30 17:23, Michal Hocko wrote:

On Fri 30-10-15 14:23:59, KAMEZAWA Hiroyuki wrote:

On 2015/10/30 0:17, mho...@kernel.org wrote:

[...]

@@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

-   /* Keep reclaiming pages as long as there is reasonable progress */
+   /*
+* Do not retry high order allocations unless they are __GFP_REPEAT
+* and even then do not retry endlessly.
+*/
pages_reclaimed += did_some_progress;
-   if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-   /* Wait for some write requests to complete then retry */
-   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-   goto retry;
+   if (order > PAGE_ALLOC_COSTLY_ORDER) {
+   if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<

why directly retry here ?


Because I wanted to preserve the previous logic for GFP_REPEAT as much
as possible here and do an incremental change in the later patch.



I see.


[...]


@@ -3150,8 +3203,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
goto got_pg;

/* Retry as long as the OOM killer is making progress */
-   if (did_some_progress)
+   if (did_some_progress) {
+   stall_backoff = 0;
goto retry;
+   }


Umm ? I'm sorry that I didn't notice page allocation may fail even
if order < PAGE_ALLOC_COSTLY_ORDER.  I thought old logic ignores
did_some_progress. It seems a big change.


__alloc_pages_may_oom will set did_some_progress


So, now, 0-order page allocation may fail in a OOM situation ?


No they don't normally and this patch doesn't change the logic here.



I understand your patch doesn't change the behavior.
Looking into __alloc_pages_may_oom(), *did_some_progress is finally set by

 if (out_of_memory() || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;

...depends on out_of_memory() return value.
Now, allocation may fail if oom-killer is disabled Isn't it complicated ?

Shouldn't we have

 if (order < PAGE_ALLOC_COSTLY_ORDER)
goto retry;

here ?

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Kamezawa Hiroyuki
On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each 
> node
> 

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Kamezawa Hiroyuki
On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each 
> node
> 

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-30 Thread Kamezawa Hiroyuki

On 2015/10/30 17:23, Michal Hocko wrote:

On Fri 30-10-15 14:23:59, KAMEZAWA Hiroyuki wrote:

On 2015/10/30 0:17, mho...@kernel.org wrote:

[...]

@@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

-   /* Keep reclaiming pages as long as there is reasonable progress */
+   /*
+* Do not retry high order allocations unless they are __GFP_REPEAT
+* and even then do not retry endlessly.
+*/
pages_reclaimed += did_some_progress;
-   if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-   /* Wait for some write requests to complete then retry */
-   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-   goto retry;
+   if (order > PAGE_ALLOC_COSTLY_ORDER) {
+   if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+   goto noretry;
+
+   if (did_some_progress)
+   goto retry;


why directly retry here ?


Because I wanted to preserve the previous logic for GFP_REPEAT as much
as possible here and do an incremental change in the later patch.



I see.


[...]


@@ -3150,8 +3203,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
goto got_pg;

/* Retry as long as the OOM killer is making progress */
-   if (did_some_progress)
+   if (did_some_progress) {
+   stall_backoff = 0;
goto retry;
+   }


Umm ? I'm sorry that I didn't notice page allocation may fail even
if order < PAGE_ALLOC_COSTLY_ORDER.  I thought old logic ignores
did_some_progress. It seems a big change.


__alloc_pages_may_oom will set did_some_progress


So, now, 0-order page allocation may fail in a OOM situation ?


No they don't normally and this patch doesn't change the logic here.



I understand your patch doesn't change the behavior.
Looking into __alloc_pages_may_oom(), *did_some_progress is finally set by

 if (out_of_memory() || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;

...depends on out_of_memory() return value.
Now, allocation may fail if oom-killer is disabled Isn't it complicated ?

Shouldn't we have

 if (order < PAGE_ALLOC_COSTLY_ORDER)
goto retry;

here ?

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> wait_iff_congested has been used to throttle allocator before it retried
> another round of direct reclaim to allow the writeback to make some
> progress and prevent reclaim from looping over dirty/writeback pages
> without making any progress. We used to do congestion_wait before
> 0e093d99763e ("writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone") but that led to undesirable stalls
> and sleeping for the full timeout even when the BDI wasn't congested.
> Hence wait_iff_congested was used instead. But it seems that even
> wait_iff_congested doesn't work as expected. We might have a small file
> LRU list with all pages dirty/writeback and yet the bdi is not congested
> so this is just a cond_resched in the end and can end up triggering pre
> mature OOM.
> 
> This patch replaces the unconditional wait_iff_congested by
> congestion_wait which is executed only if we _know_ that the last round
> of direct reclaim didn't make any progress and dirty+writeback pages are
> more than a half of the reclaimable pages on the zone which might be
> usable for our target allocation. This shouldn't reintroduce stalls
> fixed by 0e093d99763e because congestion_wait is called only when we
> are getting hopeless when sleeping is a better choice than OOM with many
> pages under IO.
> 
> Signed-off-by: Michal Hocko 
> ---
>   mm/page_alloc.c | 19 +--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c0abb75ad53..0518ca6a9776 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3191,8 +3191,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>*/
>   if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>   ac->high_zoneidx, alloc_flags, target)) {
> - /* Wait for some write requests to complete then retry 
> */
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> + unsigned long writeback = zone_page_state(zone, 
> NR_WRITEBACK),
> +   dirty = zone_page_state(zone, 
> NR_FILE_DIRTY);
> +
> + if (did_some_progress)
> + goto retry;
> +
> + /*
> +  * If we didn't make any progress and have a lot of
> +  * dirty + writeback pages then we should wait for
> +  * an IO to complete to slow down the reclaim and
> +  * prevent from pre mature OOM
> +  */
> + if (2*(writeback + dirty) > reclaimable)

Doesn't this add unnecessary latency if other zones have enough clean memory ?


Thanks,
-Kame
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> __alloc_pages_slowpath has traditionally relied on the direct reclaim
> and did_some_progress as an indicator that it makes sense to retry
> allocation rather than declaring OOM. shrink_zones had to rely on
> zone_reclaimable if shrink_zone didn't make any progress to prevent
> from pre mature OOM killer invocation - the LRU might be full of dirty
> or writeback pages and direct reclaim cannot clean those up.
> 
> zone_reclaimable will allow to rescan the reclaimable lists several
> times and restart if a page is freed. This is really subtle behavior
> and it might lead to a livelock when a single freed page keeps allocator
> looping but the current task will not be able to allocate that single
> page. OOM killer would be more appropriate than looping without any
> progress for unbounded amount of time.
> 
> This patch changes OOM detection logic and pulls it out from shrink_zone
> which is too low to be appropriate for any high level decisions such as OOM
> which is per zonelist property. It is __alloc_pages_slowpath which knows
> how many attempts have been done and what was the progress so far
> therefore it is more appropriate to implement this logic.
> 
> The new heuristic tries to be more deterministic and easier to follow.
> Retrying makes sense only if the currently reclaimable memory + free
> pages would allow the current allocation request to succeed (as per
> __zone_watermark_ok) at least for one zone in the usable zonelist.
> 
> This alone wouldn't be sufficient, though, because the writeback might
> get stuck and reclaimable pages might be pinned for a really long time
> or even depend on the current allocation context. Therefore there is a
> feedback mechanism implemented which reduces the reclaim target after
> each reclaim round without any progress. This means that we should
> eventually converge to only NR_FREE_PAGES as the target and fail on the
> wmark check and proceed to OOM. The backoff is simple and linear with
> 1/16 of the reclaimable pages for each round without any progress. We
> are optimistic and reset counter for successful reclaim rounds.
> 
> Costly high order pages mostly preserve their semantic and those without
> __GFP_REPEAT fail right away while those which have the flag set will
> back off after the amount of reclaimable pages reaches equivalent of the
> requested order. The only difference is that if there was no progress
> during the reclaim we rely on zone watermark check. This is more logical
> thing to do than previous 1< zone_reclaimable faking the progress.
> 
> Signed-off-by: Michal Hocko 
> ---
>   include/linux/swap.h |  1 +
>   mm/page_alloc.c  | 69 
> ++--
>   mm/vmscan.c  | 10 +---
>   3 files changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9c7c4b418498..8298e1dc20f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>   struct vm_area_struct *vma);
>   
>   /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>   extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
>   extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c73913648357..9c0abb75ad53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2972,6 +2972,13 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>   return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == 
> GFP_TRANSHUGE;
>   }
>   
> +/*
> + * Number of backoff steps for potentially reclaimable pages if the direct 
> reclaim
> + * cannot make any progress. Each step will reduce 1/MAX_STALL_BACKOFF of the
> + * reclaimable memory.
> + */
> +#define MAX_STALL_BACKOFF 16
> +
>   static inline struct page *
>   __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   struct alloc_context *ac)
> @@ -2984,6 +2991,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   enum migrate_mode migration_mode = MIGRATE_ASYNC;
>   bool deferred_compaction = false;
>   int contended_compaction = COMPACT_CONTENDED_NONE;
> + struct zone *zone;
> + struct zoneref *z;
> + int stall_backoff = 0;
>   
>   /*
>* In the slowpath, we sanity check order to avoid ever trying to
> @@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (gfp_mask & __GFP_NORETRY)
>   goto noretry;
>   
> - /* Keep reclaiming pages as long as there is reasonable progress */
> + /*
> +  * Do not retry high order allocations unless they are 

Re: [RFC 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> wait_iff_congested has been used to throttle allocator before it retried
> another round of direct reclaim to allow the writeback to make some
> progress and prevent reclaim from looping over dirty/writeback pages
> without making any progress. We used to do congestion_wait before
> 0e093d99763e ("writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone") but that led to undesirable stalls
> and sleeping for the full timeout even when the BDI wasn't congested.
> Hence wait_iff_congested was used instead. But it seems that even
> wait_iff_congested doesn't work as expected. We might have a small file
> LRU list with all pages dirty/writeback and yet the bdi is not congested
> so this is just a cond_resched in the end and can end up triggering pre
> mature OOM.
> 
> This patch replaces the unconditional wait_iff_congested by
> congestion_wait which is executed only if we _know_ that the last round
> of direct reclaim didn't make any progress and dirty+writeback pages are
> more than a half of the reclaimable pages on the zone which might be
> usable for our target allocation. This shouldn't reintroduce stalls
> fixed by 0e093d99763e because congestion_wait is called only when we
> are getting hopeless when sleeping is a better choice than OOM with many
> pages under IO.
> 
> Signed-off-by: Michal Hocko 
> ---
>   mm/page_alloc.c | 19 +--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c0abb75ad53..0518ca6a9776 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3191,8 +3191,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>*/
>   if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>   ac->high_zoneidx, alloc_flags, target)) {
> - /* Wait for some write requests to complete then retry 
> */
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> + unsigned long writeback = zone_page_state(zone, 
> NR_WRITEBACK),
> +   dirty = zone_page_state(zone, 
> NR_FILE_DIRTY);
> +
> + if (did_some_progress)
> + goto retry;
> +
> + /*
> +  * If we didn't make any progress and have a lot of
> +  * dirty + writeback pages then we should wait for
> +  * an IO to complete to slow down the reclaim and
> +  * prevent from pre mature OOM
> +  */
> + if (2*(writeback + dirty) > reclaimable)

Doesn't this add unnecessary latency if other zones have enough clean memory ?


Thanks,
-Kame
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> __alloc_pages_slowpath has traditionally relied on the direct reclaim
> and did_some_progress as an indicator that it makes sense to retry
> allocation rather than declaring OOM. shrink_zones had to rely on
> zone_reclaimable if shrink_zone didn't make any progress to prevent
> from pre mature OOM killer invocation - the LRU might be full of dirty
> or writeback pages and direct reclaim cannot clean those up.
> 
> zone_reclaimable will allow to rescan the reclaimable lists several
> times and restart if a page is freed. This is really subtle behavior
> and it might lead to a livelock when a single freed page keeps allocator
> looping but the current task will not be able to allocate that single
> page. OOM killer would be more appropriate than looping without any
> progress for unbounded amount of time.
> 
> This patch changes OOM detection logic and pulls it out from shrink_zone
> which is too low to be appropriate for any high level decisions such as OOM
> which is per zonelist property. It is __alloc_pages_slowpath which knows
> how many attempts have been done and what was the progress so far
> therefore it is more appropriate to implement this logic.
> 
> The new heuristic tries to be more deterministic and easier to follow.
> Retrying makes sense only if the currently reclaimable memory + free
> pages would allow the current allocation request to succeed (as per
> __zone_watermark_ok) at least for one zone in the usable zonelist.
> 
> This alone wouldn't be sufficient, though, because the writeback might
> get stuck and reclaimable pages might be pinned for a really long time
> or even depend on the current allocation context. Therefore there is a
> feedback mechanism implemented which reduces the reclaim target after
> each reclaim round without any progress. This means that we should
> eventually converge to only NR_FREE_PAGES as the target and fail on the
> wmark check and proceed to OOM. The backoff is simple and linear with
> 1/16 of the reclaimable pages for each round without any progress. We
> are optimistic and reset counter for successful reclaim rounds.
> 
> Costly high order pages mostly preserve their semantic and those without
> __GFP_REPEAT fail right away while those which have the flag set will
> back off after the amount of reclaimable pages reaches equivalent of the
> requested order. The only difference is that if there was no progress
> during the reclaim we rely on zone watermark check. This is more logical
> thing to do than previous 1< zone_reclaimable faking the progress.
> 
> Signed-off-by: Michal Hocko 
> ---
>   include/linux/swap.h |  1 +
>   mm/page_alloc.c  | 69 
> ++--
>   mm/vmscan.c  | 10 +---
>   3 files changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9c7c4b418498..8298e1dc20f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>   struct vm_area_struct *vma);
>   
>   /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>   extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
>   extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c73913648357..9c0abb75ad53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2972,6 +2972,13 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>   return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == 
> GFP_TRANSHUGE;
>   }
>   
> +/*
> + * Number of backoff steps for potentially reclaimable pages if the direct 
> reclaim
> + * cannot make any progress. Each step will reduce 1/MAX_STALL_BACKOFF of the
> + * reclaimable memory.
> + */
> +#define MAX_STALL_BACKOFF 16
> +
>   static inline struct page *
>   __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   struct alloc_context *ac)
> @@ -2984,6 +2991,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   enum migrate_mode migration_mode = MIGRATE_ASYNC;
>   bool deferred_compaction = false;
>   int contended_compaction = COMPACT_CONTENDED_NONE;
> + struct zone *zone;
> + struct zoneref *z;
> + int stall_backoff = 0;
>   
>   /*
>* In the slowpath, we sanity check order to avoid ever trying to
> @@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (gfp_mask & __GFP_NORETRY)
>   goto noretry;
>   
> - /* Keep reclaiming pages as long as there is reasonable progress */
> + /*
> +  * Do not retry high order 

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Kamezawa Hiroyuki

On 2015/10/22 3:17, Luck, Tony wrote:

+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.


You're right.


What happens then?  Do kernel allocations fail?  Or do they fall back to using 
removable memory?


Maybe the kernel cannot boot because NORMAL zone is empty.


Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.



I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for 
meminfo ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Kamezawa Hiroyuki

On 2015/10/22 3:17, Luck, Tony wrote:

+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.


You're right.


What happens then?  Do kernel allocations fail?  Or do they fall back to using 
removable memory?


Maybe the kernel cannot boot because NORMAL zone is empty.


Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.



I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for 
meminfo ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-13 Thread Kamezawa Hiroyuki

On 2015/10/09 19:36, Xishi Qiu wrote:

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:


On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
- allocate kernel memory from reliable region
- allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.



Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too.



Yes.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.


I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this,

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)



I think zones can be overlapped even while they are aligned to MAX_ORDER.






I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.


Yes, add a new zone is better, but it will change much code, so reuse 
ZONE_MOVABLE
is simpler and easier, right?



I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be 
another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable 
memory and
   doesn't specify anything about users.



His patch is very simple.



The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.



Maybe. I think it should be another discussion. With this patch and his 
fake-reliable-memory
patch, everyone can give a try.



For your requirements. I and Izumi-san are discussing following plan.

  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.



like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag 
-> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

yes.



Also we can introduce some interfaces in procfs or sysfs, right?



It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-13 Thread Kamezawa Hiroyuki

On 2015/10/09 19:36, Xishi Qiu wrote:

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:


On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
- allocate kernel memory from reliable region
- allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.



Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too.



Yes.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.


I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this,

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)



I think zones can be overlapped even while they are aligned to MAX_ORDER.






I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.


Yes, add a new zone is better, but it will change much code, so reuse 
ZONE_MOVABLE
is simpler and easier, right?



I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be 
another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable 
memory and
   doesn't specify anything about users.



His patch is very simple.



The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.



Maybe. I think it should be another discussion. With this patch and his 
fake-reliable-memory
patch, everyone can give a try.



For your requirements. I and Izumi-san are discussing following plan.

  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.



like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag 
-> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

yes.



Also we can introduce some interfaces in procfs or sysfs, right?



It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
   - allocate kernel memory from reliable region
   - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.



I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.
His patch is very simple.

For your requirements. I and Izumi-san are discussing following plan.

 - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
 - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
 - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.


Thanks,
-Kame





















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 14:52, Jiang Liu wrote:

On 2015/10/9 4:20, Andrew Morton wrote:

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:


On Wed, 19 Aug 2015, Patil, Kiran wrote:


Acked-by: Kiran Patil 


Where's the call to preempt_disable() to prevent kernels with preemption
from making numa_node_id() invalid during this iteration?


David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

   The use of numa_mem_id() is racy and best-effort.  If the unlikely
   race occurs, the memory allocation will occur on the wrong node, the
   overall result being very slightly suboptimal performance.  The
   existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

Hi Andrew,
Apologize for the slow response due to personal reasons!
And thanks for answering the question from David. To be honest,
I didn't know how to answer this question before. Actually this
question has puzzled me for a long time when dealing with memory
hot-removal. For normal cases, it only causes sub-optimal memory
allocation if schedule event happens between querying NUMA node id
and calling alloc_pages_node(). But what happens if system run into
following execution sequence?
1) node = numa_mem_id();
2) memory hot-removal event triggers
2.1) remove affected memory
2.2) reset pgdat to zero if node becomes empty after memory removal


I'm sorry if I misunderstand something.
After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no memset().


3) alloc_pages_node(), which may access zero-ed pgdat structure.


?



I haven't found a mechanism to protect system from above sequence yet,
so puzzled for a long time already:(. Does stop_machine() protect
system from such a execution sequence?


To access pgdat, a pgdat's zone should be on per-pgdat-zonelist.
Now, __build_all_zonelists() is called under stop_machine(). That's the reason
why you're asking what stop_machine() does. And, as you know, stop_machine() is 
not
protecting anything. The caller may fallback into removed zone.

Then, let's think.

At first, please note "pgdat" is not removed (and cannot be removed),
accessing pgdat's memory will not cause segmentation fault.

Just contents are problem. At removal, zone's page related information
and pgdat's page related information is cleared.

alloc_pages uses zonelist/zoneref/cache to walk each zones without accessing
pgdat itself. I think accessing zonelist is safe because it's an array updated
by stop_machine().

So, the problem is alloc_pages() can work correctly even if zone contains no 
page.
I think it should work.

(Note: zones are included in pgdat. So, zeroing pgdat means zeroing zone and 
other
 structures. it will not work.)

So, what problem you see now ?
I'm sorry I can't chase old discusions.

Thanks,
-Kame















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 14:52, Jiang Liu wrote:

On 2015/10/9 4:20, Andrew Morton wrote:

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:


On Wed, 19 Aug 2015, Patil, Kiran wrote:


Acked-by: Kiran Patil 


Where's the call to preempt_disable() to prevent kernels with preemption
from making numa_node_id() invalid during this iteration?


David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

   The use of numa_mem_id() is racy and best-effort.  If the unlikely
   race occurs, the memory allocation will occur on the wrong node, the
   overall result being very slightly suboptimal performance.  The
   existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

Hi Andrew,
Apologize for the slow response due to personal reasons!
And thanks for answering the question from David. To be honest,
I didn't know how to answer this question before. Actually this
question has puzzled me for a long time when dealing with memory
hot-removal. For normal cases, it only causes sub-optimal memory
allocation if schedule event happens between querying NUMA node id
and calling alloc_pages_node(). But what happens if system run into
following execution sequence?
1) node = numa_mem_id();
2) memory hot-removal event triggers
2.1) remove affected memory
2.2) reset pgdat to zero if node becomes empty after memory removal


I'm sorry if I misunderstand something.
After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no memset().


3) alloc_pages_node(), which may access zero-ed pgdat structure.


?



I haven't found a mechanism to protect system from above sequence yet,
so puzzled for a long time already:(. Does stop_machine() protect
system from such a execution sequence?


To access pgdat, a pgdat's zone should be on per-pgdat-zonelist.
Now, __build_all_zonelists() is called under stop_machine(). That's the reason
why you're asking what stop_machine() does. And, as you know, stop_machine() is 
not
protecting anything. The caller may fallback into removed zone.

Then, let's think.

At first, please note "pgdat" is not removed (and cannot be removed),
accessing pgdat's memory will not cause segmentation fault.

Just contents are problem. At removal, zone's page related information
and pgdat's page related information is cleared.

alloc_pages uses zonelist/zoneref/cache to walk each zones without accessing
pgdat itself. I think accessing zonelist is safe because it's an array updated
by stop_machine().

So, the problem is alloc_pages() can work correctly even if zone contains no 
page.
I think it should work.

(Note: zones are included in pgdat. So, zeroing pgdat means zeroing zone and 
other
 structures. it will not work.)

So, what problem you see now ?
I'm sorry I can't chase old discusions.

Thanks,
-Kame















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
   - allocate kernel memory from reliable region
   - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.



I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.
His patch is very simple.

For your requirements. I and Izumi-san are discussing following plan.

 - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
 - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
 - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.


Thanks,
-Kame





















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

2015-10-05 Thread Kamezawa Hiroyuki
On 2015/09/22 15:23, Ingo Molnar wrote:
> So when memory hotplug removes a piece of physical memory from pagetable
> mappings, it also frees the underlying PGD entry.
> 
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory hotplugged.
> 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Brian Gerst 
> Cc: Denys Vlasenko 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Oleg Nesterov 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Waiman Long 
> Cc: linux...@kvack.org
> Signed-off-by: Ingo Molnar 

Ishimatsu-san, Tang-san, please check.

Doesn't this patch affects the issues of 
 5255e0a79fcc0ff47b387af92bd9ef5729b1b859
 9661d5bcd058fe15b4138a00d96bd36516134543

?

-Kame

> ---
>   arch/x86/mm/init_64.c | 27 ---
>   1 file changed, 27 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7129e7647a76..60b0cc3f2819 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, 
> pud_t *pud)
>   spin_unlock(_mm.page_table_lock);
>   }
>   
> -/* Return true if pgd is changed, otherwise return false. */
> -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> -{
> - pud_t *pud;
> - int i;
> -
> - for (i = 0; i < PTRS_PER_PUD; i++) {
> - pud = pud_start + i;
> - if (pud_val(*pud))
> - return false;
> - }
> -
> - /* free a pud table */
> - free_pagetable(pgd_page(*pgd), 0);
> - spin_lock(_mm.page_table_lock);
> - pgd_clear(pgd);
> - spin_unlock(_mm.page_table_lock);
> -
> - return true;
> -}
> -
>   static void __meminit
>   remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>bool direct)
> @@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, 
> bool direct)
>   unsigned long addr;
>   pgd_t *pgd;
>   pud_t *pud;
> - bool pgd_changed = false;
>   
>   for (addr = start; addr < end; addr = next) {
>   next = pgd_addr_end(addr, end);
> @@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long 
> end, bool direct)
>   
>   pud = (pud_t *)pgd_page_vaddr(*pgd);
>   remove_pud_table(pud, addr, next, direct);
> - if (free_pud_table(pud, pgd))
> - pgd_changed = true;
>   }
>   
> - if (pgd_changed)
> - sync_global_pgds(start, end - 1, 1);
> -
>   flush_tlb_all();
>   }
>   
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

2015-10-05 Thread Kamezawa Hiroyuki
On 2015/09/22 15:23, Ingo Molnar wrote:
> So when memory hotplug removes a piece of physical memory from pagetable
> mappings, it also frees the underlying PGD entry.
> 
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory hotplugged.
> 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Brian Gerst 
> Cc: Denys Vlasenko 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Oleg Nesterov 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Waiman Long 
> Cc: linux...@kvack.org
> Signed-off-by: Ingo Molnar 

Ishimatsu-san, Tang-san, please check.

Doesn't this patch affects the issues of 
 5255e0a79fcc0ff47b387af92bd9ef5729b1b859
 9661d5bcd058fe15b4138a00d96bd36516134543

?

-Kame

> ---
>   arch/x86/mm/init_64.c | 27 ---
>   1 file changed, 27 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7129e7647a76..60b0cc3f2819 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, 
> pud_t *pud)
>   spin_unlock(_mm.page_table_lock);
>   }
>   
> -/* Return true if pgd is changed, otherwise return false. */
> -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> -{
> - pud_t *pud;
> - int i;
> -
> - for (i = 0; i < PTRS_PER_PUD; i++) {
> - pud = pud_start + i;
> - if (pud_val(*pud))
> - return false;
> - }
> -
> - /* free a pud table */
> - free_pagetable(pgd_page(*pgd), 0);
> - spin_lock(_mm.page_table_lock);
> - pgd_clear(pgd);
> - spin_unlock(_mm.page_table_lock);
> -
> - return true;
> -}
> -
>   static void __meminit
>   remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>bool direct)
> @@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, 
> bool direct)
>   unsigned long addr;
>   pgd_t *pgd;
>   pud_t *pud;
> - bool pgd_changed = false;
>   
>   for (addr = start; addr < end; addr = next) {
>   next = pgd_addr_end(addr, end);
> @@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long 
> end, bool direct)
>   
>   pud = (pud_t *)pgd_page_vaddr(*pgd);
>   remove_pud_table(pud, addr, next, direct);
> - if (free_pud_table(pud, pgd))
> - pgd_changed = true;
>   }
>   
> - if (pgd_changed)
> - sync_global_pgds(start, end - 1, 1);
> -
>   flush_tlb_all();
>   }
>   
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Kamezawa Hiroyuki

On 2015/08/25 8:15, Paul Turner wrote:

On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo  wrote:

Hello,

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:

Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.


This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.


Well, but that loses most of details and why such use cases matter to
begin with.  We can imagine up stuff to induce arbitrary set of
requirements.


All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.




I take that the
CPU intensive helper threads are usually IO workers?  Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?


Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.


Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work?  qemu already implements IO limit mechanism
after all.


No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.



I think I should explain my customer's use case of qemu + cpuset/cpu (via 
libvirt)

(1) Isolating hypervisor thread.
   As already discussed, hypervisor threads are isolated by cpuset. But their 
purpose
   is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot 
be solution
   as already discussed.

(2) Fixed rate vcpu service.
   With using cpu controller's quota/period feature, my customer creates  vcpu 
models like
   Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.

   To do this, each vcpus should be quota-limited independently, with 
per-thread cpu control.

Especially, the method (1) is used in several enterprise customers for 
stabilizing their system.

Sub-process control should be provided by some way.

Thanks,
-Kame





Anyways, a point here is that threads of the same process competing
isn't a new problem.  There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.


It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.



cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads.  For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?


We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.


  It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.


As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.



Thanks.

--
tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Kamezawa Hiroyuki

On 2015/08/25 8:15, Paul Turner wrote:

On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo t...@kernel.org wrote:

Hello,

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:

Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.


This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.


Well, but that loses most of details and why such use cases matter to
begin with.  We can imagine up stuff to induce arbitrary set of
requirements.


All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.




I take that the
CPU intensive helper threads are usually IO workers?  Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?


Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.


Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work?  qemu already implements IO limit mechanism
after all.


No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.



I think I should explain my customer's use case of qemu + cpuset/cpu (via 
libvirt)

(1) Isolating hypervisor thread.
   As already discussed, hypervisor threads are isolated by cpuset. But their 
purpose
   is to avoid _latency_ spike caused by hypervisor behavior. So, nice cannot 
be solution
   as already discussed.

(2) Fixed rate vcpu service.
   With using cpu controller's quota/period feature, my customer creates  vcpu 
models like
   Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.

   To do this, each vcpus should be quota-limited independently, with 
per-thread cpu control.

Especially, the method (1) is used in several enterprise customers for 
stabilizing their system.

Sub-process control should be provided by some way.

Thanks,
-Kame





Anyways, a point here is that threads of the same process competing
isn't a new problem.  There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.


It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.



cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads.  For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?


We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.


  It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.


As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.



Thanks.

--
tejun

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-18 Thread Kamezawa Hiroyuki

On 2015/08/19 5:31, Tejun Heo wrote:

Hello, Paul.

On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote:

2) Control within an address-space.  For subsystems with fungible resources,
e.g. CPU, it can be useful for an address space to partition its own
threads.  Losing the capability to do this against the CPU controller would
be a large set-back for instance.  Occasionally, it is useful to share these
groupings between address spaces when processes are cooperative, but this is
less of a requirement.

This is important to us.


Sure, let's build a proper interface for that.  Do you actually need
sub-hierarchy inside a process?  Can you describe your use case in
detail and why having hierarchical CPU cycle distribution is essential
for your use case?


An actual per-thread use case in our customers is qemu-kvm + cpuset.
customers pin each vcpus and qemu-kvm's worker threads to cpus.
For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and
others(vhost) to cpu 0-1.

This is an actual kvm tuning on our customers for performance guarantee.

In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu 
sizing.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-18 Thread Kamezawa Hiroyuki

On 2015/08/19 5:31, Tejun Heo wrote:

Hello, Paul.

On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote:

2) Control within an address-space.  For subsystems with fungible resources,
e.g. CPU, it can be useful for an address space to partition its own
threads.  Losing the capability to do this against the CPU controller would
be a large set-back for instance.  Occasionally, it is useful to share these
groupings between address spaces when processes are cooperative, but this is
less of a requirement.

This is important to us.


Sure, let's build a proper interface for that.  Do you actually need
sub-hierarchy inside a process?  Can you describe your use case in
detail and why having hierarchical CPU cycle distribution is essential
for your use case?


An actual per-thread use case in our customers is qemu-kvm + cpuset.
customers pin each vcpus and qemu-kvm's worker threads to cpus.
For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and
others(vhost) to cpu 0-1.

This is an actual kvm tuning on our customers for performance guarantee.

In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu 
sizing.

Thanks,
-Kame

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-11 Thread Kamezawa Hiroyuki

On 2015/08/10 17:14, Vladimir Davydov wrote:

On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

...

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
   # contaienr A  workingset=600M   limit=1G (sleepy)
   # contaienr B  workingset=300M   limit=1G (work often)
   # container C  workingset=500M   limit=1G (work slowly)
   # container D  workingset=1.2G   limit=1G (work hard)
container D can drive the zone's distance counter because of local memory 
reclaim.
If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.


This does not necessarily mean we have to use different inactive_age
counter for global and local memory pressure. In your example, having
inactive_age per lruvec and using it for evictions on both global and
local memory pressure would work just fine.



you're right.

 




 if (current memcg == recorded memcg && eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.


But if the inactive_age counter was per lruvec, then we wouldn't need to
bother about it.


yes.

Anyway, what I understand now is that we need to reduce influence from a 
memcg's behavior
against other memcgs. Your way is dividing counter completely, my idea was 
implementing
different counter. Doing it by calculation will be good because we can't have 
enough record
space.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-11 Thread Kamezawa Hiroyuki

On 2015/08/10 17:14, Vladimir Davydov wrote:

On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

...

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
   # contaienr A  workingset=600M   limit=1G (sleepy)
   # contaienr B  workingset=300M   limit=1G (work often)
   # container C  workingset=500M   limit=1G (work slowly)
   # container D  workingset=1.2G   limit=1G (work hard)
container D can drive the zone's distance counter because of local memory 
reclaim.
If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.


This does not necessarily mean we have to use different inactive_age
counter for global and local memory pressure. In your example, having
inactive_age per lruvec and using it for evictions on both global and
local memory pressure would work just fine.



you're right.

 




 if (current memcg == recorded memcg  eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.


But if the inactive_age counter was per lruvec, then we wouldn't need to
bother about it.


yes.

Anyway, what I understand now is that we need to reduce influence from a 
memcg's behavior
against other memcgs. Your way is dividing counter completely, my idea was 
implementing
different counter. Doing it by calculation will be good because we can't have 
enough record
space.


Thanks,
-Kame

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-09 Thread Kamezawa Hiroyuki

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

  1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.


No, that's not what I want. I want pages of the workingset to get
activated on refault no matter if they were evicted on global memory
pressure or due to hitting a memory cgroup limit.



Sure.


  2) maintain acitve/inactive ratio in memcg properly as global LRU does.
  3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
  # contaienr A  workingset=600M   limit=1G (sleepy)
  # contaienr B  workingset=300M   limit=1G (work often)
  # container C  workingset=500M   limit=1G (work slowly)
  # container D  workingset=1.2G   limit=1G (work hard)
 
container D can drive the zone's distance counter because of local memory reclaim.

If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.

But yes, some _real_  test are required.



Now, the record is

eviction | node | zone | 2bit.

How about changing this as

 0 |eviction | node | zone | 2bit
 1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
 if (the 1st bit is 0)
 compare eviction counter with zone's counter and activate the 
page if needed.
 else if (the 1st bit is 1)
 compare eviction counter with the memcg (if exists)


Having a single counter per memcg won't scale with the number of NUMA
nodes.


It doesn't matter, we can use lazy counter like pcpu counter because it's not 
needed to be very accurate.



 if (current memcg == recorded memcg && eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.
Above is to avoid updating zone's counter and keep memcg's LRU active/inactive 
balanced.


Also, what if a memory cgroup is protected by memory.low? Such a cgroup
may have all its pages in the active list, because it is never scanned.


If LRU never scanned, all file caches tend to be in INACTIVE...it never 
refaults.


This will affect the refault distance of other cgroups, making
activations unpredictable.





Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-09 Thread Kamezawa Hiroyuki

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio  memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

  1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.


No, that's not what I want. I want pages of the workingset to get
activated on refault no matter if they were evicted on global memory
pressure or due to hitting a memory cgroup limit.



Sure.


  2) maintain acitve/inactive ratio in memcg properly as global LRU does.
  3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
  # contaienr A  workingset=600M   limit=1G (sleepy)
  # contaienr B  workingset=300M   limit=1G (work often)
  # container C  workingset=500M   limit=1G (work slowly)
  # container D  workingset=1.2G   limit=1G (work hard)
 
container D can drive the zone's distance counter because of local memory reclaim.

If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.

But yes, some _real_  test are required.



Now, the record is

eviction | node | zone | 2bit.

How about changing this as

 0 |eviction | node | zone | 2bit
 1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
 if (the 1st bit is 0)
 compare eviction counter with zone's counter and activate the 
page if needed.
 else if (the 1st bit is 1)
 compare eviction counter with the memcg (if exists)


Having a single counter per memcg won't scale with the number of NUMA
nodes.


It doesn't matter, we can use lazy counter like pcpu counter because it's not 
needed to be very accurate.



 if (current memcg == recorded memcg  eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.
Above is to avoid updating zone's counter and keep memcg's LRU active/inactive 
balanced.


Also, what if a memory cgroup is protected by memory.low? Such a cgroup
may have all its pages in the active list, because it is never scanned.


If LRU never scanned, all file caches tend to be in INACTIVE...it never 
refaults.


This will affect the refault distance of other cgroups, making
activations unpredictable.





Thanks,
-Kame







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-06 Thread Kamezawa Hiroyuki

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:


Reading discussion, I feel storing more data is difficult, too.


Yep, even with the current 16-bit memcg id. Things would get even worse
if we wanted to extend it one day (will we?)



I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

 1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.
 2) maintain acitve/inactive ratio in memcg properly as global LRU does.
 3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.

Now, the record is

eviction | node | zone | 2bit.

How about changing this as

0 |eviction | node | zone | 2bit
1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
if (the 1st bit is 0)
compare eviction counter with zone's counter and activate the 
page if needed.
else if (the 1st bit is 1)
compare eviction counter with the memcg (if exists)
if (current memcg == recorded memcg && eviction distance is 
okay)
 activate page.
else
 inactivate
  
At page-out

if (global memory pressure)
record eviction id with using zone's counter.
else if (memcg local memory pressure)
record eviction id with memcg's counter.

By this,
   1) locally reclaimed pages cannot be activated unless it's refaulted in the 
same memcg.
  In this case, activating in the memcg has some meaning.

   2) At global memory pressure, distance is properly calculated based on 
global system status.
  global memory pressure can ignore memcg's behavior.

about shadow entries, kmemcg should take care of it


Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-06 Thread Kamezawa Hiroyuki

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:


Reading discussion, I feel storing more data is difficult, too.


Yep, even with the current 16-bit memcg id. Things would get even worse
if we wanted to extend it one day (will we?)



I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio  memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

 1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.
 2) maintain acitve/inactive ratio in memcg properly as global LRU does.
 3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.

Now, the record is

eviction | node | zone | 2bit.

How about changing this as

0 |eviction | node | zone | 2bit
1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
if (the 1st bit is 0)
compare eviction counter with zone's counter and activate the 
page if needed.
else if (the 1st bit is 1)
compare eviction counter with the memcg (if exists)
if (current memcg == recorded memcg  eviction distance is 
okay)
 activate page.
else
 inactivate
  
At page-out

if (global memory pressure)
record eviction id with using zone's counter.
else if (memcg local memory pressure)
record eviction id with memcg's counter.

By this,
   1) locally reclaimed pages cannot be activated unless it's refaulted in the 
same memcg.
  In this case, activating in the memcg has some meaning.

   2) At global memory pressure, distance is properly calculated based on 
global system status.
  global memory pressure can ignore memcg's behavior.

about shadow entries, kmemcg should take care of it


Thanks,
-Kame




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-05 Thread Kamezawa Hiroyuki

On 2015/08/05 16:47, Michal Hocko wrote:

On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote:
[...]

so, for memory controller, we'll have


We currently have only current, low, high, max and events currently.
All other knobs are either deprecated or waiting for a usecase to emerge
before they get added.



Sure. I think following has users.
 - *.stat - for chekcing health of cgroup ,or for debug
 - *.pressure_level   - for notifying memory pressure
 - *.swappiness   - for adjusting LRU activity per application type.
 - *.oom_control  - for surviving/notifiyng out of memory
memcg's oom can be recovered if limit goes up rather 
than kill.
But I know people says this knob is not useful. This 
will require
discussion.

Hm. If we don't want to increase files, NETLINK or systemcall is an another 
choice of
subsystem specific interface ?

-Kame


(in alphabet order)
memory.failcnt
memory.force_empty  (<= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (<= removed)





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-05 Thread Kamezawa Hiroyuki

On 2015/08/05 16:47, Michal Hocko wrote:

On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote:
[...]

so, for memory controller, we'll have


We currently have only current, low, high, max and events currently.
All other knobs are either deprecated or waiting for a usecase to emerge
before they get added.



Sure. I think following has users.
 - *.stat - for chekcing health of cgroup ,or for debug
 - *.pressure_level   - for notifying memory pressure
 - *.swappiness   - for adjusting LRU activity per application type.
 - *.oom_control  - for surviving/notifiyng out of memory
memcg's oom can be recovered if limit goes up rather 
than kill.
But I know people says this knob is not useful. This 
will require
discussion.

Hm. If we don't want to increase files, NETLINK or systemcall is an another 
choice of
subsystem specific interface ?

-Kame


(in alphabet order)
memory.failcnt
memory.force_empty  (= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (= removed)





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-04 Thread Kamezawa Hiroyuki
On 2015/08/03 21:04, Vladimir Davydov wrote:
> Hi,
> 
> Currently, workingset detection logic is not memcg aware - inactive_age
> is maintained per zone. As a result, if memory cgroups are used,
> refaulted file pages are activated randomly. This patch set makes
> inactive_age per lruvec so that workingset detection will work correctly
> for memory cgroup reclaim.
> 
> Thanks,
> 

Reading discussion, I feel storing more data is difficult, too.

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

   (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-04 Thread Kamezawa Hiroyuki

On 2015/08/05 4:31, Tejun Heo wrote:

 From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Tue, 4 Aug 2015 15:20:55 -0400

Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.

Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes.  Let's take
the chance to establish common conventions throughout all controllers.

This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy.  Except for io.weight knob,
all existing unified hierarchy knobs are already compliant.  A
follow-up patch will update io.weight.

v2: Added descriptions of min, low and high knobs.

Signed-off-by: Tejun Heo 
Acked-by: Johannes Weiner 
Cc: Li Zefan 
Cc: Peter Zijlstra 
---
Hello,

Added low/high descriptions and applied to the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
for-4.3-unified-base

The branch currently only contains this patch and will stay stable so
that it can be pulled from.  I kept the base weight as DFL for now.
If we decide to change it, I'll apply the change on top.

Thanks.

  Documentation/cgroups/unified-hierarchy.txt | 80 ++---
  include/linux/cgroup.h  |  9 
  2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt 
b/Documentation/cgroups/unified-hierarchy.txt
index 86847a7..1ee9caf 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -23,10 +23,13 @@ CONTENTS
  5. Other Changes
5-1. [Un]populated Notification
5-2. Other Core Changes
-  5-3. Per-Controller Changes
-5-3-1. blkio
-5-3-2. cpuset
-5-3-3. memory
+  5-3. Controller File Conventions
+5-3-1. Format
+5-3-2. Control Knobs
+  5-4. Per-Controller Changes
+5-4-1. blkio
+5-4-2. cpuset
+5-4-3. memory
  6. Planned Changes
6-1. CAP for resource control

@@ -372,14 +375,75 @@ supported and the interface files "release_agent" and
  - The "cgroup.clone_children" file is removed.


-5-3. Per-Controller Changes
+5-3. Controller File Conventions

-5-3-1. blkio
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+  VAL0 VAL1...\n
+
+- Flat keyed files
+
+  KEY0 VAL0\n
+  KEY1 VAL1\n
+  ...
+
+- Nested keyed files
+
+  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+  ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+  single file.
+
+- In general, the root cgroup should be exempt from resource control
+  and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+  control knob should be named "weight" and have the range [1, 1]
+  and 100 should be the default value.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+  limit, the control knobs should be named "min" and "max"
+  respectively.  If a controller implements best effort resource
+  gurantee and/or limit, the control knobs should be named "low" and
+  "high" respectively.
+
+  In the above four control files, the special token "max" should be
+  used to represent upward infinity for both reading and writing.
+

so, for memory controller, we'll have

(in alphabet order)
memory.failcnt
memory.force_empty  (<= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (<= removed)

?
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-04 Thread Kamezawa Hiroyuki

On 2015/08/05 4:31, Tejun Heo wrote:

 From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001
From: Tejun Heo t...@kernel.org
Date: Tue, 4 Aug 2015 15:20:55 -0400

Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.

Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes.  Let's take
the chance to establish common conventions throughout all controllers.

This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy.  Except for io.weight knob,
all existing unified hierarchy knobs are already compliant.  A
follow-up patch will update io.weight.

v2: Added descriptions of min, low and high knobs.

Signed-off-by: Tejun Heo t...@kernel.org
Acked-by: Johannes Weiner han...@cmpxchg.org
Cc: Li Zefan lize...@huawei.com
Cc: Peter Zijlstra pet...@infradead.org
---
Hello,

Added low/high descriptions and applied to the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
for-4.3-unified-base

The branch currently only contains this patch and will stay stable so
that it can be pulled from.  I kept the base weight as DFL for now.
If we decide to change it, I'll apply the change on top.

Thanks.

  Documentation/cgroups/unified-hierarchy.txt | 80 ++---
  include/linux/cgroup.h  |  9 
  2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt 
b/Documentation/cgroups/unified-hierarchy.txt
index 86847a7..1ee9caf 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -23,10 +23,13 @@ CONTENTS
  5. Other Changes
5-1. [Un]populated Notification
5-2. Other Core Changes
-  5-3. Per-Controller Changes
-5-3-1. blkio
-5-3-2. cpuset
-5-3-3. memory
+  5-3. Controller File Conventions
+5-3-1. Format
+5-3-2. Control Knobs
+  5-4. Per-Controller Changes
+5-4-1. blkio
+5-4-2. cpuset
+5-4-3. memory
  6. Planned Changes
6-1. CAP for resource control

@@ -372,14 +375,75 @@ supported and the interface files release_agent and
  - The cgroup.clone_children file is removed.


-5-3. Per-Controller Changes
+5-3. Controller File Conventions

-5-3-1. blkio
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+  VAL0 VAL1...\n
+
+- Flat keyed files
+
+  KEY0 VAL0\n
+  KEY1 VAL1\n
+  ...
+
+- Nested keyed files
+
+  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+  ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+  single file.
+
+- In general, the root cgroup should be exempt from resource control
+  and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+  control knob should be named weight and have the range [1, 1]
+  and 100 should be the default value.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+  limit, the control knobs should be named min and max
+  respectively.  If a controller implements best effort resource
+  gurantee and/or limit, the control knobs should be named low and
+  high respectively.
+
+  In the above four control files, the special token max should be
+  used to represent upward infinity for both reading and writing.
+

so, for memory controller, we'll have

(in alphabet order)
memory.failcnt
memory.force_empty  (= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (= removed)

?
-Kame

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-04 Thread Kamezawa Hiroyuki
On 2015/08/03 21:04, Vladimir Davydov wrote:
 Hi,
 
 Currently, workingset detection logic is not memcg aware - inactive_age
 is maintained per zone. As a result, if memory cgroups are used,
 refaulted file pages are activated randomly. This patch set makes
 inactive_age per lruvec so that workingset detection will work correctly
 for memory cgroup reclaim.
 
 Thanks,
 

Reading discussion, I feel storing more data is difficult, too.

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

   (refault_disatance calculated in zone) * memcg_reclaim_ratio  memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


Thanks,
-Kame



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-30 Thread Kamezawa Hiroyuki

On 2015/06/30 11:45, Xishi Qiu wrote:

On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:


On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu 


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
   include/linux/mmzone.h | 9 +
   mm/page_alloc.c| 3 +++
   mm/vmstat.c| 3 +++
   3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
   MIGRATE_UNMOVABLE,
   MIGRATE_RECLAIMABLE,
   MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+MIGRATE_MIRROR,
+#endif


I think
 MIGRATE_MIRROR_UNMOVABLE,
 MIGRATE_MIRROR_RECLAIMABLE,
 MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
 MIGRATE_MIRROR_RESERVED,<== reserved pages should be 
maintained per mirrored/unmirrored.



Hi Kame,

You mean add 3 or 4 new migratetype?



yes. But please check how NR_MIGRATETYPE_BITS will be.
I think this will not have big impact in x86-64 .


should be added with the following fallback list.

/*
  * MIRROR page range is defined by firmware at boot. The range is limited
  * and is used only for kernel memory mirroring.
  */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}



Why not like this:
{MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}



 My mistake.
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}

was my intention. This means mirrored memory and unmirrored memory is separated 
completely.

But this should affect kswapd or other memory reclaim logic.

for example, kswapd stops free pages are more than hi watermark.
But mirrored/unmirrored pages exhausted cases are not handled in this series.
You need some extra check in memory reclaim logic if you go with migration_type.




Then, we'll not lose the original information of "Reclaiable Pages".

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.



If we add a new zone, mirror_zone will span others, I'm worry about this
maybe have problems.


Yes. that's problem. And zoneid bit is very limited resource.
(But memory reclaim logic can be unchanged.)

Anyway, I'd like to see your solution with above changes 1st rather than adding 
zones.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-30 Thread Kamezawa Hiroyuki

On 2015/06/30 11:45, Xishi Qiu wrote:

On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:


On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called MIGRATE_MIRROR, it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
   include/linux/mmzone.h | 9 +
   mm/page_alloc.c| 3 +++
   mm/vmstat.c| 3 +++
   3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
   MIGRATE_UNMOVABLE,
   MIGRATE_RECLAIMABLE,
   MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+MIGRATE_MIRROR,
+#endif


I think
 MIGRATE_MIRROR_UNMOVABLE,
 MIGRATE_MIRROR_RECLAIMABLE,
 MIGRATE_MIRROR_MOVABLE, == adding this may need discuss.
 MIGRATE_MIRROR_RESERVED,== reserved pages should be 
maintained per mirrored/unmirrored.



Hi Kame,

You mean add 3 or 4 new migratetype?



yes. But please check how NR_MIGRATETYPE_BITS will be.
I think this will not have big impact in x86-64 .


should be added with the following fallback list.

/*
  * MIRROR page range is defined by firmware at boot. The range is limited
  * and is used only for kernel memory mirroring.
  */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}



Why not like this:
{MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}



 My mistake.
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}

was my intention. This means mirrored memory and unmirrored memory is separated 
completely.

But this should affect kswapd or other memory reclaim logic.

for example, kswapd stops free pages are more than hi watermark.
But mirrored/unmirrored pages exhausted cases are not handled in this series.
You need some extra check in memory reclaim logic if you go with migration_type.




Then, we'll not lose the original information of Reclaiable Pages.

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.



If we add a new zone, mirror_zone will span others, I'm worry about this
maybe have problems.


Yes. that's problem. And zoneid bit is very limited resource.
(But memory reclaim logic can be unchanged.)

Anyway, I'd like to see your solution with above changes 1st rather than adding 
zones.

Thanks,
-Kame


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 10:31, Xishi Qiu wrote:

On 2015/6/30 9:01, Kamezawa Hiroyuki wrote:


On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
*/
   int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
   {
-system_has_some_mirror = true;
+static_key_slow_inc(_has_mirror);

   return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
   }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame



Hi Kame,

How about like this? Use static bool in bootmem, and use jump label in buddy 
system.
This means we use two variable to do it.



I think it can be done but it should be done in separated patch with enough 
comment/changelog.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
   */
  int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
  {
-   system_has_some_mirror = true;
+   static_key_slow_inc(_has_mirror);

return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame


[0.00] e820: last_pfn = 0x7b800 max_arch_pfn = 0x4
[0.00] [ cut here ]
[0.00] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 
static_key_slow_inc+0x57/0xc0()
[0.00] static_key_slow_inc used before call to jump_label_init
[0.00] Modules linked in:

[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
[0.00] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[0.00]   ee366a8dff38f745 81997d68 
816683b4
[0.00]   81997dc0 81997da8 
8107b0aa
[0.00]  81d48822 81f281a0 4000 
001fcb7a4000
[0.00] Call Trace:
[0.00]  [] dump_stack+0x45/0x57
[0.00]  [] warn_slowpath_common+0x8a/0xc0
[0.00]  [] warn_slowpath_fmt+0x55/0x70
[0.00]  [] ? memblock_add_range+0x175/0x19e
[0.00]  [] static_key_slow_inc+0x57/0xc0
[0.00]  [] memblock_mark_mirror+0x19/0x33
[0.00]  [] efi_find_mirror+0x59/0xdd
[0.00]  [] setup_arch+0x642/0xccf
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] ? printk+0x55/0x6b
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] start_kernel+0xe8/0x4eb
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] x86_64_start_reservations+0x2a/0x2c
[0.00]  [] x86_64_start_kernel+0x14c/0x16f
[0.00] ---[ end trace baa7fa0514e3bc58 ]---
[0.00] [ cut here ]








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:25, Xishi Qiu wrote:

Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so
they could free to buddy system's MIGRATE_MIRROR list.
When set reserved memory, skip the mirrored memory.

Signed-off-by: Xishi Qiu 
---
  include/linux/memblock.h |  3 +++
  mm/memblock.c| 21 +
  mm/nobootmem.c   |  3 +++
  mm/page_alloc.c  |  3 +++
  4 files changed, 30 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97f71ca..53be030 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
  ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_MIRROR
+void memblock_mark_migratemirror(void);
+#endif

  /* Low level functions */
  int memblock_add_range(struct memblock_type *type,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7612876..0d0b210 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 
  #include 
@@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }

+#ifdef CONFIG_MEMORY_MIRROR
+void __init_memblock memblock_mark_migratemirror(void)
+{
+   unsigned long start_pfn, end_pfn, pfn;
+   int i, node;
+   struct page *page;
+
+   printk(KERN_DEBUG "Mirrored memory:\n");
+   for_each_mirror_pfn_range(i, MAX_NUMNODES, _pfn, _pfn,
+   ) {
+   printk(KERN_DEBUG "  node %3d: [mem %#010llx-%#010llx]\n",
+   node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1);
+   for (pfn = start_pfn; pfn < end_pfn;
+   pfn += pageblock_nr_pages) {
+   page = pfn_to_page(pfn);
+   set_pageblock_migratetype(page, MIGRATE_MIRROR);
+   }
+   }
+}
+#endif

  /**
   * __next__mem_range - next function for for_each_free_mem_range() etc.
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5258386..31aa6d4 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void)
u64 i;

memblock_clear_hotplug(0, -1);
+#ifdef CONFIG_MEMORY_MIRROR
+   memblock_mark_migratemirror();
+#endif

for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, , ,
NULL)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e4d79f..aea78a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)

block_migratetype = get_pageblock_migratetype(page);

+   if (is_migrate_mirror(block_migratetype))
+   continue;
+


If mirrored area will not have reserved memory, this should break the page 
allocator's
logic.

I think both of mirrored and unmirrored range should have reserved area.

Thanks,
-Kame


/* Only test what is necessary when the reserves are not met */
if (reserve > 0) {
/*




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu 


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
  include/linux/mmzone.h | 9 +
  mm/page_alloc.c| 3 +++
  mm/vmstat.c| 3 +++
  3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+   MIGRATE_MIRROR,
+#endif


I think
MIGRATE_MIRROR_UNMOVABLE,
MIGRATE_MIRROR_RECLAIMABLE,
MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
MIGRATE_MIRROR_RESERVED,<== reserved pages should be maintained 
per mirrored/unmirrored.

should be added with the following fallback list.

/*
 * MIRROR page range is defined by firmware at boot. The range is limited
 * and is used only for kernel memory mirroring.
 */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}

Then, we'll not lose the original information of "Reclaiable Pages".

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.

Anyway, your patch doesn't takes care of reserved memory calculation at this 
point.
Please check setup_zone_migrate_reserve() That will be a problem.

Thanks,
-Kame


MIGRATE_PCPTYPES,   /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
  #ifdef CONFIG_CMA
@@ -69,6 +72,12 @@ enum {
  #  define is_migrate_cma(migratetype) false
  #endif

+#ifdef CONFIG_MEMORY_MIRROR
+#  define is_migrate_mirror(migratetype) unlikely((migratetype) == 
MIGRATE_MIRROR)
+#else
+#  define is_migrate_mirror(migratetype) false
+#endif
+
  #define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..6e4d79f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE]   = 'E',
[MIGRATE_MOVABLE]   = 'M',
+#ifdef CONFIG_MEMORY_MIRROR
+   [MIGRATE_MIRROR]= 'O',
+#endif
[MIGRATE_RESERVE]   = 'R',
  #ifdef CONFIG_CMA
[MIGRATE_CMA]   = 'C',
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..d0323e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+#ifdef CONFIG_MEMORY_MIRROR
+   "Mirror",
+#endif
"Reserve",
  #ifdef CONFIG_CMA
"CMA",




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:23, Xishi Qiu wrote:

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it

 CONFIG_MEMORY_MIRROR

off by default.

Signed-off-by: Xishi Qiu 
---
  mm/Kconfig | 8 
  1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..c40bb8b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR


  In following patches, you use CONFIG_MEMORY_MIRROR.

I think the name is too generic besides it's depends on ACPI.
But I'm not sure address based memory mirror is planned in other platform.

So, hmm. How about dividing the config into 2 parts like attached ? (just an 
example)

Thanks,
-Kame
>From 88213b0f76e2f603c5a38690cbd85a4df1e646ba Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki 
Date: Mon, 29 Jun 2015 15:35:47 +0900
Subject: [PATCH] add a new config option for memory mirror

Add a new config option "CONFIG_MEMORY_MIRROR" for kernel assisted
memory mirroring.

In UEFI2.5 spec, Address based memory mirror is defined and it allows
the system to create partial memory mirror.

The feature guards important(kernel) memory to be mirrored by using
the address based memory mirror.

Now this depends on cpu architecure Haswell? Broadwell?
---
 arch/x86/Kconfig | 6 ++
 mm/Kconfig   | 9 +
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e33e01b..56f17df 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -596,6 +596,12 @@ config X86_SUPPORTS_MEMORY_FAILURE
depends on X86_64 || !SPARSEMEM
select ARCH_SUPPORTS_MEMORY_FAILURE
 
+config X86_SUPPORTS_MEMORY_MIRROR
+   def_bool y
+   # UEFI 2.5spec. address based memory mirror, supported only after XXX
+   depends on X86_64 && ARCH_SUPPORTS_MEMORY_FAILURE
+   select ARCH_MEMORY_MIRROR
+
 config STA2X11
bool "STA2X11 Companion Chip Support"
depends on X86_32_NON_STANDARD && PCI
diff --git a/mm/Kconfig b/mm/Kconfig
index b3a60ee..e14dc2d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,15 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
 
+config MEMORY_MIRROR
+   bool "Address range mirroring support"
+   depends on ARCH_MEMORY_MIRROR
+   default n
+   help
+ This feature allows the kernel to assist address based memory
+ mirror supported by architecture/firmware. And place some types
+  of memory (especially, kernel memory) placed into mirrored range.
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
-- 
1.9.3



Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:23, Xishi Qiu wrote:

This patch introduces a new config called CONFIG_ACPI_MIRROR_MEMORY, set it

 CONFIG_MEMORY_MIRROR

off by default.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com
---
  mm/Kconfig | 8 
  1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..c40bb8b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG  ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR


  In following patches, you use CONFIG_MEMORY_MIRROR.

I think the name is too generic besides it's depends on ACPI.
But I'm not sure address based memory mirror is planned in other platform.

So, hmm. How about dividing the config into 2 parts like attached ? (just an 
example)

Thanks,
-Kame
From 88213b0f76e2f603c5a38690cbd85a4df1e646ba Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Date: Mon, 29 Jun 2015 15:35:47 +0900
Subject: [PATCH] add a new config option for memory mirror

Add a new config option CONFIG_MEMORY_MIRROR for kernel assisted
memory mirroring.

In UEFI2.5 spec, Address based memory mirror is defined and it allows
the system to create partial memory mirror.

The feature guards important(kernel) memory to be mirrored by using
the address based memory mirror.

Now this depends on cpu architecure Haswell? Broadwell?
---
 arch/x86/Kconfig | 6 ++
 mm/Kconfig   | 9 +
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e33e01b..56f17df 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -596,6 +596,12 @@ config X86_SUPPORTS_MEMORY_FAILURE
depends on X86_64 || !SPARSEMEM
select ARCH_SUPPORTS_MEMORY_FAILURE
 
+config X86_SUPPORTS_MEMORY_MIRROR
+   def_bool y
+   # UEFI 2.5spec. address based memory mirror, supported only after XXX
+   depends on X86_64  ARCH_SUPPORTS_MEMORY_FAILURE
+   select ARCH_MEMORY_MIRROR
+
 config STA2X11
bool STA2X11 Companion Chip Support
depends on X86_32_NON_STANDARD  PCI
diff --git a/mm/Kconfig b/mm/Kconfig
index b3a60ee..e14dc2d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,15 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG  ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
 
+config MEMORY_MIRROR
+   bool Address range mirroring support
+   depends on ARCH_MEMORY_MIRROR
+   default n
+   help
+ This feature allows the kernel to assist address based memory
+ mirror supported by architecture/firmware. And place some types
+  of memory (especially, kernel memory) placed into mirrored range.
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
-- 
1.9.3



Re: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:25, Xishi Qiu wrote:

Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so
they could free to buddy system's MIGRATE_MIRROR list.
When set reserved memory, skip the mirrored memory.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com
---
  include/linux/memblock.h |  3 +++
  mm/memblock.c| 21 +
  mm/nobootmem.c   |  3 +++
  mm/page_alloc.c  |  3 +++
  4 files changed, 30 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97f71ca..53be030 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
  ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_MIRROR
+void memblock_mark_migratemirror(void);
+#endif

  /* Low level functions */
  int memblock_add_range(struct memblock_type *type,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7612876..0d0b210 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -19,6 +19,7 @@
  #include linux/debugfs.h
  #include linux/seq_file.h
  #include linux/memblock.h
+#include linux/page-isolation.h

  #include asm-generic/sections.h
  #include linux/io.h
@@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }

+#ifdef CONFIG_MEMORY_MIRROR
+void __init_memblock memblock_mark_migratemirror(void)
+{
+   unsigned long start_pfn, end_pfn, pfn;
+   int i, node;
+   struct page *page;
+
+   printk(KERN_DEBUG Mirrored memory:\n);
+   for_each_mirror_pfn_range(i, MAX_NUMNODES, start_pfn, end_pfn,
+   node) {
+   printk(KERN_DEBUG   node %3d: [mem %#010llx-%#010llx]\n,
+   node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1);
+   for (pfn = start_pfn; pfn  end_pfn;
+   pfn += pageblock_nr_pages) {
+   page = pfn_to_page(pfn);
+   set_pageblock_migratetype(page, MIGRATE_MIRROR);
+   }
+   }
+}
+#endif

  /**
   * __next__mem_range - next function for for_each_free_mem_range() etc.
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5258386..31aa6d4 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void)
u64 i;

memblock_clear_hotplug(0, -1);
+#ifdef CONFIG_MEMORY_MIRROR
+   memblock_mark_migratemirror();
+#endif

for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, start, end,
NULL)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e4d79f..aea78a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)

block_migratetype = get_pageblock_migratetype(page);

+   if (is_migrate_mirror(block_migratetype))
+   continue;
+


If mirrored area will not have reserved memory, this should break the page 
allocator's
logic.

I think both of mirrored and unmirrored range should have reserved area.

Thanks,
-Kame


/* Only test what is necessary when the reserves are not met */
if (reserve  0) {
/*




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called MIGRATE_MIRROR, it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
  include/linux/mmzone.h | 9 +
  mm/page_alloc.c| 3 +++
  mm/vmstat.c| 3 +++
  3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+   MIGRATE_MIRROR,
+#endif


I think
MIGRATE_MIRROR_UNMOVABLE,
MIGRATE_MIRROR_RECLAIMABLE,
MIGRATE_MIRROR_MOVABLE, == adding this may need discuss.
MIGRATE_MIRROR_RESERVED,== reserved pages should be maintained 
per mirrored/unmirrored.

should be added with the following fallback list.

/*
 * MIRROR page range is defined by firmware at boot. The range is limited
 * and is used only for kernel memory mirroring.
 */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}

Then, we'll not lose the original information of Reclaiable Pages.

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.

Anyway, your patch doesn't takes care of reserved memory calculation at this 
point.
Please check setup_zone_migrate_reserve() That will be a problem.

Thanks,
-Kame


MIGRATE_PCPTYPES,   /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
  #ifdef CONFIG_CMA
@@ -69,6 +72,12 @@ enum {
  #  define is_migrate_cma(migratetype) false
  #endif

+#ifdef CONFIG_MEMORY_MIRROR
+#  define is_migrate_mirror(migratetype) unlikely((migratetype) == 
MIGRATE_MIRROR)
+#else
+#  define is_migrate_mirror(migratetype) false
+#endif
+
  #define for_each_migratetype_order(order, type) \
for (order = 0; order  MAX_ORDER; order++) \
for (type = 0; type  MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..6e4d79f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE]   = 'E',
[MIGRATE_MOVABLE]   = 'M',
+#ifdef CONFIG_MEMORY_MIRROR
+   [MIGRATE_MIRROR]= 'O',
+#endif
[MIGRATE_RESERVE]   = 'R',
  #ifdef CONFIG_CMA
[MIGRATE_CMA]   = 'C',
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..d0323e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
Unmovable,
Reclaimable,
Movable,
+#ifdef CONFIG_MEMORY_MIRROR
+   Mirror,
+#endif
Reserve,
  #ifdef CONFIG_CMA
CMA,




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 10:31, Xishi Qiu wrote:

On 2015/6/30 9:01, Kamezawa Hiroyuki wrote:


On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
*/
   int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
   {
-system_has_some_mirror = true;
+static_key_slow_inc(system_has_mirror);

   return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
   }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame



Hi Kame,

How about like this? Use static bool in bootmem, and use jump label in buddy 
system.
This means we use two variable to do it.



I think it can be done but it should be done in separated patch with enough 
comment/changelog.

Thanks,
-Kame



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
   */
  int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
  {
-   system_has_some_mirror = true;
+   static_key_slow_inc(system_has_mirror);

return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame


[0.00] e820: last_pfn = 0x7b800 max_arch_pfn = 0x4
[0.00] [ cut here ]
[0.00] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 
static_key_slow_inc+0x57/0xc0()
[0.00] static_key_slow_inc used before call to jump_label_init
[0.00] Modules linked in:

[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
[0.00] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[0.00]   ee366a8dff38f745 81997d68 
816683b4
[0.00]   81997dc0 81997da8 
8107b0aa
[0.00]  81d48822 81f281a0 4000 
001fcb7a4000
[0.00] Call Trace:
[0.00]  [816683b4] dump_stack+0x45/0x57
[0.00]  [8107b0aa] warn_slowpath_common+0x8a/0xc0
[0.00]  [8107b135] warn_slowpath_fmt+0x55/0x70
[0.00]  [81660273] ? memblock_add_range+0x175/0x19e
[0.00]  [81176c57] static_key_slow_inc+0x57/0xc0
[0.00]  [81660655] memblock_mark_mirror+0x19/0x33
[0.00]  [81b12c18] efi_find_mirror+0x59/0xdd
[0.00]  [81afb8a6] setup_arch+0x642/0xccf
[0.00]  [81af3120] ? early_idt_handler_array+0x120/0x120
[0.00]  [81663480] ? printk+0x55/0x6b
[0.00]  [81af3120] ? early_idt_handler_array+0x120/0x120
[0.00]  [81af3d93] start_kernel+0xe8/0x4eb
[0.00]  [81af3120] ? early_idt_handler_array+0x120/0x120
[0.00]  [81af3120] ? early_idt_handler_array+0x120/0x120
[0.00]  [81af35ee] x86_64_start_reservations+0x2a/0x2c
[0.00]  [81af373c] x86_64_start_kernel+0x14c/0x16f
[0.00] ---[ end trace baa7fa0514e3bc58 ]---
[0.00] [ cut here ]








--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-26 Thread Kamezawa Hiroyuki

On 2015/06/26 10:43, Xishi Qiu wrote:

On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:


On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
 mm/page_alloc.c | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

 return false;
 }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame



So if we want all kernel memory allocated from mirror, how about change like 
this?
__alloc_pages_nodemask()
   gfpflags_to_migratetype()
 if (!(gfp_mask & __GFP_MOVABLE))
return MIGRATE_MIRROR


Maybe used with jump label can reduce performance impact.
==
static inline bool memory_mirror_enabled(void)
{
return static_key_false(_mirror_enabled);
}



gfpflags_to_migratetype()
  if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
  if (!(gfp_mask & __GFP_MOVABLE))
   return MIGRATE_MIRROR
  }
==

BTW, I think current memory compaction code scans ranges of MOVABLE migrate 
type.
So, if you use other migration type than MOVABLE for user pages, you may see
page fragmentation. If you want to expand this MIRROR to user pages, please 
check
mm/compaction.c


Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-26 Thread Kamezawa Hiroyuki

On 2015/06/26 10:43, Xishi Qiu wrote:

On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:


On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com
---
 mm/page_alloc.c | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

 return false;
 }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx  ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags  __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame



So if we want all kernel memory allocated from mirror, how about change like 
this?
__alloc_pages_nodemask()
   gfpflags_to_migratetype()
 if (!(gfp_mask  __GFP_MOVABLE))
return MIGRATE_MIRROR


Maybe used with jump label can reduce performance impact.
==
static inline bool memory_mirror_enabled(void)
{
return static_key_false(memory_mirror_enabled);
}



gfpflags_to_migratetype()
  if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
  if (!(gfp_mask  __GFP_MOVABLE))
   return MIGRATE_MIRROR
  }
==

BTW, I think current memory compaction code scans ranges of MOVABLE migrate 
type.
So, if you use other migration type than MOVABLE for user pages, you may see
page fragmentation. If you want to expand this MIRROR to user pages, please 
check
mm/compaction.c


Thanks,
-Kame




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-25 Thread Kamezawa Hiroyuki

On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
mm/page_alloc.c | 40 +++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
}
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-25 Thread Kamezawa Hiroyuki

On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu qiuxi...@huawei.com
---
mm/page_alloc.c | 40 +++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
}
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx  ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags  __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/16 2:20, Luck, Tony wrote:

On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote:

So, there are 3 ideas.

  (1) kernel only from MIRROR / user only from MOVABLE (Tony)
  (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
  This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
  the memory reclaim can be critical for GFP_ATOMIC.
  (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
   - creating ZONE
   - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)


We might need to rename it ... right now the memory hotplug
people use ZONE_MOVABLE to indicate regions of physical memory
that can be removed from the system.  I'm wondering whether
people will want systems that have both removable and mirrored
areas?  Then we have four attribute combinations:

mirror=no  removable=no  - prefer to use for user, could use for kernel if we 
run out of mirror
mirror=no  removable=yes - can only be used for user (kernel allocation makes 
it not-removable)
mirror=yes removable=no  - use for kernel, possibly for special users if we 
define some interface
mirror=yes removable=yes - must not use for kernel ... would have to give to 
user ... seems like a bad idea to configure a system this way



Thank you for clarification. I see "mirror=no, removable=no" case may require a 
new name.

IMHO, the value of Address-Based-Memory-Mirror is that users can protect their 
system's
important functions without using full-memory mirror. So, I feel thinking
"mirror=no, removable=no" just makes our discussion/implemenation complex 
without real
user value.

Shouldn't we start with just thiking 2 cases of
 mirror=no  removable=yes
 mirror=yes removable=no
?

And then, if the naming is problem, alias name can be added.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/11 5:40, Luck, Tony wrote:

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE


HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 
systems ... 32-bit kernels
don't even boot on systems with 64GB, and the minimum rational configuration 
for a machine that supports
mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels 
per controller * 4GB DIMM ...
leaving any channels empty likely leaves you short of memory bandwidth for 
these high core count processors).

MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel 
allocation from a MOVABLE page. I
want all kernel allocations to be from MIRROR.



So, there are 3 ideas.

 (1) kernel only from MIRROR / user only from MOVABLE (Tony)
 (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
 This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
 the memory reclaim can be critical for GFP_ATOMIC.
 (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
  - creating ZONE
  - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/11 5:40, Luck, Tony wrote:

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE


HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 
systems ... 32-bit kernels
don't even boot on systems with 64GB, and the minimum rational configuration 
for a machine that supports
mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels 
per controller * 4GB DIMM ...
leaving any channels empty likely leaves you short of memory bandwidth for 
these high core count processors).

MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel 
allocation from a MOVABLE page. I
want all kernel allocations to be from MIRROR.



So, there are 3 ideas.

 (1) kernel only from MIRROR / user only from MOVABLE (Tony)
 (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
 This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
 the memory reclaim can be critical for GFP_ATOMIC.
 (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
  - creating ZONE
  - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE - ZONE_NORMAL for (1) and (3)

Thanks,
-Kame


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/16 2:20, Luck, Tony wrote:

On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote:

So, there are 3 ideas.

  (1) kernel only from MIRROR / user only from MOVABLE (Tony)
  (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
  This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
  the memory reclaim can be critical for GFP_ATOMIC.
  (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
   - creating ZONE
   - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE - ZONE_NORMAL for (1) and (3)


We might need to rename it ... right now the memory hotplug
people use ZONE_MOVABLE to indicate regions of physical memory
that can be removed from the system.  I'm wondering whether
people will want systems that have both removable and mirrored
areas?  Then we have four attribute combinations:

mirror=no  removable=no  - prefer to use for user, could use for kernel if we 
run out of mirror
mirror=no  removable=yes - can only be used for user (kernel allocation makes 
it not-removable)
mirror=yes removable=no  - use for kernel, possibly for special users if we 
define some interface
mirror=yes removable=yes - must not use for kernel ... would have to give to 
user ... seems like a bad idea to configure a system this way



Thank you for clarification. I see mirror=no, removable=no case may require a 
new name.

IMHO, the value of Address-Based-Memory-Mirror is that users can protect their 
system's
important functions without using full-memory mirror. So, I feel thinking
mirror=no, removable=no just makes our discussion/implemenation complex 
without real
user value.

Shouldn't we start with just thiking 2 cases of
 mirror=no  removable=yes
 mirror=yes removable=no
?

And then, if the naming is problem, alias name can be added.

Thanks,
-Kame






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:09, Xishi Qiu wrote:

On 2015/6/9 15:06, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:02, Xishi Qiu wrote:

Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
we should allocate mirrored memory for both user and kernel processes.

Signed-off-by: Xishi Qiu 


I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will 
use
mirrored memory ?

Or will you add special MMAP/madvise flag to use mirrored memory ?



Hi Kame,

Yes,

MMAP/madvise
-> add VM_MIRROR
-> add GFP_MIRROR
-> use MIGRATE_MIRROR list to alloc mirrored pages

So user can use mirrored memory. What do you think?



I see. please explain it (your final plan) in patch description or in cover 
page of patches.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 01/12] mm: add a new config to manage the code

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:10, Xishi Qiu wrote:

On 2015/6/9 14:44, Kamezawa Hiroyuki wrote:


On 2015/06/04 21:56, Xishi Qiu wrote:

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
used to on/off the feature.

Signed-off-by: Xishi Qiu 
---
   mm/Kconfig | 8 
   1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..4f2a726 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
   depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
   depends on MIGRATION

+config MEMORY_MIRROR
+bool "Address range mirroring support"
+depends on X86 && NUMA
+default y
+help
+  This feature depends on hardware and firmware support.
+  ACPI or EFI records the mirror info.


default y...no runtime influence when the user doesn't use memory mirror ?



It is a new feature, so how about like this: default y -> n?



It's okay to me. But it's better to check performance impact before merge
because you modified core code of memory management.

Thanks,
-Kame
 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
   mm/page_alloc.c | 40 +++-
   1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

   return false;
   }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE

thanks,
-Kame

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
  mm/page_alloc.c | 40 +++-
  1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
  }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+   /*
+* Do not alloc mirrored memory below 4G, because 0-4G is
+* all mirrored by default, and the list is always empty.
+*/
+   if (high_zoneidx < ZONE_NORMAL)
+   return false;
+
+   /* Alloc mirrored memory for only kernel */
+   if (gfp_flags & __GFP_MIRROR)
+   return true;


GFP_KERNEL itself should imply mirror, I think.


+
+   /* Alloc mirrored memory for both user and kernel */
+   if (sysctl_mirrorable)
+   return true;


Reading this, I think this sysctl is not good. The user cannot know what is 
mirrored
because memory may not be mirrored until the sysctl is set.

Thanks,
-Kame



+
+   return false;
+}
  #endif

  /*
@@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone 
*preferred_zone,
WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(>lock, flags);
-   page = __rmqueue(zone, order, migratetype);
+   if (is_migrate_mirror(migratetype))
+   page = __rmqueue_smallest(zone, order, migratetype);
+   else
+   page = __rmqueue(zone, order, migratetype);
spin_unlock(>lock);
if (!page)
goto failed;
@@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

+#ifdef CONFIG_MEMORY_MIRROR
+   if (change_to_mirror(gfp_mask, ac.high_zoneidx))
+   ac.migratetype = MIGRATE_MIRROR;
+#endif
+
  retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();

@@ -2943,9 +2971,19 @@ retry_cpuset:

/* First allocation attempt */
alloc_mask = gfp_mask|__GFP_HARDWALL;
+retry:
page = get_page_from_freelist(alloc_mask, order, alloc_flags, );
if (unlikely(!page)) {
/*
+* If there is no mirrored memory, we will alloc other
+* types memory.
+*/
+   if (is_migrate_mirror(ac.migratetype)) {
+   ac.migratetype = gfpflags_to_migratetype(gfp_mask);
+   goto retry;
+   }
+
+   /*
 * Runtime PM, block IO and its error handling path
 * can deadlock because I/O on the device might not
 * complete.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:02, Xishi Qiu wrote:

Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
we should allocate mirrored memory for both user and kernel processes.

Signed-off-by: Xishi Qiu 


I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will 
use
mirrored memory ?

Or will you add special MMAP/madvise flag to use mirrored memory ?

Thanks,
-Kame


---
  include/linux/mmzone.h | 1 +
  kernel/sysctl.c| 9 +
  mm/page_alloc.c| 1 +
  3 files changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82e3ae..20888dd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -85,6 +85,7 @@ struct mirror_info {
  };

  extern struct mirror_info mirror_info;
+extern int sysctl_mirrorable;
  #  define is_migrate_mirror(migratetype) unlikely((migratetype) == 
MIGRATE_MIRROR)
  #else
  #  define is_migrate_mirror(migratetype) false
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a..dc2625e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
.extra2 = ,
},
  #endif
+#ifdef CONFIG_MEMORY_MIRROR
+   {
+   .procname   = "mirrorable",
+   .data   = _mirrorable,
+   .maxlen = sizeof(sysctl_mirrorable),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   },
+#endif
{
.procname   = "user_reserve_kbytes",
.data   = _user_reserve_kbytes,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 249a8f6..63b90ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;

  #ifdef CONFIG_MEMORY_MIRROR
  struct mirror_info mirror_info;
+int sysctl_mirrorable = 0;
  #endif

  static unsigned long __meminitdata nr_kernel_pages;




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 07/12] mm: introduce __GFP_MIRROR to allocate mirrored pages

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:02, Xishi Qiu wrote:

This patch introduces a new gfp flag called "__GFP_MIRROR", it is used to
allocate mirrored pages through buddy system.

Signed-off-by: Xishi Qiu 


In Tony's original proposal, the motivation was to mirror all kernel memory.

Is the purpose of this patch making mirrored range available for user space ?

But, hmm... I don't think adding a new GFP flag is a good idea. It adds many 
conditional jumps.

How about using GFP_KERNEL for user memory if the user wants mirrored memory 
with mirroring
all kernel memory?

Thanks,
-Kame


---
  include/linux/gfp.h | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..89d0091 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
  #define ___GFP_NO_KSWAPD  0x40u
  #define ___GFP_OTHER_NODE 0x80u
  #define ___GFP_WRITE  0x100u
+#define ___GFP_MIRROR  0x200u
  /* If the above are modified, __GFP_BITS_SHIFT may need updating */

  /*
@@ -95,13 +96,15 @@ struct vm_area_struct;
  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
other node */
  #define __GFP_WRITE   ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
dirty page */

+#define __GFP_MIRROR   ((__force gfp_t)___GFP_MIRROR)  /* Allocate mirrored 
memory */
+
  /*
   * This may seem redundant, but it's a way of annotating false positives vs.
   * allocations that simply cannot be supported (e.g. page tables).
   */
  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 25/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26/* Room for N __GFP_FOO bits */
  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

  /* This equals 0, but use constants in case they ever change */




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >