Re: [PATCH 0/2] Make core_pattern support namespace

2016-03-19 Thread Kamezawa Hiroyuki
On 2016/03/16 18:23, Zhao Lei wrote:
> We discussed patch titled:
>   [PATCH] Make core_pattern support namespace
> before.
> 
> Above patch can solve half problem of custom core_dump pattern
> in container, but there are also another problem that limit
> custom core_pattern in container, it is the pipe-type core_pattern
> will write core dump into host's filesystem.
> (See discussion of that patch for detail)
> 
> Now we can solve the second problem by [PATCH 1/2], I send
> the origional patch with it.
> 

Let me know your design...

This patch does using fork+execve() rather than calling UMH in pipe-coredump 
pass.
And coredump-pipe process is run as a child of get-dumped process.
Right ?

Doesn't this break existing solution actually used in distro ?

BTW, it's first time for me to see that _do_fork() is called outside from 
fork.c.
Isn't it better to add a new func in fork.c if we really need this ?

Thanks,
-Kame



Re: call_usermodehelper in containers

2016-02-19 Thread Kamezawa Hiroyuki

On 2016/02/19 14:37, Ian Kent wrote:

On Fri, 2016-02-19 at 12:08 +0900, Kamezawa Hiroyuki wrote:

On 2016/02/19 5:45, Eric W. Biederman wrote:

Personally I am a fan of the don't be clever and capture a kernel
thread
approach as it is very easy to see you what if any exploitation
opportunities there are.  The justifications for something more
clever
is trickier.  Of course we do something that from this perspective
would
be considered ``clever'' today with kthreadd and user mode helpers.



I read old discussionlet me allow clarification  to create a
helper kernel thread
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper
environment for a container.
Option A) at creating some namespace (pid, uid, etc...)
Option B) at creating a new nsproxy
Option C).at a new systemcall is called or some sysctl,
make_private_usermode_helper() or some,

   It's expected this should be triggered by init process of a
container with some capability.
   And scope of the effect should be defined. pid namespace ? nsporxy ?
or new namespace ?

1) create a helper thread.
task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
switch task's nsproxy to current.(swtich_task_namespaces())
switch task's cgroups to current (cgroup_attach_task_all())
switch task's cred to current.
copy task's capability from current
(and any other ?)
wake_up_process()

And create a link between kthread_wq and container.


Not sure I quite understand this but I thought the difficulty with this
approach previously (even though the approach was very much incomplete)
was knowing that all the "moving parts" would not allow vulnerabilities.


Ok, that was discussed.


And it looks like this would require a kernel thread for each instance.
So for a thousand containers that each mount an NFS mount that means, at
least, 1000 additional kernel threads. Might be able to sell that, if we
were lucky, but from an system administration POV it's horrible.


I agree.


There's also the question of existence (aka. lifetime) to deal with
since the thread above needs to be created at a time other than the
usermode helper callback.

What happens for SIGKILL on a container?


It depends on how the helper kthread is tied to a container related object.
If kthread is linked with some namespace, we can kill it when a namespace
goes away.

So, with your opinion,
 - a helper thread should be spawned on demand
 - the lifetime of it should be clear. It will be good to have as same life 
time as the container.

I wonder there is no solution for "moving part" problem other than calling
do_fork() or copy_process() with container's init process context if we do all 
in the kernel.
Is that possible ?

Thanks,
-Kame




Re: call_usermodehelper in containers

2016-02-18 Thread Kamezawa Hiroyuki
On 2016/02/19 5:45, Eric W. Biederman wrote: 
> Personally I am a fan of the don't be clever and capture a kernel thread
> approach as it is very easy to see you what if any exploitation
> opportunities there are.  The justifications for something more clever
> is trickier.  Of course we do something that from this perspective would
> be considered ``clever'' today with kthreadd and user mode helpers.
> 

I read old discussionlet me allow clarification  to create a helper kernel 
thread 
to run usermodehelper with using kthreadd.

0) define a trigger to create an independent usermodehelper environment for a 
container.
   Option A) at creating some namespace (pid, uid, etc...)
   Option B) at creating a new nsproxy
   Option C).at a new systemcall is called or some sysctl, 
make_private_usermode_helper() or some,
  
  It's expected this should be triggered by init process of a container with 
some capability.
  And scope of the effect should be defined. pid namespace ? nsporxy ? or new 
namespace ?

1) create a helper thread.
   task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper")
   switch task's nsproxy to current.(swtich_task_namespaces())
   switch task's cgroups to current (cgroup_attach_task_all())
   switch task's cred to current.
   copy task's capability from current
   (and any other ?)
   wake_up_process()
   
   And create a link between kthread_wq and container.

2) modify call_usermodehelper() to use kthread_worker


It seems the problem is which object container private user mode helper should 
be tied to.

Regards,
-Kame


Re: call_usermodehelper in containers

2016-02-17 Thread Kamezawa Hiroyuki
On 2016/02/18 11:57, Eric W. Biederman wrote:
> 
> Ccing The containers list because a related discussion is happening there
> and somehow this thread has never made it there.
> 
> Ian Kent  writes:
> 
>> On Mon, 2013-11-18 at 18:28 +0100, Oleg Nesterov wrote:
>>> On 11/15, Eric W. Biederman wrote:

 I don't understand that one.  Having a preforked thread with the
 proper
 environment that can act like kthreadd in terms of spawning user
 mode
 helpers works and is simple.
>>
>> Forgive me replying to such an old thread but ...
>>
>> After realizing workqueues can't be used to pre-create threads to run
>> usermode helpers I've returned to look at this.
> 
> If someone can wind up with a good implementation I will be happy.
> 
>>> Can't we ask ->child_reaper to create the non-daemonized kernel thread
>>> with the "right" ->nsproxy, ->fs, etc?
>>
>> Eric, do you think this approach would be sufficient too?
>>
>> Probably wouldn't be quite right for user namespaces but should provide
>> what's needed for other cases?
>>
>> It certainly has the advantage of not having to maintain a plague of
>> processes waiting around to execute helpers.
> 
> That certainly sounds attractive.  Especially for the case of everyone
> who wants to set a core pattern in a container.
> 
> I am fuzzy on all of the details right now, but what I do remember is
> that in the kernel the user mode helper concepts when they attempted to
> scrub a processes environment were quite error prone until we managed to
> get kthreadd(pid 2) on the scene which always had a clean environment.
> 
> If we are going to tie this kind of thing to the pid namespace I
> recommend simplying denying it if you are in a user namespace without
> an approrpriate pid namespace.  AKA simply not allowing thigns to be setup
> if current->pid_ns->user_ns != current->user_ns.
> 
Can't be handled by simple capability like CAP_SYS_USERMODEHELPER ?

User_ns check seems not to allow core-dump-cather in host will not work if 
user_ns is used.

Thanks,
-Kame



Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-21 Thread Kamezawa Hiroyuki
On 2015/12/22 6:52, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 05:47 PM, Eric W. Biederman wrote:
>>> Dongsheng Yang  writes:
>>>
 On 12/20/2015 10:37 AM, Al Viro wrote:
> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
>> On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
>>> Hi guys,
>>>We are working on making core dump behaviour isolated in
>>> container. But the problem is, the /proc/sys/kernel/core_pattern
>>> is a kernel wide setting, not belongs to a container.
>>>
>>>So we want to add core_pattern into mnt namespace. What
>>> do you think about it?
>>
>> Hi Eric,
>>  I found your patch about "net: Implement the per network namespace
>> sysctl infrastructure", I want to do the similar thing
>> in mnt namespace. Is that suggested way?
>
> Why mnt namespace and not something else?

 Hi Al,

 Well, because core_pattern indicates the path to store core file.
 In different mnt namespace, we would like to change the path with
 different value.

 In addition, Let's considering other namespaces:
 UTS ns: contains informations of kernel and arch, not proper for 
 core_pattern.
 IPC ns: communication informations, not proper for core_pattern
 PID ns: core_pattern is not related with pid
 net ns: obviousely no.
 user ns: not proper too.

 Then I believe it's better to do this in mnt namespace. of course,
 core_pattern is just one example. After this infrastructure finished,
 we can implement more sysctls as per-mnt if necessary, I think.

 Al, what do you think about this idea?
>>>
>>> The hard part is not the sysctl.  The hard part is starting the usermode
>>> helper, in an environment that it can deal with.  The mount namespace
>>> really provides you with no help there.
>>
>> Do you mean the core dump helper? But I think I don't want to touch it
>> in my development. I think I can use non-pipe way to get what I want,
>> Let me try to explain what I want here.
>>
>> (1). introduce a --core-path option in docker run command to specify the
>> path in host to store core file in one container.
>> E.g: docker run --core-path=/core/test --name=test IMAGE
>>
>> (2). When the container starting, docker attach a volume to it, similar
>> with "-v /core/test:/var/lib/docker/coredump". That means, the path of
>> /var/lib/docker/coredump in container is a link to /core/test in host.
>>
>> (3). Set the /proc/sys/kernel/core_pattern in container as
>> "/var/lib/docker/coredump". But that should not affect the core_pattern
>> in host or other containers.
>>
>> Then I think I can collect the core files from each container and save
>> them in the paths where I want.
> 
> For your case that sounds like it would work.  Unfortunately for this to
> be generally applicable and to let the OS in the contianer control it's
> fate the core dump pattern needs to be supported.
> 
> Otherwise something clever in userspace that can be written now should
> be sufficient to fill the gap.  There is enough information for the user
> mode helper to implement the policy you would like today.
> 
Let me clarify my understanding.

1) running user-mode-helper in a container.
   It's not supported by the kernel. user-mode-helper always works on a host.

2) running user mode helper in a host.
  It's supported in the newest distro(FC23). (abrt supports container.)
  Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots

If a guest user doesn't want to pass a core to the host owner, core_pattern
should be configurable but it can't.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Propose] Isolate core_pattern in mnt namespace.

2015-12-20 Thread Kamezawa Hiroyuki
On 2015/12/20 18:47, Eric W. Biederman wrote:
> Dongsheng Yang  writes:
> 
>> On 12/20/2015 10:37 AM, Al Viro wrote:
>>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote:
 On 12/17/2015 07:23 PM, Dongsheng Yang wrote:
> Hi guys,
>   We are working on making core dump behaviour isolated in
> container. But the problem is, the /proc/sys/kernel/core_pattern
> is a kernel wide setting, not belongs to a container.
>
>   So we want to add core_pattern into mnt namespace. What
> do you think about it?

 Hi Eric,
I found your patch about "net: Implement the per network namespace
 sysctl infrastructure", I want to do the similar thing
 in mnt namespace. Is that suggested way?
>>>
>>> Why mnt namespace and not something else?
>>
>> Hi Al,
>>
>> Well, because core_pattern indicates the path to store core file.
>> In different mnt namespace, we would like to change the path with
>> different value.
>>
>> In addition, Let's considering other namespaces:
>> UTS ns: contains informations of kernel and arch, not proper for 
>> core_pattern.
>> IPC ns: communication informations, not proper for core_pattern
>> PID ns: core_pattern is not related with pid
>> net ns: obviousely no.
>> user ns: not proper too.
>>
>> Then I believe it's better to do this in mnt namespace. of course,
>> core_pattern is just one example. After this infrastructure finished,
>> we can implement more sysctls as per-mnt if necessary, I think.
>>
>> Al, what do you think about this idea?
> 
> The hard part is not the sysctl.  The hard part is starting the usermode
> helper, in an environment that it can deal with.  The mount namespace
> really provides you with no help there.
> 
Let me ask. I think user-mode-helper shouldn be found in container's namespace.
Or Do you mean running user-mode-helper out of a container ?

I think if a user need to send cores to outside of a container, a file 
descriptor
or a network end point should be able to be passed to sysctl. 

Thanks,
-Kame



> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 7/7] Documentation: cgroup: add memory.swap.{current,max} description

2015-12-17 Thread Kamezawa Hiroyuki
On 2015/12/17 21:30, Vladimir Davydov wrote:
> The rationale of separate swap counter is given by Johannes Weiner.
> 
> Signed-off-by: Vladimir Davydov 
> ---
> Changes in v2:
>   - Add rationale of separate swap counter provided by Johannes.
> 
>   Documentation/cgroup.txt | 33 +
>   1 file changed, 33 insertions(+)
> 
> diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
> index 31d1f7bf12a1..f441564023e1 100644
> --- a/Documentation/cgroup.txt
> +++ b/Documentation/cgroup.txt
> @@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back.
>   the cgroup.  This may not exactly match the number of
>   processes killed but should generally be close.
>   
> +  memory.swap.current
> +
> + A read-only single value file which exists on non-root
> + cgroups.
> +
> + The total amount of swap currently being used by the cgroup
> + and its descendants.
> +
> +  memory.swap.max
> +
> + A read-write single value file which exists on non-root
> + cgroups.  The default is "max".
> +
> + Swap usage hard limit.  If a cgroup's swap usage reaches this
> + limit, anonymous meomry of the cgroup will not be swapped out.
> +
>   
>   5-2-2. General Usage
>   
> @@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or 
> the rest of the
>   system than killing the group.  Otherwise, memory.max is there to
>   limit this type of spillover and ultimately contain buggy or even
>   malicious applications.
> +
> +The combined memory+swap accounting and limiting is replaced by real
> +control over swap space.
> +
> +The main argument for a combined memory+swap facility in the original
> +cgroup design was that global or parental pressure would always be
> +able to swap all anonymous memory of a child group, regardless of the
> +child's own (possibly untrusted) configuration.  However, untrusted
> +groups can sabotage swapping by other means - such as referencing its
> +anonymous memory in a tight loop - and an admin can not assume full
> +swappability when overcommitting untrusted jobs.
> +
> +For trusted jobs, on the other hand, a combined counter is not an
> +intuitive userspace interface, and it flies in the face of the idea
> +that cgroup controllers should account and limit specific physical
> +resources.  Swap space is a resource like all others in the system,
> +and that's why unified hierarchy allows distributing it separately.
> 
Could you give here a hint how to calculate amount of swapcache,
counted both in memory.current and swap.current ?

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-17 Thread Kamezawa Hiroyuki

On 2015/12/18 3:43, Luck, Tony wrote:

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.



If the mirrored memory is small and the other is large,
I think we can both enable "non-mirrored -> normal" and "normal -> 
non-mirrored".


Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.


In the original removable usage we wanted to guarantee that Linux did not 
allocate any
kernel objects in removable memory - because that would prevent later removal 
of that
memory.

Mirror case is the same - we don't want to allocate kernel structures in 
non-mirrored memory
because an uncorrectable error in one of them would crash the system.

But I think some users might like some flexibility here.  If the system doesn't 
have enough
memory for the kernel (non-movable or mirrored), then it seems odd to end up 
crashing
the system at the point of memory exhaustion (a likely result ... the kernel 
can try to reclaim
some pages from SLAB, but that might only return a few pages, if the shortage 
continues
the system will perform poorly and eventually fail).

The whole point of removable memory or mirrored memory is to provide better 
availability.

I'd vote for a mode where running out of memory for kernel results in a

warn_on_once("Ran out of mirrored/non-removable memory for kernel - now 
allocating from all zones\n")

because I think most people would like the system to stay up rather than worry 
about some future problem that may never happen.


Hmm...like this ?
  sysctl.vm.fallback_mirror_memory = 0  // never fallback  # default.
  sysctl.vm.fallback_mirror_memory = 1  // the user memory may be allocated 
from mirrored zone.
  sysctl.vm.fallback_mirror_memory = 2  // usually kernel allocates memory 
from mirrored zone before OOM.
  sysctl.vm.fallback_mirror_memory = 3  // 1+2

However I believe my customer's choice is always 0, above implementation can be 
done in a clean way.
(adding a flag to zones (mirrored or not) and controlling fallback zonelist 
walk.)

BTW, we need this Taku's patch to make a progress. I think other devs should be 
done in another
development cycle. What does he need to get your Acks ?

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 13:48, Xishi Qiu wrote:
> On 2015/12/17 10:53, Kamezawa Hiroyuki wrote:
> 
>> On 2015/12/17 11:47, Xishi Qiu wrote:
>>> On 2015/12/17 9:38, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>Sorry for late.
>>>>
>>>>> -Original Message-
>>>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>>>> Sent: Friday, December 11, 2015 6:44 PM
>>>>> To: Izumi, Taku/泉 拓
>>>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>>>
>>>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>>>
>>>>>> Dear Xishi,
>>>>>>
>>>>>>> Hi Taku,
>>>>>>>
>>>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>>>> system
>>>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>>>
>>>>>> What does "when zone_movable and mirrored_kernelcore are both 
>>>>>> enabled?" mean ?
>>>>>>
>>>>>> My patchset just provides a new way to create ZONE_MOVABLE.
>>>>>>
>>>>>
>>>>> Hi Taku,
>>>>>
>>>
>>> Hi Taku,
>>>
>>> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
>>> So when we use "mirror", in fact, the movable zone is a new zone. I think 
>>> it is
>>> more appropriate with this name "mirrored zone", and also we can rewrite the
>>> fallback function in buddy system in this case.
>>
>> kernelcore ="mirrored zone" ?
> 
> No, it's zone_names[MAX_NR_ZONES]
> How about "Movable", -> "Non-mirrored"?
> 
That will break many user apps. I think we don't have enough reason. 

>>
>> BTW, let me confirm.
>>
>>ZONE_NORMAL = mirrored
>>ZONE_MOVABLE = not mirrored.
>>
> 
> Yes,
> 
>> so, the new zone is "not-mirrored" zone.
>>
>> Now, fallback function is
>>
>> movable -> normal -> DMA.
>>
>> As Tony requested, we may need a knob to stop a fallback in 
>> "movable->normal", later.
>>
> 
> If the mirrored memory is small and the other is large,
> I think we can both enable "non-mirrored -> normal" and "normal -> 
> non-mirrored".

Size of mirrored memory can be configured by software(EFI var).
So, having both is just overkill and normal->non-mirroed fallback is 
meaningless considering
what the feature want to guarantee.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/17 12:32, Johannes Weiner wrote:

On Thu, Dec 17, 2015 at 11:46:27AM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


That will be reflected in vmpressure or other distress mechanisms. I'm
not convinced "ran out of swap space" needs special casing in any way.


Most users checks swap space shortage as "system alarm" in enterprise systems.
At least, our customers checks swap-full.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.


I don't see why swap space allowance would need to be as dynamically
adjustable as the memory allowance. There is usually no need to be as
tight with swap space as with memory, and the performance penalty of
swapping, even with flash drives, is high enough that swap space acts
as an overflow vessel rather than be part of the regularly backing of
the anonymous/shmem working set. It really is NOT obvious that swap
space would need to be adjusted on the fly, and that it's important
that reducing the limit will be reflected in consumption right away.



With my OS support experience, some customers consider swap-space as a resource.



We shouldn't be adding hundreds of lines of likely terrible heuristics
code* on speculation that somebody MIGHT find this useful in real life.
We should wait until we are presented with a real usecase that applies
to a whole class of users, and then see what the true requirements are.


ok, we should wait.  I'm just guessing (japanese) HPC people will want the
feature for their job control. I hear many programs relies on swap.


* If a group has 200M swapped out and the swap limit is reduced by 10M
below the current consumption, which pages would you swap in? There is
no LRU list for swap space.


If a rotation can happen when a swap-in-by-real-pagefault, random swap-in
at reducing swap.limit will work enough.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Kamezawa Hiroyuki
On 2015/12/17 11:47, Xishi Qiu wrote:
> On 2015/12/17 9:38, Izumi, Taku wrote:
> 
>> Dear Xishi,
>>
>>   Sorry for late.
>>
>>> -Original Message-
>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
>>> Sent: Friday, December 11, 2015 6:44 PM
>>> To: Izumi, Taku/泉 拓
>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
>>>
>>> On 2015/12/11 13:53, Izumi, Taku wrote:
>>>
>>>> Dear Xishi,
>>>>
>>>>> Hi Taku,
>>>>>
>>>>> Whether it is possible that we rewrite the fallback function in buddy 
>>>>> system
>>>>> when zone_movable and mirrored_kernelcore are both enabled?
>>>>
>>>>What does "when zone_movable and mirrored_kernelcore are both enabled?" 
>>>> mean ?
>>>>
>>>>My patchset just provides a new way to create ZONE_MOVABLE.
>>>>
>>>
>>> Hi Taku,
>>>
> 
> Hi Taku,
> 
> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time.
> So when we use "mirror", in fact, the movable zone is a new zone. I think it 
> is
> more appropriate with this name "mirrored zone", and also we can rewrite the
> fallback function in buddy system in this case.

kernelcore ="mirrored zone" ?

BTW, let me confirm.

  ZONE_NORMAL = mirrored
  ZONE_MOVABLE = not mirrored.

so, the new zone is "not-mirrored" zone.

Now, fallback function is

   movable -> normal -> DMA.

As Tony requested, we may need a knob to stop a fallback in "movable->normal", 
later.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-16 Thread Kamezawa Hiroyuki

On 2015/12/16 20:09, Johannes Weiner wrote:

On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote:

Hmm, my requests are
  - set the same capabilities as mlock() to set swap.limit=0


Setting swap.max is already privileged operation.


Sure.


  - swap-full notification via vmpressure or something mechanism.


Why?



I think it's a sign of unhealthy condition, starting file cache drop rate to 
rise.
But I forgot that there are resource threshold notifier already. Does the 
notifier work
for swap.usage ?


  - OOM-Killer's available memory calculation may be corrupted, please check.


Vladimir updated mem_cgroup_get_limit().


I'll check it.


  - force swap-in at reducing swap.limit


Why?


If full, swap.limit cannot be reduced even if there are available memory in a 
cgroup.
Another cgroup cannot make use of the swap resource while it's occupied by 
other cgroup.
The job scheduler should have a chance to fix the situation.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/16 2:21, Michal Hocko wrote:
 

I completely agree that malicious/untrusted users absolutely have to
be capped by the hard limit. Then the separate swap limit would work
for sure. But I am less convinced about usefulness of the rigid (to
the global memory pressure) swap limit without the hard limit. All the
memory that could have been swapped out will make a memory pressure to
the rest of the system without being punished for it too much. Memcg
is allowed to grow over the high limit (in the current implementation)
without any way to shrink back in other words.

My understanding was that the primary use case for the swap limit is to
handle potential (not only malicious but also unexpectedly misbehaving
application) anon memory consumption runaways more gracefully without
the massive disruption on the global level. I simply didn't see swap
space partitioning as important enough because an alternative to swap
usage is to consume primary memory which is a more precious resource
IMO. Swap storage is really cheap and runtime expandable resource which
is not the case for the primary memory in general. Maybe there are other
use cases I am not aware of, though. Do you want to guarantee the swap
availability?



At the first implementation, NEC guy explained their use case in HPC area.
At that time, there was no swap support.

Considering 2 workloads partitioned into group A, B. total swap was 100GB.
  A: memory.limit = 40G
  B: memory.limit = 40G

Job scheduler runs applications in A and B in turn. Apps in A stops while Apps 
in B running.

If App-A requires 120GB of anonymous memory, it uses 80GB of swap. So, App-B 
can use only
20GB of swap. This can cause trouble if App-B needs 100GB of anonymous memory.
They need some knob to control amount of swap per cgroup.

The point is, at least for their customer, the swap is "resource", which should 
be under
control. With their use case, memory usage and swap usage has the same meaning. 
So,
mem+swap limit doesn't cause trouble.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 23:50, Johannes Weiner wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.


Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?


That never worked with a combined memory+swap limit, either. How could
it? The parent might swap you out under pressure, but simply touching
a few of your anon pages causes them to get swapped back in, thrashing
with whatever the parent was trying to do. Your ability to swap it out
is simply no protection against a group touching its pages.

Allowing the parent to exceed swap with separate counters makes even
less sense, because every page swapped out frees up a page of memory
that the child can reuse. For every swap page that exceeds the limit,
the child gets a free memory page! The child doesn't even have to
cause swapin, it can just steal whatever the parent tried to free up,
and meanwhile its combined memory & swap footprint explodes.


Sure.


The answer is and always should have been: don't overcommit untrusted
cgroups. Think of swap as a resource you distribute, not as breathing
room for the parents to rely on. Because it can't and could never.


ok, don't overcommmit.


And the new separate swap counter makes this explicit.


Hmm, my requests are
 - set the same capabilities as mlock() to set swap.limit=0
 - swap-full notification via vmpressure or something mechanism.
 - OOM-Killer's available memory calculation may be corrupted, please check.
 - force swap-in at reducing swap.limit

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 20:02, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


What do you mean by "malicious degree"? What is such a user trying to
achieve? Killing the system? Well, there are much more effective ways to
do so. Or does it want to exploit a system specific feature to get
benefit for itself? If so, it will hardly win by mlocking all anonymous
memory, because this will result in higher pressure exerted upon its
page cache and dcache, which normal workloads just can't get along
without.



I wanted to say almost all application developers want to set swap.limit=0 if 
allowed.
So, it's a usual people who can kill the system if swap imbalance is allowed.
 



Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are
available swap" with easy mistake.


What do you mean by "easy mistake"? Misconfiguration? If so, it's a lame
excuse IMO. Admin should take system configuration seriously. If the
host is not overcommitted, it's trivial. Otherwise, there's always a
chance that things will go south, so it's not going to be easy. It's up
to admin to analyze risks and set limits accordingly. Exporting knobs
with clear meaning is the best we can do here. swap.max is one such knob
It defines maximal usage of swap resource. Allowing to breach it just
does not add up.


Can't you add "swap excess" switch to sysctl to allow global memory
reclaim can ignore swap limitation ?


I'd be opposed to it, because this would obscure the user API. OTOH, a
kind of swap soft limit (swap.high?) might be considered. I'm not sure
if it's really necessary though, because all arguments for it do not
look convincing to me for now. So, personally, I would refrain from
implementing it until it is really called for by users of cgroup v2.



Considering my customers, running OOM-Killer while there are free swap space is
system's error rather than their misconfiguration.

BTW, mlock() requires CAP_IPC_LOCK.
please set default unlimited and check capability at setting swap limit, at 
least.

Thanks,
-Kame



Thanks,
Vladimir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.

Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-15 Thread Kamezawa Hiroyuki

On 2015/12/15 17:30, Vladimir Davydov wrote:

On Tue, Dec 15, 2015 at 12:12:40PM +0900, Kamezawa Hiroyuki wrote:

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


If we are talking about "attacking system" from inside a container,
there are much easier and disruptive ways, e.g. running a fork-bomb or
creating pipes - such memory can't be reclaimed and global OOM killer
won't help.


You're right. We just wanted to avoid affecting global memory reclaim by
each cgroup settings.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 0:30, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.

I guess this was the reason why this approach hasn't been chosen before


Yes. At that age, "never break global VM" was the policy. And "mlock" can be
used for attacking system.


but I think we can come up with a way to stop the run away consumption
even when the swap is accounted separately. All of them are quite nasty
but let me try.

We could allow charges to fail even for the high limit if the excess is
way above the amount of reclaimable memory in the given memcg/hierarchy.
A runaway load would be stopped before it can cause a considerable
damage outside of its hierarchy this way even when the swap limit
is configured small.
Now that goes against the high limit semantic which should only throttle
the consumer and shouldn't cause any functional failures but maybe this
is acceptable for the overall system stability. An alternative would
be to throttle in the high limit reclaim context proportionally to
the excess. This is normally done by the reclaim itself but with no
reclaimable memory this wouldn't work that way.


This seems hard to use for users who want to control resource precisely
even if stability is good.


Another option would be to ignore the swap limit during the global
reclaim. This wouldn't stop the runaway loads but they would at least
see their fair share of the reclaim. The swap excess could be then used
as a "handicap" for a more aggressive throttling during high limit reclaim
or to trigger hard limit sooner.


This seems to work. But users need to understand swap-limit can be exceeded.


Or we could teach the global OOM killer to select abusive anon memory
users with restricted swap. That would require to iterate through all
memcgs and checks whether their anon consumption is in a large excess to
their swap limit and fallback to the memcg OOM victim selection if that
is the case. This adds more complexity to the OOM killer path so I am
not sure this is generally acceptable, though.



I think this is not acceptable.


My question now is. Is the knob usable/useful even without additional
heuristics? Do we want to protect swap space so rigidly that a swap
limited memcg can cause bigger problems than without the swap limit
globally?



swap requires some limit. If not, an application can eat up all swap
and it will not be never freed until the application access it or
swapoff runs.

Thanks,
-Kame


The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Signed-off-by: Vladimir Davydov 
---
  include/linux/memcontrol.h |   1 +
  include/linux/swap.h   |   5 ++
  mm/memcontrol.c| 123 +
  mm/shmem.c |   4 ++
  mm/swap_state.c|   5 ++
  5 files changed, 129 insertions(+), 9 deletions(-)


[...]




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-14 Thread Kamezawa Hiroyuki

On 2015/12/15 4:42, Vladimir Davydov wrote:

On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote:

On Thu 10-12-15 14:39:14, Vladimir Davydov wrote:

In the legacy hierarchy we charge memsw, which is dubious, because:

  - memsw.limit must be >= memory.limit, so it is impossible to limit
swap usage less than memory usage. Taking into account the fact that
the primary limiting mechanism in the unified hierarchy is
memory.high while memory.limit is either left unset or set to a very
large value, moving memsw.limit knob to the unified hierarchy would
effectively make it impossible to limit swap usage according to the
user preference.

  - memsw.usage != memory.usage + swap.usage, because a page occupying
both swap entry and a swap cache page is charged only once to memsw
counter. As a result, it is possible to effectively eat up to
memory.limit of memory pages *and* memsw.limit of swap entries, which
looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.
This patch adds mem_cgroup->swap counter, which charges the actual
number of swap entries used by a cgroup. It is only charged in the
unified hierarchy, while the legacy hierarchy memsw logic is left
intact.


I agree that the previous semantic was awkward. The problem I can see
with this approach is that once the swap limit is reached the anon
memory pressure might spill over to other and unrelated memcgs during
the global memory pressure. I guess this is what Kame referred to as
anon would become mlocked basically. This would be even more of an issue
with resource delegation to sub-hierarchies because nobody will prevent
setting the swap amount to a small value and use that as an anon memory
protection.


AFAICS such anon memory protection has a side-effect: real-life
workloads need page cache to run smoothly (at least for mapping
executables). Disabling swapping would switch pressure to page caches,
resulting in performance degradation. So, I don't think per memcg swap
limit can be abused to boost your workload on an overcommitted system.

If you mean malicious users, well, they already have plenty ways to eat
all available memory up to the hard limit by creating unreclaimable
kernel objects.


"protect anon" user's malicious degree is far lower than such cracker like 
users.


Anyway, if you don't trust a container you'd better set the hard memory
limit so that it can't hurt others no matter what it runs and how it
tweaks its sub-tree knobs.



Limiting swap can easily cause "OOM-Killer even while there are available swap"
with easy mistake. Can't you add "swap excess" switch to sysctl to allow global
memory reclaim can ignore swap limitation ?

Regards,
-Kame








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2

2015-12-10 Thread Kamezawa Hiroyuki
On 2015/12/10 20:39, Vladimir Davydov wrote:
> In the legacy hierarchy we charge memsw, which is dubious, because:
> 
>   - memsw.limit must be >= memory.limit, so it is impossible to limit
> swap usage less than memory usage. Taking into account the fact that
> the primary limiting mechanism in the unified hierarchy is
> memory.high while memory.limit is either left unset or set to a very
> large value, moving memsw.limit knob to the unified hierarchy would
> effectively make it impossible to limit swap usage according to the
> user preference.
> 
>   - memsw.usage != memory.usage + swap.usage, because a page occupying
> both swap entry and a swap cache page is charged only once to memsw
> counter. As a result, it is possible to effectively eat up to
> memory.limit of memory pages *and* memsw.limit of swap entries, which
> looks unexpected.
> 
> That said, we should provide a different swap limiting mechanism for
> cgroup2.
> 
> This patch adds mem_cgroup->swap counter, which charges the actual
> number of swap entries used by a cgroup. It is only charged in the
> unified hierarchy, while the legacy hierarchy memsw logic is left
> intact.
> 
> The swap usage can be monitored using new memory.swap.current file and
> limited using memory.swap.max.
> 
> Signed-off-by: Vladimir Davydov 

setting swap.max=0 will work like mlock ?

Thanks,
-Kame


> ---
>   include/linux/memcontrol.h |   1 +
>   include/linux/swap.h   |   5 ++
>   mm/memcontrol.c| 123 
> +
>   mm/shmem.c |   4 ++
>   mm/swap_state.c|   5 ++
>   5 files changed, 129 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c6a5ed2f2744..993c9a26b637 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -169,6 +169,7 @@ struct mem_cgroup {
>   
>   /* Accounted resources */
>   struct page_counter memory;
> + struct page_counter swap;
>   struct page_counter memsw;
>   struct page_counter kmem;
>   
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..f4b3ccdcba91 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -368,11 +368,16 @@ static inline int mem_cgroup_swappiness(struct 
> mem_cgroup *mem)
>   #endif
>   #ifdef CONFIG_MEMCG_SWAP
>   extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> +extern int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry);
>   extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
>   #else
>   static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>   {
>   }
> +static inline int mem_cgroup_charge_swap(struct page *page, swp_entry_t 
> entry)
> +{
> + return 0;
> +}
>   static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f5c6abf5421..9d10e2819ec4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1212,7 +1212,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup 
> *memcg, struct task_struct *p)
>   pr_cont(":");
>   
>   for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> - if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account())
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
>   continue;
>   pr_cont(" %s:%luKB", mem_cgroup_stat_names[i],
>   K(mem_cgroup_read_stat(iter, i)));
> @@ -1248,12 +1248,15 @@ static unsigned long mem_cgroup_get_limit(struct 
> mem_cgroup *memcg)
>   {
>   unsigned long limit;
>   
> - limit = memcg->memory.limit;
> + limit = READ_ONCE(memcg->memory.limit);
>   if (mem_cgroup_swappiness(memcg)) {
>   unsigned long memsw_limit;
> + unsigned long swap_limit;
>   
> - memsw_limit = memcg->memsw.limit;
> - limit = min(limit + total_swap_pages, memsw_limit);
> + memsw_limit = READ_ONCE(memcg->memsw.limit);
> + swap_limit = min(READ_ONCE(memcg->swap.limit),
> +  (unsigned long)total_swap_pages);
> + limit = min(limit + swap_limit, memsw_limit);
>   }
>   return limit;
>   }
> @@ -4226,6 +4229,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state 
> *parent_css)
>   page_counter_init(&memcg->memory, NULL);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg->soft_limit = PAGE_COUNTER_MAX;
> + page_counter_init(&memcg->swap, NULL);
>   page_counter_init(&memcg->memsw, NULL);
>   page_counter_init(&memcg->kmem, NULL);
>   }
> @@ -4276,6 +4280,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>   page_counter_init(&memcg->memory, &parent->memory);
>   memcg->high = PAGE_COUNTER_MAX;
>   memcg-

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-11-03 Thread Kamezawa Hiroyuki

On 2015/10/31 4:42, Luck, Tony wrote:

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?


It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.



Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-30 Thread Kamezawa Hiroyuki

On 2015/10/30 17:23, Michal Hocko wrote:

On Fri 30-10-15 14:23:59, KAMEZAWA Hiroyuki wrote:

On 2015/10/30 0:17, mho...@kernel.org wrote:

[...]

@@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

-   /* Keep reclaiming pages as long as there is reasonable progress */
+   /*
+* Do not retry high order allocations unless they are __GFP_REPEAT
+* and even then do not retry endlessly.
+*/
pages_reclaimed += did_some_progress;
-   if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-   ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-   /* Wait for some write requests to complete then retry */
-   wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-   goto retry;
+   if (order > PAGE_ALLOC_COSTLY_ORDER) {
+   if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<

why directly retry here ?


Because I wanted to preserve the previous logic for GFP_REPEAT as much
as possible here and do an incremental change in the later patch.



I see.


[...]


@@ -3150,8 +3203,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
goto got_pg;

/* Retry as long as the OOM killer is making progress */
-   if (did_some_progress)
+   if (did_some_progress) {
+   stall_backoff = 0;
goto retry;
+   }


Umm ? I'm sorry that I didn't notice page allocation may fail even
if order < PAGE_ALLOC_COSTLY_ORDER.  I thought old logic ignores
did_some_progress. It seems a big change.


__alloc_pages_may_oom will set did_some_progress


So, now, 0-order page allocation may fail in a OOM situation ?


No they don't normally and this patch doesn't change the logic here.



I understand your patch doesn't change the behavior.
Looking into __alloc_pages_may_oom(), *did_some_progress is finally set by

 if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;

...depends on out_of_memory() return value.
Now, allocation may fail if oom-killer is disabled Isn't it complicated ?

Shouldn't we have

 if (order < PAGE_ALLOC_COSTLY_ORDER)
goto retry;

here ?

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each 
> node
> 

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> wait_iff_congested has been used to throttle allocator before it retried
> another round of direct reclaim to allow the writeback to make some
> progress and prevent reclaim from looping over dirty/writeback pages
> without making any progress. We used to do congestion_wait before
> 0e093d99763e ("writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone") but that led to undesirable stalls
> and sleeping for the full timeout even when the BDI wasn't congested.
> Hence wait_iff_congested was used instead. But it seems that even
> wait_iff_congested doesn't work as expected. We might have a small file
> LRU list with all pages dirty/writeback and yet the bdi is not congested
> so this is just a cond_resched in the end and can end up triggering pre
> mature OOM.
> 
> This patch replaces the unconditional wait_iff_congested by
> congestion_wait which is executed only if we _know_ that the last round
> of direct reclaim didn't make any progress and dirty+writeback pages are
> more than a half of the reclaimable pages on the zone which might be
> usable for our target allocation. This shouldn't reintroduce stalls
> fixed by 0e093d99763e because congestion_wait is called only when we
> are getting hopeless when sleeping is a better choice than OOM with many
> pages under IO.
> 
> Signed-off-by: Michal Hocko 
> ---
>   mm/page_alloc.c | 19 +--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9c0abb75ad53..0518ca6a9776 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3191,8 +3191,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>*/
>   if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>   ac->high_zoneidx, alloc_flags, target)) {
> - /* Wait for some write requests to complete then retry 
> */
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> + unsigned long writeback = zone_page_state(zone, 
> NR_WRITEBACK),
> +   dirty = zone_page_state(zone, 
> NR_FILE_DIRTY);
> +
> + if (did_some_progress)
> + goto retry;
> +
> + /*
> +  * If we didn't make any progress and have a lot of
> +  * dirty + writeback pages then we should wait for
> +  * an IO to complete to slow down the reclaim and
> +  * prevent from pre mature OOM
> +  */
> + if (2*(writeback + dirty) > reclaimable)

Doesn't this add unnecessary latency if other zones have enough clean memory ?


Thanks,
-Kame
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/3] mm, oom: refactor oom detection

2015-10-29 Thread Kamezawa Hiroyuki
On 2015/10/30 0:17, mho...@kernel.org wrote:
> From: Michal Hocko 
> 
> __alloc_pages_slowpath has traditionally relied on the direct reclaim
> and did_some_progress as an indicator that it makes sense to retry
> allocation rather than declaring OOM. shrink_zones had to rely on
> zone_reclaimable if shrink_zone didn't make any progress to prevent
> from pre mature OOM killer invocation - the LRU might be full of dirty
> or writeback pages and direct reclaim cannot clean those up.
> 
> zone_reclaimable will allow to rescan the reclaimable lists several
> times and restart if a page is freed. This is really subtle behavior
> and it might lead to a livelock when a single freed page keeps allocator
> looping but the current task will not be able to allocate that single
> page. OOM killer would be more appropriate than looping without any
> progress for unbounded amount of time.
> 
> This patch changes OOM detection logic and pulls it out from shrink_zone
> which is too low to be appropriate for any high level decisions such as OOM
> which is per zonelist property. It is __alloc_pages_slowpath which knows
> how many attempts have been done and what was the progress so far
> therefore it is more appropriate to implement this logic.
> 
> The new heuristic tries to be more deterministic and easier to follow.
> Retrying makes sense only if the currently reclaimable memory + free
> pages would allow the current allocation request to succeed (as per
> __zone_watermark_ok) at least for one zone in the usable zonelist.
> 
> This alone wouldn't be sufficient, though, because the writeback might
> get stuck and reclaimable pages might be pinned for a really long time
> or even depend on the current allocation context. Therefore there is a
> feedback mechanism implemented which reduces the reclaim target after
> each reclaim round without any progress. This means that we should
> eventually converge to only NR_FREE_PAGES as the target and fail on the
> wmark check and proceed to OOM. The backoff is simple and linear with
> 1/16 of the reclaimable pages for each round without any progress. We
> are optimistic and reset counter for successful reclaim rounds.
> 
> Costly high order pages mostly preserve their semantic and those without
> __GFP_REPEAT fail right away while those which have the flag set will
> back off after the amount of reclaimable pages reaches equivalent of the
> requested order. The only difference is that if there was no progress
> during the reclaim we rely on zone watermark check. This is more logical
> thing to do than previous 1< zone_reclaimable faking the progress.
> 
> Signed-off-by: Michal Hocko 
> ---
>   include/linux/swap.h |  1 +
>   mm/page_alloc.c  | 69 
> ++--
>   mm/vmscan.c  | 10 +---
>   3 files changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 9c7c4b418498..8298e1dc20f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>   struct vm_area_struct *vma);
>   
>   /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>   extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
>   extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c73913648357..9c0abb75ad53 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2972,6 +2972,13 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>   return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == 
> GFP_TRANSHUGE;
>   }
>   
> +/*
> + * Number of backoff steps for potentially reclaimable pages if the direct 
> reclaim
> + * cannot make any progress. Each step will reduce 1/MAX_STALL_BACKOFF of the
> + * reclaimable memory.
> + */
> +#define MAX_STALL_BACKOFF 16
> +
>   static inline struct page *
>   __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   struct alloc_context *ac)
> @@ -2984,6 +2991,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   enum migrate_mode migration_mode = MIGRATE_ASYNC;
>   bool deferred_compaction = false;
>   int contended_compaction = COMPACT_CONTENDED_NONE;
> + struct zone *zone;
> + struct zoneref *z;
> + int stall_backoff = 0;
>   
>   /*
>* In the slowpath, we sanity check order to avoid ever trying to
> @@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
> order,
>   if (gfp_mask & __GFP_NORETRY)
>   goto noretry;
>   
> - /* Keep reclaiming pages as long as there is reasonable progress */
> + /*
> +  * Do not retry high order allocations unless they are __GFP_

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Kamezawa Hiroyuki

On 2015/10/22 3:17, Luck, Tony wrote:

+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.


You're right.


What happens then?  Do kernel allocations fail?  Or do they fall back to using 
removable memory?


Maybe the kernel cannot boot because NORMAL zone is empty.


Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.



I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for 
meminfo ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-13 Thread Kamezawa Hiroyuki

On 2015/10/09 19:36, Xishi Qiu wrote:

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:


On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
- allocate kernel memory from reliable region
- allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.



Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too.



Yes.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.


I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this,

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)



I think zones can be overlapped even while they are aligned to MAX_ORDER.






I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.


Yes, add a new zone is better, but it will change much code, so reuse 
ZONE_MOVABLE
is simpler and easier, right?



I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be 
another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable 
memory and
   doesn't specify anything about users.



His patch is very simple.



The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.



Maybe. I think it should be another discussion. With this patch and his 
fake-reliable-memory
patch, everyone can give a try.



For your requirements. I and Izumi-san are discussing following plan.

  - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
  - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
  - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.



like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag 
-> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

yes.



Also we can introduce some interfaces in procfs or sysfs, right?



It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:


Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
   - allocate kernel memory from reliable region
   - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.



Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?


For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.


2) the mirrored region should locate at the start of normal
zone, right?


Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.



I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.


Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be 
messy.
His patch is very simple.

For your requirements. I and Izumi-san are discussing following plan.

 - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as 
not-reliable.
 - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
 - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that 
flag.


Thanks,
-Kame





















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node

2015-10-09 Thread Kamezawa Hiroyuki

On 2015/10/09 14:52, Jiang Liu wrote:

On 2015/10/9 4:20, Andrew Morton wrote:

On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes  
wrote:


On Wed, 19 Aug 2015, Patil, Kiran wrote:


Acked-by: Kiran Patil 


Where's the call to preempt_disable() to prevent kernels with preemption
from making numa_node_id() invalid during this iteration?


David asked this question twice, received no answer and now the patch
is in the maintainer tree, destined for mainline.

If I was asked this question I would respond

   The use of numa_mem_id() is racy and best-effort.  If the unlikely
   race occurs, the memory allocation will occur on the wrong node, the
   overall result being very slightly suboptimal performance.  The
   existing use of numa_node_id() suffers from the same issue.

But I'm not the person proposing the patch.  Please don't just ignore
reviewer comments!

Hi Andrew,
Apologize for the slow response due to personal reasons!
And thanks for answering the question from David. To be honest,
I didn't know how to answer this question before. Actually this
question has puzzled me for a long time when dealing with memory
hot-removal. For normal cases, it only causes sub-optimal memory
allocation if schedule event happens between querying NUMA node id
and calling alloc_pages_node(). But what happens if system run into
following execution sequence?
1) node = numa_mem_id();
2) memory hot-removal event triggers
2.1) remove affected memory
2.2) reset pgdat to zero if node becomes empty after memory removal


I'm sorry if I misunderstand something.
After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no memset().


3) alloc_pages_node(), which may access zero-ed pgdat structure.


?



I haven't found a mechanism to protect system from above sequence yet,
so puzzled for a long time already:(. Does stop_machine() protect
system from such a execution sequence?


To access pgdat, a pgdat's zone should be on per-pgdat-zonelist.
Now, __build_all_zonelists() is called under stop_machine(). That's the reason
why you're asking what stop_machine() does. And, as you know, stop_machine() is 
not
protecting anything. The caller may fallback into removed zone.

Then, let's think.

At first, please note "pgdat" is not removed (and cannot be removed),
accessing pgdat's memory will not cause segmentation fault.

Just contents are problem. At removal, zone's page related information
and pgdat's page related information is cleared.

alloc_pages uses zonelist/zoneref/cache to walk each zones without accessing
pgdat itself. I think accessing zonelist is safe because it's an array updated
by stop_machine().

So, the problem is alloc_pages() can work correctly even if zone contains no 
page.
I think it should work.

(Note: zones are included in pgdat. So, zeroing pgdat means zeroing zone and 
other
 structures. it will not work.)

So, what problem you see now ?
I'm sorry I can't chase old discusions.

Thanks,
-Kame















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()

2015-10-05 Thread Kamezawa Hiroyuki
On 2015/09/22 15:23, Ingo Molnar wrote:
> So when memory hotplug removes a piece of physical memory from pagetable
> mappings, it also frees the underlying PGD entry.
> 
> This complicates PGD management, so don't do this. We can keep the
> PGD mapped and the PUD table all clear - it's only a single 4K page
> per 512 GB of memory hotplugged.
> 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Brian Gerst 
> Cc: Denys Vlasenko 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Oleg Nesterov 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Waiman Long 
> Cc: linux...@kvack.org
> Signed-off-by: Ingo Molnar 

Ishimatsu-san, Tang-san, please check.

Doesn't this patch affects the issues of 
 5255e0a79fcc0ff47b387af92bd9ef5729b1b859
 9661d5bcd058fe15b4138a00d96bd36516134543

?

-Kame

> ---
>   arch/x86/mm/init_64.c | 27 ---
>   1 file changed, 27 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 7129e7647a76..60b0cc3f2819 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, 
> pud_t *pud)
>   spin_unlock(&init_mm.page_table_lock);
>   }
>   
> -/* Return true if pgd is changed, otherwise return false. */
> -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> -{
> - pud_t *pud;
> - int i;
> -
> - for (i = 0; i < PTRS_PER_PUD; i++) {
> - pud = pud_start + i;
> - if (pud_val(*pud))
> - return false;
> - }
> -
> - /* free a pud table */
> - free_pagetable(pgd_page(*pgd), 0);
> - spin_lock(&init_mm.page_table_lock);
> - pgd_clear(pgd);
> - spin_unlock(&init_mm.page_table_lock);
> -
> - return true;
> -}
> -
>   static void __meminit
>   remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>bool direct)
> @@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, 
> bool direct)
>   unsigned long addr;
>   pgd_t *pgd;
>   pud_t *pud;
> - bool pgd_changed = false;
>   
>   for (addr = start; addr < end; addr = next) {
>   next = pgd_addr_end(addr, end);
> @@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long 
> end, bool direct)
>   
>   pud = (pud_t *)pgd_page_vaddr(*pgd);
>   remove_pud_table(pud, addr, next, direct);
> - if (free_pud_table(pud, pgd))
> - pgd_changed = true;
>   }
>   
> - if (pgd_changed)
> - sync_global_pgds(start, end - 1, 1);
> -
>   flush_tlb_all();
>   }
>   
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-24 Thread Kamezawa Hiroyuki

On 2015/08/25 8:15, Paul Turner wrote:

On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo  wrote:

Hello,

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:

Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.


This is much easier to set up and play with synthetically.  Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active.   I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.


Well, but that loses most of details and why such use cases matter to
begin with.  We can imagine up stuff to induce arbitrary set of
requirements.


All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice.  The
constraints here were drawn from a real-world example.




I take that the
CPU intensive helper threads are usually IO workers?  Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?


Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke.  Each of these backends is
separate and chooses its own threading.


Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work?  qemu already implements IO limit mechanism
after all.


No.

1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time.  Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
arbitrate.
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.



I think I should explain my customer's use case of qemu + cpuset/cpu (via 
libvirt)

(1) Isolating hypervisor thread.
   As already discussed, hypervisor threads are isolated by cpuset. But their 
purpose
   is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot 
be solution
   as already discussed.

(2) Fixed rate vcpu service.
   With using cpu controller's quota/period feature, my customer creates  vcpu 
models like
   Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.

   To do this, each vcpus should be quota-limited independently, with 
per-thread cpu control.

Especially, the method (1) is used in several enterprise customers for 
stabilizing their system.

Sub-process control should be provided by some way.

Thanks,
-Kame





Anyways, a point here is that threads of the same process competing
isn't a new problem.  There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.


It's certainly not a new problem, but it's a real one, and it's
_hard_.  You're proposing removing the best known solution.



cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads.  For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?


We're talking about sub-process usage here.  This is the application
coordinating itself, NOT the sysadmin.  Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.


  It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.


As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who.  The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.



Thanks.

--
tejun

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

2015-08-18 Thread Kamezawa Hiroyuki

On 2015/08/19 5:31, Tejun Heo wrote:

Hello, Paul.

On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote:

2) Control within an address-space.  For subsystems with fungible resources,
e.g. CPU, it can be useful for an address space to partition its own
threads.  Losing the capability to do this against the CPU controller would
be a large set-back for instance.  Occasionally, it is useful to share these
groupings between address spaces when processes are cooperative, but this is
less of a requirement.

This is important to us.


Sure, let's build a proper interface for that.  Do you actually need
sub-hierarchy inside a process?  Can you describe your use case in
detail and why having hierarchical CPU cycle distribution is essential
for your use case?


An actual per-thread use case in our customers is qemu-kvm + cpuset.
customers pin each vcpus and qemu-kvm's worker threads to cpus.
For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and
others(vhost) to cpu 0-1.

This is an actual kvm tuning on our customers for performance guarantee.

In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu 
sizing.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-11 Thread Kamezawa Hiroyuki

On 2015/08/10 17:14, Vladimir Davydov wrote:

On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

...

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
   # contaienr A  workingset=600M   limit=1G (sleepy)
   # contaienr B  workingset=300M   limit=1G (work often)
   # container C  workingset=500M   limit=1G (work slowly)
   # container D  workingset=1.2G   limit=1G (work hard)
container D can drive the zone's distance counter because of local memory 
reclaim.
If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.


This does not necessarily mean we have to use different inactive_age
counter for global and local memory pressure. In your example, having
inactive_age per lruvec and using it for evictions on both global and
local memory pressure would work just fine.



you're right.

 




 if (current memcg == recorded memcg && eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.


But if the inactive_age counter was per lruvec, then we wouldn't need to
bother about it.


yes.

Anyway, what I understand now is that we need to reduce influence from a 
memcg's behavior
against other memcgs. Your way is dividing counter completely, my idea was 
implementing
different counter. Doing it by calculation will be good because we can't have 
enough record
space.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-09 Thread Kamezawa Hiroyuki

On 2015/08/08 22:05, Vladimir Davydov wrote:

On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote:

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

  1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.


No, that's not what I want. I want pages of the workingset to get
activated on refault no matter if they were evicted on global memory
pressure or due to hitting a memory cgroup limit.



Sure.


  2) maintain acitve/inactive ratio in memcg properly as global LRU does.
  3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.


What makes you think so? An example of misbehavior caused by this would
be nice to have.



By design, memcg's LRU aging logic is independent from global memory 
allocation/pressure.


Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB 
server,
  # contaienr A  workingset=600M   limit=1G (sleepy)
  # contaienr B  workingset=300M   limit=1G (work often)
  # container C  workingset=500M   limit=1G (work slowly)
  # container D  workingset=1.2G   limit=1G (work hard)
 
container D can drive the zone's distance counter because of local memory reclaim.

If active/inactive = 1:1, container D page can be activated.
At kswapd(global reclaim) runs, all container's LRU will rotate.

Possibility of refault in A, B, C is reduced by conainer D's counter updates.

But yes, some _real_  test are required.



Now, the record is

eviction | node | zone | 2bit.

How about changing this as

 0 |eviction | node | zone | 2bit
 1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
 if (the 1st bit is 0)
 compare eviction counter with zone's counter and activate the 
page if needed.
 else if (the 1st bit is 1)
 compare eviction counter with the memcg (if exists)


Having a single counter per memcg won't scale with the number of NUMA
nodes.


It doesn't matter, we can use lazy counter like pcpu counter because it's not 
needed to be very accurate.



 if (current memcg == recorded memcg && eviction distance is 
okay)
  activate page.
 else
  inactivate
At page-out
 if (global memory pressure)
 record eviction id with using zone's counter.
 else if (memcg local memory pressure)
 record eviction id with memcg's counter.



I don't understand how this is supposed to work when a memory cgroup
experiences both local and global pressure simultaneously.



I think updating global distance counter by local reclaim may update counter 
too much.
Above is to avoid updating zone's counter and keep memcg's LRU active/inactive 
balanced.


Also, what if a memory cgroup is protected by memory.low? Such a cgroup
may have all its pages in the active list, because it is never scanned.


If LRU never scanned, all file caches tend to be in INACTIVE...it never 
refaults.


This will affect the refault distance of other cgroups, making
activations unpredictable.





Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-06 Thread Kamezawa Hiroyuki

On 2015/08/06 17:59, Vladimir Davydov wrote:

On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote:


Reading discussion, I feel storing more data is difficult, too.


Yep, even with the current 16-bit memcg id. Things would get even worse
if we wanted to extend it one day (will we?)



I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

(refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


This particular formula wouldn't work I'm afraid. If there are two
isolated cgroups issuing local reclaim on the same zone, the refault
distance needed for activation would be reduced by half for no apparent
reason.


Hmm, you mean activation in memcg means activation in global LRU, and it's not a
valid reason. Current implementation does have the same issue, right ?

i.e. when a container has been hitting its limit for a while, and then, a file 
cache is
pushed out but came back soon, it can be easily activated.

I'd like to confirm what you want to do.

 1) avoid activating a file cache when it was kicked out because of memcg's 
local limit.
 2) maintain acitve/inactive ratio in memcg properly as global LRU does.
 3) reclaim shadow entry at proper timing.

All ? hmm. It seems that mixture of record of global memory pressure and of 
local memory
pressure is just wrong.

Now, the record is

eviction | node | zone | 2bit.

How about changing this as

0 |eviction | node | zone | 2bit
1 |eviction |  memcgid| 2bit

Assume each memcg has an eviction counter, which ignoring node/zone.
i.e. memcg local reclaim happens against memcg not against zone.

At page-in,
if (the 1st bit is 0)
compare eviction counter with zone's counter and activate the 
page if needed.
else if (the 1st bit is 1)
compare eviction counter with the memcg (if exists)
if (current memcg == recorded memcg && eviction distance is 
okay)
 activate page.
else
 inactivate
  
At page-out

if (global memory pressure)
record eviction id with using zone's counter.
else if (memcg local memory pressure)
record eviction id with memcg's counter.

By this,
   1) locally reclaimed pages cannot be activated unless it's refaulted in the 
same memcg.
  In this case, activating in the memcg has some meaning.

   2) At global memory pressure, distance is properly calculated based on 
global system status.
  global memory pressure can ignore memcg's behavior.

about shadow entries, kmemcg should take care of it


Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-05 Thread Kamezawa Hiroyuki

On 2015/08/05 16:47, Michal Hocko wrote:

On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote:
[...]

so, for memory controller, we'll have


We currently have only current, low, high, max and events currently.
All other knobs are either deprecated or waiting for a usecase to emerge
before they get added.



Sure. I think following has users.
 - *.stat - for chekcing health of cgroup ,or for debug
 - *.pressure_level   - for notifying memory pressure
 - *.swappiness   - for adjusting LRU activity per application type.
 - *.oom_control  - for surviving/notifiyng out of memory
memcg's oom can be recovered if limit goes up rather 
than kill.
But I know people says this knob is not useful. This 
will require
discussion.

Hm. If we don't want to increase files, NETLINK or systemcall is an another 
choice of
subsystem specific interface ?

-Kame


(in alphabet order)
memory.failcnt
memory.force_empty  (<= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (<= removed)





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Make workingset detection logic memcg aware

2015-08-04 Thread Kamezawa Hiroyuki
On 2015/08/03 21:04, Vladimir Davydov wrote:
> Hi,
> 
> Currently, workingset detection logic is not memcg aware - inactive_age
> is maintained per zone. As a result, if memory cgroups are used,
> refaulted file pages are activated randomly. This patch set makes
> inactive_age per lruvec so that workingset detection will work correctly
> for memory cgroup reclaim.
> 
> Thanks,
> 

Reading discussion, I feel storing more data is difficult, too.

I wonder, rather than collecting more data, rough calculation can help the 
situation.
for example,

   (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's 
active list

If one of per-zone calc or per-memcg calc returns true, refault should be true.

memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone.


Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] cgroup: define controller file conventions

2015-08-04 Thread Kamezawa Hiroyuki

On 2015/08/05 4:31, Tejun Heo wrote:

 From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Tue, 4 Aug 2015 15:20:55 -0400

Traditionally, each cgroup controller implemented whatever interface
it wanted leading to interfaces which are widely inconsistent.
Examining the requirements of the controllers readily yield that there
are only a few control schemes shared among all.

Two major controllers already had to implement new interface for the
unified hierarchy due to significant structural changes.  Let's take
the chance to establish common conventions throughout all controllers.

This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight
based control knobs and documents the conventions that controllers
should follow on the unified hierarchy.  Except for io.weight knob,
all existing unified hierarchy knobs are already compliant.  A
follow-up patch will update io.weight.

v2: Added descriptions of min, low and high knobs.

Signed-off-by: Tejun Heo 
Acked-by: Johannes Weiner 
Cc: Li Zefan 
Cc: Peter Zijlstra 
---
Hello,

Added low/high descriptions and applied to the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
for-4.3-unified-base

The branch currently only contains this patch and will stay stable so
that it can be pulled from.  I kept the base weight as DFL for now.
If we decide to change it, I'll apply the change on top.

Thanks.

  Documentation/cgroups/unified-hierarchy.txt | 80 ++---
  include/linux/cgroup.h  |  9 
  2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt 
b/Documentation/cgroups/unified-hierarchy.txt
index 86847a7..1ee9caf 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -23,10 +23,13 @@ CONTENTS
  5. Other Changes
5-1. [Un]populated Notification
5-2. Other Core Changes
-  5-3. Per-Controller Changes
-5-3-1. blkio
-5-3-2. cpuset
-5-3-3. memory
+  5-3. Controller File Conventions
+5-3-1. Format
+5-3-2. Control Knobs
+  5-4. Per-Controller Changes
+5-4-1. blkio
+5-4-2. cpuset
+5-4-3. memory
  6. Planned Changes
6-1. CAP for resource control

@@ -372,14 +375,75 @@ supported and the interface files "release_agent" and
  - The "cgroup.clone_children" file is removed.


-5-3. Per-Controller Changes
+5-3. Controller File Conventions

-5-3-1. blkio
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+  VAL0 VAL1...\n
+
+- Flat keyed files
+
+  KEY0 VAL0\n
+  KEY1 VAL1\n
+  ...
+
+- Nested keyed files
+
+  KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+  KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+  ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time.  For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+  single file.
+
+- In general, the root cgroup should be exempt from resource control
+  and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+  control knob should be named "weight" and have the range [1, 1]
+  and 100 should be the default value.  The values are chosen to allow
+  enough and symmetric bias in both directions while keeping it
+  intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+  limit, the control knobs should be named "min" and "max"
+  respectively.  If a controller implements best effort resource
+  gurantee and/or limit, the control knobs should be named "low" and
+  "high" respectively.
+
+  In the above four control files, the special token "max" should be
+  used to represent upward infinity for both reading and writing.
+

so, for memory controller, we'll have

(in alphabet order)
memory.failcnt
memory.force_empty  (<= should this be removed ?)
memory.kmem.failcnt
memory.kmem.max
memory.kmem.max_usage
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.max
memory.kmem.tcp.max_usage
memory.kmem.tcp.usage
memory.kmem.usage
memory.max
memory.max_usage
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.high
memory.swapiness
memory.usage
memory.use_hierarchy (<= removed)

?
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-30 Thread Kamezawa Hiroyuki

On 2015/06/30 11:45, Xishi Qiu wrote:

On 2015/6/29 15:32, Kamezawa Hiroyuki wrote:


On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu 


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
   include/linux/mmzone.h | 9 +
   mm/page_alloc.c| 3 +++
   mm/vmstat.c| 3 +++
   3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
   MIGRATE_UNMOVABLE,
   MIGRATE_RECLAIMABLE,
   MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+MIGRATE_MIRROR,
+#endif


I think
 MIGRATE_MIRROR_UNMOVABLE,
 MIGRATE_MIRROR_RECLAIMABLE,
 MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
 MIGRATE_MIRROR_RESERVED,<== reserved pages should be 
maintained per mirrored/unmirrored.



Hi Kame,

You mean add 3 or 4 new migratetype?



yes. But please check how NR_MIGRATETYPE_BITS will be.
I think this will not have big impact in x86-64 .


should be added with the following fallback list.

/*
  * MIRROR page range is defined by firmware at boot. The range is limited
  * and is used only for kernel memory mirroring.
  */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}



Why not like this:
{MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE}



 My mistake.
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, 
MIGRATE_RESERVE_MIRROR}

was my intention. This means mirrored memory and unmirrored memory is separated 
completely.

But this should affect kswapd or other memory reclaim logic.

for example, kswapd stops free pages are more than hi watermark.
But mirrored/unmirrored pages exhausted cases are not handled in this series.
You need some extra check in memory reclaim logic if you go with migration_type.




Then, we'll not lose the original information of "Reclaiable Pages".

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.



If we add a new zone, mirror_zone will span others, I'm worry about this
maybe have problems.


Yes. that's problem. And zoneid bit is very limited resource.
(But memory reclaim logic can be unchanged.)

Anyway, I'd like to see your solution with above changes 1st rather than adding 
zones.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 10:31, Xishi Qiu wrote:

On 2015/6/30 9:01, Kamezawa Hiroyuki wrote:


On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
*/
   int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
   {
-system_has_some_mirror = true;
+static_key_slow_inc(&system_has_mirror);

   return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
   }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame



Hi Kame,

How about like this? Use static bool in bootmem, and use jump label in buddy 
system.
This means we use two variable to do it.



I think it can be done but it should be done in separated patch with enough 
comment/changelog.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/30 8:11, Luck, Tony wrote:

@@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t 
base, phys_addr_t size)
   */
  int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size)
  {
-   system_has_some_mirror = true;
+   static_key_slow_inc(&system_has_mirror);

return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }


This generates some WARN_ON noise when called from efi_find_mirror():



It seems jump_label_init() is called after memory initialization. 
(init/main.c::start_kernel())
So, it may be difficut to use static_key function for our purpose because
kernel memory allocation may occur before jump_label is ready.

Thanks,
-Kame


[0.00] e820: last_pfn = 0x7b800 max_arch_pfn = 0x4
[0.00] [ cut here ]
[0.00] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 
static_key_slow_inc+0x57/0xc0()
[0.00] static_key_slow_inc used before call to jump_label_init
[0.00] Modules linked in:

[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4
[0.00] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[0.00]   ee366a8dff38f745 81997d68 
816683b4
[0.00]   81997dc0 81997da8 
8107b0aa
[0.00]  81d48822 81f281a0 4000 
001fcb7a4000
[0.00] Call Trace:
[0.00]  [] dump_stack+0x45/0x57
[0.00]  [] warn_slowpath_common+0x8a/0xc0
[0.00]  [] warn_slowpath_fmt+0x55/0x70
[0.00]  [] ? memblock_add_range+0x175/0x19e
[0.00]  [] static_key_slow_inc+0x57/0xc0
[0.00]  [] memblock_mark_mirror+0x19/0x33
[0.00]  [] efi_find_mirror+0x59/0xdd
[0.00]  [] setup_arch+0x642/0xccf
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] ? printk+0x55/0x6b
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] start_kernel+0xe8/0x4eb
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] ? early_idt_handler_array+0x120/0x120
[0.00]  [] x86_64_start_reservations+0x2a/0x2c
[0.00]  [] x86_64_start_kernel+0x14c/0x16f
[0.00] ---[ end trace baa7fa0514e3bc58 ]---
[0.00] [ cut here ]








--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:25, Xishi Qiu wrote:

Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so
they could free to buddy system's MIGRATE_MIRROR list.
When set reserved memory, skip the mirrored memory.

Signed-off-by: Xishi Qiu 
---
  include/linux/memblock.h |  3 +++
  mm/memblock.c| 21 +
  mm/nobootmem.c   |  3 +++
  mm/page_alloc.c  |  3 +++
  4 files changed, 30 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97f71ca..53be030 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
  ulong choose_memblock_flags(void);
+#ifdef CONFIG_MEMORY_MIRROR
+void memblock_mark_migratemirror(void);
+#endif

  /* Low level functions */
  int memblock_add_range(struct memblock_type *type,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7612876..0d0b210 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 

  #include 
  #include 
@@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, 
phys_addr_t size)
return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR);
  }

+#ifdef CONFIG_MEMORY_MIRROR
+void __init_memblock memblock_mark_migratemirror(void)
+{
+   unsigned long start_pfn, end_pfn, pfn;
+   int i, node;
+   struct page *page;
+
+   printk(KERN_DEBUG "Mirrored memory:\n");
+   for_each_mirror_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
+   &node) {
+   printk(KERN_DEBUG "  node %3d: [mem %#010llx-%#010llx]\n",
+   node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1);
+   for (pfn = start_pfn; pfn < end_pfn;
+   pfn += pageblock_nr_pages) {
+   page = pfn_to_page(pfn);
+   set_pageblock_migratetype(page, MIGRATE_MIRROR);
+   }
+   }
+}
+#endif

  /**
   * __next__mem_range - next function for for_each_free_mem_range() etc.
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5258386..31aa6d4 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void)
u64 i;

memblock_clear_hotplug(0, -1);
+#ifdef CONFIG_MEMORY_MIRROR
+   memblock_mark_migratemirror();
+#endif

for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
NULL)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e4d79f..aea78a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)

block_migratetype = get_pageblock_migratetype(page);

+   if (is_migrate_mirror(block_migratetype))
+   continue;
+


If mirrored area will not have reserved memory, this should break the page 
allocator's
logic.

I think both of mirrored and unmirrored range should have reserved area.

Thanks,
-Kame


/* Only test what is necessary when the reserves are not met */
if (reserve > 0) {
/*




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages

2015-06-29 Thread Kamezawa Hiroyuki

On 2015/06/27 11:24, Xishi Qiu wrote:

This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to
allocate mirrored pages.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.

Signed-off-by: Xishi Qiu 


My fear about this approarch is that this may break something existing.

Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as
MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE.

Logically, MIRROR attribute is independent from page mobility and this 
overwrites
will make some information lost.

Then,


---
  include/linux/mmzone.h | 9 +
  mm/page_alloc.c| 3 +++
  mm/vmstat.c| 3 +++
  3 files changed, 15 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..54e891a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+   MIGRATE_MIRROR,
+#endif


I think
MIGRATE_MIRROR_UNMOVABLE,
MIGRATE_MIRROR_RECLAIMABLE,
MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss.
MIGRATE_MIRROR_RESERVED,<== reserved pages should be maintained 
per mirrored/unmirrored.

should be added with the following fallback list.

/*
 * MIRROR page range is defined by firmware at boot. The range is limited
 * and is used only for kernel memory mirroring.
 */
[MIGRATE_UNMOVABLE_MIRROR]   = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE}
[MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE}

Then, we'll not lose the original information of "Reclaiable Pages".

One problem here is whteher we should have MIGRATE_RESERVE_MIRROR.

If we never allow users to allocate mirrored memory, we should have 
MIGRATE_RESERVE_MIRROR.
But it seems to require much more code change to do that.

Creating a zone or adding an attribues to zones are another design choice.

Anyway, your patch doesn't takes care of reserved memory calculation at this 
point.
Please check setup_zone_migrate_reserve() That will be a problem.

Thanks,
-Kame


MIGRATE_PCPTYPES,   /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
  #ifdef CONFIG_CMA
@@ -69,6 +72,12 @@ enum {
  #  define is_migrate_cma(migratetype) false
  #endif

+#ifdef CONFIG_MEMORY_MIRROR
+#  define is_migrate_mirror(migratetype) unlikely((migratetype) == 
MIGRATE_MIRROR)
+#else
+#  define is_migrate_mirror(migratetype) false
+#endif
+
  #define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..6e4d79f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type)
[MIGRATE_UNMOVABLE] = 'U',
[MIGRATE_RECLAIMABLE]   = 'E',
[MIGRATE_MOVABLE]   = 'M',
+#ifdef CONFIG_MEMORY_MIRROR
+   [MIGRATE_MIRROR]= 'O',
+#endif
[MIGRATE_RESERVE]   = 'R',
  #ifdef CONFIG_CMA
[MIGRATE_CMA]   = 'C',
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd97..d0323e0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
"Unmovable",
"Reclaimable",
"Movable",
+#ifdef CONFIG_MEMORY_MIRROR
+   "Mirror",
+#endif
"Reserve",
  #ifdef CONFIG_CMA
"CMA",




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code

2015-06-28 Thread Kamezawa Hiroyuki

On 2015/06/27 11:23, Xishi Qiu wrote:

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it

 CONFIG_MEMORY_MIRROR

off by default.

Signed-off-by: Xishi Qiu 
---
  mm/Kconfig | 8 
  1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..c40bb8b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR


  In following patches, you use CONFIG_MEMORY_MIRROR.

I think the name is too generic besides it's depends on ACPI.
But I'm not sure address based memory mirror is planned in other platform.

So, hmm. How about dividing the config into 2 parts like attached ? (just an 
example)

Thanks,
-Kame
>From 88213b0f76e2f603c5a38690cbd85a4df1e646ba Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki 
Date: Mon, 29 Jun 2015 15:35:47 +0900
Subject: [PATCH] add a new config option for memory mirror

Add a new config option "CONFIG_MEMORY_MIRROR" for kernel assisted
memory mirroring.

In UEFI2.5 spec, Address based memory mirror is defined and it allows
the system to create partial memory mirror.

The feature guards important(kernel) memory to be mirrored by using
the address based memory mirror.

Now this depends on cpu architecure Haswell? Broadwell?
---
 arch/x86/Kconfig | 6 ++
 mm/Kconfig   | 9 +
 2 files changed, 15 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e33e01b..56f17df 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -596,6 +596,12 @@ config X86_SUPPORTS_MEMORY_FAILURE
depends on X86_64 || !SPARSEMEM
select ARCH_SUPPORTS_MEMORY_FAILURE
 
+config X86_SUPPORTS_MEMORY_MIRROR
+   def_bool y
+   # UEFI 2.5spec. address based memory mirror, supported only after XXX
+   depends on X86_64 && ARCH_SUPPORTS_MEMORY_FAILURE
+   select ARCH_MEMORY_MIRROR
+
 config STA2X11
bool "STA2X11 Companion Chip Support"
depends on X86_32_NON_STANDARD && PCI
diff --git a/mm/Kconfig b/mm/Kconfig
index b3a60ee..e14dc2d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,15 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION
 
+config MEMORY_MIRROR
+   bool "Address range mirroring support"
+   depends on ARCH_MEMORY_MIRROR
+   default n
+   help
+ This feature allows the kernel to assist address based memory
+ mirror supported by architecture/firmware. And place some types
+  of memory (especially, kernel memory) placed into mirrored range.
+
 #
 # If we have space for more page flags then we can enable additional
 # optimizations and functionality.
-- 
1.9.3



Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-26 Thread Kamezawa Hiroyuki

On 2015/06/26 10:43, Xishi Qiu wrote:

On 2015/6/26 7:54, Kamezawa Hiroyuki wrote:


On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
 mm/page_alloc.c | 40 +++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

 return false;
 }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame



So if we want all kernel memory allocated from mirror, how about change like 
this?
__alloc_pages_nodemask()
   gfpflags_to_migratetype()
 if (!(gfp_mask & __GFP_MOVABLE))
return MIGRATE_MIRROR


Maybe used with jump label can reduce performance impact.
==
static inline bool memory_mirror_enabled(void)
{
return static_key_false(&memory_mirror_enabled);
}



gfpflags_to_migratetype()
  if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */
  if (!(gfp_mask & __GFP_MOVABLE))
   return MIGRATE_MIRROR
  }
==

BTW, I think current memory compaction code scans ranges of MOVABLE migrate 
type.
So, if you use other migration type than MOVABLE for user pages, you may see
page fragmentation. If you want to expand this MIRROR to user pages, please 
check
mm/compaction.c


Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-25 Thread Kamezawa Hiroyuki

On 2015/06/25 18:44, Xishi Qiu wrote:

On 2015/6/10 11:06, Kamezawa Hiroyuki wrote:


On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
mm/page_alloc.c | 40 +++-
1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
}
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE



Hi Kame,

Can we distinguish allocations form user or kernel only by GFP flags?



Allocation from user and file caches are now *always* done with __GFP_MOVABLE.

By this, pages will be allocated from MIGRATE_MOVABLE migration type.
MOVABLE migration type means it's can
be the target for page compaction or memory-hot-remove.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/16 2:20, Luck, Tony wrote:

On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote:

So, there are 3 ideas.

  (1) kernel only from MIRROR / user only from MOVABLE (Tony)
  (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
  This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
  the memory reclaim can be critical for GFP_ATOMIC.
  (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
   - creating ZONE
   - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)


We might need to rename it ... right now the memory hotplug
people use ZONE_MOVABLE to indicate regions of physical memory
that can be removed from the system.  I'm wondering whether
people will want systems that have both removable and mirrored
areas?  Then we have four attribute combinations:

mirror=no  removable=no  - prefer to use for user, could use for kernel if we 
run out of mirror
mirror=no  removable=yes - can only be used for user (kernel allocation makes 
it not-removable)
mirror=yes removable=no  - use for kernel, possibly for special users if we 
define some interface
mirror=yes removable=yes - must not use for kernel ... would have to give to 
user ... seems like a bad idea to configure a system this way



Thank you for clarification. I see "mirror=no, removable=no" case may require a 
new name.

IMHO, the value of Address-Based-Memory-Mirror is that users can protect their 
system's
important functions without using full-memory mirror. So, I feel thinking
"mirror=no, removable=no" just makes our discussion/implemenation complex 
without real
user value.

Shouldn't we start with just thiking 2 cases of
 mirror=no  removable=yes
 mirror=yes removable=no
?

And then, if the naming is problem, alias name can be added.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-15 Thread Kamezawa Hiroyuki

On 2015/06/11 5:40, Luck, Tony wrote:

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE


HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 
systems ... 32-bit kernels
don't even boot on systems with 64GB, and the minimum rational configuration 
for a machine that supports
mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels 
per controller * 4GB DIMM ...
leaving any channels empty likely leaves you short of memory bandwidth for 
these high core count processors).

MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel 
allocation from a MOVABLE page. I
want all kernel allocations to be from MIRROR.



So, there are 3 ideas.

 (1) kernel only from MIRROR / user only from MOVABLE (Tony)
 (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP)  (AKPM 
suggested)
 This makes use of the fact MOVABLE memory is reclaimable but Tony pointed 
out
 the memory reclaim can be critical for GFP_ATOMIC.
 (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR 
(Xishi)

2 Implementation ideas.
  - creating ZONE
  - creating new alloation attribute

I don't convince whether we need some new structure in mm. Isn't it good to use
ZONE_MOVABLE for not-mirrored memory ?
Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3)

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:09, Xishi Qiu wrote:

On 2015/6/9 15:06, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:02, Xishi Qiu wrote:

Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
we should allocate mirrored memory for both user and kernel processes.

Signed-off-by: Xishi Qiu 


I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will 
use
mirrored memory ?

Or will you add special MMAP/madvise flag to use mirrored memory ?



Hi Kame,

Yes,

MMAP/madvise
-> add VM_MIRROR
-> add GFP_MIRROR
-> use MIGRATE_MIRROR list to alloc mirrored pages

So user can use mirrored memory. What do you think?



I see. please explain it (your final plan) in patch description or in cover 
page of patches.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 01/12] mm: add a new config to manage the code

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:10, Xishi Qiu wrote:

On 2015/6/9 14:44, Kamezawa Hiroyuki wrote:


On 2015/06/04 21:56, Xishi Qiu wrote:

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
used to on/off the feature.

Signed-off-by: Xishi Qiu 
---
   mm/Kconfig | 8 
   1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..4f2a726 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
   depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
   depends on MIGRATION

+config MEMORY_MIRROR
+bool "Address range mirroring support"
+depends on X86 && NUMA
+default y
+help
+  This feature depends on hardware and firmware support.
+  ACPI or EFI records the mirror info.


default y...no runtime influence when the user doesn't use memory mirror ?



It is a new feature, so how about like this: default y -> n?



It's okay to me. But it's better to check performance impact before merge
because you modified core code of memory management.

Thanks,
-Kame
 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/09 19:04, Xishi Qiu wrote:

On 2015/6/9 15:12, Kamezawa Hiroyuki wrote:


On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
   mm/page_alloc.c | 40 +++-
   1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

   return false;
   }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+/*
+ * Do not alloc mirrored memory below 4G, because 0-4G is
+ * all mirrored by default, and the list is always empty.
+ */
+if (high_zoneidx < ZONE_NORMAL)
+return false;
+
+/* Alloc mirrored memory for only kernel */
+if (gfp_flags & __GFP_MIRROR)
+return true;


GFP_KERNEL itself should imply mirror, I think.



Hi Kame,

How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | 
__GFP_MIRROR) ?



Hm it cannot cover GFP_ATOMIC at el.

I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE

thanks,
-Kame

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 10/12] mm: add the buddy system interface

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:04, Xishi Qiu wrote:

Add the buddy system interface for address range mirroring feature.
Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages
left, use other types pages.

Signed-off-by: Xishi Qiu 
---
  mm/page_alloc.c | 40 +++-
  1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d2066..0fb55288 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn)

return false;
  }
+
+static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx)
+{
+   /*
+* Do not alloc mirrored memory below 4G, because 0-4G is
+* all mirrored by default, and the list is always empty.
+*/
+   if (high_zoneidx < ZONE_NORMAL)
+   return false;
+
+   /* Alloc mirrored memory for only kernel */
+   if (gfp_flags & __GFP_MIRROR)
+   return true;


GFP_KERNEL itself should imply mirror, I think.


+
+   /* Alloc mirrored memory for both user and kernel */
+   if (sysctl_mirrorable)
+   return true;


Reading this, I think this sysctl is not good. The user cannot know what is 
mirrored
because memory may not be mirrored until the sysctl is set.

Thanks,
-Kame



+
+   return false;
+}
  #endif

  /*
@@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone 
*preferred_zone,
WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(&zone->lock, flags);
-   page = __rmqueue(zone, order, migratetype);
+   if (is_migrate_mirror(migratetype))
+   page = __rmqueue_smallest(zone, order, migratetype);
+   else
+   page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
alloc_flags |= ALLOC_CMA;

+#ifdef CONFIG_MEMORY_MIRROR
+   if (change_to_mirror(gfp_mask, ac.high_zoneidx))
+   ac.migratetype = MIGRATE_MIRROR;
+#endif
+
  retry_cpuset:
cpuset_mems_cookie = read_mems_allowed_begin();

@@ -2943,9 +2971,19 @@ retry_cpuset:

/* First allocation attempt */
alloc_mask = gfp_mask|__GFP_HARDWALL;
+retry:
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (unlikely(!page)) {
/*
+* If there is no mirrored memory, we will alloc other
+* types memory.
+*/
+   if (is_migrate_mirror(ac.migratetype)) {
+   ac.migratetype = gfpflags_to_migratetype(gfp_mask);
+   goto retry;
+   }
+
+   /*
 * Runtime PM, block IO and its error handling path
 * can deadlock because I/O on the device might not
 * complete.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:02, Xishi Qiu wrote:

Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means
we should allocate mirrored memory for both user and kernel processes.

Signed-off-by: Xishi Qiu 


I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will 
use
mirrored memory ?

Or will you add special MMAP/madvise flag to use mirrored memory ?

Thanks,
-Kame


---
  include/linux/mmzone.h | 1 +
  kernel/sysctl.c| 9 +
  mm/page_alloc.c| 1 +
  3 files changed, 11 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82e3ae..20888dd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -85,6 +85,7 @@ struct mirror_info {
  };

  extern struct mirror_info mirror_info;
+extern int sysctl_mirrorable;
  #  define is_migrate_mirror(migratetype) unlikely((migratetype) == 
MIGRATE_MIRROR)
  #else
  #  define is_migrate_mirror(migratetype) false
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2082b1a..dc2625e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = {
.extra2 = &one,
},
  #endif
+#ifdef CONFIG_MEMORY_MIRROR
+   {
+   .procname   = "mirrorable",
+   .data   = &sysctl_mirrorable,
+   .maxlen = sizeof(sysctl_mirrorable),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   },
+#endif
{
.procname   = "user_reserve_kbytes",
.data   = &sysctl_user_reserve_kbytes,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 249a8f6..63b90ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -212,6 +212,7 @@ int user_min_free_kbytes = -1;

  #ifdef CONFIG_MEMORY_MIRROR
  struct mirror_info mirror_info;
+int sysctl_mirrorable = 0;
  #endif

  static unsigned long __meminitdata nr_kernel_pages;




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 07/12] mm: introduce __GFP_MIRROR to allocate mirrored pages

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 22:02, Xishi Qiu wrote:

This patch introduces a new gfp flag called "__GFP_MIRROR", it is used to
allocate mirrored pages through buddy system.

Signed-off-by: Xishi Qiu 


In Tony's original proposal, the motivation was to mirror all kernel memory.

Is the purpose of this patch making mirrored range available for user space ?

But, hmm... I don't think adding a new GFP flag is a good idea. It adds many 
conditional jumps.

How about using GFP_KERNEL for user memory if the user wants mirrored memory 
with mirroring
all kernel memory?

Thanks,
-Kame


---
  include/linux/gfp.h | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 15928f0..89d0091 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
  #define ___GFP_NO_KSWAPD  0x40u
  #define ___GFP_OTHER_NODE 0x80u
  #define ___GFP_WRITE  0x100u
+#define ___GFP_MIRROR  0x200u
  /* If the above are modified, __GFP_BITS_SHIFT may need updating */

  /*
@@ -95,13 +96,15 @@ struct vm_area_struct;
  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
other node */
  #define __GFP_WRITE   ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
dirty page */

+#define __GFP_MIRROR   ((__force gfp_t)___GFP_MIRROR)  /* Allocate mirrored 
memory */
+
  /*
   * This may seem redundant, but it's a way of annotating false positives vs.
   * allocations that simply cannot be supported (e.g. page tables).
   */
  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)

-#define __GFP_BITS_SHIFT 25/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26/* Room for N __GFP_FOO bits */
  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

  /* This equals 0, but use constants in case they ever change */




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 02/12] mm: introduce mirror_info

2015-06-09 Thread Kamezawa Hiroyuki

On 2015/06/04 21:57, Xishi Qiu wrote:

This patch introduces a new struct called "mirror_info", it is used to storage
the mirror address range which reported by EFI or ACPI.

TBD: call add_mirror_info() to fill it.

Signed-off-by: Xishi Qiu 
---
  arch/x86/mm/numa.c |  3 +++
  include/linux/mmzone.h | 15 +++
  mm/page_alloc.c| 33 +
  3 files changed, 51 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..781fd68 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -619,6 +619,9 @@ static int __init numa_init(int (*init_func)(void))
/* In case that parsing SRAT failed. */
WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
+#ifdef CONFIG_MEMORY_MIRROR
+   memset(&mirror_info, 0, sizeof(mirror_info));
+#endif

ret = init_func();
if (ret < 0)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 54d74f6..1fae07b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -69,6 +69,21 @@ enum {
  #  define is_migrate_cma(migratetype) false
  #endif

+#ifdef CONFIG_MEMORY_MIRROR
+struct numa_mirror_info {
+   int node;
+   unsigned long start;
+   unsigned long size;
+};
+
+struct mirror_info {
+   int count;
+   struct numa_mirror_info info[MAX_NUMNODES];
+};


MAX_NUMNODE may not be enough when the firmware cannot use contiguous
address for mirroing.



+
+extern struct mirror_info mirror_info;
+#endif


If this structure will not be updated after boot, read_mostly should be
helpful.



+
  #define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
for (type = 0; type < MIGRATE_TYPES; type++)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..41a95a7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -210,6 +210,10 @@ static char * const zone_names[MAX_NR_ZONES] = {
  int min_free_kbytes = 1024;
  int user_min_free_kbytes = -1;

+#ifdef CONFIG_MEMORY_MIRROR
+struct mirror_info mirror_info;
+#endif
+
  static unsigned long __meminitdata nr_kernel_pages;
  static unsigned long __meminitdata nr_all_pages;
  static unsigned long __meminitdata dma_reserve;
@@ -545,6 +549,31 @@ static inline int page_is_buddy(struct page *page, struct 
page *buddy,
return 0;
  }

+#ifdef CONFIG_MEMORY_MIRROR
+static void __init add_mirror_info(int node,
+   unsigned long start, unsigned long size)
+{
+   mirror_info.info[mirror_info.count].node = node;
+   mirror_info.info[mirror_info.count].start = start;
+   mirror_info.info[mirror_info.count].size = size;
+
+   mirror_info.count++;
+}
+
+static void __init print_mirror_info(void)
+{
+   int i;
+
+   printk("Mirror info\n");
+   for (i = 0; i < mirror_info.count; i++)
+   printk("  node %3d: [mem %#010lx-%#010lx]\n",
+   mirror_info.info[i].node,
+   mirror_info.info[i].start,
+   mirror_info.info[i].start +
+   mirror_info.info[i].size - 1);
+}
+#endif
+
  /*
   * Freeing function for a buddy system allocator.
   *
@@ -5438,6 +5467,10 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
   (u64)zone_movable_pfn[i] << PAGE_SHIFT);
}

+#ifdef CONFIG_MEMORY_MIRROR
+   print_mirror_info();
+#endif
+
/* Print out the early node map */
pr_info("Early memory node ranges\n");
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 01/12] mm: add a new config to manage the code

2015-06-08 Thread Kamezawa Hiroyuki

On 2015/06/04 21:56, Xishi Qiu wrote:

This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is
used to on/off the feature.

Signed-off-by: Xishi Qiu 
---
  mm/Kconfig | 8 
  1 file changed, 8 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 390214d..4f2a726 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
depends on MIGRATION

+config MEMORY_MIRROR
+   bool "Address range mirroring support"
+   depends on X86 && NUMA
+   default y
+   help
+ This feature depends on hardware and firmware support.
+ ACPI or EFI records the mirror info.


default y...no runtime influence when the user doesn't use memory mirror ?

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 03/12] mm: introduce MIGRATE_MIRROR to manage the mirrored, pages

2015-06-08 Thread Kamezawa Hiroyuki

On 2015/06/04 21:58, Xishi Qiu wrote:

This patch introduces a new MIGRATE_TYPES called "MIGRATE_MIRROR", it is used
to storage the mirrored pages list.
When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks.



I guess you need to add Mel to CC.


e.g.
euler-linux:~ # cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order   0  1  2  3  4  
5  6  7  8  9 10
Node0, zone  DMA, typeUnmovable  1  1  0  0  2  
1  1  0  1  0  0
Node0, zone  DMA, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Movable  0  0  0  0  0  
0  0  0  0  0  3
Node0, zone  DMA, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Reserve  0  0  0  0  0  
0  0  0  0  1  0
Node0, zone  DMA, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, typeUnmovable  0  0  1  0  0  
0  0  1  1  1  0
Node0, zoneDMA32, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Movable  1  2  6  6  6  
4  5  3  3  2738
Node0, zoneDMA32, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zoneDMA32, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, typeUnmovable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, type  Movable  0  0  1  1  0  
0  0  2  1  0   4254
Node0, zone   Normal, type   Mirror148104 63 70 26  
   11  2  2  1  1973
Node0, zone   Normal, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zone   Normal, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0

Number of blocks type Unmovable  Reclaimable  Movable   Mirror  
Reserve  Isolate
Node 0, zone  DMA1060   
 10
Node 0, zoneDMA3220 15250   
 10
Node 0, zone   Normal00 8702 2048   
 20
Page block order: 9
Pages per block:  512






Free pages count per migrate type at order   0  1  2  3  4  
5  6  7  8  9 10
Node1, zone   Normal, typeUnmovable  0  0  0  0  0  
0  0  0  0  0  0
Node1, zone   Normal, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node1, zone   Normal, type  Movable  2  2  1  1  2  
1  2  2  2  3   3996
Node1, zone   Normal, type   Mirror 68 94 57  6  8  
1  0  0  3  1   2003
Node1, zone   Normal, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node1, zone   Normal, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0

Number of blocks type Unmovable  Reclaimable  Movable   Mirror  
Reserve  Isolate
Node 1, zone   Normal00 8190 4096   
 20


Signed-off-by: Xishi Qiu 
---
  include/linux/mmzone.h | 6 ++
  mm/page_alloc.c| 3 +++
  mm/vmstat.c| 3 +++
  3 files changed, 12 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1fae07b..b444335 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,9 @@ enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_MOVABLE,
+#ifdef CONFIG_MEMORY_MIRROR
+   MIGRATE_MIRROR,
+#endif


I can't imagine how the fallback logic will work at reading this patch.
I think an update for fallback order array should be in this patch...


MIGRATE_PCPTYPES,   /* the number of types on the pcp lists */
MIGRATE_RESERVE = MIGRATE_PCPTYPES,
  #ifdef CONFIG_CMA
@@ -82,6 +85,9 @@ struct mirror_info {
 

Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-27 Thread Kamezawa Hiroyuki

On 2015/04/25 5:01, Andrew Morton wrote:

On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng  wrote:


Since the change to the cpu <--> mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

...

--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
  }

+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val < min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}


This should be `inline' if it's in a header file.

But it is far too large to be inlined anyway - please move it to a .c file.

And please document it.  A critical thing to describe is how we
determine whether a node is "near".  There are presumably multiple ways
in which we could decide that a node is "near" (number of hops, minimum
latency, ...).  Which one did you choose, and why?


  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
  {
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
@@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
  static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
  {
-   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);


In above VM_BUG_ON(), !node_online(nid) is the bug.


+
+   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);

return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
  }


Ouch.  These functions are called very frequently, and adding overhead
to them is a big deal.  And the patch even adds overhead to non-x86
architectures which don't benefit from it!

Is there no way this problem can be fixed somewhere else?  Preferably
by fixing things up at hotplug time.


I agree. the results should be cached. If necessary, in per-cpu line.


Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-04-01 Thread Kamezawa Hiroyuki

On 2015/04/02 10:36, Gu Zheng wrote:

Hi Kame, TJ,

On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote:


On 2015/04/01 12:02, Tejun Heo wrote:

On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:

Now, hot-added cpus will have the lowest free cpu id.

Because of this, in most of systems which has only cpu-hot-add, cpu-ids are 
always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.

determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.


Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?


We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed 
contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the 
application
but I guess there will be some amount of customers that cpuids are contiguous.


Tying hotplug and id allocation
order together usually isn't a good idea.  What if the cpu up fails
while running the notifiers?  The ID is already allocated and the next
cpu being brought up will be after a hole anyway.  Is this even
actually gonna affect userland?



Maybe. It's not fail-safe but

In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(

Basically, I don't have strong objections, you're right technically.

In summary...
  - users should not assume cpuids are contiguous.
  - all possible ids should be fixed at boot time.
  - For uses, some clarification document should be somewhere in Documenatation.


Fine to me.



So, Gu-san
  1) determine all possible ids at boot.
  2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
  3) It would be good if other guys give us ack.


Also fine.
But before this going, could you please reconsider determining the ids when 
firstly
present (the implementation on this patchset)?
Though it is not the perfect one in some words, but we can ignore the doubts 
that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there 
seems
not anything harmful to us if we go this way.



Is it so heavy work ?  Hmm. My requests are

Implement your patches as
- Please don't change current behavior at boot.
- Remember all possible apicids and give them future cpuids if not assigned.
as step 1.

Please fix dynamic pxm<->node detection in step2.

In future, memory-less node handling in x86 should be revisited.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-04-01 Thread Kamezawa Hiroyuki

On 2015/04/01 12:02, Tejun Heo wrote:

On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:

Now, hot-added cpus will have the lowest free cpu id.

Because of this, in most of systems which has only cpu-hot-add, cpu-ids are 
always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.

determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.


Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?


We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed 
contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the 
application
but I guess there will be some amount of customers that cpuids are contiguous.


Tying hotplug and id allocation
order together usually isn't a good idea.  What if the cpu up fails
while running the notifiers?  The ID is already allocated and the next
cpu being brought up will be after a hole anyway.  Is this even
actually gonna affect userland?



Maybe. It's not fail-safe but

In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(

Basically, I don't have strong objections, you're right technically.

In summary...
 - users should not assume cpuids are contiguous.
 - all possible ids should be fixed at boot time.
 - For uses, some clarification document should be somewhere in Documenatation.

So, Gu-san
 1) determine all possible ids at boot.
 2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
 3) It would be good if other guys give us ack.

In future,
I myself thinks naming system like udev for cpuid/numaid is necessary, at last.
Can that renaming feature can be cgroup/namespace feature ? If possible,
all container can have cpuids starting from 0.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-31 Thread Kamezawa Hiroyuki
On 2015/03/30 18:58, Gu Zheng wrote:
> Hi Kame-san,
> 
> On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote:
> 
>> On 2015/03/26 13:55, Gu Zheng wrote:
>>> Hi Kame-san,
>>> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
>>>
>>>> On 2015/03/26 11:17, Gu Zheng wrote:
>>>>> Previously, we build the apicid <--> cpuid mapping when the cpu is 
>>>>> present, but
>>>>> the relationship will be changed if the cpu/node hotplug happenned, 
>>>>> because we
>>>>> always choose the first free cpuid for the hot added cpu (whether it is 
>>>>> new-add
>>>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>>>> occurred, and it causes the wq sub-system allocation failture:
>>>>>  ==
>>>>> SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>>>  cache: kmalloc-192, object size: 192, buffer size: 192, default
>>>>> order:
>>>>>1, min order: 0
>>>>>  node 0: slabs: 6172, objs: 259224, free: 245741
>>>>>  node 1: slabs: 3261, objs: 136962, free: 127656
>>>>>  ==
>>>>> So here we build the persistent [lapic id] <--> cpuid mapping when the 
>>>>> cpu first
>>>>> present, and never change it.
>>>>>
>>>>> Suggested-by: KAMEZAWA Hiroyuki 
>>>>> Signed-off-by: Gu Zheng 
>>>>> ---
>>>>> arch/x86/kernel/apic/apic.c |   31 ++-
>>>>> 1 files changed, 30 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>>>> index ad3639a..d539ebc 100644
>>>>> --- a/arch/x86/kernel/apic/apic.c
>>>>> +++ b/arch/x86/kernel/apic/apic.c
>>>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>>>>   apic_write(APIC_LVT1, value);
>>>>> }
>>>>> 
>>>>> +/*
>>>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>>>> + * Do not clear the mapping even if cpu hot removed.
>>>>> + * */
>>>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>>>> + [0 ... MAX_LOCAL_APIC - 1] = -1,
>>>>> +};
>>>>
>>>>
>>>> This patch cannot handle x2apic, which is 32bit.
>>>
>>> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
>>> generating a logic cpu number for it, so it seems no problem here.
>>>
>> you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ?
> 
> I use the big array here to keep the same format with the existed ones:
> int apic_version[MAX_LOCAL_APIC];
> 
> s16 __apicid_to_node[MAX_LOCAL_APIC] = {
>   [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
> };
> Or we should also say "NO" to them?

This will not survive when someone try to enlarge MAX_LOCAL_APIC for
introdcing new cpu model because x2apic is 32bit by definition.

Please create long surviving infrastructure.

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-31 Thread Kamezawa Hiroyuki

On 2015/04/01 0:28, Tejun Heo wrote:

Hello, Kamezawa.

On Tue, Mar 31, 2015 at 03:09:05PM +0900, Kamezawa Hiroyuki wrote:

But this may be considered as API change for most hot-add users.


Hmm... Why would it be?  What can that possibly break?



Now, hot-added cpus will have the lowest free cpu id.

Because of this, in most of systems which has only cpu-hot-add, cpu-ids are 
always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.

determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.


So, for now, I vote for detemining ids at online but record it is a good way.


If we know the information during boot, let's please do it at boot
time by all means.


I basically agree. Just thinking influence of this small imcompatibility of 
forcing
ideal way because of my standing point.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-30 Thread Kamezawa Hiroyuki

On 2015/03/30 18:49, Gu Zheng wrote:

Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:


On 2015/03/27 0:18, Tejun Heo wrote:

Hello,

On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:

wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added 
cpu
to the correct node when preparing online, otherwise it will try to create 
worker
on invalid node if node hotplug occurred.


If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?



I think the kernel can define all possible

  cpuid <-> lapicid <-> pxm <-> nodeid

mapping at boot with using firmware table information.


Could you explain more?


you can scan firmware's table and store all provided information for possible 
cpus/pxms
somewhere even while future cpus/mems are not onlined rather than determine them
on demand.
But this may be considered as API change for most hot-add users.

So, for now, I vote for detemining ids at online but record it is a good way.

Thanks,
-Kame










--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-26 Thread Kamezawa Hiroyuki

On 2015/03/27 0:18, Tejun Heo wrote:

Hello,

On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:

wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added 
cpu
to the correct node when preparing online, otherwise it will try to create 
worker
on invalid node if node hotplug occurred.


If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?



I think the kernel can define all possible

 cpuid <-> lapicid <-> pxm <-> nodeid

mapping at boot with using firmware table information.

One concern is current x86 logic for memory-less node v.s. memory hotplug.
(as I explained before)

My idea is
  step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node 
id at boot.

But this may be overwritten by x86's memory less node logic. So,
  step2. check node is online or not before calling kmalloc. If offline, use -1.
 rather than updating workqueue's attribute.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-26 Thread Kamezawa Hiroyuki
On 2015/03/26 13:55, Gu Zheng wrote:
> Hi Kame-san,
> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
> 
>> On 2015/03/26 11:17, Gu Zheng wrote:
>>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, 
>>> but
>>> the relationship will be changed if the cpu/node hotplug happenned, because 
>>> we
>>> always choose the first free cpuid for the hot added cpu (whether it is 
>>> new-add
>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>> occurred, and it causes the wq sub-system allocation failture:
>>> ==
>>>SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>> cache: kmalloc-192, object size: 192, buffer size: 192, default
>>> order:
>>>   1, min order: 0
>>> node 0: slabs: 6172, objs: 259224, free: 245741
>>> node 1: slabs: 3261, objs: 136962, free: 127656
>>> ==
>>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu 
>>> first
>>> present, and never change it.
>>>
>>> Suggested-by: KAMEZAWA Hiroyuki 
>>> Signed-off-by: Gu Zheng 
>>> ---
>>>arch/x86/kernel/apic/apic.c |   31 ++-
>>>1 files changed, 30 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>> index ad3639a..d539ebc 100644
>>> --- a/arch/x86/kernel/apic/apic.c
>>> +++ b/arch/x86/kernel/apic/apic.c
>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>> apic_write(APIC_LVT1, value);
>>>}
>>>
>>> +/*
>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>> + * Do not clear the mapping even if cpu hot removed.
>>> + * */
>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>> +   [0 ... MAX_LOCAL_APIC - 1] = -1,
>>> +};
>>
>>
>> This patch cannot handle x2apic, which is 32bit.
> 
> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
> generating a logic cpu number for it, so it seems no problem here.
> 
you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ?

Anyway, APIC IDs are sparse values. Please use proper structure.

Thanks,
-Kame

>>
>> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit 
>> apicid, at least.
>>
>> But you can't create inifinit array.
>>
>> If you can't allocate the array dynamically, How about adding
>>
>>   static int cpuid_to_apicid[MAX_CPU] = {}
>>
>> or using idr library ? (please see lib/idr.c)
>>
>> I guess you can update this map after boot(after mm initialization)
>> and make use of idr library.
>>
>> About this patch, Nack.
>>
>> -Kame
>>
>>
>>
>>> +
>>> +/*
>>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>>> + * */
>>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>>> +
>>> +static int get_cpuid(int apicid)
>>> +{
>>> +   int cpuid;
>>> +
>>> +   cpuid = apicid_to_x86_cpu[apicid];
>>> +   if (cpuid == -1)
>>> +   cpuid = cpumask_next_zero(-1, &cpuid_mask);
>>> +
>>> +   return cpuid;
>>> +}
>>> +
>>>int generic_processor_info(int apicid, int version)
>>>{
>>> int cpu, max = nr_cpu_ids;
>>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>>  */
>>> cpu = 0;
>>> } else
>>> -   cpu = cpumask_next_zero(-1, cpu_present_mask);
>>> +   cpu = get_cpuid(apicid);
>>> +
>>> +   /* Store the mapping */
>>> +   apicid_to_x86_cpu[apicid] = cpu;
>>>
>>> /*
>>>  * Validate version
>>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>> early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>> apic->x86_32_early_logical_apicid(cpu);
>>>#endif
>>> +   /* Mark this cpu id as uesed (already mapping a local apic id) */
>>> +   cpumask_set_cpu(cpu, &cpuid_mask);
>>> set_cpu_possible(cpu, true);
>>> set_cpu_present(cpu, true);
>>>
>>>
>>
>>
>> .
>>
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-25 Thread Kamezawa Hiroyuki
On 2015/03/26 11:17, Gu Zheng wrote:
> Previously, we build the apicid <--> cpuid mapping when the cpu is present, 
> but
> the relationship will be changed if the cpu/node hotplug happenned, because we
> always choose the first free cpuid for the hot added cpu (whether it is 
> new-add
> or re-add), so this the cpuid <--> node mapping changed if node hot plug
> occurred, and it causes the wq sub-system allocation failture:
>==
>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
>  1, min order: 0
>node 0: slabs: 6172, objs: 259224, free: 245741
>node 1: slabs: 3261, objs: 136962, free: 127656
>==
> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu 
> first
> present, and never change it.
> 
> Suggested-by: KAMEZAWA Hiroyuki 
> Signed-off-by: Gu Zheng 
> ---
>   arch/x86/kernel/apic/apic.c |   31 ++-
>   1 files changed, 30 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ad3639a..d539ebc 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>   apic_write(APIC_LVT1, value);
>   }
>   
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> + * Do not clear the mapping even if cpu hot removed.
> + * */
> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
> + [0 ... MAX_LOCAL_APIC - 1] = -1,
> +};


This patch cannot handle x2apic, which is 32bit.

As far as I understand, it depends on CPU's spec and the newest cpu has 9bit 
apicid, at least.

But you can't create inifinit array.

If you can't allocate the array dynamically, How about adding

 static int cpuid_to_apicid[MAX_CPU] = {}

or using idr library ? (please see lib/idr.c)

I guess you can update this map after boot(after mm initialization)
and make use of idr library.

About this patch, Nack.

-Kame



> +
> +/*
> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
> + * */
> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
> +
> +static int get_cpuid(int apicid)
> +{
> + int cpuid;
> +
> + cpuid = apicid_to_x86_cpu[apicid];
> + if (cpuid == -1)
> + cpuid = cpumask_next_zero(-1, &cpuid_mask);
> +
> + return cpuid;
> +}
> +
>   int generic_processor_info(int apicid, int version)
>   {
>   int cpu, max = nr_cpu_ids;
> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>*/
>   cpu = 0;
>   } else
> - cpu = cpumask_next_zero(-1, cpu_present_mask);
> + cpu = get_cpuid(apicid);
> +
> + /* Store the mapping */
> + apicid_to_x86_cpu[apicid] = cpu;
>   
>   /*
>* Validate version
> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>   early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>   apic->x86_32_early_logical_apicid(cpu);
>   #endif
> + /* Mark this cpu id as uesed (already mapping a local apic id) */
> + cpumask_set_cpu(cpu, &cpuid_mask);
>   set_cpu_possible(cpu, true);
>   set_cpu_present(cpu, true);
>   
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-25 Thread Kamezawa Hiroyuki
On 2015/03/26 11:17, Gu Zheng wrote:
> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
> relationship is established. Because workqueue uses a info which was
> established at boot time, but it may be changed by node hotpluging.
> 
> Once pool->node points to a stale node, following allocation failure
> happens.
>==
>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
>  1, min order: 0
>node 0: slabs: 6172, objs: 259224, free: 245741
>node 1: slabs: 3261, objs: 136962, free: 127656
>==
> 
> As the apicid <--> node relationship is persistent, so the root cause is the
 ^^^
   pxm.

> cpu-id <-> lapicid mapping is not persistent (because the currently 
> implementation
> always choose the first free cpu id for the new added cpu), so if we can build
> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
> 
> Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous 
> discussion.
> 
> Gu Zheng (2):
>x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
>workqueue: update per cpu workqueue's numa affinity when cpu
>  preparing online

why patch(2/2) required ?

Thanks,
-Kame

> 
>   arch/x86/kernel/apic/apic.c |   31 ++-
>   kernel/workqueue.c  |1 +
>   2 files changed, 31 insertions(+), 1 deletions(-)
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: update numa affinity when node hotplug

2015-03-04 Thread Kamezawa Hiroyuki

On 2015/03/05 10:23, Gu Zheng wrote:

Hi Kamazawa-san,
On 03/04/2015 01:45 PM, Kamezawa Hiroyuki wrote:


On 2015/03/03 22:18, Tejun Heo wrote:

Hello, Kame.

On Tue, Mar 03, 2015 at 03:53:46PM +0900, Kamezawa Hiroyuki wrote:

relationship between proximity domain and lapic id doesn't change.
relationship between lapic-id and cpu-id changes.

pxm <-> memory address  : no change
pxm <-> lapicid : no change
pxm <-> node id : no change
lapicid <-> cpu id  : change.


So, we're changing the cpu ID to NUMA node mapping because current
NUMA code is ignoring PXM for memoryless nodes?  That's it?



For memory-less node case, yes.
Another problem is that lapicid <-> cpuid relationship is not persistent.



I personally thinks proper fix is building persistent cpu-id <-> lapicid 
relationship as
pxm does rather than creating band-aid.


Oh if this is possible, I agree that's the right direction too.



Implementation is a bit complicated now :(.


Ah well, even then, the obviously right thing to do is updating NUMA
code to always keep track of PXM information.  We don't really want to
pile NUMA hacks in random users of NUMA code.



We'd like to start from making apicid <-> cpuid persistent because memory-less
node case doesn't cause panic.

Gu-san, how do you think ?


Fine by me. But it seems that the change will break the re-use of free cpuid 
when
hot add new cpus, I am afraid it may affect other sub-systems, though I can not
point out the specific example.



That may be concern. But, IHMO, applications should not expect cpuid of newly 
added cpu.

If an application takes care of placement of [thread, memory], OS API should be 
refreshed
to allow user to tell "please pleace [thread, memory] nearby each other" rather 
than specifying
[cpu, memory]. (As Peter Z.'s schedNUMA)

If application takes care of physical placement of hardware, ethernet, SSD, 
etc...
it should not depends on cpuid.

Some of required features may not be archived yet but cpuid of newly added cpu
should not be a big problem for applications, I think.

Open dicussion with (x86/sched/mm) maintainers will be required, anyway.
I think it's worth trying.

Thanks,
-Kame





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: node-hotplug: is memset 0 safe in try_offline_node()?

2015-03-04 Thread Kamezawa Hiroyuki

On 2015/03/04 17:03, Xishi Qiu wrote:

On 2015/3/4 11:56, Gu Zheng wrote:


Hi Xishi,
On 03/04/2015 10:52 AM, Xishi Qiu wrote:


On 2015/3/4 10:22, Xishi Qiu wrote:


On 2015/3/3 18:20, Gu Zheng wrote:


Hi Xishi,
On 03/03/2015 11:30 AM, Xishi Qiu wrote:


When hot-remove a numa node, we will clear pgdat,
but is memset 0 safe in try_offline_node()?


It is not safe here. In fact, this is a temporary solution here.
As you know, pgdat is accessed lock-less now, so protection
mechanism (RCU?) is needed to make it completely safe here,
but it seems a bit over-kill.



Hi Gu,

Can we just remove "memset(pgdat, 0, sizeof(*pgdat));" ?
I find this will be fine in the stress test except the warning
when hot-add memory.


As you see, it will trigger the warning in free_area_init_node().
Could you try the following patch? It will reset the pgdat before reuse it.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1778628..0717649 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1092,6 +1092,9 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 return NULL;

 arch_refresh_nodedata(nid, pgdat);
+   } else {
+   /* Reset the pgdat to reuse */
+   memset(pgdat, 0, sizeof(*pgdat));
 }


Hi Gu,

If schedule last a long time, next_zone may be still access the pgdat here,
so it is not safe enough, right?



How about just reseting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather 
than
memset() ?

It seems breaking pointer information in pgdat is not a choice.
Just proper "values" should be reset.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: update numa affinity when node hotplug

2015-03-03 Thread Kamezawa Hiroyuki

On 2015/03/03 22:18, Tejun Heo wrote:

Hello, Kame.

On Tue, Mar 03, 2015 at 03:53:46PM +0900, Kamezawa Hiroyuki wrote:

relationship between proximity domain and lapic id doesn't change.
relationship between lapic-id and cpu-id changes.

pxm <-> memory address  : no change
pxm <-> lapicid : no change
pxm <-> node id : no change
lapicid <-> cpu id  : change.


So, we're changing the cpu ID to NUMA node mapping because current
NUMA code is ignoring PXM for memoryless nodes?  That's it?



For memory-less node case, yes.
Another problem is that lapicid <-> cpuid relationship is not persistent.



I personally thinks proper fix is building persistent cpu-id <-> lapicid 
relationship as
pxm does rather than creating band-aid.


Oh if this is possible, I agree that's the right direction too.



Implementation is a bit complicated now :(.


Ah well, even then, the obviously right thing to do is updating NUMA
code to always keep track of PXM information.  We don't really want to
pile NUMA hacks in random users of NUMA code.



We'd like to start from making apicid <-> cpuid persistent because memory-less
node case doesn't cause panic.

Gu-san, how do you think ?

Thanks,
-Kame

P.S.
Finally, I want something like udev for cpuid/numaid...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: update numa affinity when node hotplug

2015-03-02 Thread Kamezawa Hiroyuki

On 2015/03/03 1:28, Tejun Heo wrote:

Hello,

On Mon, Mar 02, 2015 at 05:41:05PM +0900, Kamezawa Hiroyuki wrote:

Let me start from explaining current behavior.

- cpu-id is determined when a new processor(lapicid/x2apicid) is founded.
   cpu-id<->nodeid relationship is _not_ recorded.


Is this something from the firmware side or is it just that we aren't
maintaining the association right now?



I think it's not just maintained.


- node-id is determined when a new pxm(firmware info) is founded.
   pxm<->nodeid relationship is recorded.

By this, there are 2 cases of cpu<->nodeid change.

Case A) In x86, cpus on memory-less nodes are all tied to existing nodes(round 
robin).
At memory-hotadd happens and a new node comes, cpus are moved to a newly added 
node
based on pxm.


Ah, okay, so the firmware doesn't provide proximity information at all
for memory-less nodes so we end up putting all of them somewhere
random and when memory is added to one of the memory-less nodes, the
mapping information changes?



With memory-less node, proximity domain for processors are given but ignored.
When memory(node) hotplug happens, the information revisited and cpuid<->nodeid
relationship is updated.


Am I understanding it correctly?  If so, it's super weird tho.  Why
wouldn't there be proximity information for a memless node?  Not
having memory doesn't mean it's at the same distance from all existing
nodes.


Firmware gives pxm for memory-less node but it's ignored.
I'm not sure why the current implemetaion is.


Case B) Adding a node after removing another node, if pxm of them were 
different from
each other, cpu<->node relatiionship changes.


I don't get this either.  Does proximity relationship actually change?
Or is it that we're assigning different IDs to the same thing?Isn't
proximity pretty much hardwired to how the system is architected to
begin with?



relationship between proximity domain and lapic id doesn't change.
relationship between lapic-id and cpu-id changes.

pxm <-> memory address  : no change
pxm <-> lapicid : no change
pxm <-> node id : no change
lapicid <-> cpu id  : change.


I personally thinks proper fix is building persistent cpu-id <-> lapicid 
relationship as
pxm does rather than creating band-aid.


Oh if this is possible, I agree that's the right direction too.



Implementation is a bit complicated now :(.

Thanks,
-Kame


Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] workqueue: update numa affinity when node hotplug

2015-03-02 Thread Kamezawa Hiroyuki

On 2015/02/27 20:54, Tejun Heo wrote:

Hello,

On Fri, Feb 27, 2015 at 06:04:52PM +0800, Gu Zheng wrote:

Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.


I've asked this a couple times now but can somebody please justify why
cpu<->node relationship needs to change?  If it has to change, that's
okay but let's please first make sure that we understand why such
thing is necessary so that we can figure out what kind of facilities
are necessary for such dynamism.

Thanks.



Let me start from explaining current behavior.

- cpu-id is determined when a new processor(lapicid/x2apicid) is founded.
  cpu-id<->nodeid relationship is _not_ recorded.

- node-id is determined when a new pxm(firmware info) is founded.
  pxm<->nodeid relationship is recorded.

By this, there are 2 cases of cpu<->nodeid change.

Case A) In x86, cpus on memory-less nodes are all tied to existing nodes(round 
robin).
At memory-hotadd happens and a new node comes, cpus are moved to a newly added 
node
based on pxm.

Case B) Adding a node after removing another node, if pxm of them were 
different from
each other, cpu<->node relatiionship changes.

I personally thinks proper fix is building persistent cpu-id <-> lapicid 
relationship as
pxm does rather than creating band-aid.

But I have no good idea to hanlde case A), which is used now.
Hm, Case-A will not be problem for the case of workqueue's kmalloc because a 
node with
ZONE_NORMAL cannot be removed.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] workqueue: update numa affinity info at node hotplug

2014-12-16 Thread Kamezawa Hiroyuki

(2014/12/17 12:22), Kamezawa Hiroyuki wrote:

(2014/12/17 10:36), Lai Jiangshan wrote:

On 12/17/2014 12:45 AM, Kamezawa Hiroyuki wrote:

With node online/offline, cpu<->node relationship is established.
Workqueue uses a info which was established at boot time but
it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
 ==
This patch updates per cpu workqueue pool's node affinity and
updates wq_numa_possible_cpumask at node online/offline event.
This update of mask is very important because it affects cpumasks
and preferred node detection.

Unbound workqueue's per node pool are updated by
by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing 
code.
What important here is to avoid wrong node detection when a cpu get onlined.
And it's handled by wq_numa_possible_cpumask update introduced by this patch.

Changelog v3->v4:
  - added workqueue_node_unregister
  - clear wq_numa_possible_cpumask at node offline.
  - merged a patch which handles per cpu pools.
  - clear per-cpu-pool's pool->node at node offlining.
  - set per-cpu-pool's pool->node at node onlining.
  - dropped modification to get_unbound_pool()
  - dropped per-cpu-pool handling at cpu online/offline.

Reported-by:  Yasuaki Ishimatsu 
Signed-off-by: KAMEZAWA Hiroyuki 
---
  include/linux/memory_hotplug.h |  3 +++
  kernel/workqueue.c | 58 +-
  mm/memory_hotplug.c|  6 -
  3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..7b4a292 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);

+/* update for workqueues */
+void workqueue_node_register(int node);
+void workqueue_node_unregister(int node);
  #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6202b08..f6ad05a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
  static struct kmem_cache *pwq_cache;

  static cpumask_var_t *wq_numa_possible_cpumask;
-/* possible CPUs of each node */
+/* PL: possible CPUs of each node */

  static bool wq_disable_numa;
  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
  WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
pool->attrs->cpumask) < 0);
  }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void workqueue_update_cpu_numa_affinity(int cpu, int node)
+{
+struct worker_pool *pool;
+
+if (node != cpu_to_node(cpu))
+return;
+cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+for_each_cpu_worker_pool(pool, cpu)
+pool->node = node;


Again, You need to check and update all the wq->numa_pwq_tbl[oldnode],
but in this patchset, the required information is lost and we can't find out 
oldnode.


cpus of oldnode, 16-31(online),48,56,64,72(offline,randomly assigned to the 
oldnode by numa_init_array())

and then cpu#48 is allocated for newnode and online

Now, the wq->numa_pwq_tbl[oldnode]'s cpumask still have cpu#48, and it may be 
scheduled to cpu#48.
See the information of my patch 4/5



That will not cause page allocation failure, right ? If so, it's out of scope 
of this patch 1/2.

I think it's handled in patch 2/2, isn't it ?


Let me correct my words. Main purpose of this patch 1/2 is handling a case "node 
disappers" after boot.
And try to handle physicall node hotplug caes.

Changes of cpu<->node relationship at CPU_ONLINE is handled in patch 2/2.

Thanks,
-Kame



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] workqueue: update numa affinity info at node hotplug

2014-12-16 Thread Kamezawa Hiroyuki

(2014/12/17 10:36), Lai Jiangshan wrote:

On 12/17/2014 12:45 AM, Kamezawa Hiroyuki wrote:

With node online/offline, cpu<->node relationship is established.
Workqueue uses a info which was established at boot time but
it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
 ==
This patch updates per cpu workqueue pool's node affinity and
updates wq_numa_possible_cpumask at node online/offline event.
This update of mask is very important because it affects cpumasks
and preferred node detection.

Unbound workqueue's per node pool are updated by
by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing 
code.
What important here is to avoid wrong node detection when a cpu get onlined.
And it's handled by wq_numa_possible_cpumask update introduced by this patch.

Changelog v3->v4:
  - added workqueue_node_unregister
  - clear wq_numa_possible_cpumask at node offline.
  - merged a patch which handles per cpu pools.
  - clear per-cpu-pool's pool->node at node offlining.
  - set per-cpu-pool's pool->node at node onlining.
  - dropped modification to get_unbound_pool()
  - dropped per-cpu-pool handling at cpu online/offline.

Reported-by:  Yasuaki Ishimatsu 
Signed-off-by: KAMEZAWA Hiroyuki 
---
  include/linux/memory_hotplug.h |  3 +++
  kernel/workqueue.c | 58 +-
  mm/memory_hotplug.c|  6 -
  3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..7b4a292 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  unsigned long pnum);

+/* update for workqueues */
+void workqueue_node_register(int node);
+void workqueue_node_unregister(int node);
  #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6202b08..f6ad05a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
  static struct kmem_cache *pwq_cache;

  static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* PL: possible CPUs of each node */

  static bool wq_disable_numa;
  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
  pool->attrs->cpumask) < 0);
  }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void workqueue_update_cpu_numa_affinity(int cpu, int node)
+{
+   struct worker_pool *pool;
+
+   if (node != cpu_to_node(cpu))
+   return;
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = node;


Again, You need to check and update all the wq->numa_pwq_tbl[oldnode],
but in this patchset, the required information is lost and we can't find out 
oldnode.


cpus of oldnode, 16-31(online),48,56,64,72(offline,randomly assigned to the 
oldnode by numa_init_array())

and then cpu#48 is allocated for newnode and online

Now, the wq->numa_pwq_tbl[oldnode]'s cpumask still have cpu#48, and it may be 
scheduled to cpu#48.
See the information of my patch 4/5



That will not cause page allocation failure, right ? If so, it's out of scope 
of this patch 1/2.

I think it's handled in patch 2/2, isn't it ?

Thanks,
-Kame






+}
+
+/*
+ * When a cpu is physically added, cpu<->node relationship is established
+ * based on firmware info. We can catch the whole view when a new NODE_DATA()
+ * coming up (a node is added).
+ * If we don't update the info, pool->node will points to a not-online node
+ * and the kernel will have allocation failure.
+ *
+ * Update wp_numa_possible_mask at online and clear it at offline.
+ */
+void workqueue_node_register(int node)
+{
+   int cpu;
+
+   mutex_lock(&wq_pool_mutex);
+   for_each_possible_cpu(cpu)
+   workqueue_update_cpu_numa_affinity(cpu, node);
+   /* unbound workqueue will be updated when the 1st cpu comes up.*/
+   mutex_unlock(&wq_pool_mutex);
+}
+
+void workqueue_node_unregister(int node)
+{
+   struct worker_pool *p

[PATCH 2/2] workqueue: update cpumask at CPU_ONLINE if necessary

2014-12-16 Thread Kamezawa Hiroyuki
In some case, cpu's numa affinity will be changed in cpu_up().
It happens after a new node is onlined.
(in x86, online cpus are tied to onlined node at boot.
 so, if memory is added later, cpu mapping can be changed at cpu_up()

Although wq_numa_possible_cpumask at el. are maintained against
node hotplug, this case should be handled.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f6ad05a..59d8be5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4618,6 +4618,27 @@ void workqueue_node_unregister(int node)
mutex_unlock(&wq_pool_mutex);
 }
 
+static void workqueue_may_update_numa_affinity(int cpu)
+{
+   int curnode = cpu_to_node(cpu);
+   int node;
+
+   if (likely(cpumask_test_cpu(cpu, wq_numa_possible_cpumask[curnode])))
+   return;
+
+   /* cpu<->node relationship is changed in cpu_up() */
+   for_each_node_state(node, N_POSSIBLE)
+   cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[node]);
+
+   workqueue_update_cpu_numa_affinity(cpu, curnode);
+}
+#else
+
+static void workqueue_may_update_numa_affinity(int cpu)
+{
+   return;
+}
+
 #endif
 
 /*
@@ -4647,6 +4668,8 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);
 
+   workqueue_may_update_numa_affinity(cpu);
+
for_each_pool(pool, pi) {
mutex_lock(&pool->attach_mutex);
 
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] workqueue: update numa affinity info at node hotplug

2014-12-16 Thread Kamezawa Hiroyuki
With node online/offline, cpu<->node relationship is established.
Workqueue uses a info which was established at boot time but
it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==
This patch updates per cpu workqueue pool's node affinity and
updates wq_numa_possible_cpumask at node online/offline event.
This update of mask is very important because it affects cpumasks
and preferred node detection.

Unbound workqueue's per node pool are updated by
by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing 
code.
What important here is to avoid wrong node detection when a cpu get onlined.
And it's handled by wq_numa_possible_cpumask update introduced by this patch.

Changelog v3->v4:
 - added workqueue_node_unregister
 - clear wq_numa_possible_cpumask at node offline.
 - merged a patch which handles per cpu pools.
 - clear per-cpu-pool's pool->node at node offlining.
 - set per-cpu-pool's pool->node at node onlining.
 - dropped modification to get_unbound_pool()
 - dropped per-cpu-pool handling at cpu online/offline.

Reported-by:  Yasuaki Ishimatsu 
Signed-off-by: KAMEZAWA Hiroyuki 
---
 include/linux/memory_hotplug.h |  3 +++
 kernel/workqueue.c | 58 +-
 mm/memory_hotplug.c|  6 -
 3 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..7b4a292 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  unsigned long pnum);
 
+/* update for workqueues */
+void workqueue_node_register(int node);
+void workqueue_node_unregister(int node);
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6202b08..f6ad05a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* PL: possible CPUs of each node */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
  pool->attrs->cpumask) < 0);
 }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void workqueue_update_cpu_numa_affinity(int cpu, int node)
+{
+   struct worker_pool *pool;
+
+   if (node != cpu_to_node(cpu))
+   return;
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = node;
+}
+
+/*
+ * When a cpu is physically added, cpu<->node relationship is established
+ * based on firmware info. We can catch the whole view when a new NODE_DATA()
+ * coming up (a node is added).
+ * If we don't update the info, pool->node will points to a not-online node
+ * and the kernel will have allocation failure.
+ *
+ * Update wp_numa_possible_mask at online and clear it at offline.
+ */
+void workqueue_node_register(int node)
+{
+   int cpu;
+
+   mutex_lock(&wq_pool_mutex);
+   for_each_possible_cpu(cpu)
+   workqueue_update_cpu_numa_affinity(cpu, node);
+   /* unbound workqueue will be updated when the 1st cpu comes up.*/
+   mutex_unlock(&wq_pool_mutex);
+}
+
+void workqueue_node_unregister(int node)
+{
+   struct worker_pool *pool;
+   int cpu;
+
+   mutex_lock(&wq_pool_mutex);
+   cpumask_clear(wq_numa_possible_cpumask[node]);
+   for_each_possible_cpu(cpu) {
+   if (node == cpu_to_node(cpu))
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = NUMA_NO_NODE;
+   }
+   /*
+* unbound workqueue's per-node pwqs are already refleshed
+* by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu
+* on this node, because all cpus of this node went down.
+* (see wq_calc_node_cpumask()). per-node unbound pwqs has been replaced
+* with wq->dfl_pwq, already.
+*/
+   mutex_unlock(&wq_pool_mutex);
+}
+
+#endif
 
 /*
  * Workqueu

[PATCH 0/2] workqueue: fix a bug when numa mapping is changed v4

2014-12-16 Thread Kamezawa Hiroyuki
This is v4. Thank you for hints/commentes to previous versions.

I think this versions only contains necessary things and not invasive.
Tested several patterns of node hotplug and seems to work well.

Changes since v3
 - removed changes against get_unbound_pool()
 - remvoed codes in cpu offline event.
 - added node unregister callback.
   clear wq_numa_possible_mask at node offline rather than cpu offline.
 - updates per-cpu pool's pool-> node at node_(un)register.
 - added more comments.
 - almost all codes are under CONFIG_MEMORY_HOTPLUG

 include/linux/memory_hotplug.h |3 +
 kernel/workqueue.c |   81 -
 mm/memory_hotplug.c|6 ++-
 3 files changed, 88 insertions(+), 2 deletions(-)

Original problem was a memory allocation failure because pool->node
points to not-online node. This happens when cpu<->node mapping changes.

Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.

2014-12-16 Thread Kamezawa Hiroyuki

(2014/12/16 17:10), Kamezawa Hiroyuki wrote:

(2014/12/16 16:49), Lai Jiangshan wrote:

On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote:

Workqueue keeps cpu<->node relationship including all possible cpus.
The original information was made at boot but it may change when
a new node is added.

Update information if a new node is ready with using node-hotplug callback.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  include/linux/memory_hotplug.h |  2 ++
  kernel/workqueue.c | 28 +++-
  mm/memory_hotplug.c|  4 
  3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..cd3cb67 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);

+/* update for workqueues */
+void workqueue_node_register(int node);
  #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2fd0bd7..5499b76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
  static struct kmem_cache *pwq_cache;

  static cpumask_var_t *wq_numa_possible_cpumask;
-/* possible CPUs of each node */
+/* PL: possible CPUs of each node */

  static bool wq_disable_numa;
  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
  WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
pool->attrs->cpumask) < 0);
  }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void reflesh_wq_possible_mask(int cpu, int node)
+{
+int oldnode;
+for_each_node_state(oldnode, N_POSSIBLE)
+cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);


You need to check and update all the wq->numa_pwq_tbl[oldnode]



you mean : if I drop patch 1/4, it will be required. Right ?
I'll add that check.


Ah, isn't it handled by wq_update_unbound_numa() called at 
workqueue_cpu_down_callback() ?
Because node's online cpumask will be empty, new pwq will be attached.

Thanks
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.

2014-12-16 Thread Kamezawa Hiroyuki

(2014/12/16 16:49), Lai Jiangshan wrote:

On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote:

Workqueue keeps cpu<->node relationship including all possible cpus.
The original information was made at boot but it may change when
a new node is added.

Update information if a new node is ready with using node-hotplug callback.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  include/linux/memory_hotplug.h |  2 ++
  kernel/workqueue.c | 28 +++-
  mm/memory_hotplug.c|  4 
  3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..cd3cb67 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
  extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  unsigned long pnum);

+/* update for workqueues */
+void workqueue_node_register(int node);
  #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2fd0bd7..5499b76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
  static struct kmem_cache *pwq_cache;

  static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* PL: possible CPUs of each node */

  static bool wq_disable_numa;
  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
  pool->attrs->cpumask) < 0);
  }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void reflesh_wq_possible_mask(int cpu, int node)
+{
+   int oldnode;
+   for_each_node_state(oldnode, N_POSSIBLE)
+   cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);


You need to check and update all the wq->numa_pwq_tbl[oldnode]



you mean : if I drop patch 1/4, it will be required. Right ?
I'll add that check.

Thanks,
-Kame



+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+/*
+ * When a cpu is physically added, cpu<->node relationship is established.
+ * We can catch the whole view when a new NODE_DATA() coming up.
+ * Update cpumask used for sched affinity of workers.
+ */
+void workqueue_node_register(int node)
+{
+   int cpu;
+   mutex_lock(&wq_pool_mutex);
+   for_each_possible_cpu(cpu) {
+   if (node == cpu_to_node(cpu))
+   reflesh_wq_possible_mask(cpu, node);
+   }
+   mutex_unlock(&wq_pool_mutex);
+}
+
+#endif

  /*
   * Workqueues should be brought up before normal priority CPU notifiers.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807..d0c1ebb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 * online_pages() and offline_pages().
 */
reset_node_present_pages(pgdat);
+   /*
+* Update workqueue's cpu/node affinity info.
+*/
+   workqueue_node_register(nid);

return pgdat;
  }





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection

2014-12-15 Thread Kamezawa Hiroyuki

(2014/12/16 14:30), Lai Jiangshan wrote:

On 12/15/2014 07:14 PM, Kamezawa Hiroyuki wrote:

Unbound wq pool's node attribute is calculated at its allocation.
But it's now calculated based on possible cpu<->node information
which can be wrong after cpu hotplug/unplug.

If wrong pool->node is set, following allocation error will happen.
==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default order:
1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
==

This patch fixes the node detection by making use of online cpu info.
Unlike cpumask, the best node can be calculated by degree of overlap
between attr->cpumask and numanode->online_cpumask.
This change doesn't corrupt original purpose of the old calculation.

Note: it's expected that this function is called as
   pool_detect_best_node
   get_unbound_pool
   alloc_unbound_pwq
   wq_update_unbound_numa
   called at CPU_ONLINE/CPU_DOWN_PREPARE
and the latest online cpu info can be applied to a new wq pool,
which replaces old one.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  kernel/workqueue.c | 38 ++
  1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..7809154 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
  }

  /**
+ * pool_detect_best_node - detect a node which contains specified cpumask.
+ * Should be called with wq_pool_mutex held.
+ * Returns a online node where the most of given cpus are tied to.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+   int node, best, match, selected = NUMA_NO_NODE;
+   static struct cpumask andmask; /* under wq_pool_mutex */
+
+   if (!wq_numa_enabled ||
+   cpumask_subset(cpu_online_mask, cpumask))
+   goto out;
+   best = 0;
+   /* select a node which contains the most number of cpu */
+   for_each_node_state(node, N_ONLINE) {
+   cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+   match = cpumask_weight(&andmask);
+   if (match > best)
+   selected = best;
+   }
+out:
+   return selected;
+}



This is a mixture of fix and development.  Why not just keep the original 
calculation?


Just because wq_numa_possible_mask is broken, it shouldn't be used if possible.
In this patch series, the mask is updated only when a node is coming up.
Is it better to clear it at node offline ?


if the mask cover multiple nodes, NUMA_NO_NODE is the best for pool->node
after the pool was created. The memory allocation will select the best node
for manage_workers(), from which CPU that the worker actually is running on.



I'll drop this and try to keep original code as much as possible.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline

2014-12-15 Thread Kamezawa Hiroyuki

(2014/12/16 14:32), Lai Jiangshan wrote:

On 12/15/2014 07:16 PM, Kamezawa Hiroyuki wrote:

The percpu workqueue pool are persistend and never be freed.
But cpu<->node relationship can be changed by cpu hotplug and pool->node
can point to an offlined node.

If pool->node points to an offlined node,
following allocation failure can happen.
 ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
 ==

This patch clears per-cpu workqueue pool's node affinity at
cpu offlining and restore it at cpu onlining.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  kernel/workqueue.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7809154..2fd0bd7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_DOWN_FAILED:
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);
+   /*
+* now cpu <-> node info is established. update numa node
+*/
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = cpu_to_node(cpu);

for_each_pool(pool, pi) {
mutex_lock(&pool->attach_mutex);
@@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct 
notifier_block *nfb,
int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;
struct workqueue_struct *wq;
+   struct worker_pool *pool;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
@@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct 
notifier_block *nfb,
INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
queue_work_on(cpu, system_highpri_wq, &unbind_work);

-   /* update NUMA affinity of unbound workqueues */
mutex_lock(&wq_pool_mutex);
+   /* update NUMA affinity of unbound workqueues */
list_for_each_entry(wq, &workqueues, list)
wq_update_unbound_numa(wq, cpu, false);
+   /* clear per-cpu workqueues's numa affinity. */
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = NUMA_NO_NODE; /* restored at online */



If the node is still online, it is better to keep the original pool->node.



Hm, ok, drop this code.  or moving all to node online event handler.

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE.

2014-12-15 Thread Kamezawa Hiroyuki
This is a corner case fix. In general, cpu-to-node relationship is
fixed when cpu/memory are visible to kernel based on firmware
information. But in some case, cpu-to-node relationship is updated
at CPU_ONLINE.

In arch/x86/mm/numa.c::numa_init_array(), a cpu will be tied to
a random numa node if the node firmware tells is memry-less node.
Such cpu's cpu affinity can be changed in cpu_up().

For example,
  step 1. boot with mem= boot option, hide memory.
  step 2. online hidden memory with using memory hot-add.
  step 3. offline cpus.
  step 4. online cpus.

In step 4, cpu's numa node affinity can be modified if memory-less
node turns out to be an usual node by step 2.

This patch handles the event in CPU_ONLINE callback of workqueue.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5499b76..cc0e1d4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4583,6 +4583,20 @@ void workqueue_node_register(int node)
}
mutex_unlock(&wq_pool_mutex);
 }
+/*
+ * When a cpu boot up with a memory less node, the cpu is be tied 
+ * a random node (see arch/x86/mm/numa.c::numa_init_array()).
+ * The relationship is fixed when a proper memory node comes up and
+ * it's visible after CPU_ONLINE. check and reflesh it.
+ */
+void check_reflesh_node_cpumask(int cpu)
+{
+   int node = cpu_to_node(cpu);
+
+   if (likely(cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])))
+   return;
+   reflesh_wq_possible_mask(cpu, node);
+}
 
 #endif
 
@@ -4617,6 +4631,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
 */
for_each_cpu_worker_pool(pool, cpu)
pool->node = cpu_to_node(cpu);
+   check_reflesh_node_cpumask(cpu);
 
for_each_pool(pool, pi) {
mutex_lock(&pool->attach_mutex);
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.

2014-12-15 Thread Kamezawa Hiroyuki
Workqueue keeps cpu<->node relationship including all possible cpus.
The original information was made at boot but it may change when
a new node is added.

Update information if a new node is ready with using node-hotplug callback.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 include/linux/memory_hotplug.h |  2 ++
 kernel/workqueue.c | 28 +++-
 mm/memory_hotplug.c|  4 
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 8f1a419..cd3cb67 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, 
struct mem_section *ms)
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
  unsigned long pnum);
 
+/* update for workqueues */
+void workqueue_node_register(int node);
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2fd0bd7..5499b76 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,7 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* PL: possible CPUs of each node */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct 
worker_pool *pool, int cpu)
WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task,
  pool->attrs->cpumask) < 0);
 }
+#ifdef CONFIG_MEMORY_HOTPLUG
+
+static void reflesh_wq_possible_mask(int cpu, int node)
+{
+   int oldnode;
+   for_each_node_state(oldnode, N_POSSIBLE)
+   cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]);
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+/*
+ * When a cpu is physically added, cpu<->node relationship is established.
+ * We can catch the whole view when a new NODE_DATA() coming up.
+ * Update cpumask used for sched affinity of workers.
+ */
+void workqueue_node_register(int node)
+{
+   int cpu;
+   mutex_lock(&wq_pool_mutex);
+   for_each_possible_cpu(cpu) {
+   if (node == cpu_to_node(cpu))
+   reflesh_wq_possible_mask(cpu, node);
+   }
+   mutex_unlock(&wq_pool_mutex);
+}
+
+#endif
 
 /*
  * Workqueues should be brought up before normal priority CPU notifiers.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807..d0c1ebb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 * online_pages() and offline_pages().
 */
reset_node_present_pages(pgdat);
+   /*
+* Update workqueue's cpu/node affinity info.
+*/
+   workqueue_node_register(nid);
 
return pgdat;
 }
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline

2014-12-15 Thread Kamezawa Hiroyuki
The percpu workqueue pool are persistend and never be freed.
But cpu<->node relationship can be changed by cpu hotplug and pool->node
can point to an offlined node.

If pool->node points to an offlined node,
following allocation failure can happen.
==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==

This patch clears per-cpu workqueue pool's node affinity at
cpu offlining and restore it at cpu onlining.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7809154..2fd0bd7 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_DOWN_FAILED:
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);
+   /*
+* now cpu <-> node info is established. update numa node
+*/
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = cpu_to_node(cpu);
 
for_each_pool(pool, pi) {
mutex_lock(&pool->attach_mutex);
@@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct 
notifier_block *nfb,
int cpu = (unsigned long)hcpu;
struct work_struct unbind_work;
struct workqueue_struct *wq;
+   struct worker_pool *pool;
 
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
@@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct 
notifier_block *nfb,
INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn);
queue_work_on(cpu, system_highpri_wq, &unbind_work);
 
-   /* update NUMA affinity of unbound workqueues */
mutex_lock(&wq_pool_mutex);
+   /* update NUMA affinity of unbound workqueues */
list_for_each_entry(wq, &workqueues, list)
wq_update_unbound_numa(wq, cpu, false);
+   /* clear per-cpu workqueues's numa affinity. */
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = NUMA_NO_NODE; /* restored at online */
mutex_unlock(&wq_pool_mutex);
 
/* wait for per-cpu unbinding to finish */
-- 
1.8.3.1




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection

2014-12-15 Thread Kamezawa Hiroyuki
Unbound wq pool's node attribute is calculated at its allocation.
But it's now calculated based on possible cpu<->node information
which can be wrong after cpu hotplug/unplug.

If wrong pool->node is set, following allocation error will happen.
==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==

This patch fixes the node detection by making use of online cpu info.
Unlike cpumask, the best node can be calculated by degree of overlap
between attr->cpumask and numanode->online_cpumask.
This change doesn't corrupt original purpose of the old calculation.

Note: it's expected that this function is called as
  pool_detect_best_node
  get_unbound_pool
  alloc_unbound_pwq
  wq_update_unbound_numa
  called at CPU_ONLINE/CPU_DOWN_PREPARE
and the latest online cpu info can be applied to a new wq pool,
which replaces old one.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 38 ++
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 09b685d..7809154 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool)
 }
 
 /**
+ * pool_detect_best_node - detect a node which contains specified cpumask.
+ * Should be called with wq_pool_mutex held.
+ * Returns a online node where the most of given cpus are tied to. 
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+   int node, best, match, selected = NUMA_NO_NODE;
+   static struct cpumask andmask; /* under wq_pool_mutex */
+
+   if (!wq_numa_enabled ||
+   cpumask_subset(cpu_online_mask, cpumask))
+   goto out;
+   best = 0;
+   /* select a node which contains the most number of cpu */
+   for_each_node_state(node, N_ONLINE) {
+   cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+   match = cpumask_weight(&andmask);
+   if (match > best)
+   selected = best;
+   }
+out:
+   return selected;
+}
+
+/**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
  *
@@ -3457,7 +3482,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
-   int node;
 
lockdep_assert_held(&wq_pool_mutex);
 
@@ -3482,17 +3506,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 * 'struct workqueue_attrs' comments for detail.
 */
pool->attrs->no_numa = false;
-
-   /* if cpumask is contained inside a NUMA node, we belong to that node */
-   if (wq_numa_enabled) {
-   for_each_node(node) {
-   if (cpumask_subset(pool->attrs->cpumask,
-  wq_numa_possible_cpumask[node])) {
-   pool->node = node;
-   break;
-   }
-   }
-   }
+   pool->node = pool_detect_best_node(pool->attrs->cpumask);
 
if (worker_pool_assign_id(pool) < 0)
goto fail;
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3

2014-12-15 Thread Kamezawa Hiroyuki
Lai-san, Tejun-san,

Thank you for review, this a fix v3. This has been tested on NUMA node hotplug
machine and seems work well.

The probelm is memory allocation failure because pool->node information can be 
stale
after node hotplug. This patch(1,2) tries to fix pool->node calculation.
Patch (3,4) tries to update cpumask calculation.
(Fixing memory allocation bug just requires patch 1,2. But cpumask should be 
update
 I think.)

Changelog since v2.
- reordered patch and split for each problem cases.
- removed unnecessary calls pointed out by Lai-san.
- restore node/cpu relationship when a new node comes online.
- handle corner case at CPU_ONLINE.

1/4  fix unbound workqueue's memory node affinity calculation.
2/4 ...  update percpu workqueue's memory node affinity at online/offline
3/4 ...  update workqueue's possible cpumask when a new node onlined.
4/4 ...  handle cpu-node affinity change at CPU_ONLINE.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.

2014-12-14 Thread Kamezawa Hiroyuki

(2014/12/15 14:19), Lai Jiangshan wrote:

On 12/15/2014 12:04 PM, Kamezawa Hiroyuki wrote:

(2014/12/15 12:34), Lai Jiangshan wrote:

On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:

(2014/12/15 11:48), Lai Jiangshan wrote:

On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:

(2014/12/15 11:12), Lai Jiangshan wrote:

On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:

Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for 
sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 67 
++
 1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;

 static cpumask_var_t *wq_numa_possible_cpumask;
-/* possible CPUs of each node */
+/* possible CPUs of each node initialized with possible info at boot.
+   but modified at cpu hotplug to be adjusted to real info.  */

 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
 call_rcu_sched(&pool->rcu, rcu_free_pool);
 }

+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+int node, best, match, selected;
+static struct cpumask andmask; /* we're under mutex */
+
+/* Is any node okay ? */
+if (!wq_numa_enabled ||
+cpumask_subset(cpu_online_mask, cpumask))
+return NUMA_NO_NODE;
+best = 0;
+selected = NUMA_NO_NODE;
+/* select a node which contains the most cpu of cpumask */
+for_each_node_state(node, N_ONLINE) {
+cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+match = cpumask_weight(&andmask);
+if (match > best)
+selected = node;
+}
+return selected;
+}
+
+
 /**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 {
 u32 hash = wqattrs_hash(attrs);
 struct worker_pool *pool;
-int node;

 lockdep_assert_held(&wq_pool_mutex);

@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
  * 'struct workqueue_attrs' comments for detail.
  */
 pool->attrs->no_numa = false;
-
-/* if cpumask is contained inside a NUMA node, we belong to that node */
-if (wq_numa_enabled) {
-for_each_node(node) {
-if (cpumask_subset(pool->attrs->cpumask,
-   wq_numa_possible_cpumask[node])) {
-pool->node = node;
-break;
-}
-}
-}
+pool->node = pool_detect_best_node(pool->attrs->cpumask);

 if (worker_pool_assign_id(pool) < 0)
 goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
 int cpu = (unsigned long)hcpu;
 struct worker_pool *pool;
 struct workqueue_struct *wq;
-int pi;
+int pi, node;

 switch (action & ~CPU_TASKS_FROZEN) {
 case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
 case CPU_ONLINE:
 mutex_lock(&wq_pool_mutex);

+/* now cpu <-> node info is established, update the info. */
+if (!wq_disable_numa) {





+for_each_node_state(node, N_POSSIBLE)
+cpumask_clear_cpu(cpu,
+wq_numa_possible_cpumask[node]);


The wq code try to reuse the origin pwqs/pools when the node still have cpu 
online.
these 3 lines of code will cause the origin pwqs/pools be on the road of dying, 
and
create a new set of pwqs/pools.


because the result of wq_calc_node_cpumask() changes ?


Yes.


Do you mean some comment should be added here ? or explaination for your reply 
for [3/4] ?


this fix [4/4] breaks the original design.



I'm sorry that I can't understand what this patch breaks.


the pwqs/pools should be kept if the node still have cpu online.


So, the fix's grand design should be

  1. drop old pwq/pools only at node offline.
  2. set proper 

Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.

2014-12-14 Thread Kamezawa Hiroyuki

(2014/12/15 12:34), Lai Jiangshan wrote:

On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote:

(2014/12/15 11:48), Lai Jiangshan wrote:

On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:

(2014/12/15 11:12), Lai Jiangshan wrote:

On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:

Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for 
sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki 
---
kernel/workqueue.c | 67 
++
1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
static struct kmem_cache *pwq_cache;

static cpumask_var_t *wq_numa_possible_cpumask;
-/* possible CPUs of each node */
+/* possible CPUs of each node initialized with possible info at boot.
+   but modified at cpu hotplug to be adjusted to real info.  */

static bool wq_disable_numa;
module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
call_rcu_sched(&pool->rcu, rcu_free_pool);
}

+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+int node, best, match, selected;
+static struct cpumask andmask; /* we're under mutex */
+
+/* Is any node okay ? */
+if (!wq_numa_enabled ||
+cpumask_subset(cpu_online_mask, cpumask))
+return NUMA_NO_NODE;
+best = 0;
+selected = NUMA_NO_NODE;
+/* select a node which contains the most cpu of cpumask */
+for_each_node_state(node, N_ONLINE) {
+cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+match = cpumask_weight(&andmask);
+if (match > best)
+selected = node;
+}
+return selected;
+}
+
+
/**
 * get_unbound_pool - get a worker_pool with the specified attributes
 * @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
{
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
-int node;

lockdep_assert_held(&wq_pool_mutex);

@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 * 'struct workqueue_attrs' comments for detail.
 */
pool->attrs->no_numa = false;
-
-/* if cpumask is contained inside a NUMA node, we belong to that node */
-if (wq_numa_enabled) {
-for_each_node(node) {
-if (cpumask_subset(pool->attrs->cpumask,
-   wq_numa_possible_cpumask[node])) {
-pool->node = node;
-break;
-}
-}
-}
+pool->node = pool_detect_best_node(pool->attrs->cpumask);

if (worker_pool_assign_id(pool) < 0)
goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
struct workqueue_struct *wq;
-int pi;
+int pi, node;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);

+/* now cpu <-> node info is established, update the info. */
+if (!wq_disable_numa) {





+for_each_node_state(node, N_POSSIBLE)
+cpumask_clear_cpu(cpu,
+wq_numa_possible_cpumask[node]);


The wq code try to reuse the origin pwqs/pools when the node still have cpu 
online.
these 3 lines of code will cause the origin pwqs/pools be on the road of dying, 
and
create a new set of pwqs/pools.


because the result of wq_calc_node_cpumask() changes ?


Yes.


Do you mean some comment should be added here ? or explaination for your reply 
for [3/4] ?


this fix [4/4] breaks the original design.



I'm sorry that I can't understand what this patch breaks.


the pwqs/pools should be kept if the node still have cpu online.


So, the fix's grand design should be

 1. drop old pwq/pools only at node offline.
 2. set proper pool->node based on online node info.
 3. update pool->node of per-cpu-pool at cpu ONLINE.

Hm. (1) is  done because cpumask_of_n

Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.

2014-12-14 Thread Kamezawa Hiroyuki

(2014/12/15 11:48), Lai Jiangshan wrote:

On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote:

(2014/12/15 11:12), Lai Jiangshan wrote:

On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:

Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for 
sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki 
---
   kernel/workqueue.c | 67 
++
   1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
   static struct kmem_cache *pwq_cache;

   static cpumask_var_t *wq_numa_possible_cpumask;
-/* possible CPUs of each node */
+/* possible CPUs of each node initialized with possible info at boot.
+   but modified at cpu hotplug to be adjusted to real info.  */

   static bool wq_disable_numa;
   module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
   call_rcu_sched(&pool->rcu, rcu_free_pool);
   }

+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+int node, best, match, selected;
+static struct cpumask andmask; /* we're under mutex */
+
+/* Is any node okay ? */
+if (!wq_numa_enabled ||
+cpumask_subset(cpu_online_mask, cpumask))
+return NUMA_NO_NODE;
+best = 0;
+selected = NUMA_NO_NODE;
+/* select a node which contains the most cpu of cpumask */
+for_each_node_state(node, N_ONLINE) {
+cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+match = cpumask_weight(&andmask);
+if (match > best)
+selected = node;
+}
+return selected;
+}
+
+
   /**
* get_unbound_pool - get a worker_pool with the specified attributes
* @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
   {
   u32 hash = wqattrs_hash(attrs);
   struct worker_pool *pool;
-int node;

   lockdep_assert_held(&wq_pool_mutex);

@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
* 'struct workqueue_attrs' comments for detail.
*/
   pool->attrs->no_numa = false;
-
-/* if cpumask is contained inside a NUMA node, we belong to that node */
-if (wq_numa_enabled) {
-for_each_node(node) {
-if (cpumask_subset(pool->attrs->cpumask,
-   wq_numa_possible_cpumask[node])) {
-pool->node = node;
-break;
-}
-}
-}
+pool->node = pool_detect_best_node(pool->attrs->cpumask);

   if (worker_pool_assign_id(pool) < 0)
   goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
   int cpu = (unsigned long)hcpu;
   struct worker_pool *pool;
   struct workqueue_struct *wq;
-int pi;
+int pi, node;

   switch (action & ~CPU_TASKS_FROZEN) {
   case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
   case CPU_ONLINE:
   mutex_lock(&wq_pool_mutex);

+/* now cpu <-> node info is established, update the info. */
+if (!wq_disable_numa) {





+for_each_node_state(node, N_POSSIBLE)
+cpumask_clear_cpu(cpu,
+wq_numa_possible_cpumask[node]);


The wq code try to reuse the origin pwqs/pools when the node still have cpu 
online.
these 3 lines of code will cause the origin pwqs/pools be on the road of dying, 
and
create a new set of pwqs/pools.


because the result of wq_calc_node_cpumask() changes ?


Yes.


Do you mean some comment should be added here ? or explaination for your reply 
for [3/4] ?


this fix [4/4] breaks the original design.



I'm sorry that I can't understand what this patch breaks.
Do you mean it's better to work with broken wq_numa_possible_cpumask ?

I guess removing wq_numa_possible_mask entirely may be the best way
byusing online_mask of the node.

Thanks,
-Kame




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.

2014-12-14 Thread Kamezawa Hiroyuki

(2014/12/15 11:12), Lai Jiangshan wrote:

On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote:

Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for 
sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  kernel/workqueue.c | 67 ++
  1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
  static struct kmem_cache *pwq_cache;

  static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* possible CPUs of each node initialized with possible info at boot.
+   but modified at cpu hotplug to be adjusted to real info.  */

  static bool wq_disable_numa;
  module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
call_rcu_sched(&pool->rcu, rcu_free_pool);
  }

+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+   int node, best, match, selected;
+   static struct cpumask andmask; /* we're under mutex */
+
+   /* Is any node okay ? */
+   if (!wq_numa_enabled ||
+   cpumask_subset(cpu_online_mask, cpumask))
+   return NUMA_NO_NODE;
+   best = 0;
+   selected = NUMA_NO_NODE;
+   /* select a node which contains the most cpu of cpumask */
+   for_each_node_state(node, N_ONLINE) {
+   cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+   match = cpumask_weight(&andmask);
+   if (match > best)
+   selected = node;
+   }
+   return selected;
+}
+
+
  /**
   * get_unbound_pool - get a worker_pool with the specified attributes
   * @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
  {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
-   int node;

lockdep_assert_held(&wq_pool_mutex);

@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 * 'struct workqueue_attrs' comments for detail.
 */
pool->attrs->no_numa = false;
-
-   /* if cpumask is contained inside a NUMA node, we belong to that node */
-   if (wq_numa_enabled) {
-   for_each_node(node) {
-   if (cpumask_subset(pool->attrs->cpumask,
-  wq_numa_possible_cpumask[node])) {
-   pool->node = node;
-   break;
-   }
-   }
-   }
+   pool->node = pool_detect_best_node(pool->attrs->cpumask);

if (worker_pool_assign_id(pool) < 0)
goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
struct workqueue_struct *wq;
-   int pi;
+   int pi, node;

switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);

+   /* now cpu <-> node info is established, update the info. */
+   if (!wq_disable_numa) {





+   for_each_node_state(node, N_POSSIBLE)
+   cpumask_clear_cpu(cpu,
+   wq_numa_possible_cpumask[node]);


The wq code try to reuse the origin pwqs/pools when the node still have cpu 
online.
these 3 lines of code will cause the origin pwqs/pools be on the road of dying, 
and
create a new set of pwqs/pools.


because the result of wq_calc_node_cpumask() changes ?
Do you mean some comment should be added here ? or explaination for your reply 
for [3/4] ?

Thanks,
-Kame








+   node = cpu_to_node(cpu);
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+   }
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = cpu_to_node(cpu);
for_each_pool(pool, pi) {

Re: [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.

2014-12-14 Thread Kamezawa Hiroyuki

(2014/12/15 11:06), Lai Jiangshan wrote:

On 12/14/2014 12:35 AM, Kamezawa Hiroyuki wrote:

remove node aware unbound pools if node goes offline.

scan unbound workqueue and remove numa affine pool when
a node goes offline.

Signed-off-by: KAMEZAWA Hiroyuki 
---
  kernel/workqueue.c | 29 +
  1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 35f4f00..07b4eb5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4921,11 +4921,40 @@ early_initcall(init_workqueues);
   * cached pools per cpu should be freed at node unplug
   */

+/*
+ * Replace per-node pwq with dfl_pwq because this node disappers.
+ * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa.
+ */
+static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid)
+{
+   struct pool_workqueue *old_pwq = NULL;
+
+   if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+   return;
+   mutex_lock(&wq->mutex);
+   if (wq->unbound_attrs->no_numa)
+   goto out_unlock;
+   spin_lock_irq(&wq->dfl_pwq->pool->lock);
+   get_pwq(wq->dfl_pwq);
+   spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+   old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq);
+out_unlock:
+   mutex_unlock(&wq->mutex);
+   put_pwq_unlocked(old_pwq);
+   return;
+}



We have already did it in wq_update_unbound_numa().



Ah, hm. CPU_OFFLINE event does this, ok.

Thanks,
-Kame



+
  void workqueue_register_numanode(int nid)
  {
  }

  void workqueue_unregister_numanode(int nid)
  {
+   struct workqueue_struct *wq;
+
+   mutex_lock(&wq_pool_mutex);
+   list_for_each_entry(wq, &workqueues, list)
+   wq_release_unbound_numa(wq, nid);
+   mutex_unlock(&wq_pool_mutex);
  }
  #endif





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] workqueue: handle change in cpu-node relationship.

2014-12-13 Thread Kamezawa Hiroyuki
Although workqueue detects relationship between cpu<->node at boot,
it is finally determined in cpu_up().
This patch tries to update pool->node using online status of cpus.

1. When a node goes down, clear per-cpu pool's node attr.
2. When a cpu comes up, update per-cpu pool's node attr.
3. When a cpu comes up, update possinle node cpumask workqueue is using for 
sched.
4. Detect the best node for unbound pool's cpumask using the latest info.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 67 ++
 1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07b4eb5..259b3ba 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -266,7 +266,8 @@ struct workqueue_struct {
 static struct kmem_cache *pwq_cache;
 
 static cpumask_var_t *wq_numa_possible_cpumask;
-   /* possible CPUs of each node */
+   /* possible CPUs of each node initialized with possible info at boot.
+   but modified at cpu hotplug to be adjusted to real info.  */
 
 static bool wq_disable_numa;
 module_param_named(disable_numa, wq_disable_numa, bool, 0444);
@@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool)
call_rcu_sched(&pool->rcu, rcu_free_pool);
 }
 
+/*
+ * detect best node for given cpumask.
+ */
+static int pool_detect_best_node(const struct cpumask *cpumask)
+{
+   int node, best, match, selected;
+   static struct cpumask andmask; /* we're under mutex */
+
+   /* Is any node okay ? */
+   if (!wq_numa_enabled ||
+   cpumask_subset(cpu_online_mask, cpumask))
+   return NUMA_NO_NODE;
+   best = 0;
+   selected = NUMA_NO_NODE;
+   /* select a node which contains the most cpu of cpumask */
+   for_each_node_state(node, N_ONLINE) {
+   cpumask_and(&andmask, cpumask, cpumask_of_node(node));
+   match = cpumask_weight(&andmask);
+   if (match > best)
+   selected = node;
+   }
+   return selected;
+}
+
+
 /**
  * get_unbound_pool - get a worker_pool with the specified attributes
  * @attrs: the attributes of the worker_pool to get
@@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 {
u32 hash = wqattrs_hash(attrs);
struct worker_pool *pool;
-   int node;
 
lockdep_assert_held(&wq_pool_mutex);
 
@@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct 
workqueue_attrs *attrs)
 * 'struct workqueue_attrs' comments for detail.
 */
pool->attrs->no_numa = false;
-
-   /* if cpumask is contained inside a NUMA node, we belong to that node */
-   if (wq_numa_enabled) {
-   for_each_node(node) {
-   if (cpumask_subset(pool->attrs->cpumask,
-  wq_numa_possible_cpumask[node])) {
-   pool->node = node;
-   break;
-   }
-   }
-   }
+   pool->node = pool_detect_best_node(pool->attrs->cpumask);
 
if (worker_pool_assign_id(pool) < 0)
goto fail;
@@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
int cpu = (unsigned long)hcpu;
struct worker_pool *pool;
struct workqueue_struct *wq;
-   int pi;
+   int pi, node;
 
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
@@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
case CPU_ONLINE:
mutex_lock(&wq_pool_mutex);
 
+   /* now cpu <-> node info is established, update the info. */
+   if (!wq_disable_numa) {
+   for_each_node_state(node, N_POSSIBLE)
+   cpumask_clear_cpu(cpu,
+   wq_numa_possible_cpumask[node]);
+   node = cpu_to_node(cpu);
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+   }
+   for_each_cpu_worker_pool(pool, cpu)
+   pool->node = cpu_to_node(cpu);
for_each_pool(pool, pi) {
mutex_lock(&pool->attach_mutex);
 
@@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid)
 void workqueue_unregister_numanode(int nid)
 {
struct workqueue_struct *wq;
+   const struct cpumask *nodecpumask;
+   struct worker_pool *pool;
+   int cpu;
 
+   /* at this point, cpu-to-node relationship is not lost */
+   nodecpumask = cpumask_of_node(nid);
+   for_each_cpu(cpu, nodecpumask) {
+   /*
+* pool is allcated at boot an

[PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.

2014-12-13 Thread Kamezawa Hiroyuki
remove node aware unbound pools if node goes offline.

scan unbound workqueue and remove numa affine pool when
a node goes offline.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 35f4f00..07b4eb5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4921,11 +4921,40 @@ early_initcall(init_workqueues);
  * cached pools per cpu should be freed at node unplug
  */
 
+/*
+ * Replace per-node pwq with dfl_pwq because this node disappers.
+ * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa.
+ */
+static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid)
+{
+   struct pool_workqueue *old_pwq = NULL;
+
+   if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND))
+   return;
+   mutex_lock(&wq->mutex);
+   if (wq->unbound_attrs->no_numa)
+   goto out_unlock;
+   spin_lock_irq(&wq->dfl_pwq->pool->lock);
+   get_pwq(wq->dfl_pwq);
+   spin_unlock_irq(&wq->dfl_pwq->pool->lock);
+   old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq);
+out_unlock:
+   mutex_unlock(&wq->mutex);
+   put_pwq_unlocked(old_pwq);
+   return;
+}
+
 void workqueue_register_numanode(int nid)
 {
 }
  
 void workqueue_unregister_numanode(int nid)
 {
+   struct workqueue_struct *wq;
+
+   mutex_lock(&wq_pool_mutex);
+   list_for_each_entry(wq, &workqueues, list)
+   wq_release_unbound_numa(wq, nid);
+   mutex_unlock(&wq_pool_mutex);
 }
 #endif
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] workqueue: fix bug when numa mapping is changed v2.

2014-12-13 Thread Kamezawa Hiroyuki

Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping
between CPU and node is changed. This was the last scene:
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

I and Yasuaki have a host which has a feature of node hotplug, this is a fix by 
me.
Tested several patterns of hotplug and I found no issue, now.
Of course I read Lai's patch and Tejun's comment. I hope I could reflect them.

1/4 ... add node-hotplug event callback.
2/4 ... add a sanity check (for debug)
3/4 ... remove per-node unbound workqueue if node goes offline.
4/4 ... update per-cpu pool's information and cpumasks, node information
based on the latest (cpu, node) information.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] workqueue: add warning if pool->node is offline

2014-12-13 Thread Kamezawa Hiroyuki
Add warning if pool->node is offline.

This patch was originaly made for debug.
I think add warning here can show what may happen.

Signed-off-by: KAMEZAWA Hiroyuki 
---
 kernel/workqueue.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f6cb357c..35f4f00 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -760,6 +760,15 @@ static bool too_many_workers(struct worker_pool *pool)
return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy;
 }
 
+/* Check pool->node */
+static int pool_to_node(struct worker_pool *pool)
+{
+   if (WARN_ON_ONCE(pool->node != NUMA_NO_NODE
+ && !node_online(pool->node)))
+   return NUMA_NO_NODE;
+   return pool->node;
+}
+
 /*
  * Wake up functions.
  */
@@ -1679,7 +1688,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
if (id < 0)
goto fail;
 
-   worker = alloc_worker(pool->node);
+   worker = alloc_worker(pool_to_node(pool));
if (!worker)
goto fail;
 
@@ -1692,7 +1701,8 @@ static struct worker *create_worker(struct worker_pool 
*pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
-   worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
+   worker->task = kthread_create_on_node(worker_thread, worker,
+ pool_to_node(pool),
  "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;
@@ -3651,7 +3661,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct 
workqueue_struct *wq,
if (!pool)
return NULL;
 
-   pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
+   pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool_to_node(pool));
if (!pwq) {
put_unbound_pool(pool);
return NULL;
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] workqueue: add a hook for node hotplug

2014-12-13 Thread Kamezawa Hiroyuki
Subject: [PATCH 1/4] add callbackof node hotplug for workqueue.

Because workqueue is numa aware, it pool has node information.
And it should be maintained against node-hotplug.

When a node which exists at boot is unpluged, following error
is detected.
==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
==
This is because pool->node points a stale node.

This patch adds callback function at node hotplug.

Signed-off-by: KAMEZAWA Hiroyuki node should be cleared and
+ * cached pools per cpu should be freed at node unplug
+ */
+
+void workqueue_register_numanode(int nid)
+{
+}
+ 
+void workqueue_unregister_numanode(int nid)
+{
+}
+#endif
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1bf4807..504b071 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1162,7 +1162,8 @@ int try_online_node(int nid)
build_all_zonelists(NULL, NULL);
mutex_unlock(&zonelists_mutex);
}
-
+   /* Now zonelist for the pgdat is ready */
+   workqueue_register_numanode(nid);
 out:
mem_hotplug_done();
return ret;
@@ -1914,7 +1915,11 @@ static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
ret = check_cpu_on_node(pgdat);
if (ret)
return ret;
-
+   /*
+* There is no online cpu on the node and this node will go.
+* make workqueue to forget this node.
+*/
+   workqueue_unregister_numanode(pgdat->node_id);
/*
 * the node will be offlined when we come here, so we can clear
 * the cpu_to_node() now.
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] mem-hotplug: Fix wrong check for zone->pageset initialization in online_pages().

2014-11-04 Thread Kamezawa Hiroyuki
(2014/10/31 18:46), Tang Chen wrote:
> When we are doing memory hot-add, the following functions are called:
> 
> add_memory()
> |--> hotadd_new_pgdat()
>   |--> free_area_init_node()
>|--> free_area_init_core()
> |--> zone->present_pages = realsize;   /* 1. zone is 
> populated */
>  |--> zone_pcp_init()
>   |--> zone->pageset = &boot_pageset;  /* 2. 
> zone->pageset is set to boot_pageset */
> 
> There are two problems here:
> 1. Zones could be populated before any memory is onlined.
> 2. All the zones on a newly added node have the same pageset pointing to 
> boot_pageset.
> 
> The above two problems will result in the following problem:
> When we online memory on one node, e.g node2, the following code is executed:
> 
> online_pages()
> {
>  ..
>  if (!populated_zone(zone)) {
>  need_zonelists_rebuild = 1;
>  build_all_zonelists(NULL, zone);
>  }
>  ..
> }
> 
> Because of problem 1, the zone has been populated, and the 
> build_all_zonelists()
>will never called. zone->pageset won't be updated.
> Because of problem 2, All the zones on a newly added node have the same 
> pageset
>pointing to boot_pageset.
> And as a result, when we online memory on node2, node3's meminfo will corrupt.
> Pages on node2 may be freed to node3.
> 
> # for ((i = 2048; i < 2064; i++)); do echo online_movable > 
> /sys/devices/system/node/node2/memory$i/state; done
> # cat /sys/devices/system/node/node2/meminfo
> Node 2 MemTotal:   33554432 kB
> Node 2 MemFree:33549092 kB
> Node 2 MemUsed:5340 kB
> ..
> # cat /sys/devices/system/node/node3/meminfo
> Node 3 MemTotal:  0 kB
> Node 3 MemFree:   248 kB/* corrupted */
> Node 3 MemUsed:   0 kB
> ..
> 
> We have to populate some zones before onlining memory, otherwise no memory 
> could be onlined.
> So when onlining pages, we should also check if zone->pageset is pointing to 
> boot_pageset.
> 
> Signed-off-by: Gu Zheng 
> Signed-off-by: Tang Chen 
> ---
>   include/linux/mm.h  | 1 +
>   mm/memory_hotplug.c | 6 +-
>   mm/page_alloc.c | 5 +
>   3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 02d11ee..83e6505 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1732,6 +1732,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const 
> char *fmt, ...);
>   
>   extern void setup_per_cpu_pageset(void);
>   
> +extern bool zone_pcp_initialized(struct zone *zone);
>   extern void zone_pcp_update(struct zone *zone);
>   extern void zone_pcp_reset(struct zone *zone);
>   
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3ab01b2..bc0de0f 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1013,9 +1013,13 @@ int __ref online_pages(unsigned long pfn, unsigned 
> long nr_pages, int online_typ
>* If this zone is not populated, then it is not in zonelist.
>* This means the page allocator ignores this zone.
>* So, zonelist must be updated after online.
> +  *
> +  * If this zone is populated, zone->pageset could be initialized
> +  * to boot_pageset for the first time a node is added. If so,
> +  * zone->pageset should be allocated.
>*/
>   mutex_lock(&zonelists_mutex);
> - if (!populated_zone(zone)) {
> + if (!populated_zone(zone) || !zone_pcp_initialized(zone)) {

Please don't add another strange meanings to zone's pcplist.

If you say zone->present_pages doesn't mean zone has pages in buddy list any 
more,
please rewrite all parts using zone->present_pages including populated_zone().
I think there are several parts calculates parameters based on present_pages.

I myself doesn't welcome having both of compliated "managed pages" and "present 
pages"...

How about adding 

static inline int managed_zone(struct zone *zone)
{
return (!!zone->managed_pages);
}

for this bug fix ?

Other parts, using present_pages, should be considered well.

Thanks,
-Kame

Thanks,
-Kame


>   need_zonelists_rebuild = 1;
>   build_all_zonelists(NULL, zone);
>   }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 736d8e1..4ff1540 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6456,6 +6456,11 @@ void __meminit zone_pcp_update(struct zone *zone)
>   }
>   #endif
>   
> +bool zone_pcp_initialized(struct zone *zone)
> +{
> + return (zone->pageset != &boot_pageset);
> +}
> +
>   void zone_pcp_reset(struct zone *zone)
>   {
>   unsigned long flags;
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lk

Re: [patch 3/3] mm: move page->mem_cgroup bad page handling into generic code

2014-11-04 Thread Kamezawa Hiroyuki
(2014/11/02 12:15), Johannes Weiner wrote:
> Now that the external page_cgroup data structure and its lookup is
> gone, let the generic bad_page() check for page->mem_cgroup sanity.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: KAMEZAWA Hiroyuki 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 2/3] mm: page_cgroup: rename file to mm/swap_cgroup.c

2014-11-04 Thread Kamezawa Hiroyuki
(2014/11/02 12:15), Johannes Weiner wrote:
> Now that the external page_cgroup data structure and its lookup is
> gone, the only code remaining in there is swap slot accounting.
> 
> Rename it and move the conditional compilation into mm/Makefile.
> 
> Signed-off-by: Johannes Weiner 

Acked-by: KAMEZAWA Hiroyuki 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3] mm: embed the memcg pointer directly into struct page

2014-11-04 Thread Kamezawa Hiroyuki
(2014/11/02 12:15), Johannes Weiner wrote:
> Memory cgroups used to have 5 per-page pointers.  To allow users to
> disable that amount of overhead during runtime, those pointers were
> allocated in a separate array, with a translation layer between them
> and struct page.
> 
> There is now only one page pointer remaining: the memcg pointer, that
> indicates which cgroup the page is associated with when charged.  The
> complexity of runtime allocation and the runtime translation overhead
> is no longer justified to save that *potential* 0.19% of memory.  With
> CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
> after the page->private member and doesn't even increase struct page,
> and then this patch actually saves space.  Remaining users that care
> can still compile their kernels without CONFIG_MEMCG.
> 
> textdata bss dec hex filename
> 8828345 1725264  983040 11536649 b00909  vmlinux.old
> 8827425 1725264  966656 11519345 afc571  vmlinux.new
> 
> Signed-off-by: Johannes Weiner 
> ---
>   include/linux/memcontrol.h  |   6 +-
>   include/linux/mm_types.h|   5 +
>   include/linux/mmzone.h  |  12 --
>   include/linux/page_cgroup.h |  53 
>   init/main.c |   7 -
>   mm/memcontrol.c | 124 +
>   mm/page_alloc.c |   2 -
>   mm/page_cgroup.c| 319 
> ----
>   8 files changed, 41 insertions(+), 487 deletions(-)
> 

Great! 
Acked-by: KAMEZAWA Hiroyuki 

BTW, init/Kconfig comments shouldn't be updated ?
(I'm sorry if it has been updated since your latest fix.)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 4/4] mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag

2014-10-21 Thread Kamezawa Hiroyuki
(2014/10/21 0:22), Johannes Weiner wrote:
> pc->mem_cgroup had to be left intact after uncharge for the final LRU
> removal, and !PCG_USED indicated whether the page was uncharged.  But
> since 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages are
> uncharged after the final LRU removal.  Uncharge can simply clear the
> pointer and the PCG_USED/PageCgroupUsed sites can test that instead.
> 
> Because this is the last page_cgroup flag, this patch reduces the
> memcg per-page overhead to a single pointer.
> 
> Signed-off-by: Johannes Weiner 

awesome.
Acked-by: KAMEZAWA Hiroyuki 

> ---
>   include/linux/page_cgroup.h |  10 -
>   mm/memcontrol.c | 107 
> +---
>   2 files changed, 42 insertions(+), 75 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 97536e685843..1289be6b436c 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -1,11 +1,6 @@
>   #ifndef __LINUX_PAGE_CGROUP_H
>   #define __LINUX_PAGE_CGROUP_H
>   
> -enum {
> - /* flags for mem_cgroup */
> - PCG_USED = 0x01,/* This page is charged to a memcg */
> -};
> -
>   struct pglist_data;
>   
>   #ifdef CONFIG_MEMCG
> @@ -19,7 +14,6 @@ struct mem_cgroup;
>* then the page cgroup for pfn always exists.
>*/
>   struct page_cgroup {
> - unsigned long flags;
>   struct mem_cgroup *mem_cgroup;
>   };
>   
> @@ -39,10 +33,6 @@ static inline void page_cgroup_init(void)
>   
>   struct page_cgroup *lookup_page_cgroup(struct page *page);
>   
> -static inline int PageCgroupUsed(struct page_cgroup *pc)
> -{
> - return !!(pc->flags & PCG_USED);
> -}
>   #else /* !CONFIG_MEMCG */
>   struct page_cgroup;
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1d66ac49e702..48d49c6b08d1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1284,14 +1284,12 @@ struct lruvec *mem_cgroup_page_lruvec(struct page 
> *page, struct zone *zone)
>   
>   pc = lookup_page_cgroup(page);
>   memcg = pc->mem_cgroup;
> -
>   /*
>* Swapcache readahead pages are added to the LRU - and
> -  * possibly migrated - before they are charged.  Ensure
> -  * pc->mem_cgroup is sane.
> +  * possibly migrated - before they are charged.
>*/
> - if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup)
> - pc->mem_cgroup = memcg = root_mem_cgroup;
> + if (!memcg)
> + memcg = root_mem_cgroup;
>   
>   mz = mem_cgroup_page_zoneinfo(memcg, page);
>   lruvec = &mz->lruvec;
> @@ -2141,7 +2139,7 @@ void __mem_cgroup_begin_update_page_stat(struct page 
> *page,
>   pc = lookup_page_cgroup(page);
>   again:
>   memcg = pc->mem_cgroup;
> - if (unlikely(!memcg || !PageCgroupUsed(pc)))
> + if (unlikely(!memcg))
>   return;
>   /*
>* If this memory cgroup is not under account moving, we don't
> @@ -2154,7 +2152,7 @@ again:
>   return;
>   
>   move_lock_mem_cgroup(memcg, flags);
> - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> + if (memcg != pc->mem_cgroup) {
>   move_unlock_mem_cgroup(memcg, flags);
>   goto again;
>   }
> @@ -2186,7 +2184,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>   
>   pc = lookup_page_cgroup(page);
>   memcg = pc->mem_cgroup;
> - if (unlikely(!memcg || !PageCgroupUsed(pc)))
> + if (unlikely(!memcg))
>   return;
>   
>   this_cpu_add(memcg->stat->count[idx], val);
> @@ -2525,9 +2523,10 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct 
> page *page)
>   VM_BUG_ON_PAGE(!PageLocked(page), page);
>   
>   pc = lookup_page_cgroup(page);
> - if (PageCgroupUsed(pc)) {
> - memcg = pc->mem_cgroup;
> - if (memcg && !css_tryget_online(&memcg->css))
> + memcg = pc->mem_cgroup;
> +
> + if (memcg) {
> + if (!css_tryget_online(&memcg->css))
>   memcg = NULL;
>   } else if (PageSwapCache(page)) {
>   ent.val = page_private(page);
> @@ -2578,7 +2577,7 @@ static void commit_charge(struct page *page, struct 
> mem_cgroup *memcg,
>   struct page_cgroup *pc = lookup_page_cgroup(page);
>   int isolated;
>   
> - VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
> + VM_BUG_ON_PAGE(pc->mem_cgroup, page);
>   /*
>* we don't need page_cgroup_lock about tail pages, becase they

  1   2   3   4   5   6   7   8   >