Re: [PATCH 0/2] Make core_pattern support namespace
On 2016/03/16 18:23, Zhao Lei wrote: > We discussed patch titled: > [PATCH] Make core_pattern support namespace > before. > > Above patch can solve half problem of custom core_dump pattern > in container, but there are also another problem that limit > custom core_pattern in container, it is the pipe-type core_pattern > will write core dump into host's filesystem. > (See discussion of that patch for detail) > > Now we can solve the second problem by [PATCH 1/2], I send > the origional patch with it. > Let me know your design... This patch does using fork+execve() rather than calling UMH in pipe-coredump pass. And coredump-pipe process is run as a child of get-dumped process. Right ? Doesn't this break existing solution actually used in distro ? BTW, it's first time for me to see that _do_fork() is called outside from fork.c. Isn't it better to add a new func in fork.c if we really need this ? Thanks, -Kame
Re: call_usermodehelper in containers
On 2016/02/19 14:37, Ian Kent wrote: On Fri, 2016-02-19 at 12:08 +0900, Kamezawa Hiroyuki wrote: On 2016/02/19 5:45, Eric W. Biederman wrote: Personally I am a fan of the don't be clever and capture a kernel thread approach as it is very easy to see you what if any exploitation opportunities there are. The justifications for something more clever is trickier. Of course we do something that from this perspective would be considered ``clever'' today with kthreadd and user mode helpers. I read old discussionlet me allow clarification to create a helper kernel thread to run usermodehelper with using kthreadd. 0) define a trigger to create an independent usermodehelper environment for a container. Option A) at creating some namespace (pid, uid, etc...) Option B) at creating a new nsproxy Option C).at a new systemcall is called or some sysctl, make_private_usermode_helper() or some, It's expected this should be triggered by init process of a container with some capability. And scope of the effect should be defined. pid namespace ? nsporxy ? or new namespace ? 1) create a helper thread. task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper") switch task's nsproxy to current.(swtich_task_namespaces()) switch task's cgroups to current (cgroup_attach_task_all()) switch task's cred to current. copy task's capability from current (and any other ?) wake_up_process() And create a link between kthread_wq and container. Not sure I quite understand this but I thought the difficulty with this approach previously (even though the approach was very much incomplete) was knowing that all the "moving parts" would not allow vulnerabilities. Ok, that was discussed. And it looks like this would require a kernel thread for each instance. So for a thousand containers that each mount an NFS mount that means, at least, 1000 additional kernel threads. Might be able to sell that, if we were lucky, but from an system administration POV it's horrible. I agree. There's also the question of existence (aka. lifetime) to deal with since the thread above needs to be created at a time other than the usermode helper callback. What happens for SIGKILL on a container? It depends on how the helper kthread is tied to a container related object. If kthread is linked with some namespace, we can kill it when a namespace goes away. So, with your opinion, - a helper thread should be spawned on demand - the lifetime of it should be clear. It will be good to have as same life time as the container. I wonder there is no solution for "moving part" problem other than calling do_fork() or copy_process() with container's init process context if we do all in the kernel. Is that possible ? Thanks, -Kame
Re: call_usermodehelper in containers
On 2016/02/19 5:45, Eric W. Biederman wrote: > Personally I am a fan of the don't be clever and capture a kernel thread > approach as it is very easy to see you what if any exploitation > opportunities there are. The justifications for something more clever > is trickier. Of course we do something that from this perspective would > be considered ``clever'' today with kthreadd and user mode helpers. > I read old discussionlet me allow clarification to create a helper kernel thread to run usermodehelper with using kthreadd. 0) define a trigger to create an independent usermodehelper environment for a container. Option A) at creating some namespace (pid, uid, etc...) Option B) at creating a new nsproxy Option C).at a new systemcall is called or some sysctl, make_private_usermode_helper() or some, It's expected this should be triggered by init process of a container with some capability. And scope of the effect should be defined. pid namespace ? nsporxy ? or new namespace ? 1) create a helper thread. task = kthread_create(kthread_work_fn, ?, ?, "usermodehelper") switch task's nsproxy to current.(swtich_task_namespaces()) switch task's cgroups to current (cgroup_attach_task_all()) switch task's cred to current. copy task's capability from current (and any other ?) wake_up_process() And create a link between kthread_wq and container. 2) modify call_usermodehelper() to use kthread_worker It seems the problem is which object container private user mode helper should be tied to. Regards, -Kame
Re: call_usermodehelper in containers
On 2016/02/18 11:57, Eric W. Biederman wrote: > > Ccing The containers list because a related discussion is happening there > and somehow this thread has never made it there. > > Ian Kent writes: > >> On Mon, 2013-11-18 at 18:28 +0100, Oleg Nesterov wrote: >>> On 11/15, Eric W. Biederman wrote: I don't understand that one. Having a preforked thread with the proper environment that can act like kthreadd in terms of spawning user mode helpers works and is simple. >> >> Forgive me replying to such an old thread but ... >> >> After realizing workqueues can't be used to pre-create threads to run >> usermode helpers I've returned to look at this. > > If someone can wind up with a good implementation I will be happy. > >>> Can't we ask ->child_reaper to create the non-daemonized kernel thread >>> with the "right" ->nsproxy, ->fs, etc? >> >> Eric, do you think this approach would be sufficient too? >> >> Probably wouldn't be quite right for user namespaces but should provide >> what's needed for other cases? >> >> It certainly has the advantage of not having to maintain a plague of >> processes waiting around to execute helpers. > > That certainly sounds attractive. Especially for the case of everyone > who wants to set a core pattern in a container. > > I am fuzzy on all of the details right now, but what I do remember is > that in the kernel the user mode helper concepts when they attempted to > scrub a processes environment were quite error prone until we managed to > get kthreadd(pid 2) on the scene which always had a clean environment. > > If we are going to tie this kind of thing to the pid namespace I > recommend simplying denying it if you are in a user namespace without > an approrpriate pid namespace. AKA simply not allowing thigns to be setup > if current->pid_ns->user_ns != current->user_ns. > Can't be handled by simple capability like CAP_SYS_USERMODEHELPER ? User_ns check seems not to allow core-dump-cather in host will not work if user_ns is used. Thanks, -Kame
Re: [Propose] Isolate core_pattern in mnt namespace.
On 2015/12/22 6:52, Eric W. Biederman wrote: > Dongsheng Yang writes: > >> On 12/20/2015 05:47 PM, Eric W. Biederman wrote: >>> Dongsheng Yang writes: >>> On 12/20/2015 10:37 AM, Al Viro wrote: > On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: >> On 12/17/2015 07:23 PM, Dongsheng Yang wrote: >>> Hi guys, >>>We are working on making core dump behaviour isolated in >>> container. But the problem is, the /proc/sys/kernel/core_pattern >>> is a kernel wide setting, not belongs to a container. >>> >>>So we want to add core_pattern into mnt namespace. What >>> do you think about it? >> >> Hi Eric, >> I found your patch about "net: Implement the per network namespace >> sysctl infrastructure", I want to do the similar thing >> in mnt namespace. Is that suggested way? > > Why mnt namespace and not something else? Hi Al, Well, because core_pattern indicates the path to store core file. In different mnt namespace, we would like to change the path with different value. In addition, Let's considering other namespaces: UTS ns: contains informations of kernel and arch, not proper for core_pattern. IPC ns: communication informations, not proper for core_pattern PID ns: core_pattern is not related with pid net ns: obviousely no. user ns: not proper too. Then I believe it's better to do this in mnt namespace. of course, core_pattern is just one example. After this infrastructure finished, we can implement more sysctls as per-mnt if necessary, I think. Al, what do you think about this idea? >>> >>> The hard part is not the sysctl. The hard part is starting the usermode >>> helper, in an environment that it can deal with. The mount namespace >>> really provides you with no help there. >> >> Do you mean the core dump helper? But I think I don't want to touch it >> in my development. I think I can use non-pipe way to get what I want, >> Let me try to explain what I want here. >> >> (1). introduce a --core-path option in docker run command to specify the >> path in host to store core file in one container. >> E.g: docker run --core-path=/core/test --name=test IMAGE >> >> (2). When the container starting, docker attach a volume to it, similar >> with "-v /core/test:/var/lib/docker/coredump". That means, the path of >> /var/lib/docker/coredump in container is a link to /core/test in host. >> >> (3). Set the /proc/sys/kernel/core_pattern in container as >> "/var/lib/docker/coredump". But that should not affect the core_pattern >> in host or other containers. >> >> Then I think I can collect the core files from each container and save >> them in the paths where I want. > > For your case that sounds like it would work. Unfortunately for this to > be generally applicable and to let the OS in the contianer control it's > fate the core dump pattern needs to be supported. > > Otherwise something clever in userspace that can be written now should > be sufficient to fill the gap. There is enough information for the user > mode helper to implement the policy you would like today. > Let me clarify my understanding. 1) running user-mode-helper in a container. It's not supported by the kernel. user-mode-helper always works on a host. 2) running user mode helper in a host. It's supported in the newest distro(FC23). (abrt supports container.) Summary is here. https://github.com/abrt/abrt/wiki/Containers-and-chroots If a guest user doesn't want to pass a core to the host owner, core_pattern should be configurable but it can't. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Propose] Isolate core_pattern in mnt namespace.
On 2015/12/20 18:47, Eric W. Biederman wrote: > Dongsheng Yang writes: > >> On 12/20/2015 10:37 AM, Al Viro wrote: >>> On Sun, Dec 20, 2015 at 10:14:29AM +0800, Dongsheng Yang wrote: On 12/17/2015 07:23 PM, Dongsheng Yang wrote: > Hi guys, > We are working on making core dump behaviour isolated in > container. But the problem is, the /proc/sys/kernel/core_pattern > is a kernel wide setting, not belongs to a container. > > So we want to add core_pattern into mnt namespace. What > do you think about it? Hi Eric, I found your patch about "net: Implement the per network namespace sysctl infrastructure", I want to do the similar thing in mnt namespace. Is that suggested way? >>> >>> Why mnt namespace and not something else? >> >> Hi Al, >> >> Well, because core_pattern indicates the path to store core file. >> In different mnt namespace, we would like to change the path with >> different value. >> >> In addition, Let's considering other namespaces: >> UTS ns: contains informations of kernel and arch, not proper for >> core_pattern. >> IPC ns: communication informations, not proper for core_pattern >> PID ns: core_pattern is not related with pid >> net ns: obviousely no. >> user ns: not proper too. >> >> Then I believe it's better to do this in mnt namespace. of course, >> core_pattern is just one example. After this infrastructure finished, >> we can implement more sysctls as per-mnt if necessary, I think. >> >> Al, what do you think about this idea? > > The hard part is not the sysctl. The hard part is starting the usermode > helper, in an environment that it can deal with. The mount namespace > really provides you with no help there. > Let me ask. I think user-mode-helper shouldn be found in container's namespace. Or Do you mean running user-mode-helper out of a container ? I think if a user need to send cores to outside of a container, a file descriptor or a network end point should be able to be passed to sysctl. Thanks, -Kame > Eric > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 7/7] Documentation: cgroup: add memory.swap.{current,max} description
On 2015/12/17 21:30, Vladimir Davydov wrote: > The rationale of separate swap counter is given by Johannes Weiner. > > Signed-off-by: Vladimir Davydov > --- > Changes in v2: > - Add rationale of separate swap counter provided by Johannes. > > Documentation/cgroup.txt | 33 + > 1 file changed, 33 insertions(+) > > diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt > index 31d1f7bf12a1..f441564023e1 100644 > --- a/Documentation/cgroup.txt > +++ b/Documentation/cgroup.txt > @@ -819,6 +819,22 @@ PAGE_SIZE multiple when read back. > the cgroup. This may not exactly match the number of > processes killed but should generally be close. > > + memory.swap.current > + > + A read-only single value file which exists on non-root > + cgroups. > + > + The total amount of swap currently being used by the cgroup > + and its descendants. > + > + memory.swap.max > + > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage hard limit. If a cgroup's swap usage reaches this > + limit, anonymous meomry of the cgroup will not be swapped out. > + > > 5-2-2. General Usage > > @@ -1291,3 +1307,20 @@ allocation from the slack available in other groups or > the rest of the > system than killing the group. Otherwise, memory.max is there to > limit this type of spillover and ultimately contain buggy or even > malicious applications. > + > +The combined memory+swap accounting and limiting is replaced by real > +control over swap space. > + > +The main argument for a combined memory+swap facility in the original > +cgroup design was that global or parental pressure would always be > +able to swap all anonymous memory of a child group, regardless of the > +child's own (possibly untrusted) configuration. However, untrusted > +groups can sabotage swapping by other means - such as referencing its > +anonymous memory in a tight loop - and an admin can not assume full > +swappability when overcommitting untrusted jobs. > + > +For trusted jobs, on the other hand, a combined counter is not an > +intuitive userspace interface, and it flies in the face of the idea > +that cgroup controllers should account and limit specific physical > +resources. Swap space is a resource like all others in the system, > +and that's why unified hierarchy allows distributing it separately. > Could you give here a hint how to calculate amount of swapcache, counted both in memory.current and swap.current ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
On 2015/12/18 3:43, Luck, Tony wrote: As Tony requested, we may need a knob to stop a fallback in "movable->normal", later. If the mirrored memory is small and the other is large, I think we can both enable "non-mirrored -> normal" and "normal -> non-mirrored". Size of mirrored memory can be configured by software(EFI var). So, having both is just overkill and normal->non-mirroed fallback is meaningless considering what the feature want to guarantee. In the original removable usage we wanted to guarantee that Linux did not allocate any kernel objects in removable memory - because that would prevent later removal of that memory. Mirror case is the same - we don't want to allocate kernel structures in non-mirrored memory because an uncorrectable error in one of them would crash the system. But I think some users might like some flexibility here. If the system doesn't have enough memory for the kernel (non-movable or mirrored), then it seems odd to end up crashing the system at the point of memory exhaustion (a likely result ... the kernel can try to reclaim some pages from SLAB, but that might only return a few pages, if the shortage continues the system will perform poorly and eventually fail). The whole point of removable memory or mirrored memory is to provide better availability. I'd vote for a mode where running out of memory for kernel results in a warn_on_once("Ran out of mirrored/non-removable memory for kernel - now allocating from all zones\n") because I think most people would like the system to stay up rather than worry about some future problem that may never happen. Hmm...like this ? sysctl.vm.fallback_mirror_memory = 0 // never fallback # default. sysctl.vm.fallback_mirror_memory = 1 // the user memory may be allocated from mirrored zone. sysctl.vm.fallback_mirror_memory = 2 // usually kernel allocates memory from mirrored zone before OOM. sysctl.vm.fallback_mirror_memory = 3 // 1+2 However I believe my customer's choice is always 0, above implementation can be done in a clean way. (adding a flag to zones (mirrored or not) and controlling fallback zonelist walk.) BTW, we need this Taku's patch to make a progress. I think other devs should be done in another development cycle. What does he need to get your Acks ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
On 2015/12/17 13:48, Xishi Qiu wrote: > On 2015/12/17 10:53, Kamezawa Hiroyuki wrote: > >> On 2015/12/17 11:47, Xishi Qiu wrote: >>> On 2015/12/17 9:38, Izumi, Taku wrote: >>> >>>> Dear Xishi, >>>> >>>>Sorry for late. >>>> >>>>> -Original Message- >>>>> From: Xishi Qiu [mailto:qiuxi...@huawei.com] >>>>> Sent: Friday, December 11, 2015 6:44 PM >>>>> To: Izumi, Taku/泉 拓 >>>>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; >>>>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛 >>>>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk >>>>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option >>>>> >>>>> On 2015/12/11 13:53, Izumi, Taku wrote: >>>>> >>>>>> Dear Xishi, >>>>>> >>>>>>> Hi Taku, >>>>>>> >>>>>>> Whether it is possible that we rewrite the fallback function in buddy >>>>>>> system >>>>>>> when zone_movable and mirrored_kernelcore are both enabled? >>>>>> >>>>>> What does "when zone_movable and mirrored_kernelcore are both >>>>>> enabled?" mean ? >>>>>> >>>>>> My patchset just provides a new way to create ZONE_MOVABLE. >>>>>> >>>>> >>>>> Hi Taku, >>>>> >>> >>> Hi Taku, >>> >>> We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time. >>> So when we use "mirror", in fact, the movable zone is a new zone. I think >>> it is >>> more appropriate with this name "mirrored zone", and also we can rewrite the >>> fallback function in buddy system in this case. >> >> kernelcore ="mirrored zone" ? > > No, it's zone_names[MAX_NR_ZONES] > How about "Movable", -> "Non-mirrored"? > That will break many user apps. I think we don't have enough reason. >> >> BTW, let me confirm. >> >>ZONE_NORMAL = mirrored >>ZONE_MOVABLE = not mirrored. >> > > Yes, > >> so, the new zone is "not-mirrored" zone. >> >> Now, fallback function is >> >> movable -> normal -> DMA. >> >> As Tony requested, we may need a knob to stop a fallback in >> "movable->normal", later. >> > > If the mirrored memory is small and the other is large, > I think we can both enable "non-mirrored -> normal" and "normal -> > non-mirrored". Size of mirrored memory can be configured by software(EFI var). So, having both is just overkill and normal->non-mirroed fallback is meaningless considering what the feature want to guarantee. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/17 12:32, Johannes Weiner wrote: On Thu, Dec 17, 2015 at 11:46:27AM +0900, Kamezawa Hiroyuki wrote: On 2015/12/16 20:09, Johannes Weiner wrote: On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote: - swap-full notification via vmpressure or something mechanism. Why? I think it's a sign of unhealthy condition, starting file cache drop rate to rise. But I forgot that there are resource threshold notifier already. Does the notifier work for swap.usage ? That will be reflected in vmpressure or other distress mechanisms. I'm not convinced "ran out of swap space" needs special casing in any way. Most users checks swap space shortage as "system alarm" in enterprise systems. At least, our customers checks swap-full. - force swap-in at reducing swap.limit Why? If full, swap.limit cannot be reduced even if there are available memory in a cgroup. Another cgroup cannot make use of the swap resource while it's occupied by other cgroup. The job scheduler should have a chance to fix the situation. I don't see why swap space allowance would need to be as dynamically adjustable as the memory allowance. There is usually no need to be as tight with swap space as with memory, and the performance penalty of swapping, even with flash drives, is high enough that swap space acts as an overflow vessel rather than be part of the regularly backing of the anonymous/shmem working set. It really is NOT obvious that swap space would need to be adjusted on the fly, and that it's important that reducing the limit will be reflected in consumption right away. With my OS support experience, some customers consider swap-space as a resource. We shouldn't be adding hundreds of lines of likely terrible heuristics code* on speculation that somebody MIGHT find this useful in real life. We should wait until we are presented with a real usecase that applies to a whole class of users, and then see what the true requirements are. ok, we should wait. I'm just guessing (japanese) HPC people will want the feature for their job control. I hear many programs relies on swap. * If a group has 200M swapped out and the swap limit is reduced by 10M below the current consumption, which pages would you swap in? There is no LRU list for swap space. If a rotation can happen when a swap-in-by-real-pagefault, random swap-in at reducing swap.limit will work enough. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
On 2015/12/17 11:47, Xishi Qiu wrote: > On 2015/12/17 9:38, Izumi, Taku wrote: > >> Dear Xishi, >> >> Sorry for late. >> >>> -Original Message- >>> From: Xishi Qiu [mailto:qiuxi...@huawei.com] >>> Sent: Friday, December 11, 2015 6:44 PM >>> To: Izumi, Taku/泉 拓 >>> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; >>> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛 >>> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk >>> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option >>> >>> On 2015/12/11 13:53, Izumi, Taku wrote: >>> >>>> Dear Xishi, >>>> >>>>> Hi Taku, >>>>> >>>>> Whether it is possible that we rewrite the fallback function in buddy >>>>> system >>>>> when zone_movable and mirrored_kernelcore are both enabled? >>>> >>>>What does "when zone_movable and mirrored_kernelcore are both enabled?" >>>> mean ? >>>> >>>>My patchset just provides a new way to create ZONE_MOVABLE. >>>> >>> >>> Hi Taku, >>> > > Hi Taku, > > We can NOT specify kernelcore= "nn[KMG]" and "mirror" at the same time. > So when we use "mirror", in fact, the movable zone is a new zone. I think it > is > more appropriate with this name "mirrored zone", and also we can rewrite the > fallback function in buddy system in this case. kernelcore ="mirrored zone" ? BTW, let me confirm. ZONE_NORMAL = mirrored ZONE_MOVABLE = not mirrored. so, the new zone is "not-mirrored" zone. Now, fallback function is movable -> normal -> DMA. As Tony requested, we may need a knob to stop a fallback in "movable->normal", later. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/16 20:09, Johannes Weiner wrote: On Wed, Dec 16, 2015 at 12:18:30PM +0900, Kamezawa Hiroyuki wrote: Hmm, my requests are - set the same capabilities as mlock() to set swap.limit=0 Setting swap.max is already privileged operation. Sure. - swap-full notification via vmpressure or something mechanism. Why? I think it's a sign of unhealthy condition, starting file cache drop rate to rise. But I forgot that there are resource threshold notifier already. Does the notifier work for swap.usage ? - OOM-Killer's available memory calculation may be corrupted, please check. Vladimir updated mem_cgroup_get_limit(). I'll check it. - force swap-in at reducing swap.limit Why? If full, swap.limit cannot be reduced even if there are available memory in a cgroup. Another cgroup cannot make use of the swap resource while it's occupied by other cgroup. The job scheduler should have a chance to fix the situation. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/16 2:21, Michal Hocko wrote: I completely agree that malicious/untrusted users absolutely have to be capped by the hard limit. Then the separate swap limit would work for sure. But I am less convinced about usefulness of the rigid (to the global memory pressure) swap limit without the hard limit. All the memory that could have been swapped out will make a memory pressure to the rest of the system without being punished for it too much. Memcg is allowed to grow over the high limit (in the current implementation) without any way to shrink back in other words. My understanding was that the primary use case for the swap limit is to handle potential (not only malicious but also unexpectedly misbehaving application) anon memory consumption runaways more gracefully without the massive disruption on the global level. I simply didn't see swap space partitioning as important enough because an alternative to swap usage is to consume primary memory which is a more precious resource IMO. Swap storage is really cheap and runtime expandable resource which is not the case for the primary memory in general. Maybe there are other use cases I am not aware of, though. Do you want to guarantee the swap availability? At the first implementation, NEC guy explained their use case in HPC area. At that time, there was no swap support. Considering 2 workloads partitioned into group A, B. total swap was 100GB. A: memory.limit = 40G B: memory.limit = 40G Job scheduler runs applications in A and B in turn. Apps in A stops while Apps in B running. If App-A requires 120GB of anonymous memory, it uses 80GB of swap. So, App-B can use only 20GB of swap. This can cause trouble if App-B needs 100GB of anonymous memory. They need some knob to control amount of swap per cgroup. The point is, at least for their customer, the swap is "resource", which should be under control. With their use case, memory usage and swap usage has the same meaning. So, mem+swap limit doesn't cause trouble. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/15 23:50, Johannes Weiner wrote: On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote: On 2015/12/15 4:42, Vladimir Davydov wrote: Anyway, if you don't trust a container you'd better set the hard memory limit so that it can't hurt others no matter what it runs and how it tweaks its sub-tree knobs. Limiting swap can easily cause "OOM-Killer even while there are available swap" with easy mistake. Can't you add "swap excess" switch to sysctl to allow global memory reclaim can ignore swap limitation ? That never worked with a combined memory+swap limit, either. How could it? The parent might swap you out under pressure, but simply touching a few of your anon pages causes them to get swapped back in, thrashing with whatever the parent was trying to do. Your ability to swap it out is simply no protection against a group touching its pages. Allowing the parent to exceed swap with separate counters makes even less sense, because every page swapped out frees up a page of memory that the child can reuse. For every swap page that exceeds the limit, the child gets a free memory page! The child doesn't even have to cause swapin, it can just steal whatever the parent tried to free up, and meanwhile its combined memory & swap footprint explodes. Sure. The answer is and always should have been: don't overcommit untrusted cgroups. Think of swap as a resource you distribute, not as breathing room for the parents to rely on. Because it can't and could never. ok, don't overcommmit. And the new separate swap counter makes this explicit. Hmm, my requests are - set the same capabilities as mlock() to set swap.limit=0 - swap-full notification via vmpressure or something mechanism. - OOM-Killer's available memory calculation may be corrupted, please check. - force swap-in at reducing swap.limit Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/15 20:02, Vladimir Davydov wrote: On Tue, Dec 15, 2015 at 12:22:41PM +0900, Kamezawa Hiroyuki wrote: On 2015/12/15 4:42, Vladimir Davydov wrote: On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote: On Thu 10-12-15 14:39:14, Vladimir Davydov wrote: In the legacy hierarchy we charge memsw, which is dubious, because: - memsw.limit must be >= memory.limit, so it is impossible to limit swap usage less than memory usage. Taking into account the fact that the primary limiting mechanism in the unified hierarchy is memory.high while memory.limit is either left unset or set to a very large value, moving memsw.limit knob to the unified hierarchy would effectively make it impossible to limit swap usage according to the user preference. - memsw.usage != memory.usage + swap.usage, because a page occupying both swap entry and a swap cache page is charged only once to memsw counter. As a result, it is possible to effectively eat up to memory.limit of memory pages *and* memsw.limit of swap entries, which looks unexpected. That said, we should provide a different swap limiting mechanism for cgroup2. This patch adds mem_cgroup->swap counter, which charges the actual number of swap entries used by a cgroup. It is only charged in the unified hierarchy, while the legacy hierarchy memsw logic is left intact. I agree that the previous semantic was awkward. The problem I can see with this approach is that once the swap limit is reached the anon memory pressure might spill over to other and unrelated memcgs during the global memory pressure. I guess this is what Kame referred to as anon would become mlocked basically. This would be even more of an issue with resource delegation to sub-hierarchies because nobody will prevent setting the swap amount to a small value and use that as an anon memory protection. AFAICS such anon memory protection has a side-effect: real-life workloads need page cache to run smoothly (at least for mapping executables). Disabling swapping would switch pressure to page caches, resulting in performance degradation. So, I don't think per memcg swap limit can be abused to boost your workload on an overcommitted system. If you mean malicious users, well, they already have plenty ways to eat all available memory up to the hard limit by creating unreclaimable kernel objects. "protect anon" user's malicious degree is far lower than such cracker like users. What do you mean by "malicious degree"? What is such a user trying to achieve? Killing the system? Well, there are much more effective ways to do so. Or does it want to exploit a system specific feature to get benefit for itself? If so, it will hardly win by mlocking all anonymous memory, because this will result in higher pressure exerted upon its page cache and dcache, which normal workloads just can't get along without. I wanted to say almost all application developers want to set swap.limit=0 if allowed. So, it's a usual people who can kill the system if swap imbalance is allowed. Anyway, if you don't trust a container you'd better set the hard memory limit so that it can't hurt others no matter what it runs and how it tweaks its sub-tree knobs. Limiting swap can easily cause "OOM-Killer even while there are available swap" with easy mistake. What do you mean by "easy mistake"? Misconfiguration? If so, it's a lame excuse IMO. Admin should take system configuration seriously. If the host is not overcommitted, it's trivial. Otherwise, there's always a chance that things will go south, so it's not going to be easy. It's up to admin to analyze risks and set limits accordingly. Exporting knobs with clear meaning is the best we can do here. swap.max is one such knob It defines maximal usage of swap resource. Allowing to breach it just does not add up. Can't you add "swap excess" switch to sysctl to allow global memory reclaim can ignore swap limitation ? I'd be opposed to it, because this would obscure the user API. OTOH, a kind of swap soft limit (swap.high?) might be considered. I'm not sure if it's really necessary though, because all arguments for it do not look convincing to me for now. So, personally, I would refrain from implementing it until it is really called for by users of cgroup v2. Considering my customers, running OOM-Killer while there are free swap space is system's error rather than their misconfiguration. BTW, mlock() requires CAP_IPC_LOCK. please set default unlimited and check capability at setting swap limit, at least. Thanks, -Kame Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/15 17:30, Vladimir Davydov wrote: On Tue, Dec 15, 2015 at 12:12:40PM +0900, Kamezawa Hiroyuki wrote: On 2015/12/15 0:30, Michal Hocko wrote: On Thu 10-12-15 14:39:14, Vladimir Davydov wrote: In the legacy hierarchy we charge memsw, which is dubious, because: - memsw.limit must be >= memory.limit, so it is impossible to limit swap usage less than memory usage. Taking into account the fact that the primary limiting mechanism in the unified hierarchy is memory.high while memory.limit is either left unset or set to a very large value, moving memsw.limit knob to the unified hierarchy would effectively make it impossible to limit swap usage according to the user preference. - memsw.usage != memory.usage + swap.usage, because a page occupying both swap entry and a swap cache page is charged only once to memsw counter. As a result, it is possible to effectively eat up to memory.limit of memory pages *and* memsw.limit of swap entries, which looks unexpected. That said, we should provide a different swap limiting mechanism for cgroup2. This patch adds mem_cgroup->swap counter, which charges the actual number of swap entries used by a cgroup. It is only charged in the unified hierarchy, while the legacy hierarchy memsw logic is left intact. I agree that the previous semantic was awkward. The problem I can see with this approach is that once the swap limit is reached the anon memory pressure might spill over to other and unrelated memcgs during the global memory pressure. I guess this is what Kame referred to as anon would become mlocked basically. This would be even more of an issue with resource delegation to sub-hierarchies because nobody will prevent setting the swap amount to a small value and use that as an anon memory protection. I guess this was the reason why this approach hasn't been chosen before Yes. At that age, "never break global VM" was the policy. And "mlock" can be used for attacking system. If we are talking about "attacking system" from inside a container, there are much easier and disruptive ways, e.g. running a fork-bomb or creating pipes - such memory can't be reclaimed and global OOM killer won't help. You're right. We just wanted to avoid affecting global memory reclaim by each cgroup settings. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/15 0:30, Michal Hocko wrote: On Thu 10-12-15 14:39:14, Vladimir Davydov wrote: In the legacy hierarchy we charge memsw, which is dubious, because: - memsw.limit must be >= memory.limit, so it is impossible to limit swap usage less than memory usage. Taking into account the fact that the primary limiting mechanism in the unified hierarchy is memory.high while memory.limit is either left unset or set to a very large value, moving memsw.limit knob to the unified hierarchy would effectively make it impossible to limit swap usage according to the user preference. - memsw.usage != memory.usage + swap.usage, because a page occupying both swap entry and a swap cache page is charged only once to memsw counter. As a result, it is possible to effectively eat up to memory.limit of memory pages *and* memsw.limit of swap entries, which looks unexpected. That said, we should provide a different swap limiting mechanism for cgroup2. This patch adds mem_cgroup->swap counter, which charges the actual number of swap entries used by a cgroup. It is only charged in the unified hierarchy, while the legacy hierarchy memsw logic is left intact. I agree that the previous semantic was awkward. The problem I can see with this approach is that once the swap limit is reached the anon memory pressure might spill over to other and unrelated memcgs during the global memory pressure. I guess this is what Kame referred to as anon would become mlocked basically. This would be even more of an issue with resource delegation to sub-hierarchies because nobody will prevent setting the swap amount to a small value and use that as an anon memory protection. I guess this was the reason why this approach hasn't been chosen before Yes. At that age, "never break global VM" was the policy. And "mlock" can be used for attacking system. but I think we can come up with a way to stop the run away consumption even when the swap is accounted separately. All of them are quite nasty but let me try. We could allow charges to fail even for the high limit if the excess is way above the amount of reclaimable memory in the given memcg/hierarchy. A runaway load would be stopped before it can cause a considerable damage outside of its hierarchy this way even when the swap limit is configured small. Now that goes against the high limit semantic which should only throttle the consumer and shouldn't cause any functional failures but maybe this is acceptable for the overall system stability. An alternative would be to throttle in the high limit reclaim context proportionally to the excess. This is normally done by the reclaim itself but with no reclaimable memory this wouldn't work that way. This seems hard to use for users who want to control resource precisely even if stability is good. Another option would be to ignore the swap limit during the global reclaim. This wouldn't stop the runaway loads but they would at least see their fair share of the reclaim. The swap excess could be then used as a "handicap" for a more aggressive throttling during high limit reclaim or to trigger hard limit sooner. This seems to work. But users need to understand swap-limit can be exceeded. Or we could teach the global OOM killer to select abusive anon memory users with restricted swap. That would require to iterate through all memcgs and checks whether their anon consumption is in a large excess to their swap limit and fallback to the memcg OOM victim selection if that is the case. This adds more complexity to the OOM killer path so I am not sure this is generally acceptable, though. I think this is not acceptable. My question now is. Is the knob usable/useful even without additional heuristics? Do we want to protect swap space so rigidly that a swap limited memcg can cause bigger problems than without the swap limit globally? swap requires some limit. If not, an application can eat up all swap and it will not be never freed until the application access it or swapoff runs. Thanks, -Kame The swap usage can be monitored using new memory.swap.current file and limited using memory.swap.max. Signed-off-by: Vladimir Davydov --- include/linux/memcontrol.h | 1 + include/linux/swap.h | 5 ++ mm/memcontrol.c| 123 + mm/shmem.c | 4 ++ mm/swap_state.c| 5 ++ 5 files changed, 129 insertions(+), 9 deletions(-) [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/15 4:42, Vladimir Davydov wrote: On Mon, Dec 14, 2015 at 04:30:37PM +0100, Michal Hocko wrote: On Thu 10-12-15 14:39:14, Vladimir Davydov wrote: In the legacy hierarchy we charge memsw, which is dubious, because: - memsw.limit must be >= memory.limit, so it is impossible to limit swap usage less than memory usage. Taking into account the fact that the primary limiting mechanism in the unified hierarchy is memory.high while memory.limit is either left unset or set to a very large value, moving memsw.limit knob to the unified hierarchy would effectively make it impossible to limit swap usage according to the user preference. - memsw.usage != memory.usage + swap.usage, because a page occupying both swap entry and a swap cache page is charged only once to memsw counter. As a result, it is possible to effectively eat up to memory.limit of memory pages *and* memsw.limit of swap entries, which looks unexpected. That said, we should provide a different swap limiting mechanism for cgroup2. This patch adds mem_cgroup->swap counter, which charges the actual number of swap entries used by a cgroup. It is only charged in the unified hierarchy, while the legacy hierarchy memsw logic is left intact. I agree that the previous semantic was awkward. The problem I can see with this approach is that once the swap limit is reached the anon memory pressure might spill over to other and unrelated memcgs during the global memory pressure. I guess this is what Kame referred to as anon would become mlocked basically. This would be even more of an issue with resource delegation to sub-hierarchies because nobody will prevent setting the swap amount to a small value and use that as an anon memory protection. AFAICS such anon memory protection has a side-effect: real-life workloads need page cache to run smoothly (at least for mapping executables). Disabling swapping would switch pressure to page caches, resulting in performance degradation. So, I don't think per memcg swap limit can be abused to boost your workload on an overcommitted system. If you mean malicious users, well, they already have plenty ways to eat all available memory up to the hard limit by creating unreclaimable kernel objects. "protect anon" user's malicious degree is far lower than such cracker like users. Anyway, if you don't trust a container you'd better set the hard memory limit so that it can't hurt others no matter what it runs and how it tweaks its sub-tree knobs. Limiting swap can easily cause "OOM-Killer even while there are available swap" with easy mistake. Can't you add "swap excess" switch to sysctl to allow global memory reclaim can ignore swap limitation ? Regards, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/7] mm: memcontrol: charge swap to cgroup2
On 2015/12/10 20:39, Vladimir Davydov wrote: > In the legacy hierarchy we charge memsw, which is dubious, because: > > - memsw.limit must be >= memory.limit, so it is impossible to limit > swap usage less than memory usage. Taking into account the fact that > the primary limiting mechanism in the unified hierarchy is > memory.high while memory.limit is either left unset or set to a very > large value, moving memsw.limit knob to the unified hierarchy would > effectively make it impossible to limit swap usage according to the > user preference. > > - memsw.usage != memory.usage + swap.usage, because a page occupying > both swap entry and a swap cache page is charged only once to memsw > counter. As a result, it is possible to effectively eat up to > memory.limit of memory pages *and* memsw.limit of swap entries, which > looks unexpected. > > That said, we should provide a different swap limiting mechanism for > cgroup2. > > This patch adds mem_cgroup->swap counter, which charges the actual > number of swap entries used by a cgroup. It is only charged in the > unified hierarchy, while the legacy hierarchy memsw logic is left > intact. > > The swap usage can be monitored using new memory.swap.current file and > limited using memory.swap.max. > > Signed-off-by: Vladimir Davydov setting swap.max=0 will work like mlock ? Thanks, -Kame > --- > include/linux/memcontrol.h | 1 + > include/linux/swap.h | 5 ++ > mm/memcontrol.c| 123 > + > mm/shmem.c | 4 ++ > mm/swap_state.c| 5 ++ > 5 files changed, 129 insertions(+), 9 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index c6a5ed2f2744..993c9a26b637 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -169,6 +169,7 @@ struct mem_cgroup { > > /* Accounted resources */ > struct page_counter memory; > + struct page_counter swap; > struct page_counter memsw; > struct page_counter kmem; > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 457181844b6e..f4b3ccdcba91 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -368,11 +368,16 @@ static inline int mem_cgroup_swappiness(struct > mem_cgroup *mem) > #endif > #ifdef CONFIG_MEMCG_SWAP > extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); > +extern int mem_cgroup_charge_swap(struct page *page, swp_entry_t entry); > extern void mem_cgroup_uncharge_swap(swp_entry_t entry); > #else > static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry) > { > } > +static inline int mem_cgroup_charge_swap(struct page *page, swp_entry_t > entry) > +{ > + return 0; > +} > static inline void mem_cgroup_uncharge_swap(swp_entry_t entry) > { > } > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7f5c6abf5421..9d10e2819ec4 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1212,7 +1212,7 @@ void mem_cgroup_print_oom_info(struct mem_cgroup > *memcg, struct task_struct *p) > pr_cont(":"); > > for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { > - if (i == MEM_CGROUP_STAT_SWAP && !do_memsw_account()) > + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account) > continue; > pr_cont(" %s:%luKB", mem_cgroup_stat_names[i], > K(mem_cgroup_read_stat(iter, i))); > @@ -1248,12 +1248,15 @@ static unsigned long mem_cgroup_get_limit(struct > mem_cgroup *memcg) > { > unsigned long limit; > > - limit = memcg->memory.limit; > + limit = READ_ONCE(memcg->memory.limit); > if (mem_cgroup_swappiness(memcg)) { > unsigned long memsw_limit; > + unsigned long swap_limit; > > - memsw_limit = memcg->memsw.limit; > - limit = min(limit + total_swap_pages, memsw_limit); > + memsw_limit = READ_ONCE(memcg->memsw.limit); > + swap_limit = min(READ_ONCE(memcg->swap.limit), > + (unsigned long)total_swap_pages); > + limit = min(limit + swap_limit, memsw_limit); > } > return limit; > } > @@ -4226,6 +4229,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state > *parent_css) > page_counter_init(&memcg->memory, NULL); > memcg->high = PAGE_COUNTER_MAX; > memcg->soft_limit = PAGE_COUNTER_MAX; > + page_counter_init(&memcg->swap, NULL); > page_counter_init(&memcg->memsw, NULL); > page_counter_init(&memcg->kmem, NULL); > } > @@ -4276,6 +4280,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css) > page_counter_init(&memcg->memory, &parent->memory); > memcg->high = PAGE_COUNTER_MAX; > memcg-
Re: [PATCH] mm: Introduce kernelcore=reliable option
On 2015/10/31 4:42, Luck, Tony wrote: If each memory controller has the same distance/latency, you (your firmware) don't need to allocate reliable memory per each memory controller. If distance is problem, another node should be allocated. ...is the behavior(splitting zone) really required ? It's useful from a memory bandwidth perspective to have allocations spread across both memory controllers. Keeping a whole bunch of Xeon cores fed needs all the bandwidth you can get. Hmm. But physical address layout is not related to dual memory controller. I think reliable range can be contiguous by firmware... -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/3] mm, oom: refactor oom detection
On 2015/10/30 17:23, Michal Hocko wrote: On Fri 30-10-15 14:23:59, KAMEZAWA Hiroyuki wrote: On 2015/10/30 0:17, mho...@kernel.org wrote: [...] @@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_NORETRY) goto noretry; - /* Keep reclaiming pages as long as there is reasonable progress */ + /* +* Do not retry high order allocations unless they are __GFP_REPEAT +* and even then do not retry endlessly. +*/ pages_reclaimed += did_some_progress; - if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || - ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) { - /* Wait for some write requests to complete then retry */ - wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50); - goto retry; + if (order > PAGE_ALLOC_COSTLY_ORDER) { + if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1< why directly retry here ? Because I wanted to preserve the previous logic for GFP_REPEAT as much as possible here and do an incremental change in the later patch. I see. [...] @@ -3150,8 +3203,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, goto got_pg; /* Retry as long as the OOM killer is making progress */ - if (did_some_progress) + if (did_some_progress) { + stall_backoff = 0; goto retry; + } Umm ? I'm sorry that I didn't notice page allocation may fail even if order < PAGE_ALLOC_COSTLY_ORDER. I thought old logic ignores did_some_progress. It seems a big change. __alloc_pages_may_oom will set did_some_progress So, now, 0-order page allocation may fail in a OOM situation ? No they don't normally and this patch doesn't change the logic here. I understand your patch doesn't change the behavior. Looking into __alloc_pages_may_oom(), *did_some_progress is finally set by if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) *did_some_progress = 1; ...depends on out_of_memory() return value. Now, allocation may fail if oom-killer is disabled Isn't it complicated ? Shouldn't we have if (order < PAGE_ALLOC_COSTLY_ORDER) goto retry; here ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: Introduce kernelcore=reliable option
On 2015/10/23 10:44, Luck, Tony wrote: > First part of each memory controller. I have two memory controllers on each > node > If each memory controller has the same distance/latency, you (your firmware) don't need to allocate reliable memory per each memory controller. If distance is problem, another node should be allocated. ...is the behavior(splitting zone) really required ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
On 2015/10/30 0:17, mho...@kernel.org wrote: > From: Michal Hocko > > wait_iff_congested has been used to throttle allocator before it retried > another round of direct reclaim to allow the writeback to make some > progress and prevent reclaim from looping over dirty/writeback pages > without making any progress. We used to do congestion_wait before > 0e093d99763e ("writeback: do not sleep on the congestion queue if > there are no congested BDIs or if significant congestion is not being > encountered in the current zone") but that led to undesirable stalls > and sleeping for the full timeout even when the BDI wasn't congested. > Hence wait_iff_congested was used instead. But it seems that even > wait_iff_congested doesn't work as expected. We might have a small file > LRU list with all pages dirty/writeback and yet the bdi is not congested > so this is just a cond_resched in the end and can end up triggering pre > mature OOM. > > This patch replaces the unconditional wait_iff_congested by > congestion_wait which is executed only if we _know_ that the last round > of direct reclaim didn't make any progress and dirty+writeback pages are > more than a half of the reclaimable pages on the zone which might be > usable for our target allocation. This shouldn't reintroduce stalls > fixed by 0e093d99763e because congestion_wait is called only when we > are getting hopeless when sleeping is a better choice than OOM with many > pages under IO. > > Signed-off-by: Michal Hocko > --- > mm/page_alloc.c | 19 +-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9c0abb75ad53..0518ca6a9776 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3191,8 +3191,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int > order, >*/ > if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > ac->high_zoneidx, alloc_flags, target)) { > - /* Wait for some write requests to complete then retry > */ > - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50); > + unsigned long writeback = zone_page_state(zone, > NR_WRITEBACK), > + dirty = zone_page_state(zone, > NR_FILE_DIRTY); > + > + if (did_some_progress) > + goto retry; > + > + /* > + * If we didn't make any progress and have a lot of > + * dirty + writeback pages then we should wait for > + * an IO to complete to slow down the reclaim and > + * prevent from pre mature OOM > + */ > + if (2*(writeback + dirty) > reclaimable) Doesn't this add unnecessary latency if other zones have enough clean memory ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/3] mm, oom: refactor oom detection
On 2015/10/30 0:17, mho...@kernel.org wrote: > From: Michal Hocko > > __alloc_pages_slowpath has traditionally relied on the direct reclaim > and did_some_progress as an indicator that it makes sense to retry > allocation rather than declaring OOM. shrink_zones had to rely on > zone_reclaimable if shrink_zone didn't make any progress to prevent > from pre mature OOM killer invocation - the LRU might be full of dirty > or writeback pages and direct reclaim cannot clean those up. > > zone_reclaimable will allow to rescan the reclaimable lists several > times and restart if a page is freed. This is really subtle behavior > and it might lead to a livelock when a single freed page keeps allocator > looping but the current task will not be able to allocate that single > page. OOM killer would be more appropriate than looping without any > progress for unbounded amount of time. > > This patch changes OOM detection logic and pulls it out from shrink_zone > which is too low to be appropriate for any high level decisions such as OOM > which is per zonelist property. It is __alloc_pages_slowpath which knows > how many attempts have been done and what was the progress so far > therefore it is more appropriate to implement this logic. > > The new heuristic tries to be more deterministic and easier to follow. > Retrying makes sense only if the currently reclaimable memory + free > pages would allow the current allocation request to succeed (as per > __zone_watermark_ok) at least for one zone in the usable zonelist. > > This alone wouldn't be sufficient, though, because the writeback might > get stuck and reclaimable pages might be pinned for a really long time > or even depend on the current allocation context. Therefore there is a > feedback mechanism implemented which reduces the reclaim target after > each reclaim round without any progress. This means that we should > eventually converge to only NR_FREE_PAGES as the target and fail on the > wmark check and proceed to OOM. The backoff is simple and linear with > 1/16 of the reclaimable pages for each round without any progress. We > are optimistic and reset counter for successful reclaim rounds. > > Costly high order pages mostly preserve their semantic and those without > __GFP_REPEAT fail right away while those which have the flag set will > back off after the amount of reclaimable pages reaches equivalent of the > requested order. The only difference is that if there was no progress > during the reclaim we rely on zone watermark check. This is more logical > thing to do than previous 1< zone_reclaimable faking the progress. > > Signed-off-by: Michal Hocko > --- > include/linux/swap.h | 1 + > mm/page_alloc.c | 69 > ++-- > mm/vmscan.c | 10 +--- > 3 files changed, 64 insertions(+), 16 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 9c7c4b418498..8298e1dc20f9 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -317,6 +317,7 @@ extern void lru_cache_add_active_or_unevictable(struct > page *page, > struct vm_area_struct *vma); > > /* linux/mm/vmscan.c */ > +extern unsigned long zone_reclaimable_pages(struct zone *zone); > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > extern int __isolate_lru_page(struct page *page, isolate_mode_t mode); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c73913648357..9c0abb75ad53 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2972,6 +2972,13 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask) > return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == > GFP_TRANSHUGE; > } > > +/* > + * Number of backoff steps for potentially reclaimable pages if the direct > reclaim > + * cannot make any progress. Each step will reduce 1/MAX_STALL_BACKOFF of the > + * reclaimable memory. > + */ > +#define MAX_STALL_BACKOFF 16 > + > static inline struct page * > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > struct alloc_context *ac) > @@ -2984,6 +2991,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int > order, > enum migrate_mode migration_mode = MIGRATE_ASYNC; > bool deferred_compaction = false; > int contended_compaction = COMPACT_CONTENDED_NONE; > + struct zone *zone; > + struct zoneref *z; > + int stall_backoff = 0; > > /* >* In the slowpath, we sanity check order to avoid ever trying to > @@ -3135,13 +3145,56 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int > order, > if (gfp_mask & __GFP_NORETRY) > goto noretry; > > - /* Keep reclaiming pages as long as there is reasonable progress */ > + /* > + * Do not retry high order allocations unless they are __GFP_
Re: [PATCH] mm: Introduce kernelcore=reliable option
On 2015/10/22 3:17, Luck, Tony wrote: + if (reliable_kernelcore) { + for_each_memblock(memory, r) { + if (memblock_is_mirror(r)) + continue; Should we have a safety check here that there is some mirrored memory? If you give the kernelcore=reliable option on a machine which doesn't have any mirror configured, then we'll mark all memory as removable. You're right. What happens then? Do kernel allocations fail? Or do they fall back to using removable memory? Maybe the kernel cannot boot because NORMAL zone is empty. Is there a /proc or /sys file that shows the current counts for the removable zone? I just tried this patch with a high percentage of memory marked as mirror ... but I'd like to see how much is actually being used to tune things a bit. I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for meminfo ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
On 2015/10/09 19:36, Xishi Qiu wrote: On 2015/10/9 17:24, Kamezawa Hiroyuki wrote: On 2015/10/09 15:46, Xishi Qiu wrote: On 2015/10/9 22:56, Taku Izumi wrote: Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. Hi Taku, You mean set non-mirrored memory to movable zone, and set mirrored memory to normal zone, right? So kernel allocations will use mirrored memory in normal zone, and user allocations will use non-mirrored memory in movable zone. My question is: 1) do we need to change the fallback function? For *our* requirement, it's not required. But if someone want to prevent user's memory allocation from NORMAL_ZONE, we need some change in zonelist walking. Hi Kame, So we assume kernel will only use normal zone(mirrored), and users use movable zone(non-mirrored) first if the memory is not enough, then use normal zone too. Yes. 2) the mirrored region should locate at the start of normal zone, right? Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE. This patch does only that. I mean the mirrored region can not at the middle or end of the zone, BIOS should report the memory like this, e.g. BIOS node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored node1: 16-24G mirrored, 24-32G non-mirrored OS node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G) node1: NORMAL(16-24G), MOVABLE(24-32G) I think zones can be overlapped even while they are aligned to MAX_ORDER. I remember Kame has already suggested this idea. In my opinion, I still think it's better to add a new migratetype or a new zone, so both user and kernel could use mirrored memory. Hi, Xishi. I and Izumi-san discussed the implementation much and found using "zone" is better approach. The biggest reason is that zone is a unit of vmscan and all statistics and handling the range of memory for a purpose. We can reuse all vmscan and information codes by making use of zones. Introdcing other structure will be messy. Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE is simpler and easier, right? I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be another job. (*)Taku-san's bootoption is to specify kernelcore to be placed into reliable memory and doesn't specify anything about users. His patch is very simple. The following plan sounds good to me. Shall we rename the zone name when it is used for mirrored memory, "movable" is a little confusion. Maybe. I think it should be another discussion. With this patch and his fake-reliable-memory patch, everyone can give a try. For your requirements. I and Izumi-san are discussing following plan. - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable. - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone. - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag. like this? user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc... yes. Also we can introduce some interfaces in procfs or sysfs, right? It's based on your use case. I think madvise() will be the 1st choice. Thanks, -kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option
On 2015/10/09 15:46, Xishi Qiu wrote: On 2015/10/9 22:56, Taku Izumi wrote: Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. Hi Taku, You mean set non-mirrored memory to movable zone, and set mirrored memory to normal zone, right? So kernel allocations will use mirrored memory in normal zone, and user allocations will use non-mirrored memory in movable zone. My question is: 1) do we need to change the fallback function? For *our* requirement, it's not required. But if someone want to prevent user's memory allocation from NORMAL_ZONE, we need some change in zonelist walking. 2) the mirrored region should locate at the start of normal zone, right? Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE. This patch does only that. I remember Kame has already suggested this idea. In my opinion, I still think it's better to add a new migratetype or a new zone, so both user and kernel could use mirrored memory. Hi, Xishi. I and Izumi-san discussed the implementation much and found using "zone" is better approach. The biggest reason is that zone is a unit of vmscan and all statistics and handling the range of memory for a purpose. We can reuse all vmscan and information codes by making use of zones. Introdcing other structure will be messy. His patch is very simple. For your requirements. I and Izumi-san are discussing following plan. - Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable. - Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone. - Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Intel-wired-lan] [Patch V3 5/9] i40e: Use numa_mem_id() to better support memoryless node
On 2015/10/09 14:52, Jiang Liu wrote: On 2015/10/9 4:20, Andrew Morton wrote: On Wed, 19 Aug 2015 17:18:15 -0700 (PDT) David Rientjes wrote: On Wed, 19 Aug 2015, Patil, Kiran wrote: Acked-by: Kiran Patil Where's the call to preempt_disable() to prevent kernels with preemption from making numa_node_id() invalid during this iteration? David asked this question twice, received no answer and now the patch is in the maintainer tree, destined for mainline. If I was asked this question I would respond The use of numa_mem_id() is racy and best-effort. If the unlikely race occurs, the memory allocation will occur on the wrong node, the overall result being very slightly suboptimal performance. The existing use of numa_node_id() suffers from the same issue. But I'm not the person proposing the patch. Please don't just ignore reviewer comments! Hi Andrew, Apologize for the slow response due to personal reasons! And thanks for answering the question from David. To be honest, I didn't know how to answer this question before. Actually this question has puzzled me for a long time when dealing with memory hot-removal. For normal cases, it only causes sub-optimal memory allocation if schedule event happens between querying NUMA node id and calling alloc_pages_node(). But what happens if system run into following execution sequence? 1) node = numa_mem_id(); 2) memory hot-removal event triggers 2.1) remove affected memory 2.2) reset pgdat to zero if node becomes empty after memory removal I'm sorry if I misunderstand something. After commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08, there is no memset(). 3) alloc_pages_node(), which may access zero-ed pgdat structure. ? I haven't found a mechanism to protect system from above sequence yet, so puzzled for a long time already:(. Does stop_machine() protect system from such a execution sequence? To access pgdat, a pgdat's zone should be on per-pgdat-zonelist. Now, __build_all_zonelists() is called under stop_machine(). That's the reason why you're asking what stop_machine() does. And, as you know, stop_machine() is not protecting anything. The caller may fallback into removed zone. Then, let's think. At first, please note "pgdat" is not removed (and cannot be removed), accessing pgdat's memory will not cause segmentation fault. Just contents are problem. At removal, zone's page related information and pgdat's page related information is cleared. alloc_pages uses zonelist/zoneref/cache to walk each zones without accessing pgdat itself. I think accessing zonelist is safe because it's an array updated by stop_machine(). So, the problem is alloc_pages() can work correctly even if zone contains no page. I think it should work. (Note: zones are included in pgdat. So, zeroing pgdat means zeroing zone and other structures. it will not work.) So, what problem you see now ? I'm sorry I can't chase old discusions. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/11] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
On 2015/09/22 15:23, Ingo Molnar wrote: > So when memory hotplug removes a piece of physical memory from pagetable > mappings, it also frees the underlying PGD entry. > > This complicates PGD management, so don't do this. We can keep the > PGD mapped and the PUD table all clear - it's only a single 4K page > per 512 GB of memory hotplugged. > > Cc: Andrew Morton > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Denys Vlasenko > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Oleg Nesterov > Cc: Peter Zijlstra > Cc: Rik van Riel > Cc: Thomas Gleixner > Cc: Waiman Long > Cc: linux...@kvack.org > Signed-off-by: Ingo Molnar Ishimatsu-san, Tang-san, please check. Doesn't this patch affects the issues of 5255e0a79fcc0ff47b387af92bd9ef5729b1b859 9661d5bcd058fe15b4138a00d96bd36516134543 ? -Kame > --- > arch/x86/mm/init_64.c | 27 --- > 1 file changed, 27 deletions(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 7129e7647a76..60b0cc3f2819 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -780,27 +780,6 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, > pud_t *pud) > spin_unlock(&init_mm.page_table_lock); > } > > -/* Return true if pgd is changed, otherwise return false. */ > -static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd) > -{ > - pud_t *pud; > - int i; > - > - for (i = 0; i < PTRS_PER_PUD; i++) { > - pud = pud_start + i; > - if (pud_val(*pud)) > - return false; > - } > - > - /* free a pud table */ > - free_pagetable(pgd_page(*pgd), 0); > - spin_lock(&init_mm.page_table_lock); > - pgd_clear(pgd); > - spin_unlock(&init_mm.page_table_lock); > - > - return true; > -} > - > static void __meminit > remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end, >bool direct) > @@ -992,7 +971,6 @@ remove_pagetable(unsigned long start, unsigned long end, > bool direct) > unsigned long addr; > pgd_t *pgd; > pud_t *pud; > - bool pgd_changed = false; > > for (addr = start; addr < end; addr = next) { > next = pgd_addr_end(addr, end); > @@ -1003,13 +981,8 @@ remove_pagetable(unsigned long start, unsigned long > end, bool direct) > > pud = (pud_t *)pgd_page_vaddr(*pgd); > remove_pud_table(pud, addr, next, direct); > - if (free_pud_table(pud, pgd)) > - pgd_changed = true; > } > > - if (pgd_changed) > - sync_global_pgds(start, end - 1, 1); > - > flush_tlb_all(); > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
On 2015/08/25 8:15, Paul Turner wrote: On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo wrote: Hello, On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote: Hmm... I was hoping for an actual configurations and usage scenarios. Preferably something people can set up and play with. This is much easier to set up and play with synthetically. Just create the 10 threads and 100 threads above then experiment with configurations designed at guaranteeing the set of 100 threads relatively uniform throughput regardless of how many are active. I don't think trying to run a VM stack adds anything except complexity of reproduction here. Well, but that loses most of details and why such use cases matter to begin with. We can imagine up stuff to induce arbitrary set of requirements. All that's being proved or disproved here is that it's difficult to coordinate the consumption of asymmetric thread pools using nice. The constraints here were drawn from a real-world example. I take that the CPU intensive helper threads are usually IO workers? Is the scenario where the VM is set up with a lot of IO devices and different ones may consume large amount of CPU cycles at any given point? Yes, generally speaking there are a few major classes of IO (flash, disk, network) that a guest may invoke. Each of these backends is separate and chooses its own threading. Hmmm... if that's the case, would limiting iops on those IO devices (or classes of them) work? qemu already implements IO limit mechanism after all. No. 1) They should proceed at the maximum rate that they can that's still within their provisioning budget. 2) The cost/IO is both inconsistent and changes over time. Attempting to micro-optimize every backend for this is infeasible, this is exactly the type of problem that the scheduler can usefully help arbitrate. 3) Even pretending (2) is fixable, dynamically dividing these right-to-work tokens between different I/O device backends is extremely complex. I think I should explain my customer's use case of qemu + cpuset/cpu (via libvirt) (1) Isolating hypervisor thread. As already discussed, hypervisor threads are isolated by cpuset. But their purpose is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot be solution as already discussed. (2) Fixed rate vcpu service. With using cpu controller's quota/period feature, my customer creates vcpu models like Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system. To do this, each vcpus should be quota-limited independently, with per-thread cpu control. Especially, the method (1) is used in several enterprise customers for stabilizing their system. Sub-process control should be provided by some way. Thanks, -Kame Anyways, a point here is that threads of the same process competing isn't a new problem. There are many ways to make those threads play nice as the application itself often has to be involved anyway, especially for something like qemu which is heavily involved in provisioning resources. It's certainly not a new problem, but it's a real one, and it's _hard_. You're proposing removing the best known solution. cgroups can be a nice brute-force add-on which lets sysadmins do wild things but it's inherently hacky and incomplete for coordinating threads. For example, what is it gonna do if qemu cloned vcpus and IO helpers dynamically off of the same parent thread? We're talking about sub-process usage here. This is the application coordinating itself, NOT the sysadmin. Processes are becoming larger and larger, we need many of the same controls within them that we have between them. It requires application's cooperation anyway but at the same time is painful to actually interact from those applications. As discussed elsewhere on thread this is really not a problem if you define consistent rules with respect to which parts are managed by who. The argument of potential interference is no different to messing with an application's on-disk configuration behind its back. Alternate strawmen which greatly improve this from where we are today have also been proposed. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy
On 2015/08/19 5:31, Tejun Heo wrote: Hello, Paul. On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote: 2) Control within an address-space. For subsystems with fungible resources, e.g. CPU, it can be useful for an address space to partition its own threads. Losing the capability to do this against the CPU controller would be a large set-back for instance. Occasionally, it is useful to share these groupings between address spaces when processes are cooperative, but this is less of a requirement. This is important to us. Sure, let's build a proper interface for that. Do you actually need sub-hierarchy inside a process? Can you describe your use case in detail and why having hierarchical CPU cycle distribution is essential for your use case? An actual per-thread use case in our customers is qemu-kvm + cpuset. customers pin each vcpus and qemu-kvm's worker threads to cpus. For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and others(vhost) to cpu 0-1. This is an actual kvm tuning on our customers for performance guarantee. In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu sizing. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Make workingset detection logic memcg aware
On 2015/08/10 17:14, Vladimir Davydov wrote: On Sun, Aug 09, 2015 at 11:12:25PM +0900, Kamezawa Hiroyuki wrote: On 2015/08/08 22:05, Vladimir Davydov wrote: On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote: ... All ? hmm. It seems that mixture of record of global memory pressure and of local memory pressure is just wrong. What makes you think so? An example of misbehavior caused by this would be nice to have. By design, memcg's LRU aging logic is independent from global memory allocation/pressure. Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB server, # contaienr A workingset=600M limit=1G (sleepy) # contaienr B workingset=300M limit=1G (work often) # container C workingset=500M limit=1G (work slowly) # container D workingset=1.2G limit=1G (work hard) container D can drive the zone's distance counter because of local memory reclaim. If active/inactive = 1:1, container D page can be activated. At kswapd(global reclaim) runs, all container's LRU will rotate. Possibility of refault in A, B, C is reduced by conainer D's counter updates. This does not necessarily mean we have to use different inactive_age counter for global and local memory pressure. In your example, having inactive_age per lruvec and using it for evictions on both global and local memory pressure would work just fine. you're right. if (current memcg == recorded memcg && eviction distance is okay) activate page. else inactivate At page-out if (global memory pressure) record eviction id with using zone's counter. else if (memcg local memory pressure) record eviction id with memcg's counter. I don't understand how this is supposed to work when a memory cgroup experiences both local and global pressure simultaneously. I think updating global distance counter by local reclaim may update counter too much. But if the inactive_age counter was per lruvec, then we wouldn't need to bother about it. yes. Anyway, what I understand now is that we need to reduce influence from a memcg's behavior against other memcgs. Your way is dividing counter completely, my idea was implementing different counter. Doing it by calculation will be good because we can't have enough record space. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Make workingset detection logic memcg aware
On 2015/08/08 22:05, Vladimir Davydov wrote: On Fri, Aug 07, 2015 at 10:38:16AM +0900, Kamezawa Hiroyuki wrote: On 2015/08/06 17:59, Vladimir Davydov wrote: On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote: I wonder, rather than collecting more data, rough calculation can help the situation. for example, (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list If one of per-zone calc or per-memcg calc returns true, refault should be true. memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone. This particular formula wouldn't work I'm afraid. If there are two isolated cgroups issuing local reclaim on the same zone, the refault distance needed for activation would be reduced by half for no apparent reason. Hmm, you mean activation in memcg means activation in global LRU, and it's not a valid reason. Current implementation does have the same issue, right ? i.e. when a container has been hitting its limit for a while, and then, a file cache is pushed out but came back soon, it can be easily activated. I'd like to confirm what you want to do. 1) avoid activating a file cache when it was kicked out because of memcg's local limit. No, that's not what I want. I want pages of the workingset to get activated on refault no matter if they were evicted on global memory pressure or due to hitting a memory cgroup limit. Sure. 2) maintain acitve/inactive ratio in memcg properly as global LRU does. 3) reclaim shadow entry at proper timing. All ? hmm. It seems that mixture of record of global memory pressure and of local memory pressure is just wrong. What makes you think so? An example of misbehavior caused by this would be nice to have. By design, memcg's LRU aging logic is independent from global memory allocation/pressure. Assume there are 4 containers(using much page-cache) with 1GB limit on 4GB server, # contaienr A workingset=600M limit=1G (sleepy) # contaienr B workingset=300M limit=1G (work often) # container C workingset=500M limit=1G (work slowly) # container D workingset=1.2G limit=1G (work hard) container D can drive the zone's distance counter because of local memory reclaim. If active/inactive = 1:1, container D page can be activated. At kswapd(global reclaim) runs, all container's LRU will rotate. Possibility of refault in A, B, C is reduced by conainer D's counter updates. But yes, some _real_ test are required. Now, the record is eviction | node | zone | 2bit. How about changing this as 0 |eviction | node | zone | 2bit 1 |eviction | memcgid| 2bit Assume each memcg has an eviction counter, which ignoring node/zone. i.e. memcg local reclaim happens against memcg not against zone. At page-in, if (the 1st bit is 0) compare eviction counter with zone's counter and activate the page if needed. else if (the 1st bit is 1) compare eviction counter with the memcg (if exists) Having a single counter per memcg won't scale with the number of NUMA nodes. It doesn't matter, we can use lazy counter like pcpu counter because it's not needed to be very accurate. if (current memcg == recorded memcg && eviction distance is okay) activate page. else inactivate At page-out if (global memory pressure) record eviction id with using zone's counter. else if (memcg local memory pressure) record eviction id with memcg's counter. I don't understand how this is supposed to work when a memory cgroup experiences both local and global pressure simultaneously. I think updating global distance counter by local reclaim may update counter too much. Above is to avoid updating zone's counter and keep memcg's LRU active/inactive balanced. Also, what if a memory cgroup is protected by memory.low? Such a cgroup may have all its pages in the active list, because it is never scanned. If LRU never scanned, all file caches tend to be in INACTIVE...it never refaults. This will affect the refault distance of other cgroups, making activations unpredictable. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Make workingset detection logic memcg aware
On 2015/08/06 17:59, Vladimir Davydov wrote: On Wed, Aug 05, 2015 at 10:34:58AM +0900, Kamezawa Hiroyuki wrote: Reading discussion, I feel storing more data is difficult, too. Yep, even with the current 16-bit memcg id. Things would get even worse if we wanted to extend it one day (will we?) I wonder, rather than collecting more data, rough calculation can help the situation. for example, (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list If one of per-zone calc or per-memcg calc returns true, refault should be true. memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone. This particular formula wouldn't work I'm afraid. If there are two isolated cgroups issuing local reclaim on the same zone, the refault distance needed for activation would be reduced by half for no apparent reason. Hmm, you mean activation in memcg means activation in global LRU, and it's not a valid reason. Current implementation does have the same issue, right ? i.e. when a container has been hitting its limit for a while, and then, a file cache is pushed out but came back soon, it can be easily activated. I'd like to confirm what you want to do. 1) avoid activating a file cache when it was kicked out because of memcg's local limit. 2) maintain acitve/inactive ratio in memcg properly as global LRU does. 3) reclaim shadow entry at proper timing. All ? hmm. It seems that mixture of record of global memory pressure and of local memory pressure is just wrong. Now, the record is eviction | node | zone | 2bit. How about changing this as 0 |eviction | node | zone | 2bit 1 |eviction | memcgid| 2bit Assume each memcg has an eviction counter, which ignoring node/zone. i.e. memcg local reclaim happens against memcg not against zone. At page-in, if (the 1st bit is 0) compare eviction counter with zone's counter and activate the page if needed. else if (the 1st bit is 1) compare eviction counter with the memcg (if exists) if (current memcg == recorded memcg && eviction distance is okay) activate page. else inactivate At page-out if (global memory pressure) record eviction id with using zone's counter. else if (memcg local memory pressure) record eviction id with memcg's counter. By this, 1) locally reclaimed pages cannot be activated unless it's refaulted in the same memcg. In this case, activating in the memcg has some meaning. 2) At global memory pressure, distance is properly calculated based on global system status. global memory pressure can ignore memcg's behavior. about shadow entries, kmemcg should take care of it Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] cgroup: define controller file conventions
On 2015/08/05 16:47, Michal Hocko wrote: On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote: [...] so, for memory controller, we'll have We currently have only current, low, high, max and events currently. All other knobs are either deprecated or waiting for a usecase to emerge before they get added. Sure. I think following has users. - *.stat - for chekcing health of cgroup ,or for debug - *.pressure_level - for notifying memory pressure - *.swappiness - for adjusting LRU activity per application type. - *.oom_control - for surviving/notifiyng out of memory memcg's oom can be recovered if limit goes up rather than kill. But I know people says this knob is not useful. This will require discussion. Hm. If we don't want to increase files, NETLINK or systemcall is an another choice of subsystem specific interface ? -Kame (in alphabet order) memory.failcnt memory.force_empty (<= should this be removed ?) memory.kmem.failcnt memory.kmem.max memory.kmem.max_usage memory.kmem.slabinfo memory.kmem.tcp.failcnt memory.kmem.tcp.max memory.kmem.tcp.max_usage memory.kmem.tcp.usage memory.kmem.usage memory.max memory.max_usage memory.move_charge_at_immigrate memory.numa_stat memory.oom_control memory.pressure_level memory.high memory.swapiness memory.usage memory.use_hierarchy (<= removed) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Make workingset detection logic memcg aware
On 2015/08/03 21:04, Vladimir Davydov wrote: > Hi, > > Currently, workingset detection logic is not memcg aware - inactive_age > is maintained per zone. As a result, if memory cgroups are used, > refaulted file pages are activated randomly. This patch set makes > inactive_age per lruvec so that workingset detection will work correctly > for memory cgroup reclaim. > > Thanks, > Reading discussion, I feel storing more data is difficult, too. I wonder, rather than collecting more data, rough calculation can help the situation. for example, (refault_disatance calculated in zone) * memcg_reclaim_ratio < memcg's active list If one of per-zone calc or per-memcg calc returns true, refault should be true. memcg_reclaim_ratio is the percentage of scan in a memcg against in a zone. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] cgroup: define controller file conventions
On 2015/08/05 4:31, Tejun Heo wrote: From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Tue, 4 Aug 2015 15:20:55 -0400 Traditionally, each cgroup controller implemented whatever interface it wanted leading to interfaces which are widely inconsistent. Examining the requirements of the controllers readily yield that there are only a few control schemes shared among all. Two major controllers already had to implement new interface for the unified hierarchy due to significant structural changes. Let's take the chance to establish common conventions throughout all controllers. This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight based control knobs and documents the conventions that controllers should follow on the unified hierarchy. Except for io.weight knob, all existing unified hierarchy knobs are already compliant. A follow-up patch will update io.weight. v2: Added descriptions of min, low and high knobs. Signed-off-by: Tejun Heo Acked-by: Johannes Weiner Cc: Li Zefan Cc: Peter Zijlstra --- Hello, Added low/high descriptions and applied to the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.3-unified-base The branch currently only contains this patch and will stay stable so that it can be pulled from. I kept the base weight as DFL for now. If we decide to change it, I'll apply the change on top. Thanks. Documentation/cgroups/unified-hierarchy.txt | 80 ++--- include/linux/cgroup.h | 9 2 files changed, 81 insertions(+), 8 deletions(-) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 86847a7..1ee9caf 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -23,10 +23,13 @@ CONTENTS 5. Other Changes 5-1. [Un]populated Notification 5-2. Other Core Changes - 5-3. Per-Controller Changes -5-3-1. blkio -5-3-2. cpuset -5-3-3. memory + 5-3. Controller File Conventions +5-3-1. Format +5-3-2. Control Knobs + 5-4. Per-Controller Changes +5-4-1. blkio +5-4-2. cpuset +5-4-3. memory 6. Planned Changes 6-1. CAP for resource control @@ -372,14 +375,75 @@ supported and the interface files "release_agent" and - The "cgroup.clone_children" file is removed. -5-3. Per-Controller Changes +5-3. Controller File Conventions -5-3-1. blkio +5-3-1. Format + +In general, all controller files should be in one of the following +formats whenever possible. + +- Values only files + + VAL0 VAL1...\n + +- Flat keyed files + + KEY0 VAL0\n + KEY1 VAL1\n + ... + +- Nested keyed files + + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... + ... + +For a writeable file, the format for writing should generally match +reading; however, controllers may allow omitting later fields or +implement restricted shortcuts for most common use cases. + +For both flat and nested keyed files, only the values for a single key +can be written at a time. For nested keyed files, the sub key pairs +may be specified in any order and not all pairs have to be specified. + + +5-3-2. Control Knobs + +- Settings for a single feature should generally be implemented in a + single file. + +- In general, the root cgroup should be exempt from resource control + and thus shouldn't have resource control knobs. + +- If a controller implements ratio based resource distribution, the + control knob should be named "weight" and have the range [1, 1] + and 100 should be the default value. The values are chosen to allow + enough and symmetric bias in both directions while keeping it + intuitive (the default is 100%). + +- If a controller implements an absolute resource guarantee and/or + limit, the control knobs should be named "min" and "max" + respectively. If a controller implements best effort resource + gurantee and/or limit, the control knobs should be named "low" and + "high" respectively. + + In the above four control files, the special token "max" should be + used to represent upward infinity for both reading and writing. + so, for memory controller, we'll have (in alphabet order) memory.failcnt memory.force_empty (<= should this be removed ?) memory.kmem.failcnt memory.kmem.max memory.kmem.max_usage memory.kmem.slabinfo memory.kmem.tcp.failcnt memory.kmem.tcp.max memory.kmem.tcp.max_usage memory.kmem.tcp.usage memory.kmem.usage memory.max memory.max_usage memory.move_charge_at_immigrate memory.numa_stat memory.oom_control memory.pressure_level memory.high memory.swapiness memory.usage memory.use_hierarchy (<= removed) ? -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages
On 2015/06/30 11:45, Xishi Qiu wrote: On 2015/6/29 15:32, Kamezawa Hiroyuki wrote: On 2015/06/27 11:24, Xishi Qiu wrote: This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to allocate mirrored pages. When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks. Signed-off-by: Xishi Qiu My fear about this approarch is that this may break something existing. Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE. Logically, MIRROR attribute is independent from page mobility and this overwrites will make some information lost. Then, --- include/linux/mmzone.h | 9 + mm/page_alloc.c| 3 +++ mm/vmstat.c| 3 +++ 3 files changed, 15 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 54d74f6..54e891a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -39,6 +39,9 @@ enum { MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, +#ifdef CONFIG_MEMORY_MIRROR +MIGRATE_MIRROR, +#endif I think MIGRATE_MIRROR_UNMOVABLE, MIGRATE_MIRROR_RECLAIMABLE, MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss. MIGRATE_MIRROR_RESERVED,<== reserved pages should be maintained per mirrored/unmirrored. Hi Kame, You mean add 3 or 4 new migratetype? yes. But please check how NR_MIGRATETYPE_BITS will be. I think this will not have big impact in x86-64 . should be added with the following fallback list. /* * MIRROR page range is defined by firmware at boot. The range is limited * and is used only for kernel memory mirroring. */ [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE} [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE} Why not like this: {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_MIRROR_RESERVED, MIGRATE_RESERVE} My mistake. [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE_MIRROR} [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE_MIRROR} was my intention. This means mirrored memory and unmirrored memory is separated completely. But this should affect kswapd or other memory reclaim logic. for example, kswapd stops free pages are more than hi watermark. But mirrored/unmirrored pages exhausted cases are not handled in this series. You need some extra check in memory reclaim logic if you go with migration_type. Then, we'll not lose the original information of "Reclaiable Pages". One problem here is whteher we should have MIGRATE_RESERVE_MIRROR. If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR. But it seems to require much more code change to do that. Creating a zone or adding an attribues to zones are another design choice. If we add a new zone, mirror_zone will span others, I'm worry about this maybe have problems. Yes. that's problem. And zoneid bit is very limited resource. (But memory reclaim logic can be unchanged.) Anyway, I'd like to see your solution with above changes 1st rather than adding zones. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface
On 2015/06/30 10:31, Xishi Qiu wrote: On 2015/6/30 9:01, Kamezawa Hiroyuki wrote: On 2015/06/30 8:11, Luck, Tony wrote: @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) { -system_has_some_mirror = true; +static_key_slow_inc(&system_has_mirror); return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); } This generates some WARN_ON noise when called from efi_find_mirror(): It seems jump_label_init() is called after memory initialization. (init/main.c::start_kernel()) So, it may be difficut to use static_key function for our purpose because kernel memory allocation may occur before jump_label is ready. Thanks, -Kame Hi Kame, How about like this? Use static bool in bootmem, and use jump label in buddy system. This means we use two variable to do it. I think it can be done but it should be done in separated patch with enough comment/changelog. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 7/8] mm: add the buddy system interface
On 2015/06/30 8:11, Luck, Tony wrote: @@ -814,7 +814,7 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size) */ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) { - system_has_some_mirror = true; + static_key_slow_inc(&system_has_mirror); return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); } This generates some WARN_ON noise when called from efi_find_mirror(): It seems jump_label_init() is called after memory initialization. (init/main.c::start_kernel()) So, it may be difficut to use static_key function for our purpose because kernel memory allocation may occur before jump_label is ready. Thanks, -Kame [0.00] e820: last_pfn = 0x7b800 max_arch_pfn = 0x4 [0.00] [ cut here ] [0.00] WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:61 static_key_slow_inc+0x57/0xc0() [0.00] static_key_slow_inc used before call to jump_label_init [0.00] Modules linked in: [0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.0 #4 [0.00] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015 [0.00] ee366a8dff38f745 81997d68 816683b4 [0.00] 81997dc0 81997da8 8107b0aa [0.00] 81d48822 81f281a0 4000 001fcb7a4000 [0.00] Call Trace: [0.00] [] dump_stack+0x45/0x57 [0.00] [] warn_slowpath_common+0x8a/0xc0 [0.00] [] warn_slowpath_fmt+0x55/0x70 [0.00] [] ? memblock_add_range+0x175/0x19e [0.00] [] static_key_slow_inc+0x57/0xc0 [0.00] [] memblock_mark_mirror+0x19/0x33 [0.00] [] efi_find_mirror+0x59/0xdd [0.00] [] setup_arch+0x642/0xccf [0.00] [] ? early_idt_handler_array+0x120/0x120 [0.00] [] ? printk+0x55/0x6b [0.00] [] ? early_idt_handler_array+0x120/0x120 [0.00] [] start_kernel+0xe8/0x4eb [0.00] [] ? early_idt_handler_array+0x120/0x120 [0.00] [] ? early_idt_handler_array+0x120/0x120 [0.00] [] x86_64_start_reservations+0x2a/0x2c [0.00] [] x86_64_start_kernel+0x14c/0x16f [0.00] ---[ end trace baa7fa0514e3bc58 ]--- [0.00] [ cut here ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 4/8] mm: add mirrored memory to buddy system
On 2015/06/27 11:25, Xishi Qiu wrote: Before free bootmem, set mirrored pageblock's migratetype to MIGRATE_MIRROR, so they could free to buddy system's MIGRATE_MIRROR list. When set reserved memory, skip the mirrored memory. Signed-off-by: Xishi Qiu --- include/linux/memblock.h | 3 +++ mm/memblock.c| 21 + mm/nobootmem.c | 3 +++ mm/page_alloc.c | 3 +++ 4 files changed, 30 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 97f71ca..53be030 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -81,6 +81,9 @@ int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); ulong choose_memblock_flags(void); +#ifdef CONFIG_MEMORY_MIRROR +void memblock_mark_migratemirror(void); +#endif /* Low level functions */ int memblock_add_range(struct memblock_type *type, diff --git a/mm/memblock.c b/mm/memblock.c index 7612876..0d0b210 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -818,6 +819,26 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) return memblock_setclr_flag(base, size, 1, MEMBLOCK_MIRROR); } +#ifdef CONFIG_MEMORY_MIRROR +void __init_memblock memblock_mark_migratemirror(void) +{ + unsigned long start_pfn, end_pfn, pfn; + int i, node; + struct page *page; + + printk(KERN_DEBUG "Mirrored memory:\n"); + for_each_mirror_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, + &node) { + printk(KERN_DEBUG " node %3d: [mem %#010llx-%#010llx]\n", + node, PFN_PHYS(start_pfn), PFN_PHYS(end_pfn) - 1); + for (pfn = start_pfn; pfn < end_pfn; + pfn += pageblock_nr_pages) { + page = pfn_to_page(pfn); + set_pageblock_migratetype(page, MIGRATE_MIRROR); + } + } +} +#endif /** * __next__mem_range - next function for for_each_free_mem_range() etc. diff --git a/mm/nobootmem.c b/mm/nobootmem.c index 5258386..31aa6d4 100644 --- a/mm/nobootmem.c +++ b/mm/nobootmem.c @@ -129,6 +129,9 @@ static unsigned long __init free_low_memory_core_early(void) u64 i; memblock_clear_hotplug(0, -1); +#ifdef CONFIG_MEMORY_MIRROR + memblock_mark_migratemirror(); +#endif for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e4d79f..aea78a5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4118,6 +4118,9 @@ static void setup_zone_migrate_reserve(struct zone *zone) block_migratetype = get_pageblock_migratetype(page); + if (is_migrate_mirror(block_migratetype)) + continue; + If mirrored area will not have reserved memory, this should break the page allocator's logic. I think both of mirrored and unmirrored range should have reserved area. Thanks, -Kame /* Only test what is necessary when the reserves are not met */ if (reserve > 0) { /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 2/8] mm: introduce MIGRATE_MIRROR to manage the mirrored pages
On 2015/06/27 11:24, Xishi Qiu wrote: This patch introduces a new migratetype called "MIGRATE_MIRROR", it is used to allocate mirrored pages. When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks. Signed-off-by: Xishi Qiu My fear about this approarch is that this may break something existing. Now, when we add MIGRATE_MIRROR type, we'll hide attributes of pageblocks as MIGRATE_UNMOVABOLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE. Logically, MIRROR attribute is independent from page mobility and this overwrites will make some information lost. Then, --- include/linux/mmzone.h | 9 + mm/page_alloc.c| 3 +++ mm/vmstat.c| 3 +++ 3 files changed, 15 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 54d74f6..54e891a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -39,6 +39,9 @@ enum { MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, +#ifdef CONFIG_MEMORY_MIRROR + MIGRATE_MIRROR, +#endif I think MIGRATE_MIRROR_UNMOVABLE, MIGRATE_MIRROR_RECLAIMABLE, MIGRATE_MIRROR_MOVABLE, <== adding this may need discuss. MIGRATE_MIRROR_RESERVED,<== reserved pages should be maintained per mirrored/unmirrored. should be added with the following fallback list. /* * MIRROR page range is defined by firmware at boot. The range is limited * and is used only for kernel memory mirroring. */ [MIGRATE_UNMOVABLE_MIRROR] = {MIGRATE_RECLAIMABLE_MIRROR, MIGRATE_RESERVE} [MIGRATE_RECLAIMABLE_MIRROR] = {MIGRATE_UNMOVABLE_MIRROR, MIGRATE_RESERVE} Then, we'll not lose the original information of "Reclaiable Pages". One problem here is whteher we should have MIGRATE_RESERVE_MIRROR. If we never allow users to allocate mirrored memory, we should have MIGRATE_RESERVE_MIRROR. But it seems to require much more code change to do that. Creating a zone or adding an attribues to zones are another design choice. Anyway, your patch doesn't takes care of reserved memory calculation at this point. Please check setup_zone_migrate_reserve() That will be a problem. Thanks, -Kame MIGRATE_PCPTYPES, /* the number of types on the pcp lists */ MIGRATE_RESERVE = MIGRATE_PCPTYPES, #ifdef CONFIG_CMA @@ -69,6 +72,12 @@ enum { # define is_migrate_cma(migratetype) false #endif +#ifdef CONFIG_MEMORY_MIRROR +# define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR) +#else +# define is_migrate_mirror(migratetype) false +#endif + #define for_each_migratetype_order(order, type) \ for (order = 0; order < MAX_ORDER; order++) \ for (type = 0; type < MIGRATE_TYPES; type++) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ebffa0e..6e4d79f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3216,6 +3216,9 @@ static void show_migration_types(unsigned char type) [MIGRATE_UNMOVABLE] = 'U', [MIGRATE_RECLAIMABLE] = 'E', [MIGRATE_MOVABLE] = 'M', +#ifdef CONFIG_MEMORY_MIRROR + [MIGRATE_MIRROR]= 'O', +#endif [MIGRATE_RESERVE] = 'R', #ifdef CONFIG_CMA [MIGRATE_CMA] = 'C', diff --git a/mm/vmstat.c b/mm/vmstat.c index 4f5cd97..d0323e0 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -901,6 +901,9 @@ static char * const migratetype_names[MIGRATE_TYPES] = { "Unmovable", "Reclaimable", "Movable", +#ifdef CONFIG_MEMORY_MIRROR + "Mirror", +#endif "Reserve", #ifdef CONFIG_CMA "CMA", -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 PATCH 1/8] mm: add a new config to manage the code
On 2015/06/27 11:23, Xishi Qiu wrote: This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", set it CONFIG_MEMORY_MIRROR off by default. Signed-off-by: Xishi Qiu --- mm/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 390214d..c40bb8b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE depends on MIGRATION +config MEMORY_MIRROR In following patches, you use CONFIG_MEMORY_MIRROR. I think the name is too generic besides it's depends on ACPI. But I'm not sure address based memory mirror is planned in other platform. So, hmm. How about dividing the config into 2 parts like attached ? (just an example) Thanks, -Kame >From 88213b0f76e2f603c5a38690cbd85a4df1e646ba Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Mon, 29 Jun 2015 15:35:47 +0900 Subject: [PATCH] add a new config option for memory mirror Add a new config option "CONFIG_MEMORY_MIRROR" for kernel assisted memory mirroring. In UEFI2.5 spec, Address based memory mirror is defined and it allows the system to create partial memory mirror. The feature guards important(kernel) memory to be mirrored by using the address based memory mirror. Now this depends on cpu architecure Haswell? Broadwell? --- arch/x86/Kconfig | 6 ++ mm/Kconfig | 9 + 2 files changed, 15 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e33e01b..56f17df 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -596,6 +596,12 @@ config X86_SUPPORTS_MEMORY_FAILURE depends on X86_64 || !SPARSEMEM select ARCH_SUPPORTS_MEMORY_FAILURE +config X86_SUPPORTS_MEMORY_MIRROR + def_bool y + # UEFI 2.5spec. address based memory mirror, supported only after XXX + depends on X86_64 && ARCH_SUPPORTS_MEMORY_FAILURE + select ARCH_MEMORY_MIRROR + config STA2X11 bool "STA2X11 Companion Chip Support" depends on X86_32_NON_STANDARD && PCI diff --git a/mm/Kconfig b/mm/Kconfig index b3a60ee..e14dc2d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -200,6 +200,15 @@ config MEMORY_HOTREMOVE depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE depends on MIGRATION +config MEMORY_MIRROR + bool "Address range mirroring support" + depends on ARCH_MEMORY_MIRROR + default n + help + This feature allows the kernel to assist address based memory + mirror supported by architecture/firmware. And place some types + of memory (especially, kernel memory) placed into mirrored range. + # # If we have space for more page flags then we can enable additional # optimizations and functionality. -- 1.9.3
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/26 10:43, Xishi Qiu wrote: On 2015/6/26 7:54, Kamezawa Hiroyuki wrote: On 2015/06/25 18:44, Xishi Qiu wrote: On 2015/6/10 11:06, Kamezawa Hiroyuki wrote: On 2015/06/09 19:04, Xishi Qiu wrote: On 2015/6/9 15:12, Kamezawa Hiroyuki wrote: On 2015/06/04 22:04, Xishi Qiu wrote: Add the buddy system interface for address range mirroring feature. Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages left, use other types pages. Signed-off-by: Xishi Qiu --- mm/page_alloc.c | 40 +++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d4d2066..0fb55288 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn) return false; } + +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx) +{ +/* + * Do not alloc mirrored memory below 4G, because 0-4G is + * all mirrored by default, and the list is always empty. + */ +if (high_zoneidx < ZONE_NORMAL) +return false; + +/* Alloc mirrored memory for only kernel */ +if (gfp_flags & __GFP_MIRROR) +return true; GFP_KERNEL itself should imply mirror, I think. Hi Kame, How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ? Hm it cannot cover GFP_ATOMIC at el. I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE Hi Kame, Can we distinguish allocations form user or kernel only by GFP flags? Allocation from user and file caches are now *always* done with __GFP_MOVABLE. By this, pages will be allocated from MIGRATE_MOVABLE migration type. MOVABLE migration type means it's can be the target for page compaction or memory-hot-remove. Thanks, -Kame So if we want all kernel memory allocated from mirror, how about change like this? __alloc_pages_nodemask() gfpflags_to_migratetype() if (!(gfp_mask & __GFP_MOVABLE)) return MIGRATE_MIRROR Maybe used with jump label can reduce performance impact. == static inline bool memory_mirror_enabled(void) { return static_key_false(&memory_mirror_enabled); } gfpflags_to_migratetype() if (memory_mirror_enabled()) { /* We want to mirror all unmovable pages */ if (!(gfp_mask & __GFP_MOVABLE)) return MIGRATE_MIRROR } == BTW, I think current memory compaction code scans ranges of MOVABLE migrate type. So, if you use other migration type than MOVABLE for user pages, you may see page fragmentation. If you want to expand this MIRROR to user pages, please check mm/compaction.c Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/25 18:44, Xishi Qiu wrote: On 2015/6/10 11:06, Kamezawa Hiroyuki wrote: On 2015/06/09 19:04, Xishi Qiu wrote: On 2015/6/9 15:12, Kamezawa Hiroyuki wrote: On 2015/06/04 22:04, Xishi Qiu wrote: Add the buddy system interface for address range mirroring feature. Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages left, use other types pages. Signed-off-by: Xishi Qiu --- mm/page_alloc.c | 40 +++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d4d2066..0fb55288 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn) return false; } + +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx) +{ +/* + * Do not alloc mirrored memory below 4G, because 0-4G is + * all mirrored by default, and the list is always empty. + */ +if (high_zoneidx < ZONE_NORMAL) +return false; + +/* Alloc mirrored memory for only kernel */ +if (gfp_flags & __GFP_MIRROR) +return true; GFP_KERNEL itself should imply mirror, I think. Hi Kame, How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ? Hm it cannot cover GFP_ATOMIC at el. I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE Hi Kame, Can we distinguish allocations form user or kernel only by GFP flags? Allocation from user and file caches are now *always* done with __GFP_MOVABLE. By this, pages will be allocated from MIGRATE_MOVABLE migration type. MOVABLE migration type means it's can be the target for page compaction or memory-hot-remove. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/16 2:20, Luck, Tony wrote: On Mon, Jun 15, 2015 at 05:47:27PM +0900, Kamezawa Hiroyuki wrote: So, there are 3 ideas. (1) kernel only from MIRROR / user only from MOVABLE (Tony) (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP) (AKPM suggested) This makes use of the fact MOVABLE memory is reclaimable but Tony pointed out the memory reclaim can be critical for GFP_ATOMIC. (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR (Xishi) 2 Implementation ideas. - creating ZONE - creating new alloation attribute I don't convince whether we need some new structure in mm. Isn't it good to use ZONE_MOVABLE for not-mirrored memory ? Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3) We might need to rename it ... right now the memory hotplug people use ZONE_MOVABLE to indicate regions of physical memory that can be removed from the system. I'm wondering whether people will want systems that have both removable and mirrored areas? Then we have four attribute combinations: mirror=no removable=no - prefer to use for user, could use for kernel if we run out of mirror mirror=no removable=yes - can only be used for user (kernel allocation makes it not-removable) mirror=yes removable=no - use for kernel, possibly for special users if we define some interface mirror=yes removable=yes - must not use for kernel ... would have to give to user ... seems like a bad idea to configure a system this way Thank you for clarification. I see "mirror=no, removable=no" case may require a new name. IMHO, the value of Address-Based-Memory-Mirror is that users can protect their system's important functions without using full-memory mirror. So, I feel thinking "mirror=no, removable=no" just makes our discussion/implemenation complex without real user value. Shouldn't we start with just thiking 2 cases of mirror=no removable=yes mirror=yes removable=no ? And then, if the naming is problem, alias name can be added. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/11 5:40, Luck, Tony wrote: I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE HIGHMEM shouldn't matter - partial memory mirror only makes any sense on X86_64 systems ... 32-bit kernels don't even boot on systems with 64GB, and the minimum rational configuration for a machine that supports mirror is 128GB (4 cpu sockets * 2 memory controller per socket * 4 channels per controller * 4GB DIMM ... leaving any channels empty likely leaves you short of memory bandwidth for these high core count processors). MOVABLE is mostly the opposite of MIRROR - we never want to fill a kernel allocation from a MOVABLE page. I want all kernel allocations to be from MIRROR. So, there are 3 ideas. (1) kernel only from MIRROR / user only from MOVABLE (Tony) (2) kernel only from MIRROR / user from MOVABLE + MIRROR(ASAP) (AKPM suggested) This makes use of the fact MOVABLE memory is reclaimable but Tony pointed out the memory reclaim can be critical for GFP_ATOMIC. (3) kernel only from MIRROR / user from MOVABLE, special user from MIRROR (Xishi) 2 Implementation ideas. - creating ZONE - creating new alloation attribute I don't convince whether we need some new structure in mm. Isn't it good to use ZONE_MOVABLE for not-mirrored memory ? Then, disable fallback from ZONE_MOVABLE -> ZONE_NORMAL for (1) and (3) Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory
On 2015/06/09 19:09, Xishi Qiu wrote: On 2015/6/9 15:06, Kamezawa Hiroyuki wrote: On 2015/06/04 22:02, Xishi Qiu wrote: Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means we should allocate mirrored memory for both user and kernel processes. Signed-off-by: Xishi Qiu I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will use mirrored memory ? Or will you add special MMAP/madvise flag to use mirrored memory ? Hi Kame, Yes, MMAP/madvise -> add VM_MIRROR -> add GFP_MIRROR -> use MIGRATE_MIRROR list to alloc mirrored pages So user can use mirrored memory. What do you think? I see. please explain it (your final plan) in patch description or in cover page of patches. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 01/12] mm: add a new config to manage the code
On 2015/06/09 19:10, Xishi Qiu wrote: On 2015/6/9 14:44, Kamezawa Hiroyuki wrote: On 2015/06/04 21:56, Xishi Qiu wrote: This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is used to on/off the feature. Signed-off-by: Xishi Qiu --- mm/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 390214d..4f2a726 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE depends on MIGRATION +config MEMORY_MIRROR +bool "Address range mirroring support" +depends on X86 && NUMA +default y +help + This feature depends on hardware and firmware support. + ACPI or EFI records the mirror info. default y...no runtime influence when the user doesn't use memory mirror ? It is a new feature, so how about like this: default y -> n? It's okay to me. But it's better to check performance impact before merge because you modified core code of memory management. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/09 19:04, Xishi Qiu wrote: On 2015/6/9 15:12, Kamezawa Hiroyuki wrote: On 2015/06/04 22:04, Xishi Qiu wrote: Add the buddy system interface for address range mirroring feature. Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages left, use other types pages. Signed-off-by: Xishi Qiu --- mm/page_alloc.c | 40 +++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d4d2066..0fb55288 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn) return false; } + +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx) +{ +/* + * Do not alloc mirrored memory below 4G, because 0-4G is + * all mirrored by default, and the list is always empty. + */ +if (high_zoneidx < ZONE_NORMAL) +return false; + +/* Alloc mirrored memory for only kernel */ +if (gfp_flags & __GFP_MIRROR) +return true; GFP_KERNEL itself should imply mirror, I think. Hi Kame, How about like this: #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_MIRROR) ? Hm it cannot cover GFP_ATOMIC at el. I guess, mirrored memory should be allocated if !__GFP_HIGHMEM or !__GFP_MOVABLE thanks, -Kame Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 10/12] mm: add the buddy system interface
On 2015/06/04 22:04, Xishi Qiu wrote: Add the buddy system interface for address range mirroring feature. Allocate mirrored pages in MIGRATE_MIRROR list. If there is no mirrored pages left, use other types pages. Signed-off-by: Xishi Qiu --- mm/page_alloc.c | 40 +++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d4d2066..0fb55288 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -599,6 +599,26 @@ static inline bool is_mirror_pfn(unsigned long pfn) return false; } + +static inline bool change_to_mirror(gfp_t gfp_flags, int high_zoneidx) +{ + /* +* Do not alloc mirrored memory below 4G, because 0-4G is +* all mirrored by default, and the list is always empty. +*/ + if (high_zoneidx < ZONE_NORMAL) + return false; + + /* Alloc mirrored memory for only kernel */ + if (gfp_flags & __GFP_MIRROR) + return true; GFP_KERNEL itself should imply mirror, I think. + + /* Alloc mirrored memory for both user and kernel */ + if (sysctl_mirrorable) + return true; Reading this, I think this sysctl is not good. The user cannot know what is mirrored because memory may not be mirrored until the sysctl is set. Thanks, -Kame + + return false; +} #endif /* @@ -1796,7 +1816,10 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, WARN_ON_ONCE(order > 1); } spin_lock_irqsave(&zone->lock, flags); - page = __rmqueue(zone, order, migratetype); + if (is_migrate_mirror(migratetype)) + page = __rmqueue_smallest(zone, order, migratetype); + else + page = __rmqueue(zone, order, migratetype); spin_unlock(&zone->lock); if (!page) goto failed; @@ -2928,6 +2951,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE) alloc_flags |= ALLOC_CMA; +#ifdef CONFIG_MEMORY_MIRROR + if (change_to_mirror(gfp_mask, ac.high_zoneidx)) + ac.migratetype = MIGRATE_MIRROR; +#endif + retry_cpuset: cpuset_mems_cookie = read_mems_allowed_begin(); @@ -2943,9 +2971,19 @@ retry_cpuset: /* First allocation attempt */ alloc_mask = gfp_mask|__GFP_HARDWALL; +retry: page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); if (unlikely(!page)) { /* +* If there is no mirrored memory, we will alloc other +* types memory. +*/ + if (is_migrate_mirror(ac.migratetype)) { + ac.migratetype = gfpflags_to_migratetype(gfp_mask); + goto retry; + } + + /* * Runtime PM, block IO and its error handling path * can deadlock because I/O on the device might not * complete. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 08/12] mm: use mirrorable to switch allocate mirrored memory
On 2015/06/04 22:02, Xishi Qiu wrote: Add a new interface in path /proc/sys/vm/mirrorable. When set to 1, it means we should allocate mirrored memory for both user and kernel processes. Signed-off-by: Xishi Qiu I can't see why do we need this switch. If this is set, all GFP_HIGHUSER will use mirrored memory ? Or will you add special MMAP/madvise flag to use mirrored memory ? Thanks, -Kame --- include/linux/mmzone.h | 1 + kernel/sysctl.c| 9 + mm/page_alloc.c| 1 + 3 files changed, 11 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f82e3ae..20888dd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -85,6 +85,7 @@ struct mirror_info { }; extern struct mirror_info mirror_info; +extern int sysctl_mirrorable; # define is_migrate_mirror(migratetype) unlikely((migratetype) == MIGRATE_MIRROR) #else # define is_migrate_mirror(migratetype) false diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 2082b1a..dc2625e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1514,6 +1514,15 @@ static struct ctl_table vm_table[] = { .extra2 = &one, }, #endif +#ifdef CONFIG_MEMORY_MIRROR + { + .procname = "mirrorable", + .data = &sysctl_mirrorable, + .maxlen = sizeof(sysctl_mirrorable), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + }, +#endif { .procname = "user_reserve_kbytes", .data = &sysctl_user_reserve_kbytes, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 249a8f6..63b90ca 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -212,6 +212,7 @@ int user_min_free_kbytes = -1; #ifdef CONFIG_MEMORY_MIRROR struct mirror_info mirror_info; +int sysctl_mirrorable = 0; #endif static unsigned long __meminitdata nr_kernel_pages; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 07/12] mm: introduce __GFP_MIRROR to allocate mirrored pages
On 2015/06/04 22:02, Xishi Qiu wrote: This patch introduces a new gfp flag called "__GFP_MIRROR", it is used to allocate mirrored pages through buddy system. Signed-off-by: Xishi Qiu In Tony's original proposal, the motivation was to mirror all kernel memory. Is the purpose of this patch making mirrored range available for user space ? But, hmm... I don't think adding a new GFP flag is a good idea. It adds many conditional jumps. How about using GFP_KERNEL for user memory if the user wants mirrored memory with mirroring all kernel memory? Thanks, -Kame --- include/linux/gfp.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 15928f0..89d0091 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -35,6 +35,7 @@ struct vm_area_struct; #define ___GFP_NO_KSWAPD 0x40u #define ___GFP_OTHER_NODE 0x80u #define ___GFP_WRITE 0x100u +#define ___GFP_MIRROR 0x200u /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -95,13 +96,15 @@ struct vm_area_struct; #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ +#define __GFP_MIRROR ((__force gfp_t)___GFP_MIRROR) /* Allocate mirrored memory */ + /* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 25/* Room for N __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 26/* Room for N __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 02/12] mm: introduce mirror_info
On 2015/06/04 21:57, Xishi Qiu wrote: This patch introduces a new struct called "mirror_info", it is used to storage the mirror address range which reported by EFI or ACPI. TBD: call add_mirror_info() to fill it. Signed-off-by: Xishi Qiu --- arch/x86/mm/numa.c | 3 +++ include/linux/mmzone.h | 15 +++ mm/page_alloc.c| 33 + 3 files changed, 51 insertions(+) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..781fd68 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -619,6 +619,9 @@ static int __init numa_init(int (*init_func)(void)) /* In case that parsing SRAT failed. */ WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX)); numa_reset_distance(); +#ifdef CONFIG_MEMORY_MIRROR + memset(&mirror_info, 0, sizeof(mirror_info)); +#endif ret = init_func(); if (ret < 0) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 54d74f6..1fae07b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -69,6 +69,21 @@ enum { # define is_migrate_cma(migratetype) false #endif +#ifdef CONFIG_MEMORY_MIRROR +struct numa_mirror_info { + int node; + unsigned long start; + unsigned long size; +}; + +struct mirror_info { + int count; + struct numa_mirror_info info[MAX_NUMNODES]; +}; MAX_NUMNODE may not be enough when the firmware cannot use contiguous address for mirroing. + +extern struct mirror_info mirror_info; +#endif If this structure will not be updated after boot, read_mostly should be helpful. + #define for_each_migratetype_order(order, type) \ for (order = 0; order < MAX_ORDER; order++) \ for (type = 0; type < MIGRATE_TYPES; type++) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ebffa0e..41a95a7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -210,6 +210,10 @@ static char * const zone_names[MAX_NR_ZONES] = { int min_free_kbytes = 1024; int user_min_free_kbytes = -1; +#ifdef CONFIG_MEMORY_MIRROR +struct mirror_info mirror_info; +#endif + static unsigned long __meminitdata nr_kernel_pages; static unsigned long __meminitdata nr_all_pages; static unsigned long __meminitdata dma_reserve; @@ -545,6 +549,31 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, return 0; } +#ifdef CONFIG_MEMORY_MIRROR +static void __init add_mirror_info(int node, + unsigned long start, unsigned long size) +{ + mirror_info.info[mirror_info.count].node = node; + mirror_info.info[mirror_info.count].start = start; + mirror_info.info[mirror_info.count].size = size; + + mirror_info.count++; +} + +static void __init print_mirror_info(void) +{ + int i; + + printk("Mirror info\n"); + for (i = 0; i < mirror_info.count; i++) + printk(" node %3d: [mem %#010lx-%#010lx]\n", + mirror_info.info[i].node, + mirror_info.info[i].start, + mirror_info.info[i].start + + mirror_info.info[i].size - 1); +} +#endif + /* * Freeing function for a buddy system allocator. * @@ -5438,6 +5467,10 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) (u64)zone_movable_pfn[i] << PAGE_SHIFT); } +#ifdef CONFIG_MEMORY_MIRROR + print_mirror_info(); +#endif + /* Print out the early node map */ pr_info("Early memory node ranges\n"); for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 01/12] mm: add a new config to manage the code
On 2015/06/04 21:56, Xishi Qiu wrote: This patch introduces a new config called "CONFIG_ACPI_MIRROR_MEMORY", it is used to on/off the feature. Signed-off-by: Xishi Qiu --- mm/Kconfig | 8 1 file changed, 8 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index 390214d..4f2a726 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -200,6 +200,14 @@ config MEMORY_HOTREMOVE depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE depends on MIGRATION +config MEMORY_MIRROR + bool "Address range mirroring support" + depends on X86 && NUMA + default y + help + This feature depends on hardware and firmware support. + ACPI or EFI records the mirror info. default y...no runtime influence when the user doesn't use memory mirror ? Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 03/12] mm: introduce MIGRATE_MIRROR to manage the mirrored, pages
On 2015/06/04 21:58, Xishi Qiu wrote: This patch introduces a new MIGRATE_TYPES called "MIGRATE_MIRROR", it is used to storage the mirrored pages list. When cat /proc/pagetypeinfo, you can see the count of free mirrored blocks. I guess you need to add Mel to CC. e.g. euler-linux:~ # cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node0, zone DMA, typeUnmovable 1 1 0 0 2 1 1 0 1 0 0 Node0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3 Node0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, typeUnmovable 0 0 1 0 0 0 0 1 1 1 0 Node0, zoneDMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Movable 1 2 6 6 6 4 5 3 3 2738 Node0, zoneDMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zoneDMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, typeUnmovable 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, type Movable 0 0 1 1 0 0 0 2 1 0 4254 Node0, zone Normal, type Mirror148104 63 70 26 11 2 2 1 1973 Node0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate Node 0, zone DMA1060 10 Node 0, zoneDMA3220 15250 10 Node 0, zone Normal00 8702 2048 20 Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node1, zone Normal, typeUnmovable 0 0 0 0 0 0 0 0 0 0 0 Node1, zone Normal, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node1, zone Normal, type Movable 2 2 1 1 2 1 2 2 2 3 3996 Node1, zone Normal, type Mirror 68 94 57 6 8 1 0 0 3 1 2003 Node1, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node1, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Mirror Reserve Isolate Node 1, zone Normal00 8190 4096 20 Signed-off-by: Xishi Qiu --- include/linux/mmzone.h | 6 ++ mm/page_alloc.c| 3 +++ mm/vmstat.c| 3 +++ 3 files changed, 12 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 1fae07b..b444335 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -39,6 +39,9 @@ enum { MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE, +#ifdef CONFIG_MEMORY_MIRROR + MIGRATE_MIRROR, +#endif I can't imagine how the fallback logic will work at reading this patch. I think an update for fallback order array should be in this patch... MIGRATE_PCPTYPES, /* the number of types on the pcp lists */ MIGRATE_RESERVE = MIGRATE_PCPTYPES, #ifdef CONFIG_CMA @@ -82,6 +85,9 @@ struct mirror_info {
Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline
On 2015/04/25 5:01, Andrew Morton wrote: On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng wrote: Since the change to the cpu <--> mapping (map the cpu to the physical node for all possible at the boot), the node of cpu may be not present, so we use the best near online node if the node is not present in the low level allocation APIs. ... --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); } +static int find_near_online_node(int node) +{ + int n, val; + int min_val = INT_MAX; + int best_node = -1; + + for_each_online_node(n) { + val = node_distance(node, n); + + if (val < min_val) { + min_val = val; + best_node = n; + } + } + + return best_node; +} This should be `inline' if it's in a header file. But it is far too large to be inlined anyway - please move it to a .c file. And please document it. A critical thing to describe is how we determine whether a node is "near". There are presumably multiple ways in which we could decide that a node is "near" (number of hops, minimum latency, ...). Which one did you choose, and why? static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { + /* Offline node, use the best near online node */ + if (!node_online(nid)) + nid = find_near_online_node(nid); + /* Unknown node is current node */ if (nid < 0) nid = numa_node_id(); @@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, unsigned int order) { - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); + /* Offline node, use the best near online node */ + if (!node_online(nid)) + nid = find_near_online_node(nid); In above VM_BUG_ON(), !node_online(nid) is the bug. + + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } Ouch. These functions are called very frequently, and adding overhead to them is a big deal. And the patch even adds overhead to non-x86 architectures which don't benefit from it! Is there no way this problem can be fixed somewhere else? Preferably by fixing things up at hotplug time. I agree. the results should be cached. If necessary, in per-cpu line. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/04/02 10:36, Gu Zheng wrote: Hi Kame, TJ, On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote: On 2015/04/01 12:02, Tejun Heo wrote: On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote: Now, hot-added cpus will have the lowest free cpu id. Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always contiguous even after cpu hot add. In enterprise, this would be considered as imcompatibility. determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt exisiting script or configuration/resource management software. Ugh... so, cpu number allocation on hot-add is part of userland interface that we're locked into? We checked most of RHEL7 packages and didn't find a problem yet. But, for examle, we know some performance test team's test program assumed contiguous cpuids and it failed. It was an easy case because we can ask them to fix the application but I guess there will be some amount of customers that cpuids are contiguous. Tying hotplug and id allocation order together usually isn't a good idea. What if the cpu up fails while running the notifiers? The ID is already allocated and the next cpu being brought up will be after a hole anyway. Is this even actually gonna affect userland? Maybe. It's not fail-safe but In general, all kernel engineers (and skilled userland engineers) knows that cpuids cannot be always contiguous and cpuids/nodeids should be checked before running programs. I think most of engineers should be aware of that but many users have their own assumption :( Basically, I don't have strong objections, you're right technically. In summary... - users should not assume cpuids are contiguous. - all possible ids should be fixed at boot time. - For uses, some clarification document should be somewhere in Documenatation. Fine to me. So, Gu-san 1) determine all possible ids at boot. 2) clarify cpuid/nodeid can have hole because of 1) in Documenation. 3) It would be good if other guys give us ack. Also fine. But before this going, could you please reconsider determining the ids when firstly present (the implementation on this patchset)? Though it is not the perfect one in some words, but we can ignore the doubts that mentioned above as the cpu/node hotplug is not frequent behaviours, and there seems not anything harmful to us if we go this way. Is it so heavy work ? Hmm. My requests are Implement your patches as - Please don't change current behavior at boot. - Remember all possible apicids and give them future cpuids if not assigned. as step 1. Please fix dynamic pxm<->node detection in step2. In future, memory-less node handling in x86 should be revisited. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/04/01 12:02, Tejun Heo wrote: On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote: Now, hot-added cpus will have the lowest free cpu id. Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always contiguous even after cpu hot add. In enterprise, this would be considered as imcompatibility. determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt exisiting script or configuration/resource management software. Ugh... so, cpu number allocation on hot-add is part of userland interface that we're locked into? We checked most of RHEL7 packages and didn't find a problem yet. But, for examle, we know some performance test team's test program assumed contiguous cpuids and it failed. It was an easy case because we can ask them to fix the application but I guess there will be some amount of customers that cpuids are contiguous. Tying hotplug and id allocation order together usually isn't a good idea. What if the cpu up fails while running the notifiers? The ID is already allocated and the next cpu being brought up will be after a hole anyway. Is this even actually gonna affect userland? Maybe. It's not fail-safe but In general, all kernel engineers (and skilled userland engineers) knows that cpuids cannot be always contiguous and cpuids/nodeids should be checked before running programs. I think most of engineers should be aware of that but many users have their own assumption :( Basically, I don't have strong objections, you're right technically. In summary... - users should not assume cpuids are contiguous. - all possible ids should be fixed at boot time. - For uses, some clarification document should be somewhere in Documenatation. So, Gu-san 1) determine all possible ids at boot. 2) clarify cpuid/nodeid can have hole because of 1) in Documenation. 3) It would be good if other guys give us ack. In future, I myself thinks naming system like udev for cpuid/numaid is necessary, at last. Can that renaming feature can be cgroup/namespace feature ? If possible, all container can have cpuids starting from 0. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
On 2015/03/30 18:58, Gu Zheng wrote: > Hi Kame-san, > > On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote: > >> On 2015/03/26 13:55, Gu Zheng wrote: >>> Hi Kame-san, >>> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote: >>> >>>> On 2015/03/26 11:17, Gu Zheng wrote: >>>>> Previously, we build the apicid <--> cpuid mapping when the cpu is >>>>> present, but >>>>> the relationship will be changed if the cpu/node hotplug happenned, >>>>> because we >>>>> always choose the first free cpuid for the hot added cpu (whether it is >>>>> new-add >>>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug >>>>> occurred, and it causes the wq sub-system allocation failture: >>>>> == >>>>> SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) >>>>> cache: kmalloc-192, object size: 192, buffer size: 192, default >>>>> order: >>>>>1, min order: 0 >>>>> node 0: slabs: 6172, objs: 259224, free: 245741 >>>>> node 1: slabs: 3261, objs: 136962, free: 127656 >>>>> == >>>>> So here we build the persistent [lapic id] <--> cpuid mapping when the >>>>> cpu first >>>>> present, and never change it. >>>>> >>>>> Suggested-by: KAMEZAWA Hiroyuki >>>>> Signed-off-by: Gu Zheng >>>>> --- >>>>> arch/x86/kernel/apic/apic.c | 31 ++- >>>>> 1 files changed, 30 insertions(+), 1 deletions(-) >>>>> >>>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c >>>>> index ad3639a..d539ebc 100644 >>>>> --- a/arch/x86/kernel/apic/apic.c >>>>> +++ b/arch/x86/kernel/apic/apic.c >>>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup) >>>>> apic_write(APIC_LVT1, value); >>>>> } >>>>> >>>>> +/* >>>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings. >>>>> + * Do not clear the mapping even if cpu hot removed. >>>>> + * */ >>>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = { >>>>> + [0 ... MAX_LOCAL_APIC - 1] = -1, >>>>> +}; >>>> >>>> >>>> This patch cannot handle x2apic, which is 32bit. >>> >>> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip >>> generating a logic cpu number for it, so it seems no problem here. >>> >> you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ? > > I use the big array here to keep the same format with the existed ones: > int apic_version[MAX_LOCAL_APIC]; > > s16 __apicid_to_node[MAX_LOCAL_APIC] = { > [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE > }; > Or we should also say "NO" to them? This will not survive when someone try to enlarge MAX_LOCAL_APIC for introdcing new cpu model because x2apic is 32bit by definition. Please create long surviving infrastructure. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/04/01 0:28, Tejun Heo wrote: Hello, Kamezawa. On Tue, Mar 31, 2015 at 03:09:05PM +0900, Kamezawa Hiroyuki wrote: But this may be considered as API change for most hot-add users. Hmm... Why would it be? What can that possibly break? Now, hot-added cpus will have the lowest free cpu id. Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always contiguous even after cpu hot add. In enterprise, this would be considered as imcompatibility. determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt exisiting script or configuration/resource management software. So, for now, I vote for detemining ids at online but record it is a good way. If we know the information during boot, let's please do it at boot time by all means. I basically agree. Just thinking influence of this small imcompatibility of forcing ideal way because of my standing point. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/03/30 18:49, Gu Zheng wrote: Hi Kame-san, On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote: On 2015/03/27 0:18, Tejun Heo wrote: Hello, On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote: wq generates the numa affinity (pool->node) for all the possible cpu's per cpu workqueue at init stage, that means the affinity of currently un-present ones' may be incorrect, so we need to update the pool->node for the new added cpu to the correct node when preparing online, otherwise it will try to create worker on invalid node if node hotplug occurred. If the mapping is gonna be static once the cpus show up, any chance we can initialize that for all possible cpus during boot? I think the kernel can define all possible cpuid <-> lapicid <-> pxm <-> nodeid mapping at boot with using firmware table information. Could you explain more? you can scan firmware's table and store all provided information for possible cpus/pxms somewhere even while future cpus/mems are not onlined rather than determine them on demand. But this may be considered as API change for most hot-add users. So, for now, I vote for detemining ids at online but record it is a good way. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/03/27 0:18, Tejun Heo wrote: Hello, On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote: wq generates the numa affinity (pool->node) for all the possible cpu's per cpu workqueue at init stage, that means the affinity of currently un-present ones' may be incorrect, so we need to update the pool->node for the new added cpu to the correct node when preparing online, otherwise it will try to create worker on invalid node if node hotplug occurred. If the mapping is gonna be static once the cpus show up, any chance we can initialize that for all possible cpus during boot? I think the kernel can define all possible cpuid <-> lapicid <-> pxm <-> nodeid mapping at boot with using firmware table information. One concern is current x86 logic for memory-less node v.s. memory hotplug. (as I explained before) My idea is step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot. But this may be overwritten by x86's memory less node logic. So, step2. check node is online or not before calling kmalloc. If offline, use -1. rather than updating workqueue's attribute. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
On 2015/03/26 13:55, Gu Zheng wrote: > Hi Kame-san, > On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote: > >> On 2015/03/26 11:17, Gu Zheng wrote: >>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, >>> but >>> the relationship will be changed if the cpu/node hotplug happenned, because >>> we >>> always choose the first free cpuid for the hot added cpu (whether it is >>> new-add >>> or re-add), so this the cpuid <--> node mapping changed if node hot plug >>> occurred, and it causes the wq sub-system allocation failture: >>> == >>>SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) >>> cache: kmalloc-192, object size: 192, buffer size: 192, default >>> order: >>> 1, min order: 0 >>> node 0: slabs: 6172, objs: 259224, free: 245741 >>> node 1: slabs: 3261, objs: 136962, free: 127656 >>> == >>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu >>> first >>> present, and never change it. >>> >>> Suggested-by: KAMEZAWA Hiroyuki >>> Signed-off-by: Gu Zheng >>> --- >>>arch/x86/kernel/apic/apic.c | 31 ++- >>>1 files changed, 30 insertions(+), 1 deletions(-) >>> >>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c >>> index ad3639a..d539ebc 100644 >>> --- a/arch/x86/kernel/apic/apic.c >>> +++ b/arch/x86/kernel/apic/apic.c >>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup) >>> apic_write(APIC_LVT1, value); >>>} >>> >>> +/* >>> + * Logic cpu number(cpuid) to local APIC id persistent mappings. >>> + * Do not clear the mapping even if cpu hot removed. >>> + * */ >>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = { >>> + [0 ... MAX_LOCAL_APIC - 1] = -1, >>> +}; >> >> >> This patch cannot handle x2apic, which is 32bit. > > IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip > generating a logic cpu number for it, so it seems no problem here. > you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ? Anyway, APIC IDs are sparse values. Please use proper structure. Thanks, -Kame >> >> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit >> apicid, at least. >> >> But you can't create inifinit array. >> >> If you can't allocate the array dynamically, How about adding >> >> static int cpuid_to_apicid[MAX_CPU] = {} >> >> or using idr library ? (please see lib/idr.c) >> >> I guess you can update this map after boot(after mm initialization) >> and make use of idr library. >> >> About this patch, Nack. >> >> -Kame >> >> >> >>> + >>> +/* >>> + * Internal cpu id bits, set the bit once cpu present, and never clear it. >>> + * */ >>> +static cpumask_t cpuid_mask = CPU_MASK_NONE; >>> + >>> +static int get_cpuid(int apicid) >>> +{ >>> + int cpuid; >>> + >>> + cpuid = apicid_to_x86_cpu[apicid]; >>> + if (cpuid == -1) >>> + cpuid = cpumask_next_zero(-1, &cpuid_mask); >>> + >>> + return cpuid; >>> +} >>> + >>>int generic_processor_info(int apicid, int version) >>>{ >>> int cpu, max = nr_cpu_ids; >>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version) >>> */ >>> cpu = 0; >>> } else >>> - cpu = cpumask_next_zero(-1, cpu_present_mask); >>> + cpu = get_cpuid(apicid); >>> + >>> + /* Store the mapping */ >>> + apicid_to_x86_cpu[apicid] = cpu; >>> >>> /* >>> * Validate version >>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version) >>> early_per_cpu(x86_cpu_to_logical_apicid, cpu) = >>> apic->x86_32_early_logical_apicid(cpu); >>>#endif >>> + /* Mark this cpu id as uesed (already mapping a local apic id) */ >>> + cpumask_set_cpu(cpu, &cpuid_mask); >>> set_cpu_possible(cpu, true); >>> set_cpu_present(cpu, true); >>> >>> >> >> >> . >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
On 2015/03/26 11:17, Gu Zheng wrote: > Previously, we build the apicid <--> cpuid mapping when the cpu is present, > but > the relationship will be changed if the cpu/node hotplug happenned, because we > always choose the first free cpuid for the hot added cpu (whether it is > new-add > or re-add), so this the cpuid <--> node mapping changed if node hot plug > occurred, and it causes the wq sub-system allocation failture: >== > SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) >cache: kmalloc-192, object size: 192, buffer size: 192, default > order: > 1, min order: 0 >node 0: slabs: 6172, objs: 259224, free: 245741 >node 1: slabs: 3261, objs: 136962, free: 127656 >== > So here we build the persistent [lapic id] <--> cpuid mapping when the cpu > first > present, and never change it. > > Suggested-by: KAMEZAWA Hiroyuki > Signed-off-by: Gu Zheng > --- > arch/x86/kernel/apic/apic.c | 31 ++- > 1 files changed, 30 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c > index ad3639a..d539ebc 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup) > apic_write(APIC_LVT1, value); > } > > +/* > + * Logic cpu number(cpuid) to local APIC id persistent mappings. > + * Do not clear the mapping even if cpu hot removed. > + * */ > +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = { > + [0 ... MAX_LOCAL_APIC - 1] = -1, > +}; This patch cannot handle x2apic, which is 32bit. As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least. But you can't create inifinit array. If you can't allocate the array dynamically, How about adding static int cpuid_to_apicid[MAX_CPU] = {} or using idr library ? (please see lib/idr.c) I guess you can update this map after boot(after mm initialization) and make use of idr library. About this patch, Nack. -Kame > + > +/* > + * Internal cpu id bits, set the bit once cpu present, and never clear it. > + * */ > +static cpumask_t cpuid_mask = CPU_MASK_NONE; > + > +static int get_cpuid(int apicid) > +{ > + int cpuid; > + > + cpuid = apicid_to_x86_cpu[apicid]; > + if (cpuid == -1) > + cpuid = cpumask_next_zero(-1, &cpuid_mask); > + > + return cpuid; > +} > + > int generic_processor_info(int apicid, int version) > { > int cpu, max = nr_cpu_ids; > @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version) >*/ > cpu = 0; > } else > - cpu = cpumask_next_zero(-1, cpu_present_mask); > + cpu = get_cpuid(apicid); > + > + /* Store the mapping */ > + apicid_to_x86_cpu[apicid] = cpu; > > /* >* Validate version > @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version) > early_per_cpu(x86_cpu_to_logical_apicid, cpu) = > apic->x86_32_early_logical_apicid(cpu); > #endif > + /* Mark this cpu id as uesed (already mapping a local apic id) */ > + cpumask_set_cpu(cpu, &cpuid_mask); > set_cpu_possible(cpu, true); > set_cpu_present(cpu, true); > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
On 2015/03/26 11:17, Gu Zheng wrote: > Yasuaki Ishimatsu found that with node online/offline, cpu<->node > relationship is established. Because workqueue uses a info which was > established at boot time, but it may be changed by node hotpluging. > > Once pool->node points to a stale node, following allocation failure > happens. >== > SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) >cache: kmalloc-192, object size: 192, buffer size: 192, default > order: > 1, min order: 0 >node 0: slabs: 6172, objs: 259224, free: 245741 >node 1: slabs: 3261, objs: 136962, free: 127656 >== > > As the apicid <--> node relationship is persistent, so the root cause is the ^^^ pxm. > cpu-id <-> lapicid mapping is not persistent (because the currently > implementation > always choose the first free cpu id for the new added cpu), so if we can build > persistent cpu-id <-> lapicid relationship, this problem will be fixed. > > Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous > discussion. > > Gu Zheng (2): >x86/cpu hotplug: make lapicid <-> cpuid mapping persistent >workqueue: update per cpu workqueue's numa affinity when cpu > preparing online why patch(2/2) required ? Thanks, -Kame > > arch/x86/kernel/apic/apic.c | 31 ++- > kernel/workqueue.c |1 + > 2 files changed, 31 insertions(+), 1 deletions(-) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] workqueue: update numa affinity when node hotplug
On 2015/03/05 10:23, Gu Zheng wrote: Hi Kamazawa-san, On 03/04/2015 01:45 PM, Kamezawa Hiroyuki wrote: On 2015/03/03 22:18, Tejun Heo wrote: Hello, Kame. On Tue, Mar 03, 2015 at 03:53:46PM +0900, Kamezawa Hiroyuki wrote: relationship between proximity domain and lapic id doesn't change. relationship between lapic-id and cpu-id changes. pxm <-> memory address : no change pxm <-> lapicid : no change pxm <-> node id : no change lapicid <-> cpu id : change. So, we're changing the cpu ID to NUMA node mapping because current NUMA code is ignoring PXM for memoryless nodes? That's it? For memory-less node case, yes. Another problem is that lapicid <-> cpuid relationship is not persistent. I personally thinks proper fix is building persistent cpu-id <-> lapicid relationship as pxm does rather than creating band-aid. Oh if this is possible, I agree that's the right direction too. Implementation is a bit complicated now :(. Ah well, even then, the obviously right thing to do is updating NUMA code to always keep track of PXM information. We don't really want to pile NUMA hacks in random users of NUMA code. We'd like to start from making apicid <-> cpuid persistent because memory-less node case doesn't cause panic. Gu-san, how do you think ? Fine by me. But it seems that the change will break the re-use of free cpuid when hot add new cpus, I am afraid it may affect other sub-systems, though I can not point out the specific example. That may be concern. But, IHMO, applications should not expect cpuid of newly added cpu. If an application takes care of placement of [thread, memory], OS API should be refreshed to allow user to tell "please pleace [thread, memory] nearby each other" rather than specifying [cpu, memory]. (As Peter Z.'s schedNUMA) If application takes care of physical placement of hardware, ethernet, SSD, etc... it should not depends on cpuid. Some of required features may not be archived yet but cpuid of newly added cpu should not be a big problem for applications, I think. Open dicussion with (x86/sched/mm) maintainers will be required, anyway. I think it's worth trying. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: node-hotplug: is memset 0 safe in try_offline_node()?
On 2015/03/04 17:03, Xishi Qiu wrote: On 2015/3/4 11:56, Gu Zheng wrote: Hi Xishi, On 03/04/2015 10:52 AM, Xishi Qiu wrote: On 2015/3/4 10:22, Xishi Qiu wrote: On 2015/3/3 18:20, Gu Zheng wrote: Hi Xishi, On 03/03/2015 11:30 AM, Xishi Qiu wrote: When hot-remove a numa node, we will clear pgdat, but is memset 0 safe in try_offline_node()? It is not safe here. In fact, this is a temporary solution here. As you know, pgdat is accessed lock-less now, so protection mechanism (RCU?) is needed to make it completely safe here, but it seems a bit over-kill. Hi Gu, Can we just remove "memset(pgdat, 0, sizeof(*pgdat));" ? I find this will be fine in the stress test except the warning when hot-add memory. As you see, it will trigger the warning in free_area_init_node(). Could you try the following patch? It will reset the pgdat before reuse it. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1778628..0717649 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1092,6 +1092,9 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) return NULL; arch_refresh_nodedata(nid, pgdat); + } else { + /* Reset the pgdat to reuse */ + memset(pgdat, 0, sizeof(*pgdat)); } Hi Gu, If schedule last a long time, next_zone may be still access the pgdat here, so it is not safe enough, right? How about just reseting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than memset() ? It seems breaking pointer information in pgdat is not a choice. Just proper "values" should be reset. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] workqueue: update numa affinity when node hotplug
On 2015/03/03 22:18, Tejun Heo wrote: Hello, Kame. On Tue, Mar 03, 2015 at 03:53:46PM +0900, Kamezawa Hiroyuki wrote: relationship between proximity domain and lapic id doesn't change. relationship between lapic-id and cpu-id changes. pxm <-> memory address : no change pxm <-> lapicid : no change pxm <-> node id : no change lapicid <-> cpu id : change. So, we're changing the cpu ID to NUMA node mapping because current NUMA code is ignoring PXM for memoryless nodes? That's it? For memory-less node case, yes. Another problem is that lapicid <-> cpuid relationship is not persistent. I personally thinks proper fix is building persistent cpu-id <-> lapicid relationship as pxm does rather than creating band-aid. Oh if this is possible, I agree that's the right direction too. Implementation is a bit complicated now :(. Ah well, even then, the obviously right thing to do is updating NUMA code to always keep track of PXM information. We don't really want to pile NUMA hacks in random users of NUMA code. We'd like to start from making apicid <-> cpuid persistent because memory-less node case doesn't cause panic. Gu-san, how do you think ? Thanks, -Kame P.S. Finally, I want something like udev for cpuid/numaid... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] workqueue: update numa affinity when node hotplug
On 2015/03/03 1:28, Tejun Heo wrote: Hello, On Mon, Mar 02, 2015 at 05:41:05PM +0900, Kamezawa Hiroyuki wrote: Let me start from explaining current behavior. - cpu-id is determined when a new processor(lapicid/x2apicid) is founded. cpu-id<->nodeid relationship is _not_ recorded. Is this something from the firmware side or is it just that we aren't maintaining the association right now? I think it's not just maintained. - node-id is determined when a new pxm(firmware info) is founded. pxm<->nodeid relationship is recorded. By this, there are 2 cases of cpu<->nodeid change. Case A) In x86, cpus on memory-less nodes are all tied to existing nodes(round robin). At memory-hotadd happens and a new node comes, cpus are moved to a newly added node based on pxm. Ah, okay, so the firmware doesn't provide proximity information at all for memory-less nodes so we end up putting all of them somewhere random and when memory is added to one of the memory-less nodes, the mapping information changes? With memory-less node, proximity domain for processors are given but ignored. When memory(node) hotplug happens, the information revisited and cpuid<->nodeid relationship is updated. Am I understanding it correctly? If so, it's super weird tho. Why wouldn't there be proximity information for a memless node? Not having memory doesn't mean it's at the same distance from all existing nodes. Firmware gives pxm for memory-less node but it's ignored. I'm not sure why the current implemetaion is. Case B) Adding a node after removing another node, if pxm of them were different from each other, cpu<->node relatiionship changes. I don't get this either. Does proximity relationship actually change? Or is it that we're assigning different IDs to the same thing?Isn't proximity pretty much hardwired to how the system is architected to begin with? relationship between proximity domain and lapic id doesn't change. relationship between lapic-id and cpu-id changes. pxm <-> memory address : no change pxm <-> lapicid : no change pxm <-> node id : no change lapicid <-> cpu id : change. I personally thinks proper fix is building persistent cpu-id <-> lapicid relationship as pxm does rather than creating band-aid. Oh if this is possible, I agree that's the right direction too. Implementation is a bit complicated now :(. Thanks, -Kame Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] workqueue: update numa affinity when node hotplug
On 2015/02/27 20:54, Tejun Heo wrote: Hello, On Fri, Feb 27, 2015 at 06:04:52PM +0800, Gu Zheng wrote: Yasuaki Ishimatsu found that with node online/offline, cpu<->node relationship is established. Because workqueue uses a info which was established at boot time, but it may be changed by node hotpluging. I've asked this a couple times now but can somebody please justify why cpu<->node relationship needs to change? If it has to change, that's okay but let's please first make sure that we understand why such thing is necessary so that we can figure out what kind of facilities are necessary for such dynamism. Thanks. Let me start from explaining current behavior. - cpu-id is determined when a new processor(lapicid/x2apicid) is founded. cpu-id<->nodeid relationship is _not_ recorded. - node-id is determined when a new pxm(firmware info) is founded. pxm<->nodeid relationship is recorded. By this, there are 2 cases of cpu<->nodeid change. Case A) In x86, cpus on memory-less nodes are all tied to existing nodes(round robin). At memory-hotadd happens and a new node comes, cpus are moved to a newly added node based on pxm. Case B) Adding a node after removing another node, if pxm of them were different from each other, cpu<->node relatiionship changes. I personally thinks proper fix is building persistent cpu-id <-> lapicid relationship as pxm does rather than creating band-aid. But I have no good idea to hanlde case A), which is used now. Hm, Case-A will not be problem for the case of workqueue's kmalloc because a node with ZONE_NORMAL cannot be removed. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] workqueue: update numa affinity info at node hotplug
(2014/12/17 12:22), Kamezawa Hiroyuki wrote: (2014/12/17 10:36), Lai Jiangshan wrote: On 12/17/2014 12:45 AM, Kamezawa Hiroyuki wrote: With node online/offline, cpu<->node relationship is established. Workqueue uses a info which was established at boot time but it may be changed by node hotpluging. Once pool->node points to a stale node, following allocation failure happens. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch updates per cpu workqueue pool's node affinity and updates wq_numa_possible_cpumask at node online/offline event. This update of mask is very important because it affects cpumasks and preferred node detection. Unbound workqueue's per node pool are updated by by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing code. What important here is to avoid wrong node detection when a cpu get onlined. And it's handled by wq_numa_possible_cpumask update introduced by this patch. Changelog v3->v4: - added workqueue_node_unregister - clear wq_numa_possible_cpumask at node offline. - merged a patch which handles per cpu pools. - clear per-cpu-pool's pool->node at node offlining. - set per-cpu-pool's pool->node at node onlining. - dropped modification to get_unbound_pool() - dropped per-cpu-pool handling at cpu online/offline. Reported-by: Yasuaki Ishimatsu Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 3 +++ kernel/workqueue.c | 58 +- mm/memory_hotplug.c| 6 - 3 files changed, 65 insertions(+), 2 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..7b4a292 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); +void workqueue_node_unregister(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6202b08..f6ad05a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; -/* possible CPUs of each node */ +/* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void workqueue_update_cpu_numa_affinity(int cpu, int node) +{ +struct worker_pool *pool; + +if (node != cpu_to_node(cpu)) +return; +cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); +for_each_cpu_worker_pool(pool, cpu) +pool->node = node; Again, You need to check and update all the wq->numa_pwq_tbl[oldnode], but in this patchset, the required information is lost and we can't find out oldnode. cpus of oldnode, 16-31(online),48,56,64,72(offline,randomly assigned to the oldnode by numa_init_array()) and then cpu#48 is allocated for newnode and online Now, the wq->numa_pwq_tbl[oldnode]'s cpumask still have cpu#48, and it may be scheduled to cpu#48. See the information of my patch 4/5 That will not cause page allocation failure, right ? If so, it's out of scope of this patch 1/2. I think it's handled in patch 2/2, isn't it ? Let me correct my words. Main purpose of this patch 1/2 is handling a case "node disappers" after boot. And try to handle physicall node hotplug caes. Changes of cpu<->node relationship at CPU_ONLINE is handled in patch 2/2. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] workqueue: update numa affinity info at node hotplug
(2014/12/17 10:36), Lai Jiangshan wrote: On 12/17/2014 12:45 AM, Kamezawa Hiroyuki wrote: With node online/offline, cpu<->node relationship is established. Workqueue uses a info which was established at boot time but it may be changed by node hotpluging. Once pool->node points to a stale node, following allocation failure happens. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch updates per cpu workqueue pool's node affinity and updates wq_numa_possible_cpumask at node online/offline event. This update of mask is very important because it affects cpumasks and preferred node detection. Unbound workqueue's per node pool are updated by by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing code. What important here is to avoid wrong node detection when a cpu get onlined. And it's handled by wq_numa_possible_cpumask update introduced by this patch. Changelog v3->v4: - added workqueue_node_unregister - clear wq_numa_possible_cpumask at node offline. - merged a patch which handles per cpu pools. - clear per-cpu-pool's pool->node at node offlining. - set per-cpu-pool's pool->node at node onlining. - dropped modification to get_unbound_pool() - dropped per-cpu-pool handling at cpu online/offline. Reported-by: Yasuaki Ishimatsu Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 3 +++ kernel/workqueue.c | 58 +- mm/memory_hotplug.c| 6 - 3 files changed, 65 insertions(+), 2 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..7b4a292 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); +void workqueue_node_unregister(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6202b08..f6ad05a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void workqueue_update_cpu_numa_affinity(int cpu, int node) +{ + struct worker_pool *pool; + + if (node != cpu_to_node(cpu)) + return; + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); + for_each_cpu_worker_pool(pool, cpu) + pool->node = node; Again, You need to check and update all the wq->numa_pwq_tbl[oldnode], but in this patchset, the required information is lost and we can't find out oldnode. cpus of oldnode, 16-31(online),48,56,64,72(offline,randomly assigned to the oldnode by numa_init_array()) and then cpu#48 is allocated for newnode and online Now, the wq->numa_pwq_tbl[oldnode]'s cpumask still have cpu#48, and it may be scheduled to cpu#48. See the information of my patch 4/5 That will not cause page allocation failure, right ? If so, it's out of scope of this patch 1/2. I think it's handled in patch 2/2, isn't it ? Thanks, -Kame +} + +/* + * When a cpu is physically added, cpu<->node relationship is established + * based on firmware info. We can catch the whole view when a new NODE_DATA() + * coming up (a node is added). + * If we don't update the info, pool->node will points to a not-online node + * and the kernel will have allocation failure. + * + * Update wp_numa_possible_mask at online and clear it at offline. + */ +void workqueue_node_register(int node) +{ + int cpu; + + mutex_lock(&wq_pool_mutex); + for_each_possible_cpu(cpu) + workqueue_update_cpu_numa_affinity(cpu, node); + /* unbound workqueue will be updated when the 1st cpu comes up.*/ + mutex_unlock(&wq_pool_mutex); +} + +void workqueue_node_unregister(int node) +{ + struct worker_pool *p
[PATCH 2/2] workqueue: update cpumask at CPU_ONLINE if necessary
In some case, cpu's numa affinity will be changed in cpu_up(). It happens after a new node is onlined. (in x86, online cpus are tied to onlined node at boot. so, if memory is added later, cpu mapping can be changed at cpu_up() Although wq_numa_possible_cpumask at el. are maintained against node hotplug, this case should be handled. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index f6ad05a..59d8be5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4618,6 +4618,27 @@ void workqueue_node_unregister(int node) mutex_unlock(&wq_pool_mutex); } +static void workqueue_may_update_numa_affinity(int cpu) +{ + int curnode = cpu_to_node(cpu); + int node; + + if (likely(cpumask_test_cpu(cpu, wq_numa_possible_cpumask[curnode]))) + return; + + /* cpu<->node relationship is changed in cpu_up() */ + for_each_node_state(node, N_POSSIBLE) + cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[node]); + + workqueue_update_cpu_numa_affinity(cpu, curnode); +} +#else + +static void workqueue_may_update_numa_affinity(int cpu) +{ + return; +} + #endif /* @@ -4647,6 +4668,8 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); + workqueue_may_update_numa_affinity(cpu); + for_each_pool(pool, pi) { mutex_lock(&pool->attach_mutex); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] workqueue: update numa affinity info at node hotplug
With node online/offline, cpu<->node relationship is established. Workqueue uses a info which was established at boot time but it may be changed by node hotpluging. Once pool->node points to a stale node, following allocation failure happens. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch updates per cpu workqueue pool's node affinity and updates wq_numa_possible_cpumask at node online/offline event. This update of mask is very important because it affects cpumasks and preferred node detection. Unbound workqueue's per node pool are updated by by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu, by existing code. What important here is to avoid wrong node detection when a cpu get onlined. And it's handled by wq_numa_possible_cpumask update introduced by this patch. Changelog v3->v4: - added workqueue_node_unregister - clear wq_numa_possible_cpumask at node offline. - merged a patch which handles per cpu pools. - clear per-cpu-pool's pool->node at node offlining. - set per-cpu-pool's pool->node at node onlining. - dropped modification to get_unbound_pool() - dropped per-cpu-pool handling at cpu online/offline. Reported-by: Yasuaki Ishimatsu Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 3 +++ kernel/workqueue.c | 58 +- mm/memory_hotplug.c| 6 - 3 files changed, 65 insertions(+), 2 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..7b4a292 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,7 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); +void workqueue_node_unregister(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6202b08..f6ad05a 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4563,6 +4563,62 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void workqueue_update_cpu_numa_affinity(int cpu, int node) +{ + struct worker_pool *pool; + + if (node != cpu_to_node(cpu)) + return; + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); + for_each_cpu_worker_pool(pool, cpu) + pool->node = node; +} + +/* + * When a cpu is physically added, cpu<->node relationship is established + * based on firmware info. We can catch the whole view when a new NODE_DATA() + * coming up (a node is added). + * If we don't update the info, pool->node will points to a not-online node + * and the kernel will have allocation failure. + * + * Update wp_numa_possible_mask at online and clear it at offline. + */ +void workqueue_node_register(int node) +{ + int cpu; + + mutex_lock(&wq_pool_mutex); + for_each_possible_cpu(cpu) + workqueue_update_cpu_numa_affinity(cpu, node); + /* unbound workqueue will be updated when the 1st cpu comes up.*/ + mutex_unlock(&wq_pool_mutex); +} + +void workqueue_node_unregister(int node) +{ + struct worker_pool *pool; + int cpu; + + mutex_lock(&wq_pool_mutex); + cpumask_clear(wq_numa_possible_cpumask[node]); + for_each_possible_cpu(cpu) { + if (node == cpu_to_node(cpu)) + for_each_cpu_worker_pool(pool, cpu) + pool->node = NUMA_NO_NODE; + } + /* +* unbound workqueue's per-node pwqs are already refleshed +* by wq_update_unbound_numa() at CPU_DOWN_PREPARE of the last cpu +* on this node, because all cpus of this node went down. +* (see wq_calc_node_cpumask()). per-node unbound pwqs has been replaced +* with wq->dfl_pwq, already. +*/ + mutex_unlock(&wq_pool_mutex); +} + +#endif /* * Workqueu
[PATCH 0/2] workqueue: fix a bug when numa mapping is changed v4
This is v4. Thank you for hints/commentes to previous versions. I think this versions only contains necessary things and not invasive. Tested several patterns of node hotplug and seems to work well. Changes since v3 - removed changes against get_unbound_pool() - remvoed codes in cpu offline event. - added node unregister callback. clear wq_numa_possible_mask at node offline rather than cpu offline. - updates per-cpu pool's pool-> node at node_(un)register. - added more comments. - almost all codes are under CONFIG_MEMORY_HOTPLUG include/linux/memory_hotplug.h |3 + kernel/workqueue.c | 81 - mm/memory_hotplug.c|6 ++- 3 files changed, 88 insertions(+), 2 deletions(-) Original problem was a memory allocation failure because pool->node points to not-online node. This happens when cpu<->node mapping changes. Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping between CPU and node is changed. This was the last scene: SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
(2014/12/16 17:10), Kamezawa Hiroyuki wrote: (2014/12/16 16:49), Lai Jiangshan wrote: On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote: Workqueue keeps cpu<->node relationship including all possible cpus. The original information was made at boot but it may change when a new node is added. Update information if a new node is ready with using node-hotplug callback. Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 2 ++ kernel/workqueue.c | 28 +++- mm/memory_hotplug.c| 4 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..cd3cb67 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 2fd0bd7..5499b76 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; -/* possible CPUs of each node */ +/* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void reflesh_wq_possible_mask(int cpu, int node) +{ +int oldnode; +for_each_node_state(oldnode, N_POSSIBLE) +cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]); You need to check and update all the wq->numa_pwq_tbl[oldnode] you mean : if I drop patch 1/4, it will be required. Right ? I'll add that check. Ah, isn't it handled by wq_update_unbound_numa() called at workqueue_cpu_down_callback() ? Because node's online cpumask will be empty, new pwq will be attached. Thanks -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
(2014/12/16 16:49), Lai Jiangshan wrote: On 12/15/2014 07:18 PM, Kamezawa Hiroyuki wrote: Workqueue keeps cpu<->node relationship including all possible cpus. The original information was made at boot but it may change when a new node is added. Update information if a new node is ready with using node-hotplug callback. Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 2 ++ kernel/workqueue.c | 28 +++- mm/memory_hotplug.c| 4 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..cd3cb67 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 2fd0bd7..5499b76 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void reflesh_wq_possible_mask(int cpu, int node) +{ + int oldnode; + for_each_node_state(oldnode, N_POSSIBLE) + cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]); You need to check and update all the wq->numa_pwq_tbl[oldnode] you mean : if I drop patch 1/4, it will be required. Right ? I'll add that check. Thanks, -Kame + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); +} +/* + * When a cpu is physically added, cpu<->node relationship is established. + * We can catch the whole view when a new NODE_DATA() coming up. + * Update cpumask used for sched affinity of workers. + */ +void workqueue_node_register(int node) +{ + int cpu; + mutex_lock(&wq_pool_mutex); + for_each_possible_cpu(cpu) { + if (node == cpu_to_node(cpu)) + reflesh_wq_possible_mask(cpu, node); + } + mutex_unlock(&wq_pool_mutex); +} + +#endif /* * Workqueues should be brought up before normal priority CPU notifiers. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1bf4807..d0c1ebb 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) * online_pages() and offline_pages(). */ reset_node_present_pages(pgdat); + /* +* Update workqueue's cpu/node affinity info. +*/ + workqueue_node_register(nid); return pgdat; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
(2014/12/16 14:30), Lai Jiangshan wrote: On 12/15/2014 07:14 PM, Kamezawa Hiroyuki wrote: Unbound wq pool's node attribute is calculated at its allocation. But it's now calculated based on possible cpu<->node information which can be wrong after cpu hotplug/unplug. If wrong pool->node is set, following allocation error will happen. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch fixes the node detection by making use of online cpu info. Unlike cpumask, the best node can be calculated by degree of overlap between attr->cpumask and numanode->online_cpumask. This change doesn't corrupt original purpose of the old calculation. Note: it's expected that this function is called as pool_detect_best_node get_unbound_pool alloc_unbound_pwq wq_update_unbound_numa called at CPU_ONLINE/CPU_DOWN_PREPARE and the latest online cpu info can be applied to a new wq pool, which replaces old one. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 38 ++ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 09b685d..7809154 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool) } /** + * pool_detect_best_node - detect a node which contains specified cpumask. + * Should be called with wq_pool_mutex held. + * Returns a online node where the most of given cpus are tied to. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ + int node, best, match, selected = NUMA_NO_NODE; + static struct cpumask andmask; /* under wq_pool_mutex */ + + if (!wq_numa_enabled || + cpumask_subset(cpu_online_mask, cpumask)) + goto out; + best = 0; + /* select a node which contains the most number of cpu */ + for_each_node_state(node, N_ONLINE) { + cpumask_and(&andmask, cpumask, cpumask_of_node(node)); + match = cpumask_weight(&andmask); + if (match > best) + selected = best; + } +out: + return selected; +} This is a mixture of fix and development. Why not just keep the original calculation? Just because wq_numa_possible_mask is broken, it shouldn't be used if possible. In this patch series, the mask is updated only when a node is coming up. Is it better to clear it at node offline ? if the mask cover multiple nodes, NUMA_NO_NODE is the best for pool->node after the pool was created. The memory allocation will select the best node for manage_workers(), from which CPU that the worker actually is running on. I'll drop this and try to keep original code as much as possible. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline
(2014/12/16 14:32), Lai Jiangshan wrote: On 12/15/2014 07:16 PM, Kamezawa Hiroyuki wrote: The percpu workqueue pool are persistend and never be freed. But cpu<->node relationship can be changed by cpu hotplug and pool->node can point to an offlined node. If pool->node points to an offlined node, following allocation failure can happen. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch clears per-cpu workqueue pool's node affinity at cpu offlining and restore it at cpu onlining. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 7809154..2fd0bd7 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_DOWN_FAILED: case CPU_ONLINE: mutex_lock(&wq_pool_mutex); + /* +* now cpu <-> node info is established. update numa node +*/ + for_each_cpu_worker_pool(pool, cpu) + pool->node = cpu_to_node(cpu); for_each_pool(pool, pi) { mutex_lock(&pool->attach_mutex); @@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct work_struct unbind_work; struct workqueue_struct *wq; + struct worker_pool *pool; switch (action & ~CPU_TASKS_FROZEN) { case CPU_DOWN_PREPARE: @@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb, INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn); queue_work_on(cpu, system_highpri_wq, &unbind_work); - /* update NUMA affinity of unbound workqueues */ mutex_lock(&wq_pool_mutex); + /* update NUMA affinity of unbound workqueues */ list_for_each_entry(wq, &workqueues, list) wq_update_unbound_numa(wq, cpu, false); + /* clear per-cpu workqueues's numa affinity. */ + for_each_cpu_worker_pool(pool, cpu) + pool->node = NUMA_NO_NODE; /* restored at online */ If the node is still online, it is better to keep the original pool->node. Hm, ok, drop this code. or moving all to node online event handler. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] workqueue: Handle cpu-node affinity change at CPU_ONLINE.
This is a corner case fix. In general, cpu-to-node relationship is fixed when cpu/memory are visible to kernel based on firmware information. But in some case, cpu-to-node relationship is updated at CPU_ONLINE. In arch/x86/mm/numa.c::numa_init_array(), a cpu will be tied to a random numa node if the node firmware tells is memry-less node. Such cpu's cpu affinity can be changed in cpu_up(). For example, step 1. boot with mem= boot option, hide memory. step 2. online hidden memory with using memory hot-add. step 3. offline cpus. step 4. online cpus. In step 4, cpu's numa node affinity can be modified if memory-less node turns out to be an usual node by step 2. This patch handles the event in CPU_ONLINE callback of workqueue. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5499b76..cc0e1d4 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4583,6 +4583,20 @@ void workqueue_node_register(int node) } mutex_unlock(&wq_pool_mutex); } +/* + * When a cpu boot up with a memory less node, the cpu is be tied + * a random node (see arch/x86/mm/numa.c::numa_init_array()). + * The relationship is fixed when a proper memory node comes up and + * it's visible after CPU_ONLINE. check and reflesh it. + */ +void check_reflesh_node_cpumask(int cpu) +{ + int node = cpu_to_node(cpu); + + if (likely(cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node]))) + return; + reflesh_wq_possible_mask(cpu, node); +} #endif @@ -4617,6 +4631,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, */ for_each_cpu_worker_pool(pool, cpu) pool->node = cpu_to_node(cpu); + check_reflesh_node_cpumask(cpu); for_each_pool(pool, pi) { mutex_lock(&pool->attach_mutex); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] workqueue: Update workqueue's possible cpumask when a new node, coming up.
Workqueue keeps cpu<->node relationship including all possible cpus. The original information was made at boot but it may change when a new node is added. Update information if a new node is ready with using node-hotplug callback. Signed-off-by: KAMEZAWA Hiroyuki --- include/linux/memory_hotplug.h | 2 ++ kernel/workqueue.c | 28 +++- mm/memory_hotplug.c| 4 3 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 8f1a419..cd3cb67 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -270,4 +270,6 @@ extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms) extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pnum); +/* update for workqueues */ +void workqueue_node_register(int node); #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 2fd0bd7..5499b76 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,7 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* PL: possible CPUs of each node */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -4559,6 +4559,32 @@ static void restore_unbound_workers_cpumask(struct worker_pool *pool, int cpu) WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask) < 0); } +#ifdef CONFIG_MEMORY_HOTPLUG + +static void reflesh_wq_possible_mask(int cpu, int node) +{ + int oldnode; + for_each_node_state(oldnode, N_POSSIBLE) + cpumask_clear_cpu(cpu, wq_numa_possible_cpumask[oldnode]); + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); +} +/* + * When a cpu is physically added, cpu<->node relationship is established. + * We can catch the whole view when a new NODE_DATA() coming up. + * Update cpumask used for sched affinity of workers. + */ +void workqueue_node_register(int node) +{ + int cpu; + mutex_lock(&wq_pool_mutex); + for_each_possible_cpu(cpu) { + if (node == cpu_to_node(cpu)) + reflesh_wq_possible_mask(cpu, node); + } + mutex_unlock(&wq_pool_mutex); +} + +#endif /* * Workqueues should be brought up before normal priority CPU notifiers. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1bf4807..d0c1ebb 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1121,6 +1121,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) * online_pages() and offline_pages(). */ reset_node_present_pages(pgdat); + /* +* Update workqueue's cpu/node affinity info. +*/ + workqueue_node_register(nid); return pgdat; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] workqueue: update per-cpu workqueue's node affinity at,online-offline
The percpu workqueue pool are persistend and never be freed. But cpu<->node relationship can be changed by cpu hotplug and pool->node can point to an offlined node. If pool->node points to an offlined node, following allocation failure can happen. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch clears per-cpu workqueue pool's node affinity at cpu offlining and restore it at cpu onlining. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 7809154..2fd0bd7 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4586,6 +4586,11 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_DOWN_FAILED: case CPU_ONLINE: mutex_lock(&wq_pool_mutex); + /* +* now cpu <-> node info is established. update numa node +*/ + for_each_cpu_worker_pool(pool, cpu) + pool->node = cpu_to_node(cpu); for_each_pool(pool, pi) { mutex_lock(&pool->attach_mutex); @@ -4619,6 +4624,7 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct work_struct unbind_work; struct workqueue_struct *wq; + struct worker_pool *pool; switch (action & ~CPU_TASKS_FROZEN) { case CPU_DOWN_PREPARE: @@ -4626,10 +4632,13 @@ static int workqueue_cpu_down_callback(struct notifier_block *nfb, INIT_WORK_ONSTACK(&unbind_work, wq_unbind_fn); queue_work_on(cpu, system_highpri_wq, &unbind_work); - /* update NUMA affinity of unbound workqueues */ mutex_lock(&wq_pool_mutex); + /* update NUMA affinity of unbound workqueues */ list_for_each_entry(wq, &workqueues, list) wq_update_unbound_numa(wq, cpu, false); + /* clear per-cpu workqueues's numa affinity. */ + for_each_cpu_worker_pool(pool, cpu) + pool->node = NUMA_NO_NODE; /* restored at online */ mutex_unlock(&wq_pool_mutex); /* wait for per-cpu unbinding to finish */ -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] workqueue:Fix unbound workqueue's node affinity detection
Unbound wq pool's node attribute is calculated at its allocation. But it's now calculated based on possible cpu<->node information which can be wrong after cpu hotplug/unplug. If wrong pool->node is set, following allocation error will happen. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This patch fixes the node detection by making use of online cpu info. Unlike cpumask, the best node can be calculated by degree of overlap between attr->cpumask and numanode->online_cpumask. This change doesn't corrupt original purpose of the old calculation. Note: it's expected that this function is called as pool_detect_best_node get_unbound_pool alloc_unbound_pwq wq_update_unbound_numa called at CPU_ONLINE/CPU_DOWN_PREPARE and the latest online cpu info can be applied to a new wq pool, which replaces old one. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 38 ++ 1 file changed, 26 insertions(+), 12 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 09b685d..7809154 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3440,6 +3440,31 @@ static void put_unbound_pool(struct worker_pool *pool) } /** + * pool_detect_best_node - detect a node which contains specified cpumask. + * Should be called with wq_pool_mutex held. + * Returns a online node where the most of given cpus are tied to. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ + int node, best, match, selected = NUMA_NO_NODE; + static struct cpumask andmask; /* under wq_pool_mutex */ + + if (!wq_numa_enabled || + cpumask_subset(cpu_online_mask, cpumask)) + goto out; + best = 0; + /* select a node which contains the most number of cpu */ + for_each_node_state(node, N_ONLINE) { + cpumask_and(&andmask, cpumask, cpumask_of_node(node)); + match = cpumask_weight(&andmask); + if (match > best) + selected = best; + } +out: + return selected; +} + +/** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get * @@ -3457,7 +3482,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; - int node; lockdep_assert_held(&wq_pool_mutex); @@ -3482,17 +3506,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - - /* if cpumask is contained inside a NUMA node, we belong to that node */ - if (wq_numa_enabled) { - for_each_node(node) { - if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { - pool->node = node; - break; - } - } - } + pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] workqueue: fix memory allocation after numa mapping is changed v3
Lai-san, Tejun-san, Thank you for review, this a fix v3. This has been tested on NUMA node hotplug machine and seems work well. The probelm is memory allocation failure because pool->node information can be stale after node hotplug. This patch(1,2) tries to fix pool->node calculation. Patch (3,4) tries to update cpumask calculation. (Fixing memory allocation bug just requires patch 1,2. But cpumask should be update I think.) Changelog since v2. - reordered patch and split for each problem cases. - removed unnecessary calls pointed out by Lai-san. - restore node/cpu relationship when a new node comes online. - handle corner case at CPU_ONLINE. 1/4 fix unbound workqueue's memory node affinity calculation. 2/4 ... update percpu workqueue's memory node affinity at online/offline 3/4 ... update workqueue's possible cpumask when a new node onlined. 4/4 ... handle cpu-node affinity change at CPU_ONLINE. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
(2014/12/15 14:19), Lai Jiangshan wrote: On 12/15/2014 12:04 PM, Kamezawa Hiroyuki wrote: (2014/12/15 12:34), Lai Jiangshan wrote: On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote: (2014/12/15 11:48), Lai Jiangshan wrote: On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote: (2014/12/15 11:12), Lai Jiangshan wrote: On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote: Although workqueue detects relationship between cpu<->node at boot, it is finally determined in cpu_up(). This patch tries to update pool->node using online status of cpus. 1. When a node goes down, clear per-cpu pool's node attr. 2. When a cpu comes up, update per-cpu pool's node attr. 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched. 4. Detect the best node for unbound pool's cpumask using the latest info. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 67 ++ 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07b4eb5..259b3ba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,8 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; -/* possible CPUs of each node */ +/* possible CPUs of each node initialized with possible info at boot. + but modified at cpu hotplug to be adjusted to real info. */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool) call_rcu_sched(&pool->rcu, rcu_free_pool); } +/* + * detect best node for given cpumask. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ +int node, best, match, selected; +static struct cpumask andmask; /* we're under mutex */ + +/* Is any node okay ? */ +if (!wq_numa_enabled || +cpumask_subset(cpu_online_mask, cpumask)) +return NUMA_NO_NODE; +best = 0; +selected = NUMA_NO_NODE; +/* select a node which contains the most cpu of cpumask */ +for_each_node_state(node, N_ONLINE) { +cpumask_and(&andmask, cpumask, cpumask_of_node(node)); +match = cpumask_weight(&andmask); +if (match > best) +selected = node; +} +return selected; +} + + /** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; -int node; lockdep_assert_held(&wq_pool_mutex); @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - -/* if cpumask is contained inside a NUMA node, we belong to that node */ -if (wq_numa_enabled) { -for_each_node(node) { -if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { -pool->node = node; -break; -} -} -} +pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct worker_pool *pool; struct workqueue_struct *wq; -int pi; +int pi, node; switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); +/* now cpu <-> node info is established, update the info. */ +if (!wq_disable_numa) { +for_each_node_state(node, N_POSSIBLE) +cpumask_clear_cpu(cpu, +wq_numa_possible_cpumask[node]); The wq code try to reuse the origin pwqs/pools when the node still have cpu online. these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and create a new set of pwqs/pools. because the result of wq_calc_node_cpumask() changes ? Yes. Do you mean some comment should be added here ? or explaination for your reply for [3/4] ? this fix [4/4] breaks the original design. I'm sorry that I can't understand what this patch breaks. the pwqs/pools should be kept if the node still have cpu online. So, the fix's grand design should be 1. drop old pwq/pools only at node offline. 2. set proper
Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
(2014/12/15 12:34), Lai Jiangshan wrote: On 12/15/2014 10:55 AM, Kamezawa Hiroyuki wrote: (2014/12/15 11:48), Lai Jiangshan wrote: On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote: (2014/12/15 11:12), Lai Jiangshan wrote: On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote: Although workqueue detects relationship between cpu<->node at boot, it is finally determined in cpu_up(). This patch tries to update pool->node using online status of cpus. 1. When a node goes down, clear per-cpu pool's node attr. 2. When a cpu comes up, update per-cpu pool's node attr. 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched. 4. Detect the best node for unbound pool's cpumask using the latest info. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 67 ++ 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07b4eb5..259b3ba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,8 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; -/* possible CPUs of each node */ +/* possible CPUs of each node initialized with possible info at boot. + but modified at cpu hotplug to be adjusted to real info. */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool) call_rcu_sched(&pool->rcu, rcu_free_pool); } +/* + * detect best node for given cpumask. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ +int node, best, match, selected; +static struct cpumask andmask; /* we're under mutex */ + +/* Is any node okay ? */ +if (!wq_numa_enabled || +cpumask_subset(cpu_online_mask, cpumask)) +return NUMA_NO_NODE; +best = 0; +selected = NUMA_NO_NODE; +/* select a node which contains the most cpu of cpumask */ +for_each_node_state(node, N_ONLINE) { +cpumask_and(&andmask, cpumask, cpumask_of_node(node)); +match = cpumask_weight(&andmask); +if (match > best) +selected = node; +} +return selected; +} + + /** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; -int node; lockdep_assert_held(&wq_pool_mutex); @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - -/* if cpumask is contained inside a NUMA node, we belong to that node */ -if (wq_numa_enabled) { -for_each_node(node) { -if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { -pool->node = node; -break; -} -} -} +pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct worker_pool *pool; struct workqueue_struct *wq; -int pi; +int pi, node; switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); +/* now cpu <-> node info is established, update the info. */ +if (!wq_disable_numa) { +for_each_node_state(node, N_POSSIBLE) +cpumask_clear_cpu(cpu, +wq_numa_possible_cpumask[node]); The wq code try to reuse the origin pwqs/pools when the node still have cpu online. these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and create a new set of pwqs/pools. because the result of wq_calc_node_cpumask() changes ? Yes. Do you mean some comment should be added here ? or explaination for your reply for [3/4] ? this fix [4/4] breaks the original design. I'm sorry that I can't understand what this patch breaks. the pwqs/pools should be kept if the node still have cpu online. So, the fix's grand design should be 1. drop old pwq/pools only at node offline. 2. set proper pool->node based on online node info. 3. update pool->node of per-cpu-pool at cpu ONLINE. Hm. (1) is done because cpumask_of_n
Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
(2014/12/15 11:48), Lai Jiangshan wrote: On 12/15/2014 10:20 AM, Kamezawa Hiroyuki wrote: (2014/12/15 11:12), Lai Jiangshan wrote: On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote: Although workqueue detects relationship between cpu<->node at boot, it is finally determined in cpu_up(). This patch tries to update pool->node using online status of cpus. 1. When a node goes down, clear per-cpu pool's node attr. 2. When a cpu comes up, update per-cpu pool's node attr. 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched. 4. Detect the best node for unbound pool's cpumask using the latest info. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 67 ++ 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07b4eb5..259b3ba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,8 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; -/* possible CPUs of each node */ +/* possible CPUs of each node initialized with possible info at boot. + but modified at cpu hotplug to be adjusted to real info. */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool) call_rcu_sched(&pool->rcu, rcu_free_pool); } +/* + * detect best node for given cpumask. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ +int node, best, match, selected; +static struct cpumask andmask; /* we're under mutex */ + +/* Is any node okay ? */ +if (!wq_numa_enabled || +cpumask_subset(cpu_online_mask, cpumask)) +return NUMA_NO_NODE; +best = 0; +selected = NUMA_NO_NODE; +/* select a node which contains the most cpu of cpumask */ +for_each_node_state(node, N_ONLINE) { +cpumask_and(&andmask, cpumask, cpumask_of_node(node)); +match = cpumask_weight(&andmask); +if (match > best) +selected = node; +} +return selected; +} + + /** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; -int node; lockdep_assert_held(&wq_pool_mutex); @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - -/* if cpumask is contained inside a NUMA node, we belong to that node */ -if (wq_numa_enabled) { -for_each_node(node) { -if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { -pool->node = node; -break; -} -} -} +pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct worker_pool *pool; struct workqueue_struct *wq; -int pi; +int pi, node; switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); +/* now cpu <-> node info is established, update the info. */ +if (!wq_disable_numa) { +for_each_node_state(node, N_POSSIBLE) +cpumask_clear_cpu(cpu, +wq_numa_possible_cpumask[node]); The wq code try to reuse the origin pwqs/pools when the node still have cpu online. these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and create a new set of pwqs/pools. because the result of wq_calc_node_cpumask() changes ? Yes. Do you mean some comment should be added here ? or explaination for your reply for [3/4] ? this fix [4/4] breaks the original design. I'm sorry that I can't understand what this patch breaks. Do you mean it's better to work with broken wq_numa_possible_cpumask ? I guess removing wq_numa_possible_mask entirely may be the best way byusing online_mask of the node. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] workqueue: handle change in cpu-node relationship.
(2014/12/15 11:12), Lai Jiangshan wrote: On 12/14/2014 12:38 AM, Kamezawa Hiroyuki wrote: Although workqueue detects relationship between cpu<->node at boot, it is finally determined in cpu_up(). This patch tries to update pool->node using online status of cpus. 1. When a node goes down, clear per-cpu pool's node attr. 2. When a cpu comes up, update per-cpu pool's node attr. 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched. 4. Detect the best node for unbound pool's cpumask using the latest info. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 67 ++ 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07b4eb5..259b3ba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,8 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* possible CPUs of each node initialized with possible info at boot. + but modified at cpu hotplug to be adjusted to real info. */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool) call_rcu_sched(&pool->rcu, rcu_free_pool); } +/* + * detect best node for given cpumask. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ + int node, best, match, selected; + static struct cpumask andmask; /* we're under mutex */ + + /* Is any node okay ? */ + if (!wq_numa_enabled || + cpumask_subset(cpu_online_mask, cpumask)) + return NUMA_NO_NODE; + best = 0; + selected = NUMA_NO_NODE; + /* select a node which contains the most cpu of cpumask */ + for_each_node_state(node, N_ONLINE) { + cpumask_and(&andmask, cpumask, cpumask_of_node(node)); + match = cpumask_weight(&andmask); + if (match > best) + selected = node; + } + return selected; +} + + /** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; - int node; lockdep_assert_held(&wq_pool_mutex); @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - - /* if cpumask is contained inside a NUMA node, we belong to that node */ - if (wq_numa_enabled) { - for_each_node(node) { - if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { - pool->node = node; - break; - } - } - } + pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct worker_pool *pool; struct workqueue_struct *wq; - int pi; + int pi, node; switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); + /* now cpu <-> node info is established, update the info. */ + if (!wq_disable_numa) { + for_each_node_state(node, N_POSSIBLE) + cpumask_clear_cpu(cpu, + wq_numa_possible_cpumask[node]); The wq code try to reuse the origin pwqs/pools when the node still have cpu online. these 3 lines of code will cause the origin pwqs/pools be on the road of dying, and create a new set of pwqs/pools. because the result of wq_calc_node_cpumask() changes ? Do you mean some comment should be added here ? or explaination for your reply for [3/4] ? Thanks, -Kame + node = cpu_to_node(cpu); + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); + } + for_each_cpu_worker_pool(pool, cpu) + pool->node = cpu_to_node(cpu); for_each_pool(pool, pi) {
Re: [PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.
(2014/12/15 11:06), Lai Jiangshan wrote: On 12/14/2014 12:35 AM, Kamezawa Hiroyuki wrote: remove node aware unbound pools if node goes offline. scan unbound workqueue and remove numa affine pool when a node goes offline. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 29 + 1 file changed, 29 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 35f4f00..07b4eb5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4921,11 +4921,40 @@ early_initcall(init_workqueues); * cached pools per cpu should be freed at node unplug */ +/* + * Replace per-node pwq with dfl_pwq because this node disappers. + * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa. + */ +static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid) +{ + struct pool_workqueue *old_pwq = NULL; + + if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND)) + return; + mutex_lock(&wq->mutex); + if (wq->unbound_attrs->no_numa) + goto out_unlock; + spin_lock_irq(&wq->dfl_pwq->pool->lock); + get_pwq(wq->dfl_pwq); + spin_unlock_irq(&wq->dfl_pwq->pool->lock); + old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq); +out_unlock: + mutex_unlock(&wq->mutex); + put_pwq_unlocked(old_pwq); + return; +} We have already did it in wq_update_unbound_numa(). Ah, hm. CPU_OFFLINE event does this, ok. Thanks, -Kame + void workqueue_register_numanode(int nid) { } void workqueue_unregister_numanode(int nid) { + struct workqueue_struct *wq; + + mutex_lock(&wq_pool_mutex); + list_for_each_entry(wq, &workqueues, list) + wq_release_unbound_numa(wq, nid); + mutex_unlock(&wq_pool_mutex); } #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] workqueue: handle change in cpu-node relationship.
Although workqueue detects relationship between cpu<->node at boot, it is finally determined in cpu_up(). This patch tries to update pool->node using online status of cpus. 1. When a node goes down, clear per-cpu pool's node attr. 2. When a cpu comes up, update per-cpu pool's node attr. 3. When a cpu comes up, update possinle node cpumask workqueue is using for sched. 4. Detect the best node for unbound pool's cpumask using the latest info. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 67 ++ 1 file changed, 53 insertions(+), 14 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 07b4eb5..259b3ba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -266,7 +266,8 @@ struct workqueue_struct { static struct kmem_cache *pwq_cache; static cpumask_var_t *wq_numa_possible_cpumask; - /* possible CPUs of each node */ + /* possible CPUs of each node initialized with possible info at boot. + but modified at cpu hotplug to be adjusted to real info. */ static bool wq_disable_numa; module_param_named(disable_numa, wq_disable_numa, bool, 0444); @@ -3449,6 +3450,31 @@ static void put_unbound_pool(struct worker_pool *pool) call_rcu_sched(&pool->rcu, rcu_free_pool); } +/* + * detect best node for given cpumask. + */ +static int pool_detect_best_node(const struct cpumask *cpumask) +{ + int node, best, match, selected; + static struct cpumask andmask; /* we're under mutex */ + + /* Is any node okay ? */ + if (!wq_numa_enabled || + cpumask_subset(cpu_online_mask, cpumask)) + return NUMA_NO_NODE; + best = 0; + selected = NUMA_NO_NODE; + /* select a node which contains the most cpu of cpumask */ + for_each_node_state(node, N_ONLINE) { + cpumask_and(&andmask, cpumask, cpumask_of_node(node)); + match = cpumask_weight(&andmask); + if (match > best) + selected = node; + } + return selected; +} + + /** * get_unbound_pool - get a worker_pool with the specified attributes * @attrs: the attributes of the worker_pool to get @@ -3467,7 +3493,6 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) { u32 hash = wqattrs_hash(attrs); struct worker_pool *pool; - int node; lockdep_assert_held(&wq_pool_mutex); @@ -3492,17 +3517,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs) * 'struct workqueue_attrs' comments for detail. */ pool->attrs->no_numa = false; - - /* if cpumask is contained inside a NUMA node, we belong to that node */ - if (wq_numa_enabled) { - for_each_node(node) { - if (cpumask_subset(pool->attrs->cpumask, - wq_numa_possible_cpumask[node])) { - pool->node = node; - break; - } - } - } + pool->node = pool_detect_best_node(pool->attrs->cpumask); if (worker_pool_assign_id(pool) < 0) goto fail; @@ -4567,7 +4582,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, int cpu = (unsigned long)hcpu; struct worker_pool *pool; struct workqueue_struct *wq; - int pi; + int pi, node; switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: @@ -4583,6 +4598,16 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb, case CPU_ONLINE: mutex_lock(&wq_pool_mutex); + /* now cpu <-> node info is established, update the info. */ + if (!wq_disable_numa) { + for_each_node_state(node, N_POSSIBLE) + cpumask_clear_cpu(cpu, + wq_numa_possible_cpumask[node]); + node = cpu_to_node(cpu); + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); + } + for_each_cpu_worker_pool(pool, cpu) + pool->node = cpu_to_node(cpu); for_each_pool(pool, pi) { mutex_lock(&pool->attach_mutex); @@ -4951,7 +4976,21 @@ void workqueue_register_numanode(int nid) void workqueue_unregister_numanode(int nid) { struct workqueue_struct *wq; + const struct cpumask *nodecpumask; + struct worker_pool *pool; + int cpu; + /* at this point, cpu-to-node relationship is not lost */ + nodecpumask = cpumask_of_node(nid); + for_each_cpu(cpu, nodecpumask) { + /* +* pool is allcated at boot an
[PATCH 3/4] workqueue: remove per-node unbound pool when node goes offline.
remove node aware unbound pools if node goes offline. scan unbound workqueue and remove numa affine pool when a node goes offline. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 29 + 1 file changed, 29 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 35f4f00..07b4eb5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4921,11 +4921,40 @@ early_initcall(init_workqueues); * cached pools per cpu should be freed at node unplug */ +/* + * Replace per-node pwq with dfl_pwq because this node disappers. + * The new pool will be set at CPU_ONLINE by wq_update_unbound_numa. + */ +static void wq_release_unbound_numa(struct workqueue_struct *wq, int nid) +{ + struct pool_workqueue *old_pwq = NULL; + + if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND)) + return; + mutex_lock(&wq->mutex); + if (wq->unbound_attrs->no_numa) + goto out_unlock; + spin_lock_irq(&wq->dfl_pwq->pool->lock); + get_pwq(wq->dfl_pwq); + spin_unlock_irq(&wq->dfl_pwq->pool->lock); + old_pwq = numa_pwq_tbl_install(wq, nid, wq->dfl_pwq); +out_unlock: + mutex_unlock(&wq->mutex); + put_pwq_unlocked(old_pwq); + return; +} + void workqueue_register_numanode(int nid) { } void workqueue_unregister_numanode(int nid) { + struct workqueue_struct *wq; + + mutex_lock(&wq_pool_mutex); + list_for_each_entry(wq, &workqueues, list) + wq_release_unbound_numa(wq, nid); + mutex_unlock(&wq_pool_mutex); } #endif -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] workqueue: fix bug when numa mapping is changed v2.
Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping between CPU and node is changed. This was the last scene: SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 I and Yasuaki have a host which has a feature of node hotplug, this is a fix by me. Tested several patterns of hotplug and I found no issue, now. Of course I read Lai's patch and Tejun's comment. I hope I could reflect them. 1/4 ... add node-hotplug event callback. 2/4 ... add a sanity check (for debug) 3/4 ... remove per-node unbound workqueue if node goes offline. 4/4 ... update per-cpu pool's information and cpumasks, node information based on the latest (cpu, node) information. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] workqueue: add warning if pool->node is offline
Add warning if pool->node is offline. This patch was originaly made for debug. I think add warning here can show what may happen. Signed-off-by: KAMEZAWA Hiroyuki --- kernel/workqueue.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index f6cb357c..35f4f00 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -760,6 +760,15 @@ static bool too_many_workers(struct worker_pool *pool) return nr_idle > 2 && (nr_idle - 2) * MAX_IDLE_WORKERS_RATIO >= nr_busy; } +/* Check pool->node */ +static int pool_to_node(struct worker_pool *pool) +{ + if (WARN_ON_ONCE(pool->node != NUMA_NO_NODE + && !node_online(pool->node))) + return NUMA_NO_NODE; + return pool->node; +} + /* * Wake up functions. */ @@ -1679,7 +1688,7 @@ static struct worker *create_worker(struct worker_pool *pool) if (id < 0) goto fail; - worker = alloc_worker(pool->node); + worker = alloc_worker(pool_to_node(pool)); if (!worker) goto fail; @@ -1692,7 +1701,8 @@ static struct worker *create_worker(struct worker_pool *pool) else snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id); - worker->task = kthread_create_on_node(worker_thread, worker, pool->node, + worker->task = kthread_create_on_node(worker_thread, worker, + pool_to_node(pool), "kworker/%s", id_buf); if (IS_ERR(worker->task)) goto fail; @@ -3651,7 +3661,7 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq, if (!pool) return NULL; - pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node); + pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool_to_node(pool)); if (!pwq) { put_unbound_pool(pool); return NULL; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] workqueue: add a hook for node hotplug
Subject: [PATCH 1/4] add callbackof node hotplug for workqueue. Because workqueue is numa aware, it pool has node information. And it should be maintained against node-hotplug. When a node which exists at boot is unpluged, following error is detected. == SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min order: 0 node 0: slabs: 6172, objs: 259224, free: 245741 node 1: slabs: 3261, objs: 136962, free: 127656 == This is because pool->node points a stale node. This patch adds callback function at node hotplug. Signed-off-by: KAMEZAWA Hiroyuki node should be cleared and + * cached pools per cpu should be freed at node unplug + */ + +void workqueue_register_numanode(int nid) +{ +} + +void workqueue_unregister_numanode(int nid) +{ +} +#endif diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1bf4807..504b071 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1162,7 +1162,8 @@ int try_online_node(int nid) build_all_zonelists(NULL, NULL); mutex_unlock(&zonelists_mutex); } - + /* Now zonelist for the pgdat is ready */ + workqueue_register_numanode(nid); out: mem_hotplug_done(); return ret; @@ -1914,7 +1915,11 @@ static int check_and_unmap_cpu_on_node(pg_data_t *pgdat) ret = check_cpu_on_node(pgdat); if (ret) return ret; - + /* +* There is no online cpu on the node and this node will go. +* make workqueue to forget this node. +*/ + workqueue_unregister_numanode(pgdat->node_id); /* * the node will be offlined when we come here, so we can clear * the cpu_to_node() now. -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] mem-hotplug: Fix wrong check for zone->pageset initialization in online_pages().
(2014/10/31 18:46), Tang Chen wrote: > When we are doing memory hot-add, the following functions are called: > > add_memory() > |--> hotadd_new_pgdat() > |--> free_area_init_node() >|--> free_area_init_core() > |--> zone->present_pages = realsize; /* 1. zone is > populated */ > |--> zone_pcp_init() > |--> zone->pageset = &boot_pageset; /* 2. > zone->pageset is set to boot_pageset */ > > There are two problems here: > 1. Zones could be populated before any memory is onlined. > 2. All the zones on a newly added node have the same pageset pointing to > boot_pageset. > > The above two problems will result in the following problem: > When we online memory on one node, e.g node2, the following code is executed: > > online_pages() > { > .. > if (!populated_zone(zone)) { > need_zonelists_rebuild = 1; > build_all_zonelists(NULL, zone); > } > .. > } > > Because of problem 1, the zone has been populated, and the > build_all_zonelists() >will never called. zone->pageset won't be updated. > Because of problem 2, All the zones on a newly added node have the same > pageset >pointing to boot_pageset. > And as a result, when we online memory on node2, node3's meminfo will corrupt. > Pages on node2 may be freed to node3. > > # for ((i = 2048; i < 2064; i++)); do echo online_movable > > /sys/devices/system/node/node2/memory$i/state; done > # cat /sys/devices/system/node/node2/meminfo > Node 2 MemTotal: 33554432 kB > Node 2 MemFree:33549092 kB > Node 2 MemUsed:5340 kB > .. > # cat /sys/devices/system/node/node3/meminfo > Node 3 MemTotal: 0 kB > Node 3 MemFree: 248 kB/* corrupted */ > Node 3 MemUsed: 0 kB > .. > > We have to populate some zones before onlining memory, otherwise no memory > could be onlined. > So when onlining pages, we should also check if zone->pageset is pointing to > boot_pageset. > > Signed-off-by: Gu Zheng > Signed-off-by: Tang Chen > --- > include/linux/mm.h | 1 + > mm/memory_hotplug.c | 6 +- > mm/page_alloc.c | 5 + > 3 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 02d11ee..83e6505 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1732,6 +1732,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const > char *fmt, ...); > > extern void setup_per_cpu_pageset(void); > > +extern bool zone_pcp_initialized(struct zone *zone); > extern void zone_pcp_update(struct zone *zone); > extern void zone_pcp_reset(struct zone *zone); > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 3ab01b2..bc0de0f 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1013,9 +1013,13 @@ int __ref online_pages(unsigned long pfn, unsigned > long nr_pages, int online_typ >* If this zone is not populated, then it is not in zonelist. >* This means the page allocator ignores this zone. >* So, zonelist must be updated after online. > + * > + * If this zone is populated, zone->pageset could be initialized > + * to boot_pageset for the first time a node is added. If so, > + * zone->pageset should be allocated. >*/ > mutex_lock(&zonelists_mutex); > - if (!populated_zone(zone)) { > + if (!populated_zone(zone) || !zone_pcp_initialized(zone)) { Please don't add another strange meanings to zone's pcplist. If you say zone->present_pages doesn't mean zone has pages in buddy list any more, please rewrite all parts using zone->present_pages including populated_zone(). I think there are several parts calculates parameters based on present_pages. I myself doesn't welcome having both of compliated "managed pages" and "present pages"... How about adding static inline int managed_zone(struct zone *zone) { return (!!zone->managed_pages); } for this bug fix ? Other parts, using present_pages, should be considered well. Thanks, -Kame Thanks, -Kame > need_zonelists_rebuild = 1; > build_all_zonelists(NULL, zone); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 736d8e1..4ff1540 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6456,6 +6456,11 @@ void __meminit zone_pcp_update(struct zone *zone) > } > #endif > > +bool zone_pcp_initialized(struct zone *zone) > +{ > + return (zone->pageset != &boot_pageset); > +} > + > void zone_pcp_reset(struct zone *zone) > { > unsigned long flags; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lk
Re: [patch 3/3] mm: move page->mem_cgroup bad page handling into generic code
(2014/11/02 12:15), Johannes Weiner wrote: > Now that the external page_cgroup data structure and its lookup is > gone, let the generic bad_page() check for page->mem_cgroup sanity. > > Signed-off-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] mm: page_cgroup: rename file to mm/swap_cgroup.c
(2014/11/02 12:15), Johannes Weiner wrote: > Now that the external page_cgroup data structure and its lookup is > gone, the only code remaining in there is swap slot accounting. > > Rename it and move the conditional compilation into mm/Makefile. > > Signed-off-by: Johannes Weiner Acked-by: KAMEZAWA Hiroyuki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] mm: embed the memcg pointer directly into struct page
(2014/11/02 12:15), Johannes Weiner wrote: > Memory cgroups used to have 5 per-page pointers. To allow users to > disable that amount of overhead during runtime, those pointers were > allocated in a separate array, with a translation layer between them > and struct page. > > There is now only one page pointer remaining: the memcg pointer, that > indicates which cgroup the page is associated with when charged. The > complexity of runtime allocation and the runtime translation overhead > is no longer justified to save that *potential* 0.19% of memory. With > CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding > after the page->private member and doesn't even increase struct page, > and then this patch actually saves space. Remaining users that care > can still compile their kernels without CONFIG_MEMCG. > > textdata bss dec hex filename > 8828345 1725264 983040 11536649 b00909 vmlinux.old > 8827425 1725264 966656 11519345 afc571 vmlinux.new > > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 6 +- > include/linux/mm_types.h| 5 + > include/linux/mmzone.h | 12 -- > include/linux/page_cgroup.h | 53 > init/main.c | 7 - > mm/memcontrol.c | 124 + > mm/page_alloc.c | 2 - > mm/page_cgroup.c| 319 > ---- > 8 files changed, 41 insertions(+), 487 deletions(-) > Great! Acked-by: KAMEZAWA Hiroyuki BTW, init/Kconfig comments shouldn't be updated ? (I'm sorry if it has been updated since your latest fix.) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 4/4] mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag
(2014/10/21 0:22), Johannes Weiner wrote: > pc->mem_cgroup had to be left intact after uncharge for the final LRU > removal, and !PCG_USED indicated whether the page was uncharged. But > since 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") pages are > uncharged after the final LRU removal. Uncharge can simply clear the > pointer and the PCG_USED/PageCgroupUsed sites can test that instead. > > Because this is the last page_cgroup flag, this patch reduces the > memcg per-page overhead to a single pointer. > > Signed-off-by: Johannes Weiner awesome. Acked-by: KAMEZAWA Hiroyuki > --- > include/linux/page_cgroup.h | 10 - > mm/memcontrol.c | 107 > +--- > 2 files changed, 42 insertions(+), 75 deletions(-) > > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h > index 97536e685843..1289be6b436c 100644 > --- a/include/linux/page_cgroup.h > +++ b/include/linux/page_cgroup.h > @@ -1,11 +1,6 @@ > #ifndef __LINUX_PAGE_CGROUP_H > #define __LINUX_PAGE_CGROUP_H > > -enum { > - /* flags for mem_cgroup */ > - PCG_USED = 0x01,/* This page is charged to a memcg */ > -}; > - > struct pglist_data; > > #ifdef CONFIG_MEMCG > @@ -19,7 +14,6 @@ struct mem_cgroup; >* then the page cgroup for pfn always exists. >*/ > struct page_cgroup { > - unsigned long flags; > struct mem_cgroup *mem_cgroup; > }; > > @@ -39,10 +33,6 @@ static inline void page_cgroup_init(void) > > struct page_cgroup *lookup_page_cgroup(struct page *page); > > -static inline int PageCgroupUsed(struct page_cgroup *pc) > -{ > - return !!(pc->flags & PCG_USED); > -} > #else /* !CONFIG_MEMCG */ > struct page_cgroup; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1d66ac49e702..48d49c6b08d1 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1284,14 +1284,12 @@ struct lruvec *mem_cgroup_page_lruvec(struct page > *page, struct zone *zone) > > pc = lookup_page_cgroup(page); > memcg = pc->mem_cgroup; > - > /* >* Swapcache readahead pages are added to the LRU - and > - * possibly migrated - before they are charged. Ensure > - * pc->mem_cgroup is sane. > + * possibly migrated - before they are charged. >*/ > - if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup) > - pc->mem_cgroup = memcg = root_mem_cgroup; > + if (!memcg) > + memcg = root_mem_cgroup; > > mz = mem_cgroup_page_zoneinfo(memcg, page); > lruvec = &mz->lruvec; > @@ -2141,7 +2139,7 @@ void __mem_cgroup_begin_update_page_stat(struct page > *page, > pc = lookup_page_cgroup(page); > again: > memcg = pc->mem_cgroup; > - if (unlikely(!memcg || !PageCgroupUsed(pc))) > + if (unlikely(!memcg)) > return; > /* >* If this memory cgroup is not under account moving, we don't > @@ -2154,7 +2152,7 @@ again: > return; > > move_lock_mem_cgroup(memcg, flags); > - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { > + if (memcg != pc->mem_cgroup) { > move_unlock_mem_cgroup(memcg, flags); > goto again; > } > @@ -2186,7 +2184,7 @@ void mem_cgroup_update_page_stat(struct page *page, > > pc = lookup_page_cgroup(page); > memcg = pc->mem_cgroup; > - if (unlikely(!memcg || !PageCgroupUsed(pc))) > + if (unlikely(!memcg)) > return; > > this_cpu_add(memcg->stat->count[idx], val); > @@ -2525,9 +2523,10 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct > page *page) > VM_BUG_ON_PAGE(!PageLocked(page), page); > > pc = lookup_page_cgroup(page); > - if (PageCgroupUsed(pc)) { > - memcg = pc->mem_cgroup; > - if (memcg && !css_tryget_online(&memcg->css)) > + memcg = pc->mem_cgroup; > + > + if (memcg) { > + if (!css_tryget_online(&memcg->css)) > memcg = NULL; > } else if (PageSwapCache(page)) { > ent.val = page_private(page); > @@ -2578,7 +2577,7 @@ static void commit_charge(struct page *page, struct > mem_cgroup *memcg, > struct page_cgroup *pc = lookup_page_cgroup(page); > int isolated; > > - VM_BUG_ON_PAGE(PageCgroupUsed(pc), page); > + VM_BUG_ON_PAGE(pc->mem_cgroup, page); > /* >* we don't need page_cgroup_lock about tail pages, becase they