Re: [RFC PATCH 00/13] Core scheduling v5
On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote: > On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes > > wrote: > > > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > > wrote: > > > [...] > > > > TODO lists: > > > > > > > > - Interface discussions could not come to a conclusion in v5 and hence > > > > would > > > >like to restart the discussion and reach a consensus on it. > > > >- > > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org > > > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > > prctl(2) where you only pass a TID of a task you'd like to share a > > > core with (credit to Peter for the idea [1]) so we can make use of > > > ptrace_may_access() checks. I am currently finishing writing of > > > kselftests for this and post it all once it is ready. > > > > > Thinking more about it, using TID/PID for prctl(2) and internally > > using a task identifier to identify coresched group may have > > limitations. A coresched group can exist longer than the lifetime > > of a task and then there is a chance for that identifier to be > > reused by a newer task which may or maynot be a part of the same > > coresched group. > > True, for the prctl(2) tagging (a task wanting to share core with > another) we will need some way of internally identifying groups which does > not depend on any value that can be reused for another purpose. > That was my concern as well. That's why I was thinking it should be an arbitrary, user/admin/orchestrator defined value and not be the responsibility of the kernel at all. However... > [..] > > What do you think about having a separate cgroup for coresched? > > Both coresched cgroup and prctl() could co-exist where prctl could > > be used to isolate individual process or task and coresched cgroup > > to group trusted processes. > > This sounds like a fine idea to me. I wonder how Tejun and Peter feel about > having a new attribute-less CGroup controller for core-scheduling and just > use that for tagging. (No need to even have a tag file, just adding/removing > to/from CGroup will tag). > ... this could be an interesting approach. Then the cookie could still be the cgroup address as is and there would be no need for the prctl. At least so it seems. Cheers, Phil > > > However a question: If using the prctl(2) on a CGroup tagged task, we > > > discussed in previous threads [2] to override the CGroup cookie such > > > that the task may not share a core with any of the tasks in its CGroup > > > anymore and I think Peter and Phil are Ok with. My question though is > > > - would that not be confusing for anyone looking at the CGroup > > > filesystem's "tag" and "tasks" files? > > > > > Having a dedicated cgroup for coresched could solve this problem > > as well. "coresched.tasks" inside the cgroup hierarchy would list all > > the taskx in the group and prctl can override this and take it out > > of the group. > > We don't even need coresched.tasks, just the existing 'tasks' of CGroups can > be used. > > > > To resolve this, I am proposing to add a new CGroup file > > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > > were assigned cookies due to their CGroup residency. As soon as one > > > prctl(2)'s the task, it will stop showing up in the CGroup's > > > "tasks.coresched" file (unless of course it was requesting to > > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > > with this solution? > > > > > As I mentioned above, IMHO cpu cgroups should not be used to account > > for core scheduling as well. Cpu cgroups serve a different purpose > > and overloading it with core scheduling would not be flexible and > > scalable. But if there is a consensus to move forward with cpu cgroups, > > adding this new file seems to be okay with me. > > Yes, this is the problem. Many people use CPU controller CGroups already for > other purposes. In that case, tagging a CGroup would make all the entities in > the group be able to share a core, which may not always make sense. May be a > new CGroup controller is the answer (?). > > thanks, > > - Joel > --
Re: [RFC PATCH 00/13] Core scheduling v5
Hi Vineeth, On 2020/6/26 4:12, Vineeth Remanan Pillai wrote: > On Wed, Mar 4, 2020 at 12:00 PM vpillai wrote: >> >> >> Fifth iteration of the Core-Scheduling feature. >> > Its probably time for an iteration and We are planning to post v6 based > on this branch: > https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y > > Just wanted to share the details about v6 here before posting the patch > series. If there is no objection to the following, we shall be posting > the v6 early next week. > > The main changes from v6 are the following: > 1. Address Peter's comments in v5 >- Code cleanup >- Remove fixes related to hotplugging. >- Split the patch out for force idling starvation > 3. Fix for RCU deadlock > 4. core wide priority comparison minor re-work. > 5. IRQ Pause patch > 6. Documentation >- > https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst > > This version is much leaner compared to v5 due to the removal of hotplug > support. As a result, dynamic coresched enable/disable on cpus due to > smt on/off on the core do not function anymore. I tried to reproduce the > crashes during hotplug, but could not reproduce reliably. The plan is to > try to reproduce the crashes with v6, and document each corner case for > crashes > as we fix those. Previously, we randomly fixed the issues without a clear > documentation and the fixes became complex over time. > > TODO lists: > > - Interface discussions could not come to a conclusion in v5 and hence would >like to restart the discussion and reach a consensus on it. >- > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org > > - Core wide vruntime calculation needs rework: >- > https://lwn.net/ml/linux-kernel/20200506143506.gh5...@hirez.programming.kicks-ass.net > > - Load balancing/migration changes ignores group weights: >- > https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain According to Aaron's response below: https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/ The following logic seems to be helpful for Aaron's case. + /* +* Ignore cookie match if there is a big imbalance between the src rq +* and dst rq. +*/ + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1) + return true; I didn't see any other comments on the patch at here: https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b6...@linux.intel.com/ Do we have another way to address this issue? Thanks, -Aubrey
Re: [RFC PATCH 00/13] Core scheduling v5
Hi Aubrey, On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey wrote: > > > - Load balancing/migration changes ignores group weights: > >- > > https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain > > According to Aaron's response below: > https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/ > > The following logic seems to be helpful for Aaron's case. > > + /* > +* Ignore cookie match if there is a big imbalance between the src rq > +* and dst rq. > +*/ > + if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1) > + return true; > > I didn't see any other comments on the patch at here: > https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b6...@linux.intel.com/ > > Do we have another way to address this issue? > We do not have a clear fix for this yet, and did not get much time to work on this. I feel that the above change would not be fixing the real issue. The issue is about not considering the weight of the group when we try to load balance, but the above change is checking only the nr_running which might not work always. I feel that we should fix the real issue in v6 and probably hold on to adding the workaround fix in the interim. I have added a TODO specifically for this bug in v6. What do you think? Thanks, Vineeth
Re: [RFC PATCH 00/13] Core scheduling v5
On Fri, Jun 26, 2020 at 11:10 AM Joel Fernandes wrote: > [..] > > What do you think about having a separate cgroup for coresched? > > Both coresched cgroup and prctl() could co-exist where prctl could > > be used to isolate individual process or task and coresched cgroup > > to group trusted processes. > > This sounds like a fine idea to me. I wonder how Tejun and Peter feel about > having a new attribute-less CGroup controller for core-scheduling and just > use that for tagging. (No need to even have a tag file, just adding/removing > to/from CGroup will tag). Unless there are any major objections to this idea, or better ideas for CGroup users, we will consider proposing a new CGroup controller for this. The issue with CPU controller CGroups being they may be configured in a way that is incompatible with tagging. And I was also thinking of a new clone flag CLONE_CORE (which allows a child to share a parent's core). This is because the fork-semantics are not clear and it may be better to leave the behavior of fork to userspace IMHO than hard-coding policy in the kernel. Perhaps we can also discuss this at the scheduler MC at Plumbers. Any other thoughts? - Joel
Re: [RFC PATCH 00/13] Core scheduling v5
On Fri, Jun 26, 2020 at 11:10:28AM -0400, Joel Fernandes wrote: > On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes > > wrote: > > > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > > wrote: > > > [...] > > > > TODO lists: > > > > > > > > - Interface discussions could not come to a conclusion in v5 and hence > > > > would > > > >like to restart the discussion and reach a consensus on it. > > > >- > > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org > > > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > > prctl(2) where you only pass a TID of a task you'd like to share a > > > core with (credit to Peter for the idea [1]) so we can make use of > > > ptrace_may_access() checks. I am currently finishing writing of > > > kselftests for this and post it all once it is ready. > > > > > Thinking more about it, using TID/PID for prctl(2) and internally > > using a task identifier to identify coresched group may have > > limitations. A coresched group can exist longer than the lifetime > > of a task and then there is a chance for that identifier to be > > reused by a newer task which may or maynot be a part of the same > > coresched group. > > True, for the prctl(2) tagging (a task wanting to share core with > another) we will need some way of internally identifying groups which does > not depend on any value that can be reused for another purpose. > > [..] > > What do you think about having a separate cgroup for coresched? > > Both coresched cgroup and prctl() could co-exist where prctl could > > be used to isolate individual process or task and coresched cgroup > > to group trusted processes. > > This sounds like a fine idea to me. I wonder how Tejun and Peter feel about > having a new attribute-less CGroup controller for core-scheduling and just > use that for tagging. (No need to even have a tag file, just adding/removing > to/from CGroup will tag). +Tejun thanks, - Joel > > > However a question: If using the prctl(2) on a CGroup tagged task, we > > > discussed in previous threads [2] to override the CGroup cookie such > > > that the task may not share a core with any of the tasks in its CGroup > > > anymore and I think Peter and Phil are Ok with. My question though is > > > - would that not be confusing for anyone looking at the CGroup > > > filesystem's "tag" and "tasks" files? > > > > > Having a dedicated cgroup for coresched could solve this problem > > as well. "coresched.tasks" inside the cgroup hierarchy would list all > > the taskx in the group and prctl can override this and take it out > > of the group. > > We don't even need coresched.tasks, just the existing 'tasks' of CGroups can > be used. > > > > To resolve this, I am proposing to add a new CGroup file > > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > > were assigned cookies due to their CGroup residency. As soon as one > > > prctl(2)'s the task, it will stop showing up in the CGroup's > > > "tasks.coresched" file (unless of course it was requesting to > > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > > with this solution? > > > > > As I mentioned above, IMHO cpu cgroups should not be used to account > > for core scheduling as well. Cpu cgroups serve a different purpose > > and overloading it with core scheduling would not be flexible and > > scalable. But if there is a consensus to move forward with cpu cgroups, > > adding this new file seems to be okay with me. > > Yes, this is the problem. Many people use CPU controller CGroups already for > other purposes. In that case, tagging a CGroup would make all the entities in > the group be able to share a core, which may not always make sense. May be a > new CGroup controller is the answer (?). > > thanks, > > - Joel >
Re: [RFC PATCH 00/13] Core scheduling v5
On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote: > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes wrote: > > > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > > wrote: > > [...] > > > TODO lists: > > > > > > - Interface discussions could not come to a conclusion in v5 and hence > > > would > > >like to restart the discussion and reach a consensus on it. > > >- > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org > > > > Thanks Vineeth, just want to add: I have a revised implementation of > > prctl(2) where you only pass a TID of a task you'd like to share a > > core with (credit to Peter for the idea [1]) so we can make use of > > ptrace_may_access() checks. I am currently finishing writing of > > kselftests for this and post it all once it is ready. > > > Thinking more about it, using TID/PID for prctl(2) and internally > using a task identifier to identify coresched group may have > limitations. A coresched group can exist longer than the lifetime > of a task and then there is a chance for that identifier to be > reused by a newer task which may or maynot be a part of the same > coresched group. True, for the prctl(2) tagging (a task wanting to share core with another) we will need some way of internally identifying groups which does not depend on any value that can be reused for another purpose. [..] > What do you think about having a separate cgroup for coresched? > Both coresched cgroup and prctl() could co-exist where prctl could > be used to isolate individual process or task and coresched cgroup > to group trusted processes. This sounds like a fine idea to me. I wonder how Tejun and Peter feel about having a new attribute-less CGroup controller for core-scheduling and just use that for tagging. (No need to even have a tag file, just adding/removing to/from CGroup will tag). > > However a question: If using the prctl(2) on a CGroup tagged task, we > > discussed in previous threads [2] to override the CGroup cookie such > > that the task may not share a core with any of the tasks in its CGroup > > anymore and I think Peter and Phil are Ok with. My question though is > > - would that not be confusing for anyone looking at the CGroup > > filesystem's "tag" and "tasks" files? > > > Having a dedicated cgroup for coresched could solve this problem > as well. "coresched.tasks" inside the cgroup hierarchy would list all > the taskx in the group and prctl can override this and take it out > of the group. We don't even need coresched.tasks, just the existing 'tasks' of CGroups can be used. > > To resolve this, I am proposing to add a new CGroup file > > 'tasks.coresched' to the CGroup, and this will only contain tasks that > > were assigned cookies due to their CGroup residency. As soon as one > > prctl(2)'s the task, it will stop showing up in the CGroup's > > "tasks.coresched" file (unless of course it was requesting to > > prctl-share a core with someone in its CGroup itself). Are folks Ok > > with this solution? > > > As I mentioned above, IMHO cpu cgroups should not be used to account > for core scheduling as well. Cpu cgroups serve a different purpose > and overloading it with core scheduling would not be flexible and > scalable. But if there is a consensus to move forward with cpu cgroups, > adding this new file seems to be okay with me. Yes, this is the problem. Many people use CPU controller CGroups already for other purposes. In that case, tagging a CGroup would make all the entities in the group be able to share a core, which may not always make sense. May be a new CGroup controller is the answer (?). thanks, - Joel
Re: [RFC PATCH 00/13] Core scheduling v5
On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes wrote: > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai > wrote: > [...] > > TODO lists: > > > > - Interface discussions could not come to a conclusion in v5 and hence > > would > >like to restart the discussion and reach a consensus on it. > >- > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org > > Thanks Vineeth, just want to add: I have a revised implementation of > prctl(2) where you only pass a TID of a task you'd like to share a > core with (credit to Peter for the idea [1]) so we can make use of > ptrace_may_access() checks. I am currently finishing writing of > kselftests for this and post it all once it is ready. > Thinking more about it, using TID/PID for prctl(2) and internally using a task identifier to identify coresched group may have limitations. A coresched group can exist longer than the lifetime of a task and then there is a chance for that identifier to be reused by a newer task which may or maynot be a part of the same coresched group. A way to overcome this is to have a coresched group with a seperate identifier implemented internally and have mapping from task to the group. And cgroup framework provides exactly that. I feel we could use prctl for isolating individual tasks/processes and use grouping frameworks like cgroup for core scheduling groups. Cpu cgroup might not be a good idea as it has its own purpose. Users might not always want a group of trusted tasks in the same cpu cgroup. Or all the processes in an existing cpu cgroup might not be mutually trusted as well. What do you think about having a separate cgroup for coresched? Both coresched cgroup and prctl() could co-exist where prctl could be used to isolate individual process or task and coresched cgroup to group trusted processes. > However a question: If using the prctl(2) on a CGroup tagged task, we > discussed in previous threads [2] to override the CGroup cookie such > that the task may not share a core with any of the tasks in its CGroup > anymore and I think Peter and Phil are Ok with. My question though is > - would that not be confusing for anyone looking at the CGroup > filesystem's "tag" and "tasks" files? > Having a dedicated cgroup for coresched could solve this problem as well. "coresched.tasks" inside the cgroup hierarchy would list all the taskx in the group and prctl can override this and take it out of the group. > To resolve this, I am proposing to add a new CGroup file > 'tasks.coresched' to the CGroup, and this will only contain tasks that > were assigned cookies due to their CGroup residency. As soon as one > prctl(2)'s the task, it will stop showing up in the CGroup's > "tasks.coresched" file (unless of course it was requesting to > prctl-share a core with someone in its CGroup itself). Are folks Ok > with this solution? > As I mentioned above, IMHO cpu cgroups should not be used to account for core scheduling as well. Cpu cgroups serve a different purpose and overloading it with core scheduling would not be flexible and scalable. But if there is a consensus to move forward with cpu cgroups, adding this new file seems to be okay with me. Thoughts/suggestions/concerns? Thanks, Vineeth
Re: [RFC PATCH 00/13] Core scheduling v5
On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai wrote: [...] > TODO lists: > > - Interface discussions could not come to a conclusion in v5 and hence would >like to restart the discussion and reach a consensus on it. >- > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org Thanks Vineeth, just want to add: I have a revised implementation of prctl(2) where you only pass a TID of a task you'd like to share a core with (credit to Peter for the idea [1]) so we can make use of ptrace_may_access() checks. I am currently finishing writing of kselftests for this and post it all once it is ready. However a question: If using the prctl(2) on a CGroup tagged task, we discussed in previous threads [2] to override the CGroup cookie such that the task may not share a core with any of the tasks in its CGroup anymore and I think Peter and Phil are Ok with. My question though is - would that not be confusing for anyone looking at the CGroup filesystem's "tag" and "tasks" files? To resolve this, I am proposing to add a new CGroup file 'tasks.coresched' to the CGroup, and this will only contain tasks that were assigned cookies due to their CGroup residency. As soon as one prctl(2)'s the task, it will stop showing up in the CGroup's "tasks.coresched" file (unless of course it was requesting to prctl-share a core with someone in its CGroup itself). Are folks Ok with this solution? [1] https://lore.kernel.org/lkml/20200528170128.gn2...@worktop.programming.kicks-ass.net/ [2] https://lore.kernel.org/lkml/20200524140046.ga5...@lorien.usersys.redhat.com/
Re: [RFC PATCH 00/13] Core scheduling v5
On Wed, Mar 4, 2020 at 12:00 PM vpillai wrote: > > > Fifth iteration of the Core-Scheduling feature. > Its probably time for an iteration and We are planning to post v6 based on this branch: https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y Just wanted to share the details about v6 here before posting the patch series. If there is no objection to the following, we shall be posting the v6 early next week. The main changes from v6 are the following: 1. Address Peter's comments in v5 - Code cleanup - Remove fixes related to hotplugging. - Split the patch out for force idling starvation 3. Fix for RCU deadlock 4. core wide priority comparison minor re-work. 5. IRQ Pause patch 6. Documentation - https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst This version is much leaner compared to v5 due to the removal of hotplug support. As a result, dynamic coresched enable/disable on cpus due to smt on/off on the core do not function anymore. I tried to reproduce the crashes during hotplug, but could not reproduce reliably. The plan is to try to reproduce the crashes with v6, and document each corner case for crashes as we fix those. Previously, we randomly fixed the issues without a clear documentation and the fixes became complex over time. TODO lists: - Interface discussions could not come to a conclusion in v5 and hence would like to restart the discussion and reach a consensus on it. - https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org - Core wide vruntime calculation needs rework: - https://lwn.net/ml/linux-kernel/20200506143506.gh5...@hirez.programming.kicks-ass.net - Load balancing/migration changes ignores group weights: - https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain Please have a look and let me know comments/suggestions or anything missed. Thanks, Vineeth
Re: [RFC PATCH 00/13] Core scheduling v5
On Tue, 2020-04-14 at 16:21 +0200, Peter Zijlstra wrote: > On Wed, Mar 04, 2020 at 04:59:50PM +, vpillai wrote: > > > > - Investigate the source of the overhead even when no tasks are > > tagged: > > https://lkml.org/lkml/2019/10/29/242 > > - explain why we're all still doing this > > Seriously, what actual problems does it solve? The patch-set still > isn't > L1TF complete and afaict it does exactly nothing for MDS. > Hey Peter! Late to the party, I know... But I'm replying anyway. At least, you'll have the chance to yell at me for this during OSPM. ;-P > Like I've written many times now, back when the world was simpler and > all we had to worry about was L1TF, core-scheduling made some sense, > but > how does it make sense today? > Indeed core-scheduling alone doesn't even completely solve L1TF. There are the interrupts and the VMEXITs issues. Both are being discussed in this thread and, FWIW, my personal opinion is that the way to go is what Alex says here: <79529592-5d60-2a41-fbb6-4a5f8279f...@amazon.com> (E.g., when he mentions solution 4 "Create a "safe" page table which runs with HT enabled", etc). But let's stick to your point: if it were only for L1TF, then fine, but it's all pointless because of MDS. My answer to this is very much focused on my usecase, which is virtualization. I know you hate us, and you surely have your good reasons, but you know... :-) Correct me if I'm wrong, but I think that the "nice" thing of L1TF is that it allows a VM to spy on another VM or on the host, but it does not allow a regular task to spy on another task or on the kernel (well, it would, but it's easily mitigated). The bad thing about MDS is that it instead allow *all* of that. Now, one thing that we absolutely want to avoid in virt is that a VM is able to spy on other VMs or on the host. Sure, we also care about tasks running in our VMs to be safe, but, really, inter-VM and VM-to-host isolation is the primary concern of an hypervisor. And how a VM (or stuff running inside a VM) can spy on another VM or on the host, via L1TF or MDS? Well, if the attacker VM and the victim VM --or if the attacker VM and the host-- are running on the same core. If they're not, it can't... which is basically an L1TF-only looking scenario. So, in virt, core-scheduling: 1) is the *only* way (aside from no-EPT) to prevent attacker VM to spy on victim VM, if they're running concurrently, both in guest mode, on the same core (and that's, of course, because with core-scheduling they just won't be doing that :-) ) 2) interrupts and VMEXITs needs being taken care of --which was the case already when, as you said "we had only L1TF". Once that is done we will effectively prevent all VM to VM and VM to host attack scenarios. Sure, it will still be possible, for instance, for task_A in VM1 to spy on task_B, also in VM1. This seems to be, AFAIUI, Joel's usecase, so I'm happy to leave it to him to defend that, as he's doing already (but indeed I'm very happy to see that it is also getting attention). Now, of course saying anything like "works for my own usecase so let's go for it" does not fly. But since you were asking whether and how this feature could make sense today, suppose that: 1) we get core-scheduling, 2) we find a solution for irqs and VMEXITs, as we would have to if there was only L1TF, 3) we manage to make the overhead of core-scheduling close to zero when it's there (I mean, enabled at compile time) but not used (I mean, no tagging of tasks, or whatever). That would mean that virt people can enable core-scheduling, and achieve good inter-VM and VM-to-host isolation, without imposing overhead to other use cases, that would leave core-scheduling disabled. And this is something that I would think it makes sense. Of course, we're not there... because even when this series will give us point 1, we will also need 2 and we need to make sure we also satisfy 3 (and we weren't, last time I checked ;-P). But I think it's worth keeping trying. I'd also add a couple of more ideas, still about core-scheduling in virt, but from a different standpoint than security: - if I tag vcpu0 and vcpu1 together[*], then vcpu2 and vcpu3 together, then vcpu4 and vcpu5 together, then I'm sure that each pair will always be scheduled on the same core. At which point I can define an SMT virtual topology, for the VM, that will make sense, even without pinning the vcpus; - if I run VMs from different customers, when vcpu2 of VM1 and vcpu1 of VM2 run on the same core, they influence each others' performance. If, e.g., I bill basing on time spent on CPUs, it means customer A's workload, running in VM1, may influence the billing of customer B, who owns VM2. With core scheduling, if I tag all the vcpus of each VM together, I won't have this any longer. [*] with "tag together" I mean let them have the same tag which, ATM would be "put them in the same cgroup and enable cpu.tag".
Re: [RFC PATCH 00/13] Core scheduling v5
- Test environment: Intel Xeon Server platform CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s):4 - Kernel under test: Core scheduling v5 base https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y - Test set based on sysbench 1.1.0-bd4b418: A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 workload tasks for each cgroup) B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for each cgroup) - Test results briefing: 1 Good results: 1.1 For test set A, coresched could achieve same or better performance compared to smt_off, for both cpu workload and sysbench workload 1.2 For test set B, cpu workload, coresched could achieve better performance compared to smt_off 2 Bad results: 2.1 For test set B, mysql workload, coresched performance is lower than smt_off, potential fairness issue between cpu workloads and mysql workloads 2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups cpu workloads - Test results: Note: test results in following tables are Tput normalized to default baseline -- Test set A Tput normalized results: +++---+-+---+---+-+---+-+ || | default | coresched | smt_off | *** | default | coresched | smt_off | +++===+=+===+===+=+===+=+ | cgroups| | cg cpu 1 | cg cpu 1| cg cpu 1 | *** | cg mysql 1 | cg mysql 1| cg mysql 1 | +++---+-+---+---+-+---+-+ | sysbench workload | | cpu | cpu | cpu | *** | mysql | mysql | mysql | +++---+-+---+---+-+---+-+ | 192 tasks / cgroup | | 1 | 0.95| 0.54 | *** | 1 | 0.92 | 0.97| +++---+-+---+---+-+---+-+ -- Test set B Tput normalized results: +++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+ || | default | coresched | smt_off | *** | default | coresched | smt_off | ** | default | coresched | smt_off | * | default | coresched | smt_off | +++===+=+===+===+=+===+=+==+=+===+=+=+=+===+=+ | cgroups| | cg cpu 1 | cg cpu 1| cg cpu 1 | *** | cg cpu 2| cg cpu 2 | cg cpu 2| ** | cg mysql 1 | cg mysql 1 | cg mysql 1 | * | cg mysql 2 | cg mysql 2| cg mysql 2 | +++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+ | sysbench workload | | cpu | cpu | cpu | *** | cpu | cpu | cpu | ** | mysql | mysql | mysql | * | mysql | mysql | mysql | +++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+ | 192 tasks / cgroup | | 1 | 0.9 | 0.47 | *** | 1 | 1.32 | 0.66| ** | 1 | 0.42 | 0.89| * | 1 | 0.42 | 0.89| +++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+ > On Date: Wed, 4 Mar 2020 16:59:50 +, vpillai > wrote: > To: Nishanth Aravamudan , Julien Desfossez > , Peter Zijlstra , Tim > Chen , mi...@kernel.org, t...@linutronix.de, > p...@google.com, torva...@linux-foundation.org > CC: vpillai , linux-kernel@vger.kernel.org, > fweis...@gmail.com, keesc...@chromium.org, kerr...@google.com, Phil Auld > , Aaron Lu , Aubrey Li > , aubrey...@linux.intel.com, Valentin Schneider > , Mel Gorman , Pawan > Gupta , Paolo Bonzini > , Joel Fernandes , > j...@joelfernandes.org > >