Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-30 Thread Phil Auld
On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  
> > wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > >  wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > >  - Interface discussions could not come to a conclusion in v5 and hence 
> > > > would
> > > >like to restart the discussion and reach a consensus on it.
> > > >- 
> > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
> 
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
>

That was my concern as well. That's why I was thinking it should be
an arbitrary, user/admin/orchestrator defined value and not be the
responsibility of the kernel at all.  However...


> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
> 
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).
>

... this could be an interesting approach. Then the cookie could still
be the cgroup address as is and there would be no need for the prctl. At
least so it seems. 



Cheers,
Phil

> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with.  My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
> 
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
> 
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
> 
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
> 
> thanks,
> 
>  - Joel
> 

-- 



Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-29 Thread Li, Aubrey
Hi Vineeth,

On 2020/6/26 4:12, Vineeth Remanan Pillai wrote:
> On Wed, Mar 4, 2020 at 12:00 PM vpillai  wrote:
>>
>>
>> Fifth iteration of the Core-Scheduling feature.
>>
> Its probably time for an iteration and We are planning to post v6 based
> on this branch:
>  https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y
> 
> Just wanted to share the details about v6 here before posting the patch
> series. If there is no objection to the following, we shall be posting
> the v6 early next week.
> 
> The main changes from v6 are the following:
> 1. Address Peter's comments in v5
>- Code cleanup
>- Remove fixes related to hotplugging.
>- Split the patch out for force idling starvation
> 3. Fix for RCU deadlock
> 4. core wide priority comparison minor re-work.
> 5. IRQ Pause patch
> 6. Documentation
>- 
> https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst
> 
> This version is much leaner compared to v5 due to the removal of hotplug
> support. As a result, dynamic coresched enable/disable on cpus due to
> smt on/off on the core do not function anymore. I tried to reproduce the
> crashes during hotplug, but could not reproduce reliably. The plan is to
> try to reproduce the crashes with v6, and document each corner case for 
> crashes
> as we fix those. Previously, we randomly fixed the issues without a clear
> documentation and the fixes became complex over time.
> 
> TODO lists:
> 
>  - Interface discussions could not come to a conclusion in v5 and hence would
>like to restart the discussion and reach a consensus on it.
>- 
> https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
> 
>  - Core wide vruntime calculation needs rework:
>- 
> https://lwn.net/ml/linux-kernel/20200506143506.gh5...@hirez.programming.kicks-ass.net
> 
>  - Load balancing/migration changes ignores group weights:
>- 
> https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain

According to Aaron's response below:
https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/

The following logic seems to be helpful for Aaron's case.

+   /*
+* Ignore cookie match if there is a big imbalance between the src rq
+* and dst rq.
+*/
+   if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+   return true;

I didn't see any other comments on the patch at here:
https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b6...@linux.intel.com/

Do we have another way to address this issue?

Thanks,
-Aubrey


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-29 Thread Vineeth Remanan Pillai
Hi Aubrey,

On Mon, Jun 29, 2020 at 8:34 AM Li, Aubrey  wrote:
>
> >  - Load balancing/migration changes ignores group weights:
> >- 
> > https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain
>
> According to Aaron's response below:
> https://lwn.net/ml/linux-kernel/20200305085231.GA12108@ziqianlu-desktop.localdomain/
>
> The following logic seems to be helpful for Aaron's case.
>
> +   /*
> +* Ignore cookie match if there is a big imbalance between the src rq
> +* and dst rq.
> +*/
> +   if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
> +   return true;
>
> I didn't see any other comments on the patch at here:
> https://lwn.net/ml/linux-kernel/67e46f79-51c2-5b69-71c6-133ec10b6...@linux.intel.com/
>
> Do we have another way to address this issue?
>
We do not have a clear fix for this yet, and did not get much time to
work on this.

I feel that the above change would not be fixing the real issue.
The issue is about not considering the weight of the group when we
try to load balance, but the above change is checking only the
nr_running which might not work always. I feel that we should fix
the real issue in v6 and probably hold on to adding the workaround
fix in the interim.  I have added a TODO specifically for this bug
in v6.

What do you think?

Thanks,
Vineeth


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-27 Thread Joel Fernandes
On Fri, Jun 26, 2020 at 11:10 AM Joel Fernandes  wrote:
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
>
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

Unless there are any major objections to this idea, or better ideas
for CGroup users, we will consider proposing a new CGroup controller
for this. The issue with CPU controller CGroups being they may be
configured in a way that is incompatible with tagging.

And I was also thinking of a new clone flag CLONE_CORE (which allows a
child to share a parent's core). This is because the fork-semantics
are not clear and it may be better to leave the behavior of fork to
userspace IMHO than hard-coding policy in the kernel.

Perhaps we can also discuss this at the scheduler MC at Plumbers.

Any other thoughts?

 - Joel


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-26 Thread Joel Fernandes
On Fri, Jun 26, 2020 at 11:10:28AM -0400, Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  
> > wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > >  wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > >  - Interface discussions could not come to a conclusion in v5 and hence 
> > > > would
> > > >like to restart the discussion and reach a consensus on it.
> > > >- 
> > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
> 
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
> 
> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
> 
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).

+Tejun

thanks,

 - Joel


> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with.  My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
> 
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
> 
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
> 
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
> 
> thanks,
> 
>  - Joel
> 


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-26 Thread Joel Fernandes
On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  wrote:
> >
> > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> >  wrote:
> > [...]
> > > TODO lists:
> > >
> > >  - Interface discussions could not come to a conclusion in v5 and hence 
> > > would
> > >like to restart the discussion and reach a consensus on it.
> > >- 
> > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
> >
> > Thanks Vineeth, just want to add: I have a revised implementation of
> > prctl(2) where you only pass a TID of a task you'd like to share a
> > core with (credit to Peter for the idea [1]) so we can make use of
> > ptrace_may_access() checks. I am currently finishing writing of
> > kselftests for this and post it all once it is ready.
> >
> Thinking more about it, using TID/PID for prctl(2) and internally
> using a task identifier to identify coresched group may have
> limitations. A coresched group can exist longer than the lifetime
> of a task and then there is a chance for that identifier to be
> reused by a newer task which may or maynot be a part of the same
> coresched group.

True, for the prctl(2) tagging (a task wanting to share core with
another) we will need some way of internally identifying groups which does
not depend on any value that can be reused for another purpose.

[..]
> What do you think about having a separate cgroup for coresched?
> Both coresched cgroup and prctl() could co-exist where prctl could
> be used to isolate individual process or task and coresched cgroup
> to group trusted processes.

This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
having a new attribute-less CGroup controller for core-scheduling and just
use that for tagging. (No need to even have a tag file, just adding/removing
to/from CGroup will tag).

> > However a question: If using the prctl(2) on a CGroup tagged task, we
> > discussed in previous threads [2] to override the CGroup cookie such
> > that the task may not share a core with any of the tasks in its CGroup
> > anymore and I think Peter and Phil are Ok with.  My question though is
> > - would that not be confusing for anyone looking at the CGroup
> > filesystem's "tag" and "tasks" files?
> >
> Having a dedicated cgroup for coresched could solve this problem
> as well. "coresched.tasks" inside the cgroup hierarchy would list all
> the taskx in the group and prctl can override this and take it out
> of the group.

We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
be used.

> > To resolve this, I am proposing to add a new CGroup file
> > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > were assigned cookies due to their CGroup residency. As soon as one
> > prctl(2)'s the task, it will stop showing up in the CGroup's
> > "tasks.coresched" file (unless of course it was requesting to
> > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > with this solution?
> >
> As I mentioned above, IMHO cpu cgroups should not be used to account
> for core scheduling as well. Cpu cgroups serve a different purpose
> and overloading it with core scheduling would not be flexible and
> scalable. But if there is a consensus to move forward with cpu cgroups,
> adding this new file seems to be okay with me.

Yes, this is the problem. Many people use CPU controller CGroups already for
other purposes. In that case, tagging a CGroup would make all the entities in
the group be able to share a core, which may not always make sense. May be a
new CGroup controller is the answer (?).

thanks,

 - Joel



Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-26 Thread Vineeth Remanan Pillai
On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  wrote:
>
> On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
>  wrote:
> [...]
> > TODO lists:
> >
> >  - Interface discussions could not come to a conclusion in v5 and hence 
> > would
> >like to restart the discussion and reach a consensus on it.
> >- 
> > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
>
> Thanks Vineeth, just want to add: I have a revised implementation of
> prctl(2) where you only pass a TID of a task you'd like to share a
> core with (credit to Peter for the idea [1]) so we can make use of
> ptrace_may_access() checks. I am currently finishing writing of
> kselftests for this and post it all once it is ready.
>
Thinking more about it, using TID/PID for prctl(2) and internally
using a task identifier to identify coresched group may have
limitations. A coresched group can exist longer than the lifetime
of a task and then there is a chance for that identifier to be
reused by a newer task which may or maynot be a part of the same
coresched group.

A way to overcome this is to have a coresched group with a seperate
identifier implemented internally and have mapping from task to the
group. And cgroup framework provides exactly that.

I feel we could use prctl for isolating individual tasks/processes
and use grouping frameworks like cgroup for core scheduling groups.
Cpu cgroup might not be a good idea as it has its own purpose. Users
might not always want a group of trusted tasks in the same cpu cgroup.
Or all the processes in an existing cpu cgroup might not be mutually
trusted as well.

What do you think about having a separate cgroup for coresched?
Both coresched cgroup and prctl() could co-exist where prctl could
be used to isolate individual process or task and coresched cgroup
to group trusted processes.

> However a question: If using the prctl(2) on a CGroup tagged task, we
> discussed in previous threads [2] to override the CGroup cookie such
> that the task may not share a core with any of the tasks in its CGroup
> anymore and I think Peter and Phil are Ok with.  My question though is
> - would that not be confusing for anyone looking at the CGroup
> filesystem's "tag" and "tasks" files?
>
Having a dedicated cgroup for coresched could solve this problem
as well. "coresched.tasks" inside the cgroup hierarchy would list all
the taskx in the group and prctl can override this and take it out
of the group.

> To resolve this, I am proposing to add a new CGroup file
> 'tasks.coresched' to the CGroup, and this will only contain tasks that
> were assigned cookies due to their CGroup residency. As soon as one
> prctl(2)'s the task, it will stop showing up in the CGroup's
> "tasks.coresched" file (unless of course it was requesting to
> prctl-share a core with someone in its CGroup itself). Are folks Ok
> with this solution?
>
As I mentioned above, IMHO cpu cgroups should not be used to account
for core scheduling as well. Cpu cgroups serve a different purpose
and overloading it with core scheduling would not be flexible and
scalable. But if there is a consensus to move forward with cpu cgroups,
adding this new file seems to be okay with me.

Thoughts/suggestions/concerns?

Thanks,
Vineeth


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-25 Thread Joel Fernandes
On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
 wrote:
[...]
> TODO lists:
>
>  - Interface discussions could not come to a conclusion in v5 and hence would
>like to restart the discussion and reach a consensus on it.
>- 
> https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org

Thanks Vineeth, just want to add: I have a revised implementation of
prctl(2) where you only pass a TID of a task you'd like to share a
core with (credit to Peter for the idea [1]) so we can make use of
ptrace_may_access() checks. I am currently finishing writing of
kselftests for this and post it all once it is ready.

However a question: If using the prctl(2) on a CGroup tagged task, we
discussed in previous threads [2] to override the CGroup cookie such
that the task may not share a core with any of the tasks in its CGroup
anymore and I think Peter and Phil are Ok with.  My question though is
- would that not be confusing for anyone looking at the CGroup
filesystem's "tag" and "tasks" files?

To resolve this, I am proposing to add a new CGroup file
'tasks.coresched' to the CGroup, and this will only contain tasks that
were assigned cookies due to their CGroup residency. As soon as one
prctl(2)'s the task, it will stop showing up in the CGroup's
"tasks.coresched" file (unless of course it was requesting to
prctl-share a core with someone in its CGroup itself). Are folks Ok
with this solution?

[1]  
https://lore.kernel.org/lkml/20200528170128.gn2...@worktop.programming.kicks-ass.net/
[2] 
https://lore.kernel.org/lkml/20200524140046.ga5...@lorien.usersys.redhat.com/


Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-25 Thread Vineeth Remanan Pillai
On Wed, Mar 4, 2020 at 12:00 PM vpillai  wrote:
>
>
> Fifth iteration of the Core-Scheduling feature.
>
Its probably time for an iteration and We are planning to post v6 based
on this branch:
 https://github.com/digitalocean/linux-coresched/tree/coresched/pre-v6-v5.7.y

Just wanted to share the details about v6 here before posting the patch
series. If there is no objection to the following, we shall be posting
the v6 early next week.

The main changes from v6 are the following:
1. Address Peter's comments in v5
   - Code cleanup
   - Remove fixes related to hotplugging.
   - Split the patch out for force idling starvation
3. Fix for RCU deadlock
4. core wide priority comparison minor re-work.
5. IRQ Pause patch
6. Documentation
   - 
https://github.com/digitalocean/linux-coresched/blob/coresched/pre-v6-v5.7.y/Documentation/admin-guide/hw-vuln/core-scheduling.rst

This version is much leaner compared to v5 due to the removal of hotplug
support. As a result, dynamic coresched enable/disable on cpus due to
smt on/off on the core do not function anymore. I tried to reproduce the
crashes during hotplug, but could not reproduce reliably. The plan is to
try to reproduce the crashes with v6, and document each corner case for crashes
as we fix those. Previously, we randomly fixed the issues without a clear
documentation and the fixes became complex over time.

TODO lists:

 - Interface discussions could not come to a conclusion in v5 and hence would
   like to restart the discussion and reach a consensus on it.
   - 
https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org

 - Core wide vruntime calculation needs rework:
   - 
https://lwn.net/ml/linux-kernel/20200506143506.gh5...@hirez.programming.kicks-ass.net

 - Load balancing/migration changes ignores group weights:
   - 
https://lwn.net/ml/linux-kernel/20200225034438.GA617271@ziqianlu-desktop.localdomain


Please have a look and let me know comments/suggestions or anything missed.

Thanks,
Vineeth


Re: [RFC PATCH 00/13] Core scheduling v5

2020-05-09 Thread Dario Faggioli
On Tue, 2020-04-14 at 16:21 +0200, Peter Zijlstra wrote:
> On Wed, Mar 04, 2020 at 04:59:50PM +, vpillai wrote:
> > 
> > - Investigate the source of the overhead even when no tasks are
> > tagged:
> >   https://lkml.org/lkml/2019/10/29/242
> 
>  - explain why we're all still doing this 
> 
> Seriously, what actual problems does it solve? The patch-set still
> isn't
> L1TF complete and afaict it does exactly nothing for MDS.
> 
Hey Peter! Late to the party, I know...

But I'm replying anyway. At least, you'll have the chance to yell at me
for this during OSPM. ;-P

> Like I've written many times now, back when the world was simpler and
> all we had to worry about was L1TF, core-scheduling made some sense,
> but
> how does it make sense today?
> 
Indeed core-scheduling alone doesn't even completely solve L1TF. There
are the interrupts and the VMEXITs issues. Both are being discussed in
this thread and, FWIW, my personal opinion is that the way to go is
what Alex says here:

<79529592-5d60-2a41-fbb6-4a5f8279f...@amazon.com>

(E.g., when he mentions solution 4 "Create a "safe" page table which
runs with HT enabled", etc).

But let's stick to your point: if it were only for L1TF, then fine, but
it's all pointless because of MDS. My answer to this is very much
focused on my usecase, which is virtualization. I know you hate us, and
you surely have your good reasons, but you know... :-)

Correct me if I'm wrong, but I think that the "nice" thing of L1TF is
that it allows a VM to spy on another VM or on the host, but it does
not allow a regular task to spy on another task or on the kernel (well,
it would, but it's easily mitigated).

The bad thing about MDS is that it instead allow *all* of that.

Now, one thing that we absolutely want to avoid in virt is that a VM is
able to spy on other VMs or on the host. Sure, we also care about tasks
running in our VMs to be safe, but, really, inter-VM and VM-to-host
isolation is the primary concern of an hypervisor.

And how a VM (or stuff running inside a VM) can spy on another VM or on
the host, via L1TF or MDS? Well, if the attacker VM and the victim VM
--or if the attacker VM and the host-- are running on the same core. If
they're not, it can't... which is basically an L1TF-only looking
scenario.

So, in virt, core-scheduling:
1) is the *only* way (aside from no-EPT) to prevent attacker VM to spy 
   on victim VM, if they're running concurrently, both in guest mode,
   on the same core (and that's, of course, because with
   core-scheduling they just won't be doing that :-) )
2) interrupts and VMEXITs needs being taken care of --which was the 
   case already when, as you said "we had only L1TF". Once that is done
   we will effectively prevent all VM to VM and VM to host attack
   scenarios.

Sure, it will still be possible, for instance, for task_A in VM1 to spy
on task_B, also in VM1. This seems to be, AFAIUI, Joel's usecase, so
I'm happy to leave it to him to defend that, as he's doing already (but
indeed I'm very happy to see that it is also getting attention).

Now, of course saying anything like "works for my own usecase so let's
go for it" does not fly. But since you were asking whether and how this
feature could make sense today, suppose that:
1) we get core-scheduling,
2) we find a solution for irqs and VMEXITs, as we would have to if 
   there was only L1TF,
3) we manage to make the overhead of core-scheduling close to zero 
   when it's there (I mean, enabled at compile time) but not used (I
   mean, no tagging of tasks, or whatever).

That would mean that virt people can enable core-scheduling, and
achieve good inter-VM and VM-to-host isolation, without imposing
overhead to other use cases, that would leave core-scheduling disabled.

And this is something that I would think it makes sense.

Of course, we're not there... because even when this series will give
us point 1, we will also need 2 and we need to make sure we also
satisfy 3 (and we weren't, last time I checked ;-P).

But I think it's worth keeping trying.

I'd also add a couple of more ideas, still about core-scheduling in
virt, but from a different standpoint than security:
- if I tag vcpu0 and vcpu1 together[*], then vcpu2 and vcpu3 together,
  then vcpu4 and vcpu5 together, then I'm sure that each pair will
  always be scheduled on the same core. At which point I can define
  an SMT virtual topology, for the VM, that will make sense, even
  without pinning the vcpus;
- if I run VMs from different customers, when vcpu2 of VM1 and vcpu1
  of VM2 run on the same core, they influence each others' performance.
  If, e.g., I bill basing on time spent on CPUs, it means customer
  A's workload, running in VM1, may influence the billing of customer
  B, who owns VM2. With core scheduling, if I tag all the vcpus of each
  VM together, I won't have this any longer.

[*] with "tag together" I mean let them have the same tag which, ATM
would be "put them in the same cgroup and enable cpu.tag".


Re: [RFC PATCH 00/13] Core scheduling v5

2020-05-08 Thread Ning, Hongyu


- Test environment:
Intel Xeon Server platform
CPU(s):  192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):   2
NUMA node(s):4

- Kernel under test: 
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 
workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql 
in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for 
each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared 
to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance 
compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than 
smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups 
cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+++---+-+---+---+-+---+-+
||    | default   | coresched   | smt_off   | ***   | 
default | coresched | smt_off |
+++===+=+===+===+=+===+=+
| cgroups|    | cg cpu 1  | cg cpu 1| cg cpu 1  | ***   | 
cg mysql 1  | cg mysql 1| cg mysql 1  |
+++---+-+---+---+-+---+-+
| sysbench workload  |    | cpu   | cpu | cpu   | ***   | 
mysql   | mysql | mysql   |
+++---+-+---+---+-+---+-+
| 192 tasks / cgroup |    | 1 | 0.95| 0.54  | ***   | 1 
  | 0.92  | 0.97|
+++---+-+---+---+-+---+-+

-- Test set B Tput normalized results:
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
||    | default   | coresched   | smt_off   | ***   | 
default | coresched | smt_off | **   | default | coresched 
| smt_off | *   | default | coresched | smt_off |
+++===+=+===+===+=+===+=+==+=+===+=+=+=+===+=+
| cgroups|    | cg cpu 1  | cg cpu 1| cg cpu 1  | ***   | 
cg cpu 2| cg cpu 2  | cg cpu 2| **   | cg mysql 1  | cg mysql 1
| cg mysql 1  | *   | cg mysql 2  | cg mysql 2| cg mysql 2  |
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
| sysbench workload  |    | cpu   | cpu | cpu   | ***   | 
cpu | cpu   | cpu | **   | mysql   | mysql 
| mysql   | *   | mysql   | mysql | mysql   |
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
| 192 tasks / cgroup |    | 1 | 0.9 | 0.47  | ***   | 1 
  | 1.32  | 0.66| **   | 1   | 0.42  | 
0.89| *   | 1   | 0.42  | 0.89|
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+


> On Date: Wed,  4 Mar 2020 16:59:50 +, vpillai  
> wrote:
> To: Nishanth Aravamudan , Julien Desfossez 
> , Peter Zijlstra , Tim 
> Chen , mi...@kernel.org, t...@linutronix.de, 
> p...@google.com, torva...@linux-foundation.org
> CC: vpillai , linux-kernel@vger.kernel.org, 
> fweis...@gmail.com, keesc...@chromium.org, kerr...@google.com, Phil Auld 
> , Aaron Lu , Aubrey Li 
> , aubrey...@linux.intel.com, Valentin Schneider 
> , Mel Gorman , Pawan 
> Gupta , Paolo Bonzini 
> , Joel Fernandes , 
> j...@joelfernandes.org
> 
>