Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-11 Thread Greg Kerr
On Mon, Mar 11, 2019 at 4:36 PM Subhra Mazumdar
 wrote:
>
>
> On 3/11/19 11:34 AM, Subhra Mazumdar wrote:
> >
> > On 3/10/19 9:23 PM, Aubrey Li wrote:
> >> On Sat, Mar 9, 2019 at 3:50 AM Subhra Mazumdar
> >>  wrote:
> >>> expected. Most of the performance recovery happens in patch 15 which,
> >>> unfortunately, is also the one that introduces the hard lockup.
> >>>
> >> After applied Subhra's patch, the following is triggered by enabling
> >> core sched when a cgroup is
> >> under heavy load.
> >>
> > It seems you are facing some other deadlock where printk is involved.
> > Can you
> > drop the last patch (patch 16 sched: Debug bits...) and try?
> >
> > Thanks,
> > Subhra
> >
> Never Mind, I am seeing the same lockdep deadlock output even w/o patch
> 16. Btw
> the NULL fix had something missing, following works.

Is this panic below, which occurs when I tag the first process,
related or known? If not, I will debug it tomorrow.

[   46.831828] BUG: unable to handle kernel NULL pointer dereference
at 
[   46.831829] core sched enabled
[   46.834261] #PF error: [WRITE]
[   46.834899] PGD 0 P4D 0
[   46.835438] Oops: 0002 [#1] SMP PTI
[   46.836158] CPU: 0 PID: 11 Comm: migration/0 Not tainted
5.0.0everyday-glory-03949-g2d8fdbb66245-dirty #7
[   46.838206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   46.839844] RIP: 0010:_raw_spin_lock+0x7/0x20
[   46.840448] Code: 00 00 00 65 81 05 25 ca 5c 51 00 02 00 00 31 c0
ba ff 00 00 00 f0 0f b1 17 74 05 e9 93 80 46 ff f3 c3 90 31 c0 ba 01
00 00 00  0f b1 17 74 07 89 c6 e9 1c 6e 46 ff f3 c3 66 2e 0f 1f 84
00 00
[   46.843000] RSP: 0018:b9d300cabe38 EFLAGS: 00010046
[   46.843744] RAX:  RBX:  RCX: 0004
[   46.844709] RDX: 0001 RSI: aea435ae RDI: 
[   46.845689] RBP: b9d300cabed8 R08:  R09: 00020800
[   46.846651] R10: af603ea0 R11: 0001 R12: af6576c0
[   46.847619] R13: 9a57366c8000 R14: 9a5737401300 R15: ade868f0
[   46.848584] FS:  () GS:9a5737a0()
knlGS:
[   46.849680] CS:  0010 DS:  ES:  CR0: 80050033
[   46.850455] CR2:  CR3: 0001d36fa000 CR4: 06f0
[   46.851415] DR0:  DR1:  DR2: 
[   46.852371] DR3:  DR6: fffe0ff0 DR7: 0400
[   46.853326] Call Trace:
[   46.853678]  __schedule+0x139/0x11f0
[   46.854167]  ? cpumask_next+0x16/0x20
[   46.854668]  ? cpu_stop_queue_work+0xc0/0xc0
[   46.855252]  ? sort_range+0x20/0x20
[   46.855742]  schedule+0x4e/0x60
[   46.856171]  smpboot_thread_fn+0x12a/0x160
[   46.856725]  kthread+0x112/0x120
[   46.857164]  ? kthread_stop+0xf0/0xf0
[   46.857661]  ret_from_fork+0x35/0x40
[   46.858146] Modules linked in:
[   46.858562] CR2: 
[   46.859022] ---[ end trace e9fff08f17bfd2be ]---

- Greg

>
> ->8
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4..27cbc64 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4131,7 +4131,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
>   * Avoid running the skip buddy, if running something else can
>   * be done without getting too unfair.
> */
> -   if (cfs_rq->skip == se) {
> +   if (cfs_rq->skip && cfs_rq->skip == se) {
>  struct sched_entity *second;
>
>  if (se == curr) {
> @@ -4149,13 +4149,15 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct
> sched_entity *curr)
> /*
>   * Prefer last buddy, try to return the CPU to a preempted task.
> */
> -   if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
> +   if (left && cfs_rq->last && wakeup_preempt_entity(cfs_rq->last,
> left)
> +   < 1)
>  se = cfs_rq->last;
>
> /*
>   * Someone really wants this to run. If it's not unfair, run it.
> */
> -   if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
> +   if (left && cfs_rq->next && wakeup_preempt_entity(cfs_rq->next,
> left)
> +   < 1)
>  se = cfs_rq->next;
>
>  clear_buddies(cfs_rq, se);
> @@ -6958,6 +6960,9 @@ pick_task_fair(struct rq *rq)
>
>  se = pick_next_entity(cfs_rq, NULL);
>
> +   if (!(se || curr))
> +   return NULL;
> +
>  if (curr) {
>  if (se && curr->on_rq)
> update_curr(cfs_rq);
>


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-20 Thread Greg Kerr
On Wed, Feb 20, 2019 at 10:42:55AM +0100, Peter Zijlstra wrote:
> 
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
> 
I am relieved to know that when my mail client embeds HTML tags into raw
text, it will only be the second most annoying thing I've done on
e-mail.

Speaking of annoying things to do, sorry for switching e-mail addresses
but this is easier to do from my personal e-mail.

> On Tue, Feb 19, 2019 at 02:07:01PM -0800, Greg Kerr wrote:
> > Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
> > quick and dirty cgroup tagging interface," I believe cgroups are used to
> > define co-scheduling groups in this implementation.
> > 
> > Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
> > pal...@google.com) are considering an interface that is usable by 
> > unprivileged
> > userspace apps. cgroups are a global resource that require privileged 
> > access.
> > Have you considered an interface that is akin to namespaces? Consider the
> > following strawperson API proposal (I understand prctl() is generally
> > used for process
> > specific actions, so we aren't married to using prctl()):
> 
> I don't think we're anywhere near the point where I care about
> interfaces with this stuff.
> 
> Interfaces are a trivial but tedious matter once the rest works to
> satisfaction.
> 
I agree that the API itself is a bit of a bike shedding and that's why I
provided a strawperson proposal to highlight the desired properties. I
do think the high level semantics are important to agree upon.

Using cgroups could imply that a privileged user is meant to create and
track all the core scheduling groups. It sounds like you picked cgroups
out of ease of prototyping and not the specific behavior?

> As it happens; there is actually a bug in that very cgroup patch that
> can cause undesired scheduling. Try spotting and fixing that.
> 
This is where I think the high level properties of core scheduling are
relevant. I'm not sure what bug is in the existing patch, but it's hard
for me to tell if the existing code behaves correctly without answering
questions, such as, "Should processes from two separate parents be
allowed to co-execute?"

> Another question is if we want to be L1TF complete (and how strict) or
> not, and if so, build the missing pieces (for instance we currently
> don't kick siblings on IRQ/trap/exception entry -- and yes that's nasty
> and horrible code and missing for that reason).
>
I assumed from the beginning that this should be safe across exceptions.
Is there a mitigating reason that it shouldn't?

> 
> So first; does this provide what we need? If that's sorted we can
> bike-shed on uapi/abi.
I agree on not bike shedding about the API, but can we agree on some of
the high level properties? For example, who generates the core
scheduling ids, what properties about them are enforced, etc.?

Regards,

Greg Kerr


Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-02-19 Thread Greg Kerr
Thanks for posting this patchset Peter. Based on the patch titled, "sched: A
quick and dirty cgroup tagging interface," I believe cgroups are used to
define co-scheduling groups in this implementation.

Chrome OS engineers (kerr...@google.com, mpden...@google.com, and
pal...@google.com) are considering an interface that is usable by unprivileged
userspace apps. cgroups are a global resource that require privileged access.
Have you considered an interface that is akin to namespaces? Consider the
following strawperson API proposal (I understand prctl() is generally
used for process
specific actions, so we aren't married to using prctl()):

# API Properties

The kernel introduces coscheduling groups, which specify which processes may
be executed together. An unprivileged process may use prctl() to create a
coscheduling group. The process may then join the coscheduling group, and
place any of its child processes into the coscheduling group. To
provide flexibility for
unrelated processes to join pre-existing groups, an IPC mechanism could send a
coscheduling group handle between processes.

# Strawperson API Proposal
To create a new coscheduling group:
int coscheduling_group = prctl(PR_CREATE_COSCHEDULING_GROUP);

The return value is >= 0 on success and -1 on failure, with the following
possible values for errno:

ENOTSUP: This kernel doesn’t support the PR_NEW_COSCHEDULING_GROUP
operation.
EMFILE: The process’ kernel-side coscheduling group table is full.

To join a given process to the group:
pid_t process = /* self or child... */
int status = prctl(PR_JOIN_COSCHEDULING_GROUP, coscheduling_group, process);
if (status) {
err(errno, NULL);
}

The kernel will check and enforce that the given process ID really is the
caller’s own PID or a PID of one of the caller’s children, and that the given
group ID really exists. The return value is 0 on success and -1 on failure,
with the following possible values for errno:

EPERM: The caller could not join the given process to the coscheduling
   group because it was not the creator of the given coscheduling group.
EPERM: The caller could not join the given process to the coscheduling
   group because the given process was not the caller or one
of the caller’s
   children.
EINVAL: The given group ID did not exist in the kernel-side coscheduling
group table associated with the caller.
ESRCH: The given process did not exist.

Regards,

Greg Kerr (kerr...@google.com)

On Mon, Feb 18, 2019 at 9:40 AM Peter Zijlstra  wrote:
>
>
> A much 'demanded' feature: core-scheduling :-(
>
> I still hate it with a passion, and that is part of why it took a little
> longer than 'promised'.
>
> While this one doesn't have all the 'features' of the previous (never
> published) version and isn't L1TF 'complete', I tend to like the structure
> better (relatively speaking: I hate it slightly less).
>
> This one is sched class agnostic and therefore, in principle, doesn't horribly
> wreck RT (in fact, RT could 'ab'use this by setting 'task->core_cookie = task'
> to force-idle siblings).
>
> Now, as hinted by that, there are semi sane reasons for actually having this.
> Various hardware features like Intel RDT - Memory Bandwidth Allocation, work
> per core (due to SMT fundamentally sharing caches) and therefore grouping
> related tasks on a core makes it more reliable.
>
> However; whichever way around you turn this cookie; it is expensive and nasty.
>
> It doesn't help that there are truly bonghit crazy proposals for using this 
> out
> there, and I really hope to never see them in code.
>
> These patches are lightly tested and didn't insta explode, but no promises,
> they might just set your pets on fire.
>
> 'enjoy'
>
> @pjt; I know this isn't quite what we talked about, but this is where I ended
> up after I started typing. There's plenty design decisions to question and my
> changelogs don't even get close to beginning to cover them all. Feel free to 
> ask.
>
> ---
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   8 +-
>  kernel/sched/core.c  | 762 
> ---
>  kernel/sched/deadline.c  |  99 +++---
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 129 +---
>  kernel/sched/idle.c  |  42 ++-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++---
>  kernel/sched/sched.h | 183 
>  kernel/sched/stop_task.c |  35 ++-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  13 files changed, 1096 insertions(+), 279 deletions(-)
>
>