subject:"threads"

[PATCH 3/5] openrisc: traps: Don't send signals to kernel mode threads

2024-04-11 Thread Stafford Horne

OpenRISC exception handling sends signals to user processes on floating
point exceptions and trap instructions (for debugging) among others.
There is a bug where the trap handling logic may send signals to kernel
threads, we should not send these signals to kernel threads, if that
happens we treat it as an error.

This patch adds conditions to die if the kernel receives these
exceptions in kernel mode code.

Fixes: 27267655c531 ("openrisc: Support floating point user api")
Signed-off-by: Stafford Horne 
---
 arch/openrisc/kernel/traps.c | 48 ++--
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/openrisc/kernel/traps.c b/arch/openrisc/kernel/traps.c
index 88fe27e4c10c..211ddaa0c5fa 100644
--- a/arch/openrisc/kernel/traps.c
+++ b/arch/openrisc/kernel/traps.c
@@ -180,29 +180,39 @@ asmlinkage void unhandled_exception(struct pt_regs *regs, 
int ea, int vector)
 
 asmlinkage void do_fpe_trap(struct pt_regs *regs, unsigned long address)
 {
-   int code = FPE_FLTUNK;
-   unsigned long fpcsr = regs->fpcsr;
-
-   if (fpcsr & SPR_FPCSR_IVF)
-   code = FPE_FLTINV;
-   else if (fpcsr & SPR_FPCSR_OVF)
-   code = FPE_FLTOVF;
-   else if (fpcsr & SPR_FPCSR_UNF)
-   code = FPE_FLTUND;
-   else if (fpcsr & SPR_FPCSR_DZF)
-   code = FPE_FLTDIV;
-   else if (fpcsr & SPR_FPCSR_IXF)
-   code = FPE_FLTRES;
-
-   /* Clear all flags */
-   regs->fpcsr &= ~SPR_FPCSR_ALLF;
-
-   force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
+   if (user_mode(regs)) {
+   int code = FPE_FLTUNK;
+   unsigned long fpcsr = regs->fpcsr;
+
+   if (fpcsr & SPR_FPCSR_IVF)
+   code = FPE_FLTINV;
+   else if (fpcsr & SPR_FPCSR_OVF)
+   code = FPE_FLTOVF;
+   else if (fpcsr & SPR_FPCSR_UNF)
+   code = FPE_FLTUND;
+   else if (fpcsr & SPR_FPCSR_DZF)
+   code = FPE_FLTDIV;
+   else if (fpcsr & SPR_FPCSR_IXF)
+   code = FPE_FLTRES;
+
+   /* Clear all flags */
+   regs->fpcsr &= ~SPR_FPCSR_ALLF;
+
+   force_sig_fault(SIGFPE, code, (void __user *)regs->pc);
+   } else {
+   pr_emerg("KERNEL: Illegal fpe exception 0x%.8lx\n", regs->pc);
+   die("Die:", regs, SIGFPE);
+   }
 }
 
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
-   force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
+   if (user_mode(regs)) {
+   force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
+   } else {
+   pr_emerg("KERNEL: Illegal trap exception 0x%.8lx\n", regs->pc);
+   die("Die:", regs, SIGILL);
+   }
 }
 
 asmlinkage void do_unaligned_access(struct pt_regs *regs, unsigned long 
address)
-- 
2.44.0

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-14 Thread Wei Wang

On Tue, Apr 6, 2021 at 6:39 PM Pavan Kondeti  wrote:
>
> On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > > background workqueue if we identify that it is cpu intensive. However, 
> > > this
> > > needs case by case analysis, tweaking etc. If there is no other 
> > > alternative,
> > > we might end up chasing the background workers and reduce their nice 
> > > value.
> >
> > There shouldn't be that many workqueues that consume a lot of cpu cycles.
> > The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> > shouldn't be a long list and we want them identified anyway.
> >
> Sure. I have not done a complete study on which workers in our system can
> compete with important tasks in other cgroups. We will have to do that to
> adjust the workqueue priority so that the impact can be minimized.
>

kswapd0 is actually migratable to subgroup.

But echo what Pavan said, the real world is not ideal and the
problematice drivers are inactive when the use case is not activated
in Android, e.g. connectivity, camera, etc. It is tricky sometimes to
track all of those.

> > > The problem at our hand (which you might be knowing already) is that, 
> > > lets say
> > > we have 2 cgroups in our setup and we want to prioritize UX over 
> > > background.
> > > IOW, reduce the cpu.shares of background cgroup. This helps in 
> > > prioritizing
> > > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > > can potentially have CPU share equal to the entire UX cgroup.
> > >
> > > -kworker/0:0
> > > -kworker/1:0
> > > - UX
> > > Task-A
> > > Task-B
> > > - background
> > > Task-X
> > > Task-Y
> > > Hence we want to move all kernel threads to another cgroup so that this 
> > > cgroup
> > > will have CPU share equal to UX.
> > >
> > > The patch presented here allows us to create the above setup. Any other
> > > alternative approaches to achieve the same without impacting any future
> > > designs/requirements?
> >
> > Not quite the same but we already have
> > /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> > so maybe a global default priority knob would help here?
> >
>
> yeah, not exactly what we are looking for. It gives us the abiility to 
> restrict
> the priority of all unbound workqueues at the expense of not being able to
> prioritize one workqueue over another workqueue.
>

Same here, Android used to have its cgroup setup like this, where the
BG group can be starved too long potentially (and sometimes PI is not
inevitable of course,  that's the reason why I removed BG cgroup in
Android (https://android-review.googlesource.com/q/topic:remove_bg_cgroup)).

-Task-A
-Task-B
-kworker/0:0
- background
Task-X
Task-Y

So having things left in root will post the same risk,

> Thanks,
> Pavan
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
> Foundation Collaborative Project.
>

Re: [PATCH v4 08/12] perf record: introduce --threads= command line option

2021-04-12 Thread Namhyung Kim

Hello,

On Tue, Apr 6, 2021 at 5:49 PM Bayduraev, Alexey V
 wrote:
>
>
> Provide --threads option in perf record command line interface.
> The option can have a value in the form of masks that specify
> cpus to be monitored with data streaming threads and its layout
> in system topology. The masks can be filtered using cpu mask
> provided via -C option.
>
> The specification value can be user defined list of masks. Masks
> separated by colon define cpus to be monitored by one thread and
> affinity mask of that thread is separated by slash. For example:
> /:/
> specifies parallel threads layout that consists of two threads
> with corresponding assigned cpus to be monitored.
>
> The specification value can be a string e.g. "cpu", "core" or
> "socket" meaning creation of data streaming thread for every
> cpu or core or socket to monitor distinct cpus or cpus grouped
> by core or socket.
>
> The option provided with no or empty value defaults to per-cpu
> parallel threads layout creating data streaming thread for every
> cpu being monitored.
>
> Feature design and implementation are based on prototypes [1], [2].
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git 
> -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jo...@kernel.org/
>
> Suggested-by: Jiri Olsa 
> Suggested-by: Namhyung Kim 
> Signed-off-by: Alexey Bayduraev 
> ---
[SNIP]
> +static int record__init_thread_masks_spec(struct record *rec, struct 
> perf_cpu_map *cpus,
> + char **maps_spec, char 
> **affinity_spec, u32 nr_spec)
> +{
> +   u32 s;
> +   int ret, nr_threads = 0;
> +   struct mmap_cpu_mask cpus_mask;
> +   struct thread_mask thread_mask, full_mask;
> +
> +   ret = record__mmap_cpu_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   return ret;
> +   record__mmap_cpu_mask_init(_mask, cpus);
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   goto out_free_cpu_mask;
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   goto out_free_thread_mask;
> +   record__thread_mask_clear(_mask);
> +
> +   for (s = 0; s < nr_spec; s++) {
> +   record__thread_mask_clear(_mask);
> +
> +   record__mmap_cpu_mask_init_spec(_mask.maps, 
> maps_spec[s]);
> +   record__mmap_cpu_mask_init_spec(_mask.affinity, 
> affinity_spec[s]);
> +
> +   if (!bitmap_and(thread_mask.maps.bits, thread_mask.maps.bits,
> +   cpus_mask.bits, thread_mask.maps.nbits) ||
> +   !bitmap_and(thread_mask.affinity.bits, 
> thread_mask.affinity.bits,
> +   cpus_mask.bits, thread_mask.affinity.nbits))
> +   continue;
> +
> +   ret = record__thread_mask_intersects(_mask, 
> _mask);
> +   if (ret)
> +   return ret;

I think you should free other masks.

> +   record__thread_mask_or(_mask, _mask, _mask);
> +
> +   rec->thread_masks = realloc(rec->thread_masks,
> +   (nr_threads + 1) * sizeof(struct 
> thread_mask));
> +   if (!rec->thread_masks) {
> +   pr_err("Failed to allocate thread masks\n");
> +   ret = -ENOMEM;
> +   goto out_free_full_mask;

But this will leak rec->thread_masks as it's overwritten.


> +   }
> +   rec->thread_masks[nr_threads] = thread_mask;
> +   pr_debug("thread_masks[%d]: addr=", nr_threads);
> +   mmap_cpu_mask__scnprintf(>thread_masks[nr_threads].maps, 
> "maps");
> +   pr_debug("thread_masks[%d]: addr=", nr_threads);
> +   
> mmap_cpu_mask__scnprintf(>thread_masks[nr_threads].affinity, "affinity");
> +   nr_threads++;
> +   ret = record__thread_mask_alloc(_mask, cpu__max_cpu());
> +   if (ret)
> +   return ret;

Ditto, use goto.

> +   }
> +
> +   rec->nr_threads = nr_threads;
> +   pr_debug("threads: nr_threads=%d\n", rec->nr_threads);
> +
> +out_free_full_mask:
> +   record__thread_mask_free(_mask);
> +out_free_thread_mask:
> +   record__thread_mask_free(_mask);
> +out_free_cpu_mask:
> +   record__mmap_cpu_mask_free(_mask);
> +
> +   return 0;
> +}

[SNIP]
> +
> +static int

[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2
Gitweb:
https://git.kernel.org/tip/9bcb959d05eeb564dfc9cac13a59843a4fb2edf2
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 09 Apr 2021 18:02:20 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc34e35..1ad929b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8d25d10a4f5a5d87c062838358ab5b3ed7eaa131
Gitweb:
https://git.kernel.org/tip/8d25d10a4f5a5d87c062838358ab5b3ed7eaa131
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 09 Apr 2021 13:52:10 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0bd861..d10e33d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 29b628b521119c0dfe151da302e11018cb32db4f
Gitweb:
https://git.kernel.org/tip/29b628b521119c0dfe151da302e11018cb32db4f
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Lingutla Chandrasekhar 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0bd861..d10e33d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;

Re: [PATCH v4 08/12] perf record: introduce --threads= command line option

2021-04-08 Thread Jiri Olsa

On Tue, Apr 06, 2021 at 11:49:06AM +0300, Bayduraev, Alexey V wrote:

SNIP

> Suggested-by: Jiri Olsa 
> Suggested-by: Namhyung Kim 
> Signed-off-by: Alexey Bayduraev 
> ---
>  tools/include/linux/bitmap.h |  11 ++
>  tools/lib/bitmap.c   |  14 ++
>  tools/perf/builtin-record.c  | 317 ++-
>  tools/perf/util/record.h |   1 +
>  4 files changed, 341 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
> index 477a1cae513f..2eb1d1084543 100644
> --- a/tools/include/linux/bitmap.h
> +++ b/tools/include/linux/bitmap.h
> @@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long 
> *bitmap1,
>  int __bitmap_equal(const unsigned long *bitmap1,
>  const unsigned long *bitmap2, unsigned int bits);
>  void bitmap_clear(unsigned long *map, unsigned int start, int len);
> +int __bitmap_intersects(const unsigned long *bitmap1,
> + const unsigned long *bitmap2, unsigned int bits);
>  
>  #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 
> 1)))
>  
> @@ -178,4 +180,13 @@ static inline int bitmap_equal(const unsigned long *src1,
>   return __bitmap_equal(src1, src2, nbits);
>  }
>  
> +static inline int bitmap_intersects(const unsigned long *src1,
> + const unsigned long *src2, unsigned int nbits)
> +{
> + if (small_const_nbits(nbits))
> + return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
> + else
> + return __bitmap_intersects(src1, src2, nbits);
> +}
> +
>  #endif /* _PERF_BITOPS_H */
> diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
> index 5043747ef6c5..3cc3a5b43bb5 100644
> --- a/tools/lib/bitmap.c
> +++ b/tools/lib/bitmap.c
> @@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,
>  
>   return 1;
>  }
> +
> +int __bitmap_intersects(const unsigned long *bitmap1,
> + const unsigned long *bitmap2, unsigned int bits)
> +{
> + unsigned int k, lim = bits/BITS_PER_LONG;
> + for (k = 0; k < lim; ++k)
> + if (bitmap1[k] & bitmap2[k])
> + return 1;
> +
> + if (bits % BITS_PER_LONG)
> + if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
> + return 1;
> + return 0;
> +}

please move __bitmap_intersects function to the separate patch

jirka

Re: [PATCH v4 05/12] perf record: start threads in the beginning of trace streaming

2021-04-08 Thread Andi Kleen

> + err = write(thread->pipes.ack[1], , sizeof(msg));
> + if (err == -1)
> + pr_err("threads[%d]: failed to notify on start. Error %m", 
> thread->tid);

It might be safer to not use %m. I'm not sure if all the non glibc
libcs that people use support it.

[PATCH v5 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-07 Thread Valentin Schneider

From: Lingutla Chandrasekhar 

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Reviewed-by: Vincent Guittot 
Reviewed-by: Dietmar Eggemann 
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..04d5e14fa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
 
-- 
2.25.1

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > background workqueue if we identify that it is cpu intensive. However, this
> > needs case by case analysis, tweaking etc. If there is no other alternative,
> > we might end up chasing the background workers and reduce their nice value.
> 
> There shouldn't be that many workqueues that consume a lot of cpu cycles.
> The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> shouldn't be a long list and we want them identified anyway.
> 
Sure. I have not done a complete study on which workers in our system can
compete with important tasks in other cgroups. We will have to do that to
adjust the workqueue priority so that the impact can be minimized.

> > The problem at our hand (which you might be knowing already) is that, lets 
> > say
> > we have 2 cgroups in our setup and we want to prioritize UX over background.
> > IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > can potentially have CPU share equal to the entire UX cgroup.
> > 
> > -kworker/0:0
> > -kworker/1:0
> > - UX
> > ----Task-A
> > Task-B
> > - background
> > Task-X
> > Task-Y
> > Hence we want to move all kernel threads to another cgroup so that this 
> > cgroup
> > will have CPU share equal to UX.
> > 
> > The patch presented here allows us to create the above setup. Any other
> > alternative approaches to achieve the same without impacting any future
> > designs/requirements?
> 
> Not quite the same but we already have
> /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> so maybe a global default priority knob would help here?
> 

yeah, not exactly what we are looking for. It gives us the abiility to restrict
the priority of all unbound workqueues at the expense of not being able to
prioritize one workqueue over another workqueue.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Tejun Heo

Hello,

On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> background workqueue if we identify that it is cpu intensive. However, this
> needs case by case analysis, tweaking etc. If there is no other alternative,
> we might end up chasing the background workers and reduce their nice value.

There shouldn't be that many workqueues that consume a lot of cpu cycles.
The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
shouldn't be a long list and we want them identified anyway.

> The problem at our hand (which you might be knowing already) is that, lets say
> we have 2 cgroups in our setup and we want to prioritize UX over background.
> IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> can potentially have CPU share equal to the entire UX cgroup.
> 
> -kworker/0:0
> -kworker/1:0
> - UX
> Task-A
> Task-B
> - background
> Task-X
> Task-Y
> Hence we want to move all kernel threads to another cgroup so that this cgroup
> will have CPU share equal to UX.
> 
> The patch presented here allows us to create the above setup. Any other
> alternative approaches to achieve the same without impacting any future
> designs/requirements?

Not quite the same but we already have
/sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
so maybe a global default priority knob would help here?

Thanks.

-- 
tejun

Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-06 Thread Valentin Schneider

On 06/04/21 17:35, Dietmar Eggemann wrote:
> On 01/04/2021 21:30, Valentin Schneider wrote:
>> From: Lingutla Chandrasekhar 
>> 
>> During load balance, LBF_SOME_PINNED will bet set if any candidate task
>
> nitpick; s/bet/be ?
>

Yes indeed...

> [...]
>
> Reviewed-by: Dietmar Eggemann

Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-06 Thread Dietmar Eggemann

On 01/04/2021 21:30, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar 
> 
> During load balance, LBF_SOME_PINNED will bet set if any candidate task

nitpick; s/bet/be ?

[...]

Reviewed-by: Dietmar Eggemann

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

Hi Tejun,

On Tue, Apr 06, 2021 at 09:36:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> > In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> > important work. Given that CPU shares of root cgroup can't be changed,
> > leaving the tasks inside root cgroup will give them higher share
> > compared to the other tasks inside important cgroups. This is mitigated
> > by moving all tasks inside root cgroup to a different cgroup after
> > Android is booted. However, there are many kernel tasks stuck in the
> > root cgroup after the boot.
> > 
> > We see all kworker threads are in the root cpu cgroup. This is because,
> > tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> > This restriction is in place to avoid kworkers getting moved to a cpuset
> > which conflicts with kworker affinity. Relax this restriction by explicitly
> > checking if the task is moving out of a cpuset cgroup. This allows kworkers
> > to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> > are mounted on different hierarchies.
> > 
> > We also see kthreadd_task and any kernel thread created after the Android 
> > boot
> > also stuck in the root cgroup. The current code prevents kthreadd_task 
> > moving
> > out root cgroup to avoid the possibility of creating new RT kernel threads
> > inside a cgroup with no RT runtime allocated. Apply this restriction when 
> > tasks
> > are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> > kernel threads to be moved out of root cpu cgroup if the kernel does not
> > enable RT group scheduling.
> 
> The fundamental reason why those kthreads are in the root cgroup is because
> they're doing work on behalf of the entire system and their resource usages
> can't be attributed to any specific cgroup. What we want to do is accounting
> actual usages to the originating cgroups so that cpu cycles spent by kswapd
> is charged to the originating cgroups, however well we can define them, and
> then throttle the origin if the consumption is going over budget for that
> cgroup's allocation. This is how we already handle shared IOs.

Thanks for your reply. I understand the reasoning on why we don't allow
kworkers to a custom cgroup.

> 
> The problem with the proposed patch is that it breaks the logical
> organization of resource hierarchy in a way which hinders proper future
> solutions.
> 
> If all you want is deprioritizing certain kworkers, please use workqueue
> attrs instead.
> 
Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
background workqueue if we identify that it is cpu intensive. However, this
needs case by case analysis, tweaking etc. If there is no other alternative,
we might end up chasing the background workers and reduce their nice value.

The problem at our hand (which you might be knowing already) is that, lets say
we have 2 cgroups in our setup and we want to prioritize UX over background.
IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
can potentially have CPU share equal to the entire UX cgroup.

-kworker/0:0
-kworker/1:0
- UX
Task-A
Task-B
- background
Task-X
Task-Y

Hence we want to move all kernel threads to another cgroup so that this cgroup
will have CPU share equal to UX.

The patch presented here allows us to create the above setup. Any other
alternative approaches to achieve the same without impacting any future
designs/requirements?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Tejun Heo

Hello,

On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> important work. Given that CPU shares of root cgroup can't be changed,
> leaving the tasks inside root cgroup will give them higher share
> compared to the other tasks inside important cgroups. This is mitigated
> by moving all tasks inside root cgroup to a different cgroup after
> Android is booted. However, there are many kernel tasks stuck in the
> root cgroup after the boot.
> 
> We see all kworker threads are in the root cpu cgroup. This is because,
> tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> This restriction is in place to avoid kworkers getting moved to a cpuset
> which conflicts with kworker affinity. Relax this restriction by explicitly
> checking if the task is moving out of a cpuset cgroup. This allows kworkers
> to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> are mounted on different hierarchies.
> 
> We also see kthreadd_task and any kernel thread created after the Android boot
> also stuck in the root cgroup. The current code prevents kthreadd_task moving
> out root cgroup to avoid the possibility of creating new RT kernel threads
> inside a cgroup with no RT runtime allocated. Apply this restriction when 
> tasks
> are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> kernel threads to be moved out of root cpu cgroup if the kernel does not
> enable RT group scheduling.

The fundamental reason why those kthreads are in the root cgroup is because
they're doing work on behalf of the entire system and their resource usages
can't be attributed to any specific cgroup. What we want to do is accounting
actual usages to the originating cgroups so that cpu cycles spent by kswapd
is charged to the originating cgroups, however well we can define them, and
then throttle the origin if the consumption is going over budget for that
cgroup's allocation. This is how we already handle shared IOs.

The problem with the proposed patch is that it breaks the logical
organization of resource hierarchy in a way which hinders proper future
solutions.

If all you want is deprioritizing certain kworkers, please use workqueue
attrs instead.

Thanks.

-- 
tejun

[PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavankumar Kondeti

In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
important work. Given that CPU shares of root cgroup can't be changed,
leaving the tasks inside root cgroup will give them higher share
compared to the other tasks inside important cgroups. This is mitigated
by moving all tasks inside root cgroup to a different cgroup after
Android is booted. However, there are many kernel tasks stuck in the
root cgroup after the boot.

We see all kworker threads are in the root cpu cgroup. This is because,
tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
This restriction is in place to avoid kworkers getting moved to a cpuset
which conflicts with kworker affinity. Relax this restriction by explicitly
checking if the task is moving out of a cpuset cgroup. This allows kworkers
to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
are mounted on different hierarchies.

We also see kthreadd_task and any kernel thread created after the Android boot
also stuck in the root cgroup. The current code prevents kthreadd_task moving
out root cgroup to avoid the possibility of creating new RT kernel threads
inside a cgroup with no RT runtime allocated. Apply this restriction when tasks
are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
kernel threads to be moved out of root cpu cgroup if the kernel does not
enable RT group scheduling.

[1] 
https://android.googlesource.com/kernel/common/+/f08f049de11c15a4251cb1db08cf0bee20bd9b59

Signed-off-by: Pavankumar Kondeti 
---
v2:
- Added cgroup_task_migration_allowed() wrapper function

 kernel/cgroup/cgroup-internal.h | 28 +++-
 kernel/cgroup/cgroup-v1.c   |  2 +-
 kernel/cgroup/cgroup.c  | 13 -
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc..cd69302 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -202,6 +202,31 @@ static inline void get_css_set(struct css_set *cset)
refcount_inc(>refcount);
 }
 
+static inline bool cgroup_task_migration_allowed(struct task_struct *tsk,
+struct cgroup *dst_cgrp)
+{
+   /*
+* RT kthreads may be born in a cgroup with no rt_runtime allocated.
+* Just say no.
+*/
+#ifdef CONFIG_RT_GROUP_SCHED
+   if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << 
cpu_cgrp_id)))
+   return false;
+#endif
+
+   /*
+* kthreads may acquire PF_NO_SETAFFINITY during initialization.
+* If userland migrates such a kthread to a non-root cgroup, it can
+* become trapped in a cpuset. Just say no.
+*/
+#ifdef CONFIG_CPUSETS
+   if ((tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) &&
+   (dst_cgrp->root->subsys_mask & (1U << cpuset_cgrp_id)))
+   return false;
+#endif
+   return true;
+}
+
 bool cgroup_ssid_enabled(int ssid);
 bool cgroup_on_dfl(const struct cgroup *cgrp);
 bool cgroup_is_thread_root(struct cgroup *cgrp);
@@ -232,7 +257,8 @@ int cgroup_migrate(struct task_struct *leader, bool 
threadgroup,
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
   bool threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-bool *locked)
+bool *locked,
+struct cgroup *dst_cgrp)
__acquires(_threadgroup_rwsem);
 void cgroup_procs_write_finish(struct task_struct *task, bool locked)
__releases(_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index a575178..d674a6c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -497,7 +497,7 @@ static ssize_t __cgroup1_procs_write(struct 
kernfs_open_file *of,
if (!cgrp)
return -ENODEV;
 
-   task = cgroup_procs_write_start(buf, threadgroup, );
+   task = cgroup_procs_write_start(buf, threadgroup, , cgrp);
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20..44cc653 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2744,7 +2744,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
task_struct *leader,
 }
 
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-bool *locked)
+bool *locked,
+struct cgroup *dst_cgrp)
__acquires(_threadgroup_rwsem)
 {
struct task_struct *tsk;
@@ -2783,13 +2784,7 @@ struct ta

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

Hi Quentin,

On Tue, Apr 06, 2021 at 12:10:41PM +, Quentin Perret wrote:
> Hi Pavan,
> 
> On Tuesday 06 Apr 2021 at 16:29:13 (+0530), Pavankumar Kondeti wrote:
> > In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> > important work. Given that CPU shares of root cgroup can't be changed,
> > leaving the tasks inside root cgroup will give them higher share
> > compared to the other tasks inside important cgroups. This is mitigated
> > by moving all tasks inside root cgroup to a different cgroup after
> > Android is booted. However, there are many kernel tasks stuck in the
> > root cgroup after the boot.
> > 
> > We see all kworker threads are in the root cpu cgroup. This is because,
> > tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> > This restriction is in place to avoid kworkers getting moved to a cpuset
> > which conflicts with kworker affinity. Relax this restriction by explicitly
> > checking if the task is moving out of a cpuset cgroup. This allows kworkers
> > to be moved out root cpu cgroup.
> > 
> > We also see kthreadd_task and any kernel thread created after the Android 
> > boot
> > also stuck in the root cgroup. The current code prevents kthreadd_task 
> > moving
> > out root cgroup to avoid the possibility of creating new RT kernel threads
> > inside a cgroup with no RT runtime allocated. Apply this restriction when 
> > tasks
> > are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> > kernel threads to be moved out of root cpu cgroup if the kernel does not
> > enable RT group scheduling.
> 
> OK, so IIUC this only works with cgroup v1 -- the unified hierarchy in
> v2 forces you to keep cpu and cpuset in 'sync'. But that should be fine,
> so this looks like a nice improvement to me.
> 
Yes. I will mention this in commit description.

> >  
> >  struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> > -bool *locked)
> > +bool *locked,
> > +struct cgroup *dst_cgrp)
> > __acquires(_threadgroup_rwsem)
> >  {
> > struct task_struct *tsk;
> > @@ -2784,15 +2785,28 @@ struct task_struct *cgroup_procs_write_start(char 
> > *buf, bool threadgroup,
> > tsk = tsk->group_leader;
> >  
> > /*
> > +* RT kthreads may be born in a cgroup with no rt_runtime allocated.
> > +* Just say no.
> > +*/
> > +#ifdef CONFIG_RT_GROUP_SCHED
> > +   if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << 
> > cpu_cgrp_id))) {
> > +   tsk = ERR_PTR(-EINVAL);
> > +   goto out_unlock_threadgroup;
> > +   }
> > +#endif
> > +
> > +   /*
> >  * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> >  * If userland migrates such a kthread to a non-root cgroup, it can
> > -* become trapped in a cpuset, or RT kthread may be born in a
> > -* cgroup with no rt_runtime allocated.  Just say no.
> > +* become trapped in a cpuset. Just say no.
> >  */
> > -   if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +#ifdef CONFIG_CPUSETS
> > +   if ((tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) &&
> > +   (dst_cgrp->root->subsys_mask & (1U << cpuset_cgrp_id))) 
> > {
> > tsk = ERR_PTR(-EINVAL);
> > goto out_unlock_threadgroup;
> > }
> > +#endif
> 
> Nit: maybe move this #ifdefery out to a header?
> 
Agreed.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Quentin Perret

Hi Pavan,

On Tuesday 06 Apr 2021 at 16:29:13 (+0530), Pavankumar Kondeti wrote:
> In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> important work. Given that CPU shares of root cgroup can't be changed,
> leaving the tasks inside root cgroup will give them higher share
> compared to the other tasks inside important cgroups. This is mitigated
> by moving all tasks inside root cgroup to a different cgroup after
> Android is booted. However, there are many kernel tasks stuck in the
> root cgroup after the boot.
> 
> We see all kworker threads are in the root cpu cgroup. This is because,
> tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> This restriction is in place to avoid kworkers getting moved to a cpuset
> which conflicts with kworker affinity. Relax this restriction by explicitly
> checking if the task is moving out of a cpuset cgroup. This allows kworkers
> to be moved out root cpu cgroup.
> 
> We also see kthreadd_task and any kernel thread created after the Android boot
> also stuck in the root cgroup. The current code prevents kthreadd_task moving
> out root cgroup to avoid the possibility of creating new RT kernel threads
> inside a cgroup with no RT runtime allocated. Apply this restriction when 
> tasks
> are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> kernel threads to be moved out of root cpu cgroup if the kernel does not
> enable RT group scheduling.

OK, so IIUC this only works with cgroup v1 -- the unified hierarchy in
v2 forces you to keep cpu and cpuset in 'sync'. But that should be fine,
so this looks like a nice improvement to me.

> [1] 
> https://android.googlesource.com/kernel/common/+/f08f049de11c15a4251cb1db08cf0bee20bd9b59
> 
> Signed-off-by: Pavankumar Kondeti 
> ---
>  kernel/cgroup/cgroup-internal.h |  3 ++-
>  kernel/cgroup/cgroup-v1.c   |  2 +-
>  kernel/cgroup/cgroup.c  | 24 +++-
>  3 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index bfbeabc..a96ed9a 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -232,7 +232,8 @@ int cgroup_migrate(struct task_struct *leader, bool 
> threadgroup,
>  int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
>  bool threadgroup);
>  struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> -  bool *locked)
> +  bool *locked,
> +  struct cgroup *dst_cgrp)
>   __acquires(_threadgroup_rwsem);
>  void cgroup_procs_write_finish(struct task_struct *task, bool locked)
>   __releases(_threadgroup_rwsem);
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index a575178..d674a6c 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -497,7 +497,7 @@ static ssize_t __cgroup1_procs_write(struct 
> kernfs_open_file *of,
>   if (!cgrp)
>   return -ENODEV;
>  
> - task = cgroup_procs_write_start(buf, threadgroup, );
> + task = cgroup_procs_write_start(buf, threadgroup, , cgrp);
>   ret = PTR_ERR_OR_ZERO(task);
>   if (ret)
>   goto out_unlock;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 9153b20..41864a8 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2744,7 +2744,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
> task_struct *leader,
>  }
>  
>  struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> -  bool *locked)
> +  bool *locked,
> +  struct cgroup *dst_cgrp)
>   __acquires(_threadgroup_rwsem)
>  {
>   struct task_struct *tsk;
> @@ -2784,15 +2785,28 @@ struct task_struct *cgroup_procs_write_start(char 
> *buf, bool threadgroup,
>   tsk = tsk->group_leader;
>  
>   /*
> +  * RT kthreads may be born in a cgroup with no rt_runtime allocated.
> +  * Just say no.
> +  */
> +#ifdef CONFIG_RT_GROUP_SCHED
> + if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << 
> cpu_cgrp_id))) {
> + tsk = ERR_PTR(-EINVAL);
> + goto out_unlock_threadgroup;
> + }
> +#endif
> +
> + /*
>* kthreads may acquire PF_NO_SETAFFINITY during initialization.
>* If userland migrates such a kthread to a non-root cgroup, it can
> -  * become trapped in

[PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavankumar Kondeti

In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
important work. Given that CPU shares of root cgroup can't be changed,
leaving the tasks inside root cgroup will give them higher share
compared to the other tasks inside important cgroups. This is mitigated
by moving all tasks inside root cgroup to a different cgroup after
Android is booted. However, there are many kernel tasks stuck in the
root cgroup after the boot.

We see all kworker threads are in the root cpu cgroup. This is because,
tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
This restriction is in place to avoid kworkers getting moved to a cpuset
which conflicts with kworker affinity. Relax this restriction by explicitly
checking if the task is moving out of a cpuset cgroup. This allows kworkers
to be moved out root cpu cgroup.

We also see kthreadd_task and any kernel thread created after the Android boot
also stuck in the root cgroup. The current code prevents kthreadd_task moving
out root cgroup to avoid the possibility of creating new RT kernel threads
inside a cgroup with no RT runtime allocated. Apply this restriction when tasks
are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
kernel threads to be moved out of root cpu cgroup if the kernel does not
enable RT group scheduling.

[1] 
https://android.googlesource.com/kernel/common/+/f08f049de11c15a4251cb1db08cf0bee20bd9b59

Signed-off-by: Pavankumar Kondeti 
---
 kernel/cgroup/cgroup-internal.h |  3 ++-
 kernel/cgroup/cgroup-v1.c   |  2 +-
 kernel/cgroup/cgroup.c  | 24 +++-
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index bfbeabc..a96ed9a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -232,7 +232,8 @@ int cgroup_migrate(struct task_struct *leader, bool 
threadgroup,
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
   bool threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-bool *locked)
+bool *locked,
+struct cgroup *dst_cgrp)
__acquires(_threadgroup_rwsem);
 void cgroup_procs_write_finish(struct task_struct *task, bool locked)
__releases(_threadgroup_rwsem);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index a575178..d674a6c 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -497,7 +497,7 @@ static ssize_t __cgroup1_procs_write(struct 
kernfs_open_file *of,
if (!cgrp)
return -ENODEV;
 
-   task = cgroup_procs_write_start(buf, threadgroup, );
+   task = cgroup_procs_write_start(buf, threadgroup, , cgrp);
ret = PTR_ERR_OR_ZERO(task);
if (ret)
goto out_unlock;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20..41864a8 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2744,7 +2744,8 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
task_struct *leader,
 }
 
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
-bool *locked)
+bool *locked,
+struct cgroup *dst_cgrp)
__acquires(_threadgroup_rwsem)
 {
struct task_struct *tsk;
@@ -2784,15 +2785,28 @@ struct task_struct *cgroup_procs_write_start(char *buf, 
bool threadgroup,
tsk = tsk->group_leader;
 
/*
+* RT kthreads may be born in a cgroup with no rt_runtime allocated.
+* Just say no.
+*/
+#ifdef CONFIG_RT_GROUP_SCHED
+   if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << 
cpu_cgrp_id))) {
+   tsk = ERR_PTR(-EINVAL);
+   goto out_unlock_threadgroup;
+   }
+#endif
+
+   /*
 * kthreads may acquire PF_NO_SETAFFINITY during initialization.
 * If userland migrates such a kthread to a non-root cgroup, it can
-* become trapped in a cpuset, or RT kthread may be born in a
-* cgroup with no rt_runtime allocated.  Just say no.
+* become trapped in a cpuset. Just say no.
 */
-   if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
+#ifdef CONFIG_CPUSETS
+   if ((tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) &&
+   (dst_cgrp->root->subsys_mask & (1U << cpuset_cgrp_id))) 
{
tsk = ERR_PTR(-EINVAL);
goto out_unlock_threadgroup;
}
+#endif
 
get_task_struct(tsk);
goto out_unlock_rcu;
@@ -4740,7 +4754,7 @@ static ssize_t __cgroup_

[PATCH v4 08/12] perf record: introduce --threads= command line option

2021-04-06 Thread Bayduraev, Alexey V



Provide --threads option in perf record command line interface.
The option can have a value in the form of masks that specify
cpus to be monitored with data streaming threads and its layout
in system topology. The masks can be filtered using cpu mask
provided via -C option.

The specification value can be user defined list of masks. Masks
separated by colon define cpus to be monitored by one thread and
affinity mask of that thread is separated by slash. For example:
/:/
specifies parallel threads layout that consists of two threads
with corresponding assigned cpus to be monitored.

The specification value can be a string e.g. "cpu", "core" or
"socket" meaning creation of data streaming thread for every
cpu or core or socket to monitor distinct cpus or cpus grouped
by core or socket.

The option provided with no or empty value defaults to per-cpu
parallel threads layout creating data streaming thread for every
cpu being monitored.

Feature design and implementation are based on prototypes [1], [2].

[1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b 
perf/record_threads
[2] https://lore.kernel.org/lkml/20180913125450.21342-1-jo...@kernel.org/

Suggested-by: Jiri Olsa 
Suggested-by: Namhyung Kim 
Signed-off-by: Alexey Bayduraev 
---
 tools/include/linux/bitmap.h |  11 ++
 tools/lib/bitmap.c   |  14 ++
 tools/perf/builtin-record.c  | 317 ++-
 tools/perf/util/record.h |   1 +
 4 files changed, 341 insertions(+), 2 deletions(-)

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 477a1cae513f..2eb1d1084543 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -18,6 +18,8 @@ int __bitmap_and(unsigned long *dst, const unsigned long 
*bitmap1,
 int __bitmap_equal(const unsigned long *bitmap1,
   const unsigned long *bitmap2, unsigned int bits);
 void bitmap_clear(unsigned long *map, unsigned int start, int len);
+int __bitmap_intersects(const unsigned long *bitmap1,
+   const unsigned long *bitmap2, unsigned int bits);
 
 #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
 
@@ -178,4 +180,13 @@ static inline int bitmap_equal(const unsigned long *src1,
return __bitmap_equal(src1, src2, nbits);
 }
 
+static inline int bitmap_intersects(const unsigned long *src1,
+   const unsigned long *src2, unsigned int nbits)
+{
+   if (small_const_nbits(nbits))
+   return ((*src1 & *src2) & BITMAP_LAST_WORD_MASK(nbits)) != 0;
+   else
+   return __bitmap_intersects(src1, src2, nbits);
+}
+
 #endif /* _PERF_BITOPS_H */
diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
index 5043747ef6c5..3cc3a5b43bb5 100644
--- a/tools/lib/bitmap.c
+++ b/tools/lib/bitmap.c
@@ -86,3 +86,17 @@ int __bitmap_equal(const unsigned long *bitmap1,
 
return 1;
 }
+
+int __bitmap_intersects(const unsigned long *bitmap1,
+   const unsigned long *bitmap2, unsigned int bits)
+{
+   unsigned int k, lim = bits/BITS_PER_LONG;
+   for (k = 0; k < lim; ++k)
+   if (bitmap1[k] & bitmap2[k])
+   return 1;
+
+   if (bits % BITS_PER_LONG)
+   if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
+   return 1;
+   return 0;
+}
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index c1416d28ac6d..41a22f48037d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -49,6 +49,7 @@
 #include "util/clockid.h"
 #include "asm/bug.h"
 #include "perf.h"
+#include "cputopo.h"
 
 #include 
 #include 
@@ -120,6 +121,20 @@ static const char *thread_msg_tags[THREAD_MSG__MAX] = {
"UNDEFINED", "READY"
 };
 
+enum thread_spec {
+   THREAD_SPEC__UNDEFINED = 0,
+   THREAD_SPEC__CPU,
+   THREAD_SPEC__CORE,
+   THREAD_SPEC__SOCKET,
+   THREAD_SPEC__NUMA,
+   THREAD_SPEC__USER,
+   THREAD_SPEC__MAX,
+};
+
+static const char *thread_spec_tags[THREAD_SPEC__MAX] = {
+   "undefined", "cpu", "core", "socket", "numa", "user"
+};
+
 struct record {
struct perf_tooltool;
struct record_opts  opts;
@@ -2662,6 +2677,63 @@ static void record__thread_mask_free(struct thread_mask 
*mask)
record__mmap_cpu_mask_free(>affinity);
 }
 
+static int record__thread_mask_or(struct thread_mask *dest, struct thread_mask 
*src1,
+  struct thread_mask *src2)
+{
+   if (src1->maps.nbits != src2->maps.nbits || src1->affinity.nbits != 
src2->affinity.nbits ||
+   dest->maps.nbits != src1->maps.nbits || dest->affinity.nbits != 
src1->affinity.nbits)
+   return -EINVAL;
+
+

[PATCH v4 05/12] perf record: start threads in the beginning of trace streaming

2021-04-06 Thread Bayduraev, Alexey V



Start thread in detached state because its management is implemented
via messaging to avoid any scaling issues. Block signals prior thread
start so only main tool thread would be notified on external async
signals during data collection. Thread affinity mask is used to assign
eligible cpus for the thread to run. Wait and sync on thread start using
thread ack pipe.

Signed-off-by: Alexey Bayduraev 
---
 tools/perf/builtin-record.c | 103 +++-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4612314853c1..339198b2e37d 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1412,6 +1412,64 @@ static void record__thread_munmap_filtered(struct 
fdarray *fda, int fd,
perf_mmap__put(map);
 }
 
+static void *record__thread(void *arg)
+{
+   enum thread_msg msg = THREAD_MSG__READY;
+   bool terminate = false;
+   struct fdarray *pollfd;
+   int err, ctlfd_pos;
+
+   thread = arg;
+   thread->tid = syscall(SYS_gettid);
+
+   err = write(thread->pipes.ack[1], , sizeof(msg));
+   if (err == -1)
+   pr_err("threads[%d]: failed to notify on start. Error %m", 
thread->tid);
+
+   pr_debug("threads[%d]: started on cpu=%d\n", thread->tid, 
sched_getcpu());
+
+   pollfd = >pollfd;
+   ctlfd_pos = thread->ctlfd_pos;
+
+   for (;;) {
+   unsigned long long hits = thread->samples;
+
+   if (record__mmap_read_all(thread->rec, false) < 0 || terminate)
+   break;
+
+   if (hits == thread->samples) {
+
+   err = fdarray__poll(pollfd, -1);
+   /*
+* Propagate error, only if there's any. Ignore positive
+* number of returned events and interrupt error.
+*/
+   if (err > 0 || (err < 0 && errno == EINTR))
+   err = 0;
+   thread->waking++;
+
+   if (fdarray__filter(pollfd, POLLERR | POLLHUP,
+   record__thread_munmap_filtered, 
NULL) == 0)
+   break;
+   }
+
+   if (pollfd->entries[ctlfd_pos].revents & POLLHUP) {
+   terminate = true;
+   close(thread->pipes.msg[0]);
+   pollfd->entries[ctlfd_pos].fd = -1;
+   pollfd->entries[ctlfd_pos].events = 0;
+   }
+
+   pollfd->entries[ctlfd_pos].revents = 0;
+   }
+
+   err = write(thread->pipes.ack[1], , sizeof(msg));
+   if (err == -1)
+   pr_err("threads[%d]: failed to notify on termination. Error 
%m", thread->tid);
+
+   return NULL;
+}
+
 static void record__init_features(struct record *rec)
 {
struct perf_session *session = rec->session;
@@ -1849,13 +1907,56 @@ static int record__terminate_thread(struct thread_data 
*thread_data)
 
 static int record__start_threads(struct record *rec)
 {
+   int t, tt, ret = 0, nr_threads = rec->nr_threads;
struct thread_data *thread_data = rec->thread_data;
+   sigset_t full, mask;
+   pthread_t handle;
+   pthread_attr_t attrs;
+
+   sigfillset();
+   if (sigprocmask(SIG_SETMASK, , )) {
+   pr_err("Failed to block signals on threads start. Error: %m\n");
+   return -1;
+   }
+
+   pthread_attr_init();
+   pthread_attr_setdetachstate(, PTHREAD_CREATE_DETACHED);
+
+   for (t = 1; t < nr_threads; t++) {
+   enum thread_msg msg = THREAD_MSG__UNDEFINED;
+
+   pthread_attr_setaffinity_np(, 
MMAP_CPU_MASK_BYTES(&(thread_data[t].mask->affinity)),
+   (cpu_set_t 
*)(thread_data[t].mask->affinity.bits));
+
+   if (pthread_create(, , record__thread, 
_data[t])) {
+   for (tt = 1; tt < t; tt++)
+   record__terminate_thread(_data[t]);
+   pr_err("Failed to start threads. Error: %m\n");
+   ret = -1;
+   goto out_err;
+   }
+
+   if (read(thread_data[t].pipes.ack[0], , sizeof(msg)) > 0)
+   pr_debug2("threads[%d]: sent %s\n", 
rec->thread_data[t].tid,
+thread_msg_tags[msg]);
+   }
+
+   if (nr_threads > 1) {
+   sched_setaffinity(0, 
MMAP_CPU_MASK_BYTES(_data[0].mask->affinity),
+ (cpu_set_t 
*)thread_data[0].mask->affinity.bits);
+   }
 
thread = _data[0];
 
pr_debug("threads[%d]: started on cpu=%d\n",

[PATCH v4 04/12] perf record: stop threads in the end of trace streaming

2021-04-06 Thread Bayduraev, Alexey V



Signal thread to terminate by closing write fd of msg pipe.
Receive THREAD_MSG__READY message as the confirmation of the
thread's termination. Stop threads created for parallel trace
streaming prior their stats processing.

Signed-off-by: Alexey Bayduraev 
---
 tools/perf/builtin-record.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ecb6bf33ed85..4612314853c1 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -110,6 +110,16 @@ struct thread_data {
 
 static __thread struct thread_data *thread;
 
+enum thread_msg {
+   THREAD_MSG__UNDEFINED = 0,
+   THREAD_MSG__READY,
+   THREAD_MSG__MAX,
+};
+
+static const char *thread_msg_tags[THREAD_MSG__MAX] = {
+   "UNDEFINED", "READY"
+};
+
 struct record {
struct perf_tooltool;
struct record_opts  opts;
@@ -1820,6 +1830,23 @@ static void hit_auxtrace_snapshot_trigger(struct record 
*rec)
}
 }
 
+static int record__terminate_thread(struct thread_data *thread_data)
+{
+   int res;
+   enum thread_msg ack = THREAD_MSG__UNDEFINED;
+   pid_t tid = thread_data->tid;
+
+   close(thread_data->pipes.msg[1]);
+   res = read(thread_data->pipes.ack[0], , sizeof(ack));
+   if (res != -1)
+   pr_debug2("threads[%d]: sent %s\n", tid, thread_msg_tags[ack]);
+   else
+   pr_err("threads[%d]: failed to recv msg=%s from tid=%d\n",
+  thread->tid, thread_msg_tags[ack], tid);
+
+   return 0;
+}
+
 static int record__start_threads(struct record *rec)
 {
struct thread_data *thread_data = rec->thread_data;
@@ -1836,6 +1863,9 @@ static int record__stop_threads(struct record *rec, 
unsigned long *waking)
int t;
struct thread_data *thread_data = rec->thread_data;
 
+   for (t = 1; t < rec->nr_threads; t++)
+   record__terminate_thread(_data[t]);
+
for (t = 0; t < rec->nr_threads; t++) {
rec->samples += thread_data[t].samples;
*waking += thread_data[t].waking;
-- 
2.19.0

Re: [PATCH 0/6] Allow signals for IO threads

2021-04-02 Thread Stefan Metzmacher

Am 01.04.21 um 18:24 schrieb Linus Torvalds:
> On Thu, Apr 1, 2021 at 9:00 AM Stefan Metzmacher  wrote:
>>
>> I haven't tried it, but it seems gdb tries to use PTRACE_PEEKUSR
>> against the last thread tid listed under /proc//tasks/ in order to
>> get the architecture for the userspace application
> 
> Christ, what an odd hack. Why wouldn't it just do it on the initial
> thread you actually attached to?
> 
> Are you sure it's not simply because your test-case was to attach to
> the io_uring thread? Because the io_uring thread might as well be
> considered to be 64-bit.

  │   └─io_uring-cp,1396 Makefile file
  │   ├─{iou-mgr-1396},1397
  │   ├─{iou-wrk-1396},1398
  │   └─{iou-wrk-1396},1399

strace -ttT -o strace-uring-fail.txt gdb --pid 1396
(note strace -f would deadlock gdb with SIGSTOP)

The full file can be found here:
https://www.samba.org/~metze/strace-uring-fail.txt
(I guess there was a race and the workers 1398 and 1399 exited in between,
that's it using 1397):

18:46:35.429498 ptrace(PTRACE_PEEKUSER, 1397, 8*CS, [NULL]) = 0 <0.22>

>> so my naive assumption
>> would be that it wouldn't allow the detection of a 32-bit application
>> using a 64-bit kernel.
> 
> I'm not entirely convinced we want to care about a confused gdb
> implementation and somebody debugging a case that I don't believe
> happens in practice.
> 
> 32-bit user space is legacy. And legacy isn't io_uring. If somebody
> insists on doing odd things, they can live with the odd results.

Ok, I'd agree for 32-bit applications, but what about libraries?
E.g. distributions ship libraries like libsmbclient or nss modules
as 64-bit and 32-bit version, in order to support legacy applications
to run. Why shouldn't the 32-bit library builds not support io_uring?
(Note libsmbclient don't use io_uring yet, but I guess it will be in future).

Any ideas regarding similar problems for other architectures?

metze

Re: [PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-02 Thread Vincent Guittot

On Thu, 1 Apr 2021 at 21:30, Valentin Schneider
 wrote:
>
> From: Lingutla Chandrasekhar 
>
> During load balance, LBF_SOME_PINNED will bet set if any candidate task
> cannot be detached due to CPU affinity constraints. This can result in
> setting env->sd->parent->sgc->group_imbalance, which can lead to a group
> being classified as group_imbalanced (rather than any of the other, lower
> group_type) when balancing at a higher level.
>
> In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
> set due to per-CPU kthreads being the only other runnable tasks on any
> given rq. This results in changing the group classification during
> load-balance at higher levels when in reality there is nothing that can be
> done for this affinity constraint: per-CPU kthreads, as the name implies,
> don't get to move around (modulo hotplug shenanigans).
>
> It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
> with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
> an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
> we can leverage here to not set LBF_SOME_PINNED.
>
> Note that the aforementioned classification to group_imbalance (when
> nothing can be done) is especially problematic on big.LITTLE systems, which
> have a topology the likes of:
>
>   DIE [  ]
>   MC  [][]
>0  1  2  3
>L  L  B  B
>
>   arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
>
> Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
> level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
> the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
> CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
> can significantly delay the upgmigration of said misfit tasks. Systems
> relying on ASYM_PACKING are likely to face similar issues.
>
> Signed-off-by: Lingutla Chandrasekhar 
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> [Reword changelog]
> Signed-off-by: Valentin Schneider 

Reviewed-by: Vincent Guittot 

> ---
>  kernel/sched/fair.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d73bdbb2d40..04d5e14fa261 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct 
> lb_env *env)
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> +   /* Disregard pcpu kthreads; they are where they need to be. */
> +   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> +   return 0;
> +
> if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
> int cpu;
>
> --
> 2.25.1
>

[PATCH v4 1/3] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-01 Thread Valentin Schneider

From: Lingutla Chandrasekhar 

During load balance, LBF_SOME_PINNED will bet set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d73bdbb2d40..04d5e14fa261 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7567,6 +7567,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
 
-- 
2.25.1

[PATCH v14 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA

2021-04-01 Thread Alex Kogan

Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.

Signed-off-by: Alex Kogan 
Reviewed-by: Steve Sistare 
Reviewed-by: Waiman Long 
---
 kernel/locking/qspinlock_cna.h | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 0513360c11fe..29c3abbd3d94 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
 #endif
 
 #include 
+#include 
 
 /*
  * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -35,7 +36,8 @@
  * running on the same NUMA node. If it is not, that waiter is detached from 
the
  * main queue and moved into the tail of the secondary queue. This way, we
  * gradually filter the primary queue, leaving only waiters running on the same
- * preferred NUMA node.
+ * preferred NUMA node. Note that certain priortized waiters (e.g., in
+ * irq and nmi contexts) are excluded from being moved to the secondary queue.
  *
  * We change the NUMA node preference after a waiter at the head of the
  * secondary queue spins for a certain amount of time (10ms, by default).
@@ -49,6 +51,8 @@
  *  Dave Dice 
  */
 
+#define CNA_PRIORITY_NODE  0x
+
 struct cna_node {
struct mcs_spinlock mcs;
u16 numa_node;
@@ -121,9 +125,10 @@ static int __init cna_init_nodes(void)
 
 static __always_inline void cna_init_node(struct mcs_spinlock *node)
 {
+   bool priority = !in_task() || irqs_disabled() || rt_task(current);
struct cna_node *cn = (struct cna_node *)node;
 
-   cn->numa_node = cn->real_numa_node;
+   cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
cn->partial_order = LOCAL_WAITER_FOUND;
cn->start_time = 0;
 }
@@ -266,11 +271,13 @@ static void cna_order_queue(struct mcs_spinlock *node)
next_numa_node = ((struct cna_node *)next)->numa_node;
 
if (next_numa_node != numa_node) {
-   struct mcs_spinlock *nnext = READ_ONCE(next->next);
+   if (next_numa_node != CNA_PRIORITY_NODE) {
+   struct mcs_spinlock *nnext = READ_ONCE(next->next);
 
-   if (nnext) {
-   cna_splice_next(node, next, nnext);
-   next = nnext;
+   if (nnext) {
+   cna_splice_next(node, next, nnext);
+   next = nnext;
+   }
}
/*
 * Inherit NUMA node id of primary queue, to maintain the
@@ -287,6 +294,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct 
qspinlock *lock,
struct cna_node *cn = (struct cna_node *)node;
 
if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+   /*
+* We are at the head of the wait queue, no need to use
+* the fake NUMA node ID.
+*/
+   if (cn->numa_node == CNA_PRIORITY_NODE)
+   cn->numa_node = cn->real_numa_node;
+
/*
 * Try and put the time otherwise spent spin waiting on
 * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
-- 
2.24.3 (Apple Git-128)

Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Linus Torvalds

On Thu, Apr 1, 2021 at 9:00 AM Stefan Metzmacher  wrote:
>
> I haven't tried it, but it seems gdb tries to use PTRACE_PEEKUSR
> against the last thread tid listed under /proc//tasks/ in order to
> get the architecture for the userspace application

Christ, what an odd hack. Why wouldn't it just do it on the initial
thread you actually attached to?

Are you sure it's not simply because your test-case was to attach to
the io_uring thread? Because the io_uring thread might as well be
considered to be 64-bit.

> so my naive assumption
> would be that it wouldn't allow the detection of a 32-bit application
> using a 64-bit kernel.

I'm not entirely convinced we want to care about a confused gdb
implementation and somebody debugging a case that I don't believe
happens in practice.

32-bit user space is legacy. And legacy isn't io_uring. If somebody
insists on doing odd things, they can live with the odd results.

  Linus

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-04-01 Thread Stefan Metzmacher

Hi Jens,

>>> I don't assume signals wanted by userspace should potentially handled in an 
>>> io_thread...
>>> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?
>>
>> I guess we do actually need it, if we're not fiddling with
>> wants_signal() for them. To quell Oleg's concerns, we can just move it
>> to post dup_task_struct(), that should eliminate any race concerns
>> there.
> 
> If that one is racy, don' we better also want this one?
> https://lore.kernel.org/io-uring/438b738c1e4827a7fdfe43087da88bbe17eedc72.1616197787.git.me...@samba.org/T/#u
> 
> And clear tsk->pf_io_worker ?

As the workers don't clone other workers I guess it's fine to defer this to 
5.13.

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Linus Torvalds

On Thu, Apr 1, 2021 at 7:58 AM Stefan Metzmacher  wrote:
>
> >
> > Ok, the following makes gdb happy again:
> >
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned 
> > long sp, unsigned long arg,
> > /* Kernel thread ? */
> > if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> > memset(childregs, 0, sizeof(struct pt_regs));
> > +   if (p->flags & PF_IO_WORKER)
> > +   childregs->cs = current_pt_regs()->cs;
> > kthread_frame_init(frame, sp, arg);
> > return 0;
> > }
>
> Would it be possible to fix this remaining problem before 5.12 final?

Please not that way.

But doing something like

childregs->cs = __USER_CS;
childregs->ss = __USER_DS;
childregs->ds = __USER_DS;
childregs->es = __USER_DS;

might make sense (just do it unconditionally, rather than making it
special to PF_IO_WORKER).

Does that make gdb happy too?

   Linus

Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Stefan Metzmacher



Am 01.04.21 um 17:39 schrieb Linus Torvalds:
> On Thu, Apr 1, 2021 at 7:58 AM Stefan Metzmacher  wrote:
>>
>>>
>>> Ok, the following makes gdb happy again:
>>>
>>> --- a/arch/x86/kernel/process.c
>>> +++ b/arch/x86/kernel/process.c
>>> @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned 
>>> long sp, unsigned long arg,
>>> /* Kernel thread ? */
>>> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>> memset(childregs, 0, sizeof(struct pt_regs));
>>> +   if (p->flags & PF_IO_WORKER)
>>> +   childregs->cs = current_pt_regs()->cs;
>>> kthread_frame_init(frame, sp, arg);
>>> return 0;
>>> }
>>
>> Would it be possible to fix this remaining problem before 5.12 final?
> 
> Please not that way.
> 
> But doing something like
> 
> childregs->cs = __USER_CS;
> childregs->ss = __USER_DS;
> childregs->ds = __USER_DS;
> childregs->es = __USER_DS;
> 
> might make sense (just do it unconditionally, rather than making it
> special to PF_IO_WORKER).
> 
> Does that make gdb happy too?

I haven't tried it, but it seems gdb tries to use PTRACE_PEEKUSR
against the last thread tid listed under /proc//tasks/ in order to
get the architecture for the userspace application, so my naive assumption
would be that it wouldn't allow the detection of a 32-bit application
using a 64-bit kernel.

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-04-01 Thread Stefan Metzmacher

Hi Jens,

>> For help, type "help".
>> Type "apropos word" to search for commands related to "word".
>> Attaching to process 1320
>> [New LWP 1321]
>> [New LWP 1322]
>>
>> warning: Selected architecture i386:x86-64 is not compatible with reported 
>> target architecture i386
>>
>> warning: Architecture rejected target-supplied description
>> syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> 38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or 
>> directory.
>> (gdb)
> 
> Ok, the following makes gdb happy again:
> 
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
> /* Kernel thread ? */
> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> memset(childregs, 0, sizeof(struct pt_regs));
> +   if (p->flags & PF_IO_WORKER)
> +   childregs->cs = current_pt_regs()->cs;
> kthread_frame_init(frame, sp, arg);
> return 0;
> }
> 
> I'm wondering if we should decouple the PF_KTHREAD and PF_IO_WORKER cases 
> even more
> and keep as much of a userspace-like copy_thread as possible.

Would it be possible to fix this remaining problem before 5.12 final?
(I don't think my change would be the correct fix actually
and other architectures may have similar problems).

Thanks!
metze

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-27 Thread Jens Axboe

On 3/27/21 11:40 AM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> On 3/26/21 4:38 PM, Jens Axboe wrote:
>>> OK good point, and follows the same logic even if it won't make a
>>> difference in my case. I'll make the change.
>>
>> Made the suggested edits and ran the quick tests and the KILL/STOP
>> testing, and no ill effects observed. Kicked off the longer runs now.
>>
>> Not a huge amount of changes from the posted series, but please peruse
>> here if you want to double check:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.12
>>
>> And diff against v2 posted is below. Thanks!
> 
> That looks good.  Thanks.
> 
> Acked-by: "Eric W. Biederman" 

Thanks Eric, amended to add that.

-- 
Jens Axboe

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-27 Thread Eric W. Biederman

Jens Axboe  writes:

> On 3/26/21 4:38 PM, Jens Axboe wrote:
>> OK good point, and follows the same logic even if it won't make a
>> difference in my case. I'll make the change.
>
> Made the suggested edits and ran the quick tests and the KILL/STOP
> testing, and no ill effects observed. Kicked off the longer runs now.
>
> Not a huge amount of changes from the posted series, but please peruse
> here if you want to double check:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.12
>
> And diff against v2 posted is below. Thanks!

That looks good.  Thanks.

Acked-by: "Eric W. Biederman" 

>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 3e2f059a1737..7434eb40ca8c 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -505,10 +505,9 @@ static int io_wqe_worker(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - break;
> - if (get_signal())
> + if (!get_signal())
>   continue;
> + break;
>   }
>   if (ret)
>   continue;
> @@ -722,10 +721,9 @@ static int io_wq_manager(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - set_bit(IO_WQ_BIT_EXIT, >state);
> - else if (get_signal())
> + if (!get_signal())
>   continue;
> + set_bit(IO_WQ_BIT_EXIT, >state);
>   }
>   } while (!test_bit(IO_WQ_BIT_EXIT, >state));
>  
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 66ae46874d85..880abd8b6d31 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -6746,10 +6746,9 @@ static int io_sq_thread(void *data)
>   if (signal_pending(current)) {
>   struct ksignal ksig;
>  
> - if (fatal_signal_pending(current))
> - break;
> - if (get_signal())
> + if (!get_signal())
>   continue;
> + break;
>   }
>   sqt_spin = false;
>   cap_entries = !list_is_singular(>ctx_list);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5b75fbe3d2d6..f2718350bf4b 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2752,15 +2752,6 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> - /*
> -  * PF_IO_WORKER threads will catch and exit on fatal signals
> -  * themselves. They have cleanup that must be performed, so
> -  * we cannot call do_exit() on their behalf. coredumps also
> -  * do not apply to them.
> -  */
> - if (current->flags & PF_IO_WORKER)
> - return false;
> -
>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);
> @@ -2776,6 +2767,14 @@ bool get_signal(struct ksignal *ksig)
>   do_coredump(>info);
>   }
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + goto out;
> +
>   /*
>* Death signals, no core dump.
>*/
> @@ -2783,7 +2782,7 @@ bool get_signal(struct ksignal *ksig)
>   /* NOTREACHED */
>   }
>   spin_unlock_irq(>siglock);
> -
> +out:
>   ksig->sig = signr;
>  
>   if (!(ksig->ka.sa.sa_flags & SA_EXPOSE_TAGBITS))

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-27 Thread Jens Axboe

On 3/26/21 7:46 PM, Stefan Metzmacher wrote:
> 
> Hi Jens,
> 
>> root@ub1704-166:~# LANG=C gdb --pid 1320
>> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
>> Copyright (C) 2020 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.
>> Type "show copying" and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>.
>> Find the GDB manual and other documentation resources online at:
>> <http://www.gnu.org/software/gdb/documentation/>.
>>
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word".
>> Attaching to process 1320
>> [New LWP 1321]
>> [New LWP 1322]
>>
>> warning: Selected architecture i386:x86-64 is not compatible with reported 
>> target architecture i386
>>
>> warning: Architecture rejected target-supplied description
>> syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>> 38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or 
>> directory.
>> (gdb)
> 
> Ok, the following makes gdb happy again:
> 
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
> sp, unsigned long arg,
> /* Kernel thread ? */
> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> memset(childregs, 0, sizeof(struct pt_regs));
> +   if (p->flags & PF_IO_WORKER)
> +   childregs->cs = current_pt_regs()->cs;
> kthread_frame_init(frame, sp, arg);
> return 0;
>     }

Confirmed, it stops complaining about the arch at that point.

> I'm wondering if we should decouple the PF_KTHREAD and PF_IO_WORKER
> cases even more and keep as much of a userspace-like copy_thread as
> possible.

Probably makes sense, the only thing they really share is the func+arg
setup. Hence PF_IO_WORKER threads likely just use the rest of the init,
where it doesn't conflict with the frame setup.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher



Hi Jens,

> root@ub1704-166:~# LANG=C gdb --pid 1320
> GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
> Copyright (C) 2020 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word".
> Attaching to process 1320
> [New LWP 1321]
> [New LWP 1322]
> 
> warning: Selected architecture i386:x86-64 is not compatible with reported 
> target architecture i386
> 
> warning: Architecture rejected target-supplied description
> syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 38  ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or 
> directory.
> (gdb)

Ok, the following makes gdb happy again:

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,8 @@ int copy_thread(unsigned long clone_flags, unsigned long 
sp, unsigned long arg,
/* Kernel thread ? */
if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
memset(childregs, 0, sizeof(struct pt_regs));
+   if (p->flags & PF_IO_WORKER)
+   childregs->cs = current_pt_regs()->cs;
kthread_frame_init(frame, sp, arg);
return 0;
}

I'm wondering if we should decouple the PF_KTHREAD and PF_IO_WORKER cases even 
more
and keep as much of a userspace-like copy_thread as possible.

metze

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Jens Axboe

On 3/26/21 4:38 PM, Jens Axboe wrote:
> On 3/26/21 4:35 PM, Eric W. Biederman wrote:
>> Jens Axboe  writes:
>>
>>> On 3/26/21 4:23 PM, Eric W. Biederman wrote:
>>>> Jens Axboe  writes:
>>>>
>>>>> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>>>>>> Jens Axboe  writes:
>>>>>>
>>>>>>> We go through various hoops to disallow signals for the IO threads, but
>>>>>>> there's really no reason why we cannot just allow them. The IO threads
>>>>>>> never return to userspace like a normal thread, and hence don't go 
>>>>>>> through
>>>>>>> normal signal processing. Instead, just check for a pending signal as 
>>>>>>> part
>>>>>>> of the work loop, and call get_signal() to handle it for us if anything
>>>>>>> is pending.
>>>>>>>
>>>>>>> With that, we can support receiving signals, including special ones like
>>>>>>> SIGSTOP.
>>>>>>>
>>>>>>> Signed-off-by: Jens Axboe 
>>>>>>> ---
>>>>>>>  fs/io-wq.c| 24 +---
>>>>>>>  fs/io_uring.c | 12 
>>>>>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>>>>>> index b7c1fa932cb3..3e2f059a1737 100644
>>>>>>> --- a/fs/io-wq.c
>>>>>>> +++ b/fs/io-wq.c
>>>>>>> @@ -16,7 +16,6 @@
>>>>>>>  #include 
>>>>>>>  #include 
>>>>>>>  #include 
>>>>>>> -#include 
>>>>>>>  
>>>>>>>  #include "../kernel/sched/sched.h"
>>>>>>>  #include "io-wq.h"
>>>>>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>>>>>> if (io_flush_signals())
>>>>>>> continue;
>>>>>>> ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>>>>>> -   if (try_to_freeze() || ret)
>>>>>>> +   if (signal_pending(current)) {
>>>>>>> +   struct ksignal ksig;
>>>>>>> +
>>>>>>> +   if (fatal_signal_pending(current))
>>>>>>> +   break;
>>>>>>> +   if (get_signal())
>>>>>>> +   continue;
>>>>>> ^^
>>>>>>
>>>>>> That is wrong.  You are promising to deliver a signal to signal
>>>>>> handler and them simply discarding it.  Perhaps:
>>>>>>
>>>>>>  if (!get_signal())
>>>>>>  continue;
>>>>>>  WARN_ON(!sig_kernel_stop(ksig->sig));
>>>>>> break;
>>>>>
>>>>> Thanks, updated.
>>>>
>>>> Gah.  Kill the WARN_ON.
>>>>
>>>> I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
>>>> The function sig_kernel_fatal does not exist.
>>>>
>>>> Fatal is the state that is left when a signal is neither
>>>> ignored nor a stop signal, and does not have a handler.
>>>>
>>>> The rest of the logic still works.
>>>
>>> I've just come to the same conclusion myself after testing it.
>>> Of the 3 cases, most of them can do the continue, but doesn't
>>> really matter with the way the loop is structured. Anyway, looks
>>> like this now:
>>
>> This idiom in the code:
>>> +   if (signal_pending(current)) {
>>> +   struct ksignal ksig;
>>> +
>>> +   if (fatal_signal_pending(current))
>>> +   break;
>>> +   if (!get_signal())
>>> +   continue;
>>>  }
>>
>> Needs to be:
>>> +   if (signal_pending(current)) {
>>> +   struct ksignal ksig;
>>> +
>>> +   if (!get_signal())
>>> +   continue;
>>> +

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Jens Axboe

On 3/26/21 4:35 PM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> On 3/26/21 4:23 PM, Eric W. Biederman wrote:
>>> Jens Axboe  writes:
>>>
>>>> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>>>>> Jens Axboe  writes:
>>>>>
>>>>>> We go through various hoops to disallow signals for the IO threads, but
>>>>>> there's really no reason why we cannot just allow them. The IO threads
>>>>>> never return to userspace like a normal thread, and hence don't go 
>>>>>> through
>>>>>> normal signal processing. Instead, just check for a pending signal as 
>>>>>> part
>>>>>> of the work loop, and call get_signal() to handle it for us if anything
>>>>>> is pending.
>>>>>>
>>>>>> With that, we can support receiving signals, including special ones like
>>>>>> SIGSTOP.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe 
>>>>>> ---
>>>>>>  fs/io-wq.c| 24 +---
>>>>>>  fs/io_uring.c | 12 
>>>>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>>>>> index b7c1fa932cb3..3e2f059a1737 100644
>>>>>> --- a/fs/io-wq.c
>>>>>> +++ b/fs/io-wq.c
>>>>>> @@ -16,7 +16,6 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> -#include 
>>>>>>  
>>>>>>  #include "../kernel/sched/sched.h"
>>>>>>  #include "io-wq.h"
>>>>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>>>>>  if (io_flush_signals())
>>>>>>  continue;
>>>>>>  ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>>>>> -if (try_to_freeze() || ret)
>>>>>> +if (signal_pending(current)) {
>>>>>> +struct ksignal ksig;
>>>>>> +
>>>>>> +if (fatal_signal_pending(current))
>>>>>> +break;
>>>>>> +if (get_signal())
>>>>>> +continue;
>>>>> ^^
>>>>>
>>>>> That is wrong.  You are promising to deliver a signal to signal
>>>>> handler and them simply discarding it.  Perhaps:
>>>>>
>>>>>   if (!get_signal())
>>>>>   continue;
>>>>>   WARN_ON(!sig_kernel_stop(ksig->sig));
>>>>> break;
>>>>
>>>> Thanks, updated.
>>>
>>> Gah.  Kill the WARN_ON.
>>>
>>> I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
>>> The function sig_kernel_fatal does not exist.
>>>
>>> Fatal is the state that is left when a signal is neither
>>> ignored nor a stop signal, and does not have a handler.
>>>
>>> The rest of the logic still works.
>>
>> I've just come to the same conclusion myself after testing it.
>> Of the 3 cases, most of them can do the continue, but doesn't
>> really matter with the way the loop is structured. Anyway, looks
>> like this now:
> 
> This idiom in the code:
>> +if (signal_pending(current)) {
>> +struct ksignal ksig;
>> +
>> +if (fatal_signal_pending(current))
>> +break;
>> +if (!get_signal())
>> +continue;
>>  }
> 
> Needs to be:
>> +if (signal_pending(current)) {
>> +struct ksignal ksig;
>> +
>> +if (!get_signal())
>> +continue;
>> +break;
>>  }
> 
> Because any signal returned from get_signal is fatal in this case.
> It might make sense to "WARN_ON(ksig->ka.sa.sa_handler != SIG_DFL)".
> As the io workers don't handle that case.
> 
> It won't happen because you have everything blocked.
>
> The extra fatal_signal_pending(current) logic is just confusing in this
> case.

OK good point, and follows the same logic even if it won't make a
difference in my case. I'll make the change.

-- 
Jens Axboe

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman

Jens Axboe  writes:

> On 3/26/21 4:23 PM, Eric W. Biederman wrote:
>> Jens Axboe  writes:
>> 
>>> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>>>> Jens Axboe  writes:
>>>>
>>>>> We go through various hoops to disallow signals for the IO threads, but
>>>>> there's really no reason why we cannot just allow them. The IO threads
>>>>> never return to userspace like a normal thread, and hence don't go through
>>>>> normal signal processing. Instead, just check for a pending signal as part
>>>>> of the work loop, and call get_signal() to handle it for us if anything
>>>>> is pending.
>>>>>
>>>>> With that, we can support receiving signals, including special ones like
>>>>> SIGSTOP.
>>>>>
>>>>> Signed-off-by: Jens Axboe 
>>>>> ---
>>>>>  fs/io-wq.c| 24 +---
>>>>>  fs/io_uring.c | 12 
>>>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>>>> index b7c1fa932cb3..3e2f059a1737 100644
>>>>> --- a/fs/io-wq.c
>>>>> +++ b/fs/io-wq.c
>>>>> @@ -16,7 +16,6 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> -#include 
>>>>>  
>>>>>  #include "../kernel/sched/sched.h"
>>>>>  #include "io-wq.h"
>>>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>>>>   if (io_flush_signals())
>>>>>   continue;
>>>>>   ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>>>> - if (try_to_freeze() || ret)
>>>>> + if (signal_pending(current)) {
>>>>> + struct ksignal ksig;
>>>>> +
>>>>> + if (fatal_signal_pending(current))
>>>>> + break;
>>>>> + if (get_signal())
>>>>> + continue;
>>>> ^^
>>>>
>>>> That is wrong.  You are promising to deliver a signal to signal
>>>> handler and them simply discarding it.  Perhaps:
>>>>
>>>>if (!get_signal())
>>>>continue;
>>>>WARN_ON(!sig_kernel_stop(ksig->sig));
>>>> break;
>>>
>>> Thanks, updated.
>> 
>> Gah.  Kill the WARN_ON.
>> 
>> I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
>> The function sig_kernel_fatal does not exist.
>> 
>> Fatal is the state that is left when a signal is neither
>> ignored nor a stop signal, and does not have a handler.
>> 
>> The rest of the logic still works.
>
> I've just come to the same conclusion myself after testing it.
> Of the 3 cases, most of them can do the continue, but doesn't
> really matter with the way the loop is structured. Anyway, looks
> like this now:

This idiom in the code:
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (fatal_signal_pending(current))
> + break;
> + if (!get_signal())
> + continue;
>  }

Needs to be:
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (!get_signal())
> + continue;
> + break;
>  }

Because any signal returned from get_signal is fatal in this case.
It might make sense to "WARN_ON(ksig->ka.sa.sa_handler != SIG_DFL)".
As the io workers don't handle that case.

It won't happen because you have everything blocked.

The extra fatal_signal_pending(current) logic is just confusing in this
case.

Eric

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Jens Axboe

On 3/26/21 4:23 PM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>>> Jens Axboe  writes:
>>>
>>>> We go through various hoops to disallow signals for the IO threads, but
>>>> there's really no reason why we cannot just allow them. The IO threads
>>>> never return to userspace like a normal thread, and hence don't go through
>>>> normal signal processing. Instead, just check for a pending signal as part
>>>> of the work loop, and call get_signal() to handle it for us if anything
>>>> is pending.
>>>>
>>>> With that, we can support receiving signals, including special ones like
>>>> SIGSTOP.
>>>>
>>>> Signed-off-by: Jens Axboe 
>>>> ---
>>>>  fs/io-wq.c| 24 +---
>>>>  fs/io_uring.c | 12 
>>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>>> index b7c1fa932cb3..3e2f059a1737 100644
>>>> --- a/fs/io-wq.c
>>>> +++ b/fs/io-wq.c
>>>> @@ -16,7 +16,6 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> -#include 
>>>>  
>>>>  #include "../kernel/sched/sched.h"
>>>>  #include "io-wq.h"
>>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>>>if (io_flush_signals())
>>>>continue;
>>>>ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>>> -  if (try_to_freeze() || ret)
>>>> +  if (signal_pending(current)) {
>>>> +  struct ksignal ksig;
>>>> +
>>>> +  if (fatal_signal_pending(current))
>>>> +  break;
>>>> +  if (get_signal())
>>>> +  continue;
>>> ^^
>>>
>>> That is wrong.  You are promising to deliver a signal to signal
>>> handler and them simply discarding it.  Perhaps:
>>>
>>> if (!get_signal())
>>> continue;
>>> WARN_ON(!sig_kernel_stop(ksig->sig));
>>> break;
>>
>> Thanks, updated.
> 
> Gah.  Kill the WARN_ON.
> 
> I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
> The function sig_kernel_fatal does not exist.
> 
> Fatal is the state that is left when a signal is neither
> ignored nor a stop signal, and does not have a handler.
> 
> The rest of the logic still works.

I've just come to the same conclusion myself after testing it.
Of the 3 cases, most of them can do the continue, but doesn't
really matter with the way the loop is structured. Anyway, looks
like this now:


commit 769186e30cd437f5e1a000e7cf00286948779da4
Author: Jens Axboe 
Date:   Thu Mar 25 18:16:06 2021 -0600

io_uring: handle signals for IO threads like a normal thread

We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.

With that, we can support receiving signals, including special ones like
SIGSTOP.

Signed-off-by: Jens Axboe 

diff --git a/fs/io-wq.c b/fs/io-wq.c
index b7c1fa932cb3..7e5970c8b0be 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
if (io_flush_signals())
continue;
ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
-   if (try_to_freeze() || ret)
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   break;
+   if (!get_signal())
+   continue;
+   }
+   if (ret)
continue;
-   if (fatal_signal_pending(current))
-   break;
/* timed out, exit unless we're the fixed worker */
if

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman

Jens Axboe  writes:

> On 3/26/21 2:29 PM, Eric W. Biederman wrote:
>> Jens Axboe  writes:
>> 
>>> We go through various hoops to disallow signals for the IO threads, but
>>> there's really no reason why we cannot just allow them. The IO threads
>>> never return to userspace like a normal thread, and hence don't go through
>>> normal signal processing. Instead, just check for a pending signal as part
>>> of the work loop, and call get_signal() to handle it for us if anything
>>> is pending.
>>>
>>> With that, we can support receiving signals, including special ones like
>>> SIGSTOP.
>>>
>>> Signed-off-by: Jens Axboe 
>>> ---
>>>  fs/io-wq.c| 24 +---
>>>  fs/io_uring.c | 12 
>>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>>> index b7c1fa932cb3..3e2f059a1737 100644
>>> --- a/fs/io-wq.c
>>> +++ b/fs/io-wq.c
>>> @@ -16,7 +16,6 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> -#include 
>>>  
>>>  #include "../kernel/sched/sched.h"
>>>  #include "io-wq.h"
>>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>> if (io_flush_signals())
>>> continue;
>>> ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>>> -   if (try_to_freeze() || ret)
>>> +   if (signal_pending(current)) {
>>> +   struct ksignal ksig;
>>> +
>>> +   if (fatal_signal_pending(current))
>>> +   break;
>>> +   if (get_signal())
>>> +   continue;
>> ^^
>> 
>> That is wrong.  You are promising to deliver a signal to signal
>> handler and them simply discarding it.  Perhaps:
>> 
>>  if (!get_signal())
>>  continue;
>>  WARN_ON(!sig_kernel_stop(ksig->sig));
>> break;
>
> Thanks, updated.

Gah.  Kill the WARN_ON.

I was thinking "WARN_ON(!sig_kernel_fatal(ksig->sig));"
The function sig_kernel_fatal does not exist.

Fatal is the state that is left when a signal is neither
ignored nor a stop signal, and does not have a handler.

The rest of the logic still works.

Eric

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Jens Axboe

On 3/26/21 2:29 PM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> We go through various hoops to disallow signals for the IO threads, but
>> there's really no reason why we cannot just allow them. The IO threads
>> never return to userspace like a normal thread, and hence don't go through
>> normal signal processing. Instead, just check for a pending signal as part
>> of the work loop, and call get_signal() to handle it for us if anything
>> is pending.
>>
>> With that, we can support receiving signals, including special ones like
>> SIGSTOP.
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  fs/io-wq.c| 24 +---
>>  fs/io_uring.c | 12 
>>  2 files changed, 25 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>> index b7c1fa932cb3..3e2f059a1737 100644
>> --- a/fs/io-wq.c
>> +++ b/fs/io-wq.c
>> @@ -16,7 +16,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  
>>  #include "../kernel/sched/sched.h"
>>  #include "io-wq.h"
>> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>>  if (io_flush_signals())
>>  continue;
>>  ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
>> -if (try_to_freeze() || ret)
>> +if (signal_pending(current)) {
>> +struct ksignal ksig;
>> +
>> +if (fatal_signal_pending(current))
>> +break;
>> +if (get_signal())
>> +continue;
> ^^
> 
> That is wrong.  You are promising to deliver a signal to signal
> handler and them simply discarding it.  Perhaps:
> 
>   if (!get_signal())
>   continue;
>   WARN_ON(!sig_kernel_stop(ksig->sig));
> break;

Thanks, updated.

-- 
Jens Axboe

Re: [PATCH 1/7] kernel: don't call do_exit() for PF_IO_WORKER threads

2021-03-26 Thread Jens Axboe

On 3/26/21 2:43 PM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> Right now we're never calling get_signal() from PF_IO_WORKER threads, but
>> in preparation for doing so, don't handle a fatal signal for them. The
>> workers have state they need to cleanup when exiting, and they don't do
>> coredumps, so just return instead of performing either a dump or calling
>> do_exit() on their behalf. The threads themselves will detect a fatal
>> signal and do proper shutdown.
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  kernel/signal.c | 9 +
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index f2a1b898da29..e3e1b8fbfe8a 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2756,6 +2756,15 @@ bool get_signal(struct ksignal *ksig)
>>       */
>>  current->flags |= PF_SIGNALED;
>>  
>> +/*
>> + * PF_IO_WORKER threads will catch and exit on fatal signals
>> + * themselves. They have cleanup that must be performed, so
>> + * we cannot call do_exit() on their behalf. coredumps also
>> + * do not apply to them.
>> + */
>> +if (current->flags & PF_IO_WORKER)
>> +return false;
>> +
> 
> Returning false when get_signal needs the caller to handle a signal
> adds a very weird and awkward special case to how get_signal returns
> arguments.
> 
> Instead you should simply break and let get_signal return SIGKILL like
> any other signal that has a handler that the caller of get_signal needs
> to handle.
> 
> Something like:
>> +/*
>> + * PF_IO_WORKER have cleanup that must be performed,
>> + * before calling do_exit().
>> + */
>> +if (current->flags & PF_IO_WORKER)
>> +break;
> 
> 
> As do_coredump does not call do_exit there is no reason to skip calling into
> the coredump handling either.   And allowing it will remove yet another
> special case from the io worker code.

Thanks, I'll turn it into a break, that does seem like a better idea in
general. Actually it wants to be a goto or similar, as a break will
assume that we have the sighand lock held. With the coredump being
irrelevant, I'll just it before the do_exit() call.

-- 
Jens Axboe

Re: [PATCH 1/7] kernel: don't call do_exit() for PF_IO_WORKER threads

2021-03-26 Thread Eric W. Biederman

Jens Axboe  writes:

> Right now we're never calling get_signal() from PF_IO_WORKER threads, but
> in preparation for doing so, don't handle a fatal signal for them. The
> workers have state they need to cleanup when exiting, and they don't do
> coredumps, so just return instead of performing either a dump or calling
> do_exit() on their behalf. The threads themselves will detect a fatal
> signal and do proper shutdown.
>
> Signed-off-by: Jens Axboe 
> ---
>  kernel/signal.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f2a1b898da29..e3e1b8fbfe8a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2756,6 +2756,15 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf. coredumps also
> +  * do not apply to them.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + return false;
> +

Returning false when get_signal needs the caller to handle a signal
adds a very weird and awkward special case to how get_signal returns
arguments.

Instead you should simply break and let get_signal return SIGKILL like
any other signal that has a handler that the caller of get_signal needs
to handle.

Something like:
> + /*
> +  * PF_IO_WORKER have cleanup that must be performed,
> +  * before calling do_exit().
> +  */
> + if (current->flags & PF_IO_WORKER)
> + break;


As do_coredump does not call do_exit there is no reason to skip calling into
the coredump handling either.   And allowing it will remove yet another
special case from the io worker code.

>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);

Eric

Re: [PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Eric W. Biederman

Jens Axboe  writes:

> We go through various hoops to disallow signals for the IO threads, but
> there's really no reason why we cannot just allow them. The IO threads
> never return to userspace like a normal thread, and hence don't go through
> normal signal processing. Instead, just check for a pending signal as part
> of the work loop, and call get_signal() to handle it for us if anything
> is pending.
>
> With that, we can support receiving signals, including special ones like
> SIGSTOP.
>
> Signed-off-by: Jens Axboe 
> ---
>  fs/io-wq.c| 24 +---
>  fs/io_uring.c | 12 
>  2 files changed, 25 insertions(+), 11 deletions(-)
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index b7c1fa932cb3..3e2f059a1737 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,7 +16,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "../kernel/sched/sched.h"
>  #include "io-wq.h"
> @@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
>   if (io_flush_signals())
>   continue;
>   ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
> - if (try_to_freeze() || ret)
> + if (signal_pending(current)) {
> + struct ksignal ksig;
> +
> + if (fatal_signal_pending(current))
> + break;
> + if (get_signal())
> + continue;
^^

That is wrong.  You are promising to deliver a signal to signal
handler and them simply discarding it.  Perhaps:

if (!get_signal())
continue;
WARN_ON(!sig_kernel_stop(ksig->sig));
break;


Eric

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 12:01 PM, Stefan Metzmacher wrote:
> Am 26.03.21 um 16:29 schrieb Jens Axboe:
>> On 3/26/21 9:23 AM, Stefan Metzmacher wrote:
>>> Am 26.03.21 um 16:01 schrieb Jens Axboe:
>>>> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>>>>> Jens, sorry, I got lost :/
>>>>
>>>> Let's bring you back in :-)
>>>>
>>>>> On 03/25, Jens Axboe wrote:
>>>>>>
>>>>>> With IO threads accepting signals, including SIGSTOP,
>>>>>
>>>>> where can I find this change? Looks like I wasn't cc'ed...
>>>>
>>>> It's this very series.
>>>>
>>>>>> unmask the
>>>>>> SIGSTOP signal from the default blocked mask.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe 
>>>>>> ---
>>>>>>  kernel/fork.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index d3171e8e88e5..d5a40552910f 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int 
>>>>>> (*fn)(void *), void *arg, int node)
>>>>>>  tsk = copy_process(NULL, 0, node, );
>>>>>>  if (!IS_ERR(tsk)) {
>>>>>>  sigfillset(>blocked);
>>>>>> -sigdelsetmask(>blocked, sigmask(SIGKILL));
>>>>>> +sigdelsetmask(>blocked, 
>>>>>> sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>>>
>>>>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is 
>>>>> minor.
>>>>
>>>> Ah thanks.
>>>>
>>>>> To remind, either way this is racy and can't really help.
>>>>>
>>>>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>>>>> I must have missed something.
>>>>
>>>> I do think the above is a no-op at this point, and we can probably just
>>>> kill it. Let me double check, hopefully we can just remove this blocked
>>>> part.
>>>
>>> Is this really correct to drop in your "kernel: stop masking signals in 
>>> create_io_thread()"
>>> commit?
>>>
>>> I don't assume signals wanted by userspace should potentially handled in an 
>>> io_thread...
>>> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?
>>
>> I guess we do actually need it, if we're not fiddling with
>> wants_signal() for them. To quell Oleg's concerns, we can just move it
>> to post dup_task_struct(), that should eliminate any race concerns
>> there.
> 
> If that one is racy, don' we better also want this one?
> https://lore.kernel.org/io-uring/438b738c1e4827a7fdfe43087da88bbe17eedc72.1616197787.git.me...@samba.org/T/#u
> 
> And clear tsk->pf_io_worker ?

Definitely prudent. I'll get round 2 queued up shortly.

-- 
Jens Axboe

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 16:29 schrieb Jens Axboe:
> On 3/26/21 9:23 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 16:01 schrieb Jens Axboe:
>>> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>>>> Jens, sorry, I got lost :/
>>>
>>> Let's bring you back in :-)
>>>
>>>> On 03/25, Jens Axboe wrote:
>>>>>
>>>>> With IO threads accepting signals, including SIGSTOP,
>>>>
>>>> where can I find this change? Looks like I wasn't cc'ed...
>>>
>>> It's this very series.
>>>
>>>>> unmask the
>>>>> SIGSTOP signal from the default blocked mask.
>>>>>
>>>>> Signed-off-by: Jens Axboe 
>>>>> ---
>>>>>  kernel/fork.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>> index d3171e8e88e5..d5a40552910f 100644
>>>>> --- a/kernel/fork.c
>>>>> +++ b/kernel/fork.c
>>>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>>>>> *), void *arg, int node)
>>>>>   tsk = copy_process(NULL, 0, node, );
>>>>>   if (!IS_ERR(tsk)) {
>>>>>   sigfillset(>blocked);
>>>>> - sigdelsetmask(>blocked, sigmask(SIGKILL));
>>>>> + sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>>
>>>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is 
>>>> minor.
>>>
>>> Ah thanks.
>>>
>>>> To remind, either way this is racy and can't really help.
>>>>
>>>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>>>> I must have missed something.
>>>
>>> I do think the above is a no-op at this point, and we can probably just
>>> kill it. Let me double check, hopefully we can just remove this blocked
>>> part.
>>
>> Is this really correct to drop in your "kernel: stop masking signals in 
>> create_io_thread()"
>> commit?
>>
>> I don't assume signals wanted by userspace should potentially handled in an 
>> io_thread...
>> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?
> 
> I guess we do actually need it, if we're not fiddling with
> wants_signal() for them. To quell Oleg's concerns, we can just move it
> to post dup_task_struct(), that should eliminate any race concerns
> there.

If that one is racy, don' we better also want this one?
https://lore.kernel.org/io-uring/438b738c1e4827a7fdfe43087da88bbe17eedc72.1616197787.git.me...@samba.org/T/#u

And clear tsk->pf_io_worker ?

metze

Re: [PATCHSET v2 0/7] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 9:51 AM, Jens Axboe wrote:
> Hi,
> 
> For the v1 posting, see here:

Sigh, just ignore the last 4 patches (07...10/10) in this series,
there are sitting on top of this series and I messed up the git send-email.
This patch series ends in the 4 reverts.

-- 
Jens Axboe

[PATCH 06/10] Revert "signal: don't allow STOP on PF_IO_WORKER threads"

2021-03-26 Thread Jens Axboe

This reverts commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597.

The IO threads allow and handle SIGSTOP now, so don't special case them
anymore in task_set_jobctl_pending().

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 76d85830d4fa..5b75fbe3d2d6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,8 +288,7 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) ||
-(task->flags & (PF_EXITING | PF_IO_WORKER
+   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)
-- 
2.31.0

[PATCH 4/7] Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"

2021-03-26 Thread Jens Axboe

This reverts commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5.

IO threads now take signals just fine, so there's no reason to limit them
specifically. Revert the change that prevented that from happening.

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index e3e1b8fbfe8a..af890479921a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -834,9 +834,6 @@ static int check_kill_permission(int sig, struct 
kernel_siginfo *info,
 
if (!valid_signal(sig))
return -EINVAL;
-   /* PF_IO_WORKER threads don't take any signals */
-   if (t->flags & PF_IO_WORKER)
-   return -ESRCH;
 
if (!si_fromuser(info))
return 0;
-- 
2.31.0

[PATCH 7/7] Revert "signal: don't allow STOP on PF_IO_WORKER threads"

2021-03-26 Thread Jens Axboe

This reverts commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597.

The IO threads allow and handle SIGSTOP now, so don't special case them
anymore in task_set_jobctl_pending().

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 76d85830d4fa..5b75fbe3d2d6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,8 +288,7 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) ||
-(task->flags & (PF_EXITING | PF_IO_WORKER
+   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)
-- 
2.31.0

[PATCH 2/7] io_uring: handle signals for IO threads like a normal thread

2021-03-26 Thread Jens Axboe

We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.

With that, we can support receiving signals, including special ones like
SIGSTOP.

Signed-off-by: Jens Axboe 
---
 fs/io-wq.c| 24 +---
 fs/io_uring.c | 12 
 2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index b7c1fa932cb3..3e2f059a1737 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,7 +16,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "../kernel/sched/sched.h"
 #include "io-wq.h"
@@ -503,10 +502,16 @@ static int io_wqe_worker(void *data)
if (io_flush_signals())
continue;
ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
-   if (try_to_freeze() || ret)
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   break;
+   if (get_signal())
+   continue;
+   }
+   if (ret)
continue;
-   if (fatal_signal_pending(current))
-   break;
/* timed out, exit unless we're the fixed worker */
if (test_bit(IO_WQ_BIT_EXIT, >state) ||
!(worker->flags & IO_WORKER_F_FIXED))
@@ -714,9 +719,14 @@ static int io_wq_manager(void *data)
set_current_state(TASK_INTERRUPTIBLE);
io_wq_check_workers(wq);
schedule_timeout(HZ);
-   try_to_freeze();
-   if (fatal_signal_pending(current))
-   set_bit(IO_WQ_BIT_EXIT, >state);
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   set_bit(IO_WQ_BIT_EXIT, >state);
+   else if (get_signal())
+   continue;
+   }
} while (!test_bit(IO_WQ_BIT_EXIT, >state));
 
io_wq_check_workers(wq);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 54ea561db4a5..350418a88db3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,7 +78,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -6765,8 +6764,14 @@ static int io_sq_thread(void *data)
timeout = jiffies + sqd->sq_thread_idle;
continue;
}
-   if (fatal_signal_pending(current))
-   break;
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   break;
+   if (get_signal())
+   continue;
+   }
sqt_spin = false;
cap_entries = !list_is_singular(>ctx_list);
list_for_each_entry(ctx, >ctx_list, sqd_list) {
@@ -6809,7 +6814,6 @@ static int io_sq_thread(void *data)
 
mutex_unlock(>lock);
schedule();
-   try_to_freeze();
mutex_lock(>lock);
list_for_each_entry(ctx, >ctx_list, sqd_list)
io_ring_clear_wakeup_flag(ctx);
-- 
2.31.0

[PATCH 03/10] Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"

2021-03-26 Thread Jens Axboe

This reverts commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5.

IO threads now take signals just fine, so there's no reason to limit them
specifically. Revert the change that prevented that from happening.

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index e3e1b8fbfe8a..af890479921a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -834,9 +834,6 @@ static int check_kill_permission(int sig, struct 
kernel_siginfo *info,
 
if (!valid_signal(sig))
return -EINVAL;
-   /* PF_IO_WORKER threads don't take any signals */
-   if (t->flags & PF_IO_WORKER)
-   return -ESRCH;
 
if (!si_fromuser(info))
return 0;
-- 
2.31.0

[PATCH 1/7] kernel: don't call do_exit() for PF_IO_WORKER threads

2021-03-26 Thread Jens Axboe

Right now we're never calling get_signal() from PF_IO_WORKER threads, but
in preparation for doing so, don't handle a fatal signal for them. The
workers have state they need to cleanup when exiting, and they don't do
coredumps, so just return instead of performing either a dump or calling
do_exit() on their behalf. The threads themselves will detect a fatal
signal and do proper shutdown.

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index f2a1b898da29..e3e1b8fbfe8a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2756,6 +2756,15 @@ bool get_signal(struct ksignal *ksig)
 */
current->flags |= PF_SIGNALED;
 
+   /*
+* PF_IO_WORKER threads will catch and exit on fatal signals
+* themselves. They have cleanup that must be performed, so
+* we cannot call do_exit() on their behalf. coredumps also
+* do not apply to them.
+*/
+   if (current->flags & PF_IO_WORKER)
+   return false;
+
if (sig_kernel_coredump(signr)) {
if (print_fatal_signals)
print_fatal_signal(ksig->info.si_signo);
-- 
2.31.0

[PATCHSET v2 0/7] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

Hi,

For the v1 posting, see here:

https://lore.kernel.org/io-uring/20210326003928.978750-1-ax...@kernel.dk/

I've run this through the usual testing, and it's running long term right
now. I've tested the cases that Stefan reported, and we seem fine now.

Changes since v1:

- Catch fatal signals in get_signal() for PF_IO_WORKER. This is only a
  problem for nested signals, like SIGSTOP followed by SIGKILL. We
  can't have get_signal() calling do_exit() on behalf of the IO threads,
  they have cleanups to do. Thanks Stefan.

- Move signal masking to when the PF_IO_WORKER thread is created, and since
  we now handle SIGSTOP, unmask that as well. Thanks Oleg.

- Remove try_to_freeze() parts in IO threads, we don't need those anymore
  with the calling of get_signal().

- Minor cleanups.

 fs/io-wq.c   | 24 +---
 fs/io_uring.c| 12 
 kernel/fork.c| 16 
 kernel/freezer.c |  2 +-
 kernel/ptrace.c  |  2 +-
 kernel/signal.c  | 19 ---
 6 files changed, 47 insertions(+), 28 deletions(-)

-- 
Jens Axboe

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 9:23 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 16:01 schrieb Jens Axboe:
>> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>>> Jens, sorry, I got lost :/
>>
>> Let's bring you back in :-)
>>
>>> On 03/25, Jens Axboe wrote:
>>>>
>>>> With IO threads accepting signals, including SIGSTOP,
>>>
>>> where can I find this change? Looks like I wasn't cc'ed...
>>
>> It's this very series.
>>
>>>> unmask the
>>>> SIGSTOP signal from the default blocked mask.
>>>>
>>>> Signed-off-by: Jens Axboe 
>>>> ---
>>>>  kernel/fork.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index d3171e8e88e5..d5a40552910f 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>>>> *), void *arg, int node)
>>>>tsk = copy_process(NULL, 0, node, );
>>>>if (!IS_ERR(tsk)) {
>>>>sigfillset(>blocked);
>>>> -  sigdelsetmask(>blocked, sigmask(SIGKILL));
>>>> +  sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>>
>>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.
>>
>> Ah thanks.
>>
>>> To remind, either way this is racy and can't really help.
>>>
>>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>>> I must have missed something.
>>
>> I do think the above is a no-op at this point, and we can probably just
>> kill it. Let me double check, hopefully we can just remove this blocked
>> part.
> 
> Is this really correct to drop in your "kernel: stop masking signals in 
> create_io_thread()"
> commit?
> 
> I don't assume signals wanted by userspace should potentially handled in an 
> io_thread...
> e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?

I guess we do actually need it, if we're not fiddling with
wants_signal() for them. To quell Oleg's concerns, we can just move it
to post dup_task_struct(), that should eliminate any race concerns
there.

-- 
Jens Axboe

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 16:01 schrieb Jens Axboe:
> On 3/26/21 7:48 AM, Oleg Nesterov wrote:
>> Jens, sorry, I got lost :/
> 
> Let's bring you back in :-)
> 
>> On 03/25, Jens Axboe wrote:
>>>
>>> With IO threads accepting signals, including SIGSTOP,
>>
>> where can I find this change? Looks like I wasn't cc'ed...
> 
> It's this very series.
> 
>>> unmask the
>>> SIGSTOP signal from the default blocked mask.
>>>
>>> Signed-off-by: Jens Axboe 
>>> ---
>>>  kernel/fork.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index d3171e8e88e5..d5a40552910f 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>>> *), void *arg, int node)
>>> tsk = copy_process(NULL, 0, node, );
>>> if (!IS_ERR(tsk)) {
>>> sigfillset(>blocked);
>>> -   sigdelsetmask(>blocked, sigmask(SIGKILL));
>>> +   sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
>>
>> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.
> 
> Ah thanks.
> 
>> To remind, either way this is racy and can't really help.
>>
>> And if "IO threads accepting signals" then I don't understand why. Sorry,
>> I must have missed something.
> 
> I do think the above is a no-op at this point, and we can probably just
> kill it. Let me double check, hopefully we can just remove this blocked
> part.

Is this really correct to drop in your "kernel: stop masking signals in 
create_io_thread()"
commit?

I don't assume signals wanted by userspace should potentially handled in an 
io_thread...
e.g. things set with fcntl(fd, F_SETSIG,) used together with F_SETLEASE?

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 9:11 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 16:10 schrieb Jens Axboe:
>> On 3/26/21 9:08 AM, Stefan Metzmacher wrote:
>>> Am 26.03.21 um 15:55 schrieb Jens Axboe:
>>>> On 3/26/21 8:53 AM, Jens Axboe wrote:
>>>>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>>>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>>>>
>>>>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>>>>
>>>>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>>>>
>>>>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>>>>> tested both of them separately and didn't observe anything. I also 
>>>>>>>>>> ran
>>>>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>>>>> pushed), fwiw.
>>>>>>>>>
>>>>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>>>>
>>>>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for 
>>>>>>>> us
>>>>>>>> and never return. So we might need a special case in there to deal with
>>>>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>>>>> correctly for IO threads.
>>>>>>>
>>>>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() 
>>>>>>> from being called?
>>>>>>
>>>>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>>>>
>>>>> Yes exactly, we're waiting in there being stopped. So we either need to
>>>>> check to something ala:
>>>>>
>>>>> relock:
>>>>> + if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>>>>> + return false;
>>>>>
>>>>> to catch it upfront and from the relock case, or add:
>>>>>
>>>>>   fatal:
>>>>> + if (current->flags & PF_IO_WORKER)
>>>>> + return false;
>>>>>
>>>>> to catch it in the fatal section.
>>>>
>>>> Can you try this? Not crazy about adding a special case, but I don't
>>>> think there's any way around this one. And should be pretty cheap, as
>>>> we're already pulling in ->flags right above anyway.
>>>>
>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>> index 5ad8566534e7..5b75fbe3d2d6 100644
>>>> --- a/kernel/signal.c
>>>> +++ b/kernel/signal.c
>>>> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>>>> */
>>>>current->flags |= PF_SIGNALED;
>>>>  
>>>> +  /*
>>>> +   * PF_IO_WORKER threads will catch and exit on fatal signals
>>>> +   * themselves. They have cleanup that must be performed, so
>>>> +   * we cannot call do_exit() on their behalf. coredumps also
>>>> +   * do not apply to them.
>>>> +   */
>>>> +  if (current->flags & PF_IO_WORKER)
>>>> +  return false;
>>>> +
>>>>if (sig_kernel_coredump(signr)) {
>>>>if (print_fatal_signals)
>>>>print_fatal_signal(ksig->info.si_signo);
>>>>
>>>
>>> I guess not before next week, but if it resolves the problem for you,
>>> I guess it would be good to get this into rc5.
>>
>> It does, I pushed out a new branch. I'll send out a v2 series in a bit.
> 
> Great, thanks!
> 
> Any chance to get the "cmdline" hiding included?

I'll take a look at your response there, haven't yet. Wanted to get this
one sorted first.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 16:10 schrieb Jens Axboe:
> On 3/26/21 9:08 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 15:55 schrieb Jens Axboe:
>>> On 3/26/21 8:53 AM, Jens Axboe wrote:
>>>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>>>
>>>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>>>
>>>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>>>
>>>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>>>> pushed), fwiw.
>>>>>>>>
>>>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>>>
>>>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>>>> and never return. So we might need a special case in there to deal with
>>>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>>>> correctly for IO threads.
>>>>>>
>>>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>>>> being called?
>>>>>
>>>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>>>
>>>> Yes exactly, we're waiting in there being stopped. So we either need to
>>>> check to something ala:
>>>>
>>>> relock:
>>>> +  if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>>>> +  return false;
>>>>
>>>> to catch it upfront and from the relock case, or add:
>>>>
>>>>fatal:
>>>> +  if (current->flags & PF_IO_WORKER)
>>>> +  return false;
>>>>
>>>> to catch it in the fatal section.
>>>
>>> Can you try this? Not crazy about adding a special case, but I don't
>>> think there's any way around this one. And should be pretty cheap, as
>>> we're already pulling in ->flags right above anyway.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 5ad8566534e7..5b75fbe3d2d6 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>>>  */
>>> current->flags |= PF_SIGNALED;
>>>  
>>> +   /*
>>> +* PF_IO_WORKER threads will catch and exit on fatal signals
>>> +* themselves. They have cleanup that must be performed, so
>>> +* we cannot call do_exit() on their behalf. coredumps also
>>> +* do not apply to them.
>>> +*/
>>> +   if (current->flags & PF_IO_WORKER)
>>> +   return false;
>>> +
>>> if (sig_kernel_coredump(signr)) {
>>> if (print_fatal_signals)
>>> print_fatal_signal(ksig->info.si_signo);
>>>
>>
>> I guess not before next week, but if it resolves the problem for you,
>> I guess it would be good to get this into rc5.
> 
> It does, I pushed out a new branch. I'll send out a v2 series in a bit.

Great, thanks!

Any chance to get the "cmdline" hiding included?

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 9:08 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 15:55 schrieb Jens Axboe:
>> On 3/26/21 8:53 AM, Jens Axboe wrote:
>>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>>
>>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>>
>>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>>
>>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>>> pushed), fwiw.
>>>>>>>
>>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>>
>>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>>> and never return. So we might need a special case in there to deal with
>>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>>> correctly for IO threads.
>>>>>
>>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>>> being called?
>>>>
>>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>>
>>> Yes exactly, we're waiting in there being stopped. So we either need to
>>> check to something ala:
>>>
>>> relock:
>>> +   if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>>> +   return false;
>>>
>>> to catch it upfront and from the relock case, or add:
>>>
>>> fatal:
>>> +   if (current->flags & PF_IO_WORKER)
>>> +   return false;
>>>
>>> to catch it in the fatal section.
>>
>> Can you try this? Not crazy about adding a special case, but I don't
>> think there's any way around this one. And should be pretty cheap, as
>> we're already pulling in ->flags right above anyway.
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 5ad8566534e7..5b75fbe3d2d6 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>>   */
>>  current->flags |= PF_SIGNALED;
>>  
>> +/*
>> + * PF_IO_WORKER threads will catch and exit on fatal signals
>> + * themselves. They have cleanup that must be performed, so
>> + * we cannot call do_exit() on their behalf. coredumps also
>> + * do not apply to them.
>> + */
>> +if (current->flags & PF_IO_WORKER)
>> +return false;
>> +
>>  if (sig_kernel_coredump(signr)) {
>>  if (print_fatal_signals)
>>  print_fatal_signal(ksig->info.si_signo);
>>
> 
> I guess not before next week, but if it resolves the problem for you,
> I guess it would be good to get this into rc5.

It does, I pushed out a new branch. I'll send out a v2 series in a bit.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 9:04 AM, Stefan Metzmacher wrote:
> 
> Am 26.03.21 um 15:53 schrieb Jens Axboe:
>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>
>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>
>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>
>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>> pushed), fwiw.
>>>>>>
>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>
>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>> and never return. So we might need a special case in there to deal with
>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>> correctly for IO threads.
>>>>
>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>> being called?
>>>
>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>
>> Yes exactly, we're waiting in there being stopped. So we either need to
>> check to something ala:
>>
>> relock:
>> +if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>> +return false;
>>
>> to catch it upfront and from the relock case, or add:
>>
>>  fatal:
>> +if (current->flags & PF_IO_WORKER)
>> +return false;
>>
>> to catch it in the fatal section.
>>
> 
> Or something like io_uring_files_cancel()
> 
> Maybe change current->pf_io_worker with a generic current->io_thread
> structure which, has exit hooks, as well as
> io_wq_worker_sleeping() and io_wq_worker_running().
> 
> Maybe create_io_thread would take such an structure
> as argument instead of a single function pointer.
> 
> struct io_thread_description {
>   const char *name;
>   int (*thread_fn)(struct io_thread_description *);
>   void (*sleeping_fn)((struct io_thread_description *);
>   void (*running_fn)((struct io_thread_description *);
>   void (*exit_fn)((struct io_thread_description *);
> };
> 
> And then
> struct io_wq_manager {
>   struct io_thread_description description;
>   ... manager specific stuff...
> };

I did consider something like that, but seems a bit over-engineered
just for catching this case. And any kind of logic for PF_EXITING
ends up being a bit tricky for cancelations.

We can look into doing that for 5.13 potentially.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 15:55 schrieb Jens Axboe:
> On 3/26/21 8:53 AM, Jens Axboe wrote:
>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>>
>>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>>> incremental you sent, which wasn't complete.
>>>>>>>
>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>>
>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>>> pushed), fwiw.
>>>>>>
>>>>>> I can reproduce this one! I'll take a closer look.
>>>>>
>>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>>> and never return. So we might need a special case in there to deal with
>>>>> that, or some other way of ensuring that fatal signal gets processed
>>>>> correctly for IO threads.
>>>>
>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>>> being called?
>>>
>>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
>>
>> Yes exactly, we're waiting in there being stopped. So we either need to
>> check to something ala:
>>
>> relock:
>> +if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
>> +return false;
>>
>> to catch it upfront and from the relock case, or add:
>>
>>  fatal:
>> +if (current->flags & PF_IO_WORKER)
>> +return false;
>>
>> to catch it in the fatal section.
> 
> Can you try this? Not crazy about adding a special case, but I don't
> think there's any way around this one. And should be pretty cheap, as
> we're already pulling in ->flags right above anyway.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5ad8566534e7..5b75fbe3d2d6 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
>*/
>   current->flags |= PF_SIGNALED;
>  
> + /*
> +  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * themselves. They have cleanup that must be performed, so
> +  * we cannot call do_exit() on their behalf. coredumps also
> +  * do not apply to them.
> +  */
> + if (current->flags & PF_IO_WORKER)
> + return false;
> +
>   if (sig_kernel_coredump(signr)) {
>   if (print_fatal_signals)
>   print_fatal_signal(ksig->info.si_signo);
> 

I guess not before next week, but if it resolves the problem for you,
I guess it would be good to get this into rc5.

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher



Am 26.03.21 um 15:53 schrieb Jens Axboe:
> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>
>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>> incremental you sent, which wasn't complete.
>>>>>>
>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>
>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>> pushed), fwiw.
>>>>>
>>>>> I can reproduce this one! I'll take a closer look.
>>>>
>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>> and never return. So we might need a special case in there to deal with
>>>> that, or some other way of ensuring that fatal signal gets processed
>>>> correctly for IO threads.
>>>
>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>> being called?
>>
>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
> 
> Yes exactly, we're waiting in there being stopped. So we either need to
> check to something ala:
> 
> relock:
> + if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
> + return false;
> 
> to catch it upfront and from the relock case, or add:
> 
>   fatal:
> + if (current->flags & PF_IO_WORKER)
> + return false;
> 
> to catch it in the fatal section.
> 

Or something like io_uring_files_cancel()

Maybe change current->pf_io_worker with a generic current->io_thread
structure which, has exit hooks, as well as
io_wq_worker_sleeping() and io_wq_worker_running().

Maybe create_io_thread would take such an structure
as argument instead of a single function pointer.

struct io_thread_description {
const char *name;
int (*thread_fn)(struct io_thread_description *);
void (*sleeping_fn)((struct io_thread_description *);
void (*running_fn)((struct io_thread_description *);
void (*exit_fn)((struct io_thread_description *);
};

And then
struct io_wq_manager {
struct io_thread_description description;
... manager specific stuff...
};

metze

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 7:48 AM, Oleg Nesterov wrote:
> Jens, sorry, I got lost :/

Let's bring you back in :-)

> On 03/25, Jens Axboe wrote:
>>
>> With IO threads accepting signals, including SIGSTOP,
> 
> where can I find this change? Looks like I wasn't cc'ed...

It's this very series.

>> unmask the
>> SIGSTOP signal from the default blocked mask.
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  kernel/fork.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index d3171e8e88e5..d5a40552910f 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void 
>> *), void *arg, int node)
>>  tsk = copy_process(NULL, 0, node, );
>>  if (!IS_ERR(tsk)) {
>>  sigfillset(>blocked);
>> -sigdelsetmask(>blocked, sigmask(SIGKILL));
>> +sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
> 
> siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.

Ah thanks.

> To remind, either way this is racy and can't really help.
> 
> And if "IO threads accepting signals" then I don't understand why. Sorry,
> I must have missed something.

I do think the above is a no-op at this point, and we can probably just
kill it. Let me double check, hopefully we can just remove this blocked
part.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 8:53 AM, Jens Axboe wrote:
> On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>>> The KILL after STOP deadlock still exists.
>>>>>>
>>>>>> In which tree? Sounds like you're still on the old one with that
>>>>>> incremental you sent, which wasn't complete.
>>>>>>
>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>>
>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>>> pushed), fwiw.
>>>>>
>>>>> I can reproduce this one! I'll take a closer look.
>>>>
>>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>>> and never return. So we might need a special case in there to deal with
>>>> that, or some other way of ensuring that fatal signal gets processed
>>>> correctly for IO threads.
>>>
>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>>> being called?
>>
>> Ah, we're still in the first get_signal() from SIGSTOP, correct?
> 
> Yes exactly, we're waiting in there being stopped. So we either need to
> check to something ala:
> 
> relock:
> + if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
> + return false;
> 
> to catch it upfront and from the relock case, or add:
> 
>   fatal:
> + if (current->flags & PF_IO_WORKER)
> + return false;
> 
> to catch it in the fatal section.

Can you try this? Not crazy about adding a special case, but I don't
think there's any way around this one. And should be pretty cheap, as
we're already pulling in ->flags right above anyway.

diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..5b75fbe3d2d6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig)
 */
current->flags |= PF_SIGNALED;
 
+   /*
+* PF_IO_WORKER threads will catch and exit on fatal signals
+* themselves. They have cleanup that must be performed, so
+* we cannot call do_exit() on their behalf. coredumps also
+* do not apply to them.
+*/
+   if (current->flags & PF_IO_WORKER)
+   return false;
+
if (sig_kernel_coredump(signr)) {
if (print_fatal_signals)
print_fatal_signal(ksig->info.si_signo);

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 8:45 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
>> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>>> The KILL after STOP deadlock still exists.
>>>>>
>>>>> In which tree? Sounds like you're still on the old one with that
>>>>> incremental you sent, which wasn't complete.
>>>>>
>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>>
>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>>> tested both of them separately and didn't observe anything. I also ran
>>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>>> pushed), fwiw.
>>>>
>>>> I can reproduce this one! I'll take a closer look.
>>>
>>> OK, that one is actually pretty straight forward - we rely on cleaning
>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>>> and never return. So we might need a special case in there to deal with
>>> that, or some other way of ensuring that fatal signal gets processed
>>> correctly for IO threads.
>>
>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
>> being called?
> 
> Ah, we're still in the first get_signal() from SIGSTOP, correct?

Yes exactly, we're waiting in there being stopped. So we either need to
check to something ala:

relock:
+   if (current->flags & PF_IO_WORKER && fatal_signal_pending(current))
+   return false;

to catch it upfront and from the relock case, or add:

fatal:
+   if (current->flags & PF_IO_WORKER)
+   return false;

to catch it in the fatal section.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 8:43 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>> The KILL after STOP deadlock still exists.
>>>>
>>>> In which tree? Sounds like you're still on the old one with that
>>>> incremental you sent, which wasn't complete.
>>>>
>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>
>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>> tested both of them separately and didn't observe anything. I also ran
>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>> pushed), fwiw.
>>>
>>> I can reproduce this one! I'll take a closer look.
>>
>> OK, that one is actually pretty straight forward - we rely on cleaning
>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>> and never return. So we might need a special case in there to deal with
>> that, or some other way of ensuring that fatal signal gets processed
>> correctly for IO threads.
> 
> And if (fatal_signal_pending(current)) doesn't prevent get_signal()
> from being called?

Usually yes, but this case is first doing SIGSTOP, so we're waiting in
get_signal() -> do_signal_stop() when the SIGKILL arrives. Hence there's
no way to catch it in the worker themselves.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 15:43 schrieb Stefan Metzmacher:
> Am 26.03.21 um 15:38 schrieb Jens Axboe:
>> On 3/26/21 7:59 AM, Jens Axboe wrote:
>>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>>> The KILL after STOP deadlock still exists.
>>>>
>>>> In which tree? Sounds like you're still on the old one with that
>>>> incremental you sent, which wasn't complete.
>>>>
>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>>
>>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>>> tested both of them separately and didn't observe anything. I also ran
>>>> your io_uring-cp example (and found a bug in the example, fixed and
>>>> pushed), fwiw.
>>>
>>> I can reproduce this one! I'll take a closer look.
>>
>> OK, that one is actually pretty straight forward - we rely on cleaning
>> up on exit, but for fatal cases, get_signal() will call do_exit() for us
>> and never return. So we might need a special case in there to deal with
>> that, or some other way of ensuring that fatal signal gets processed
>> correctly for IO threads.
> 
> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from 
> being called?

Ah, we're still in the first get_signal() from SIGSTOP, correct?

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 15:38 schrieb Jens Axboe:
> On 3/26/21 7:59 AM, Jens Axboe wrote:
>> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>>> The KILL after STOP deadlock still exists.
>>>
>>> In which tree? Sounds like you're still on the old one with that
>>> incremental you sent, which wasn't complete.
>>>
>>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>>
>>> No, it should kill up in all cases. I'll try your stop + kill, I just
>>> tested both of them separately and didn't observe anything. I also ran
>>> your io_uring-cp example (and found a bug in the example, fixed and
>>> pushed), fwiw.
>>
>> I can reproduce this one! I'll take a closer look.
> 
> OK, that one is actually pretty straight forward - we rely on cleaning
> up on exit, but for fatal cases, get_signal() will call do_exit() for us
> and never return. So we might need a special case in there to deal with
> that, or some other way of ensuring that fatal signal gets processed
> correctly for IO threads.

And if (fatal_signal_pending(current)) doesn't prevent get_signal() from being 
called?

metze

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 7:59 AM, Jens Axboe wrote:
> On 3/26/21 7:54 AM, Jens Axboe wrote:
>>> The KILL after STOP deadlock still exists.
>>
>> In which tree? Sounds like you're still on the old one with that
>> incremental you sent, which wasn't complete.
>>
>>> Does io_wq_manager() exits without cleaning up on SIGKILL?
>>
>> No, it should kill up in all cases. I'll try your stop + kill, I just
>> tested both of them separately and didn't observe anything. I also ran
>> your io_uring-cp example (and found a bug in the example, fixed and
>> pushed), fwiw.
> 
> I can reproduce this one! I'll take a closer look.

OK, that one is actually pretty straight forward - we rely on cleaning
up on exit, but for fatal cases, get_signal() will call do_exit() for us
and never return. So we might need a special case in there to deal with
that, or some other way of ensuring that fatal signal gets processed
correctly for IO threads.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 7:54 AM, Jens Axboe wrote:
>> The KILL after STOP deadlock still exists.
> 
> In which tree? Sounds like you're still on the old one with that
> incremental you sent, which wasn't complete.
> 
>> Does io_wq_manager() exits without cleaning up on SIGKILL?
> 
> No, it should kill up in all cases. I'll try your stop + kill, I just
> tested both of them separately and didn't observe anything. I also ran
> your io_uring-cp example (and found a bug in the example, fixed and
> pushed), fwiw.

I can reproduce this one! I'll take a closer look.

-- 
Jens Axboe

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 7:31 AM, Stefan Metzmacher wrote:
> Am 26.03.21 um 13:56 schrieb Jens Axboe:
>> On 3/26/21 5:48 AM, Stefan Metzmacher wrote:
>>>
>>> Am 26.03.21 um 01:39 schrieb Jens Axboe:
>>>> Hi,
>>>>
>>>> As discussed in a previous thread today, the seemingly much saner approach
>>>> is just to allow signals (including SIGSTOP) for the PF_IO_WORKER IO
>>>> threads. If we just have the threads call get_signal() for
>>>> signal_pending(), then everything just falls out naturally with how
>>>> we receive and handle signals.
>>>>
>>>> Patch 1 adds support for checking and calling get_signal() from the
>>>> regular IO workers, the manager, and the SQPOLL thread. Patch 2 unblocks
>>>> SIGSTOP from the default IO thread blocked mask, and the rest just revert
>>>> special cases that were put in place for PF_IO_WORKER threads.
>>>>
>>>> With this done, only two special cases remain for PF_IO_WORKER, and they
>>>> aren't related to signals so not part of this patchset. But both of them
>>>> can go away as well now that we have "real" threads as IO workers, and
>>>> then we'll have zero special cases for PF_IO_WORKER.
>>>>
>>>> This passes the usual regression testing, my other usual 24h run has been
>>>> kicked off. But I wanted to send this out early.
>>>>
>>>> Thanks to Linus for the suggestion. As with most other good ideas, it's
>>>> obvious once you hear it. The fact that we end up with _zero_ special
>>>> cases with this is a clear sign that this is the right way to do it
>>>> indeed. The fact that this series is 2/3rds revert further drives that
>>>> point home. Also thanks to Eric for diligent review on the signal side
>>>> of things for the past changes (and hopefully ditto on this series :-))
>>>
>>> Ok, I'm testing a8ff6a3b20bd16d071ef66824ae4428529d114f9 from
>>> your io_uring-5.12 branch.
>>>
>>> And using this patch:
>>> diff --git a/examples/io_uring-cp.c b/examples/io_uring-cp.c
>>> index cc7a227a5ec7..6e26a4214015 100644
>>> --- a/examples/io_uring-cp.c
>>> +++ b/examples/io_uring-cp.c
>>> @@ -116,13 +116,16 @@ static void queue_write(struct io_uring *ring, struct 
>>> io_data *data)
>>> io_uring_submit(ring);
>>>  }
>>>
>>> -static int copy_file(struct io_uring *ring, off_t insize)
>>> +static int copy_file(struct io_uring *ring, off_t _insize)
>>>  {
>>> +   off_t insize = _insize;
>>> unsigned long reads, writes;
>>> struct io_uring_cqe *cqe;
>>> off_t write_left, offset;
>>> int ret;
>>>
>>> +again:
>>> +   insize = _insize;
>>> write_left = insize;
>>> writes = reads = offset = 0;
>>>
>>> @@ -221,6 +224,12 @@ static int copy_file(struct io_uring *ring, off_t 
>>> insize)
>>> }
>>> }
>>>
>>> +   {
>>> +   struct timespec ts = { .tv_nsec = 99, };
>>> +   nanosleep(, NULL);
>>> +   goto again;
>>> +   }
>>> +
>>> return 0;
>>>  }
>>>
>>> Running ./io_uring-cp ~/linux-image-5.12.0-rc2+-dbg_5.12.0-rc2+-5_amd64.deb 
>>> file
>>> What I see is this:
>>>
>>> kill -SIGSTOP to any thread I used a worker with pid 2061 here, results in
>>>
>>> root@ub1704-166:~# head /proc/2061/status
>>> Name:   iou-wrk-2041
>>> Umask:  0022
>>> State:  R (running)
>>> Tgid:   2041
>>> Ngid:   0
>>> Pid:2061
>>> PPid:   1857
>>> TracerPid:  0
>>> Uid:0   0   0   0
>>> Gid:0   0   0   0
>>> root@ub1704-166:~# head /proc/2041/status
>>> Name:   io_uring-cp
>>> Umask:  0022
>>> State:  T (stopped)
>>> Tgid:   2041
>>> Ngid:   0
>>> Pid:2041
>>> PPid:   1857
>>> TracerPid:  0
>>> Uid:0   0   0   0
>>> Gid:0   0   0   0
>>> root@ub1704-166:~# head /proc/2042/status
>>> Name:   iou-mgr-2041
>>> Umask:  0022
>>> State:  T (stopped)
>>> Tgid:   2041
>>> Ngid:   0
>>> Pid:2042
>>> PPid:   1857
>>> TracerPid:  0
>>> Uid:0   0

Re: [PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-26 Thread Oleg Nesterov

Jens, sorry, I got lost :/

On 03/25, Jens Axboe wrote:
>
> With IO threads accepting signals, including SIGSTOP,

where can I find this change? Looks like I wasn't cc'ed...

> unmask the
> SIGSTOP signal from the default blocked mask.
> 
> Signed-off-by: Jens Axboe 
> ---
>  kernel/fork.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d3171e8e88e5..d5a40552910f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
> void *arg, int node)
>   tsk = copy_process(NULL, 0, node, );
>   if (!IS_ERR(tsk)) {
>   sigfillset(>blocked);
> - sigdelsetmask(>blocked, sigmask(SIGKILL));
> + sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));

siginitsetinv(blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)) but this is minor.

To remind, either way this is racy and can't really help.

And if "IO threads accepting signals" then I don't understand why. Sorry,
I must have missed something.

Oleg.

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Stefan Metzmacher

Am 26.03.21 um 13:56 schrieb Jens Axboe:
> On 3/26/21 5:48 AM, Stefan Metzmacher wrote:
>>
>> Am 26.03.21 um 01:39 schrieb Jens Axboe:
>>> Hi,
>>>
>>> As discussed in a previous thread today, the seemingly much saner approach
>>> is just to allow signals (including SIGSTOP) for the PF_IO_WORKER IO
>>> threads. If we just have the threads call get_signal() for
>>> signal_pending(), then everything just falls out naturally with how
>>> we receive and handle signals.
>>>
>>> Patch 1 adds support for checking and calling get_signal() from the
>>> regular IO workers, the manager, and the SQPOLL thread. Patch 2 unblocks
>>> SIGSTOP from the default IO thread blocked mask, and the rest just revert
>>> special cases that were put in place for PF_IO_WORKER threads.
>>>
>>> With this done, only two special cases remain for PF_IO_WORKER, and they
>>> aren't related to signals so not part of this patchset. But both of them
>>> can go away as well now that we have "real" threads as IO workers, and
>>> then we'll have zero special cases for PF_IO_WORKER.
>>>
>>> This passes the usual regression testing, my other usual 24h run has been
>>> kicked off. But I wanted to send this out early.
>>>
>>> Thanks to Linus for the suggestion. As with most other good ideas, it's
>>> obvious once you hear it. The fact that we end up with _zero_ special
>>> cases with this is a clear sign that this is the right way to do it
>>> indeed. The fact that this series is 2/3rds revert further drives that
>>> point home. Also thanks to Eric for diligent review on the signal side
>>> of things for the past changes (and hopefully ditto on this series :-))
>>
>> Ok, I'm testing a8ff6a3b20bd16d071ef66824ae4428529d114f9 from
>> your io_uring-5.12 branch.
>>
>> And using this patch:
>> diff --git a/examples/io_uring-cp.c b/examples/io_uring-cp.c
>> index cc7a227a5ec7..6e26a4214015 100644
>> --- a/examples/io_uring-cp.c
>> +++ b/examples/io_uring-cp.c
>> @@ -116,13 +116,16 @@ static void queue_write(struct io_uring *ring, struct 
>> io_data *data)
>> io_uring_submit(ring);
>>  }
>>
>> -static int copy_file(struct io_uring *ring, off_t insize)
>> +static int copy_file(struct io_uring *ring, off_t _insize)
>>  {
>> +   off_t insize = _insize;
>> unsigned long reads, writes;
>> struct io_uring_cqe *cqe;
>> off_t write_left, offset;
>> int ret;
>>
>> +again:
>> +   insize = _insize;
>> write_left = insize;
>> writes = reads = offset = 0;
>>
>> @@ -221,6 +224,12 @@ static int copy_file(struct io_uring *ring, off_t 
>> insize)
>> }
>> }
>>
>> +   {
>> +   struct timespec ts = { .tv_nsec = 99, };
>> +   nanosleep(, NULL);
>> +   goto again;
>> +   }
>> +
>> return 0;
>>  }
>>
>> Running ./io_uring-cp ~/linux-image-5.12.0-rc2+-dbg_5.12.0-rc2+-5_amd64.deb 
>> file
>> What I see is this:
>>
>> kill -SIGSTOP to any thread I used a worker with pid 2061 here, results in
>>
>> root@ub1704-166:~# head /proc/2061/status
>> Name:   iou-wrk-2041
>> Umask:  0022
>> State:  R (running)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2061
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>> root@ub1704-166:~# head /proc/2041/status
>> Name:   io_uring-cp
>> Umask:  0022
>> State:  T (stopped)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2041
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>> root@ub1704-166:~# head /proc/2042/status
>> Name:   iou-mgr-2041
>> Umask:  0022
>> State:  T (stopped)
>> Tgid:   2041
>> Ngid:   0
>> Pid:2042
>> PPid:   1857
>> TracerPid:  0
>> Uid:0   0   0   0
>> Gid:0   0   0   0
>>
>> So userspace and iou-mgr-2041 stop, but the workers don't.
>> 49 workers burn cpu as much as possible.
>>
>> kill -KILL 2061
>> results in this:
>> - all workers are gone
>> - iou-mgr-2041 is gone
>> - io_uring-cp waits in status D forever
>>
>> root@ub1704-166:~# head /proc/2041/status
>> Name:   io_uring-cp
>> Umask:  0022
>> State:  D (disk slee

Re: [PATCH 0/6] Allow signals for IO threads

2021-03-26 Thread Jens Axboe

On 3/26/21 5:48 AM, Stefan Metzmacher wrote:
> 
> Am 26.03.21 um 01:39 schrieb Jens Axboe:
>> Hi,
>>
>> As discussed in a previous thread today, the seemingly much saner approach
>> is just to allow signals (including SIGSTOP) for the PF_IO_WORKER IO
>> threads. If we just have the threads call get_signal() for
>> signal_pending(), then everything just falls out naturally with how
>> we receive and handle signals.
>>
>> Patch 1 adds support for checking and calling get_signal() from the
>> regular IO workers, the manager, and the SQPOLL thread. Patch 2 unblocks
>> SIGSTOP from the default IO thread blocked mask, and the rest just revert
>> special cases that were put in place for PF_IO_WORKER threads.
>>
>> With this done, only two special cases remain for PF_IO_WORKER, and they
>> aren't related to signals so not part of this patchset. But both of them
>> can go away as well now that we have "real" threads as IO workers, and
>> then we'll have zero special cases for PF_IO_WORKER.
>>
>> This passes the usual regression testing, my other usual 24h run has been
>> kicked off. But I wanted to send this out early.
>>
>> Thanks to Linus for the suggestion. As with most other good ideas, it's
>> obvious once you hear it. The fact that we end up with _zero_ special
>> cases with this is a clear sign that this is the right way to do it
>> indeed. The fact that this series is 2/3rds revert further drives that
>> point home. Also thanks to Eric for diligent review on the signal side
>> of things for the past changes (and hopefully ditto on this series :-))
> 
> Ok, I'm testing a8ff6a3b20bd16d071ef66824ae4428529d114f9 from
> your io_uring-5.12 branch.
> 
> And using this patch:
> diff --git a/examples/io_uring-cp.c b/examples/io_uring-cp.c
> index cc7a227a5ec7..6e26a4214015 100644
> --- a/examples/io_uring-cp.c
> +++ b/examples/io_uring-cp.c
> @@ -116,13 +116,16 @@ static void queue_write(struct io_uring *ring, struct 
> io_data *data)
> io_uring_submit(ring);
>  }
> 
> -static int copy_file(struct io_uring *ring, off_t insize)
> +static int copy_file(struct io_uring *ring, off_t _insize)
>  {
> +   off_t insize = _insize;
> unsigned long reads, writes;
> struct io_uring_cqe *cqe;
> off_t write_left, offset;
> int ret;
> 
> +again:
> +   insize = _insize;
> write_left = insize;
> writes = reads = offset = 0;
> 
> @@ -221,6 +224,12 @@ static int copy_file(struct io_uring *ring, off_t insize)
> }
> }
> 
> +   {
> +   struct timespec ts = { .tv_nsec = 99, };
> +   nanosleep(, NULL);
> +   goto again;
> +   }
> +
> return 0;
>  }
> 
> Running ./io_uring-cp ~/linux-image-5.12.0-rc2+-dbg_5.12.0-rc2+-5_amd64.deb 
> file
> What I see is this:
> 
> kill -SIGSTOP to any thread I used a worker with pid 2061 here, results in
> 
> root@ub1704-166:~# head /proc/2061/status
> Name:   iou-wrk-2041
> Umask:  0022
> State:  R (running)
> Tgid:   2041
> Ngid:   0
> Pid:2061
> PPid:   1857
> TracerPid:  0
> Uid:0   0   0   0
> Gid:0   0   0   0
> root@ub1704-166:~# head /proc/2041/status
> Name:   io_uring-cp
> Umask:  0022
> State:  T (stopped)
> Tgid:   2041
> Ngid:   0
> Pid:2041
> PPid:   1857
> TracerPid:  0
> Uid:0   0   0   0
> Gid:0   0   0   0
> root@ub1704-166:~# head /proc/2042/status
> Name:   iou-mgr-2041
> Umask:  0022
> State:  T (stopped)
> Tgid:   2041
> Ngid:   0
> Pid:2042
> PPid:   1857
> TracerPid:  0
> Uid:0   0   0   0
> Gid:0   0   0   0
> 
> So userspace and iou-mgr-2041 stop, but the workers don't.
> 49 workers burn cpu as much as possible.
> 
> kill -KILL 2061
> results in this:
> - all workers are gone
> - iou-mgr-2041 is gone
> - io_uring-cp waits in status D forever
> 
> root@ub1704-166:~# head /proc/2041/status
> Name:   io_uring-cp
> Umask:  0022
> State:  D (disk sleep)
> Tgid:   2041
> Ngid:   0
> Pid:2041
> PPid:   1857
> TracerPid:  0
> Uid:0   0   0   0
> Gid:0   0   0   0
> root@ub1704-166:~# cat /proc/2041/stack
> [<0>] io_wq_destroy_manager+0x36/0xa0
> [<0>] io_wq_put_and_exit+0x2b/0x40
> [<0>] io_uring_clean_tctx+0xc5/0x110
> [<0>] __io_uring_files_cancel+0x336/0x4e0
> [<0>] do_exit+0x16b/0x13b0
> [<0>] do_group_exit+0x8b/0x140
> [<0>] get_signal+0x219/0xc90

[PATCH 6/8] Revert "signal: don't allow STOP on PF_IO_WORKER threads"

2021-03-25 Thread Jens Axboe

This reverts commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597.

The IO threads allow and handle SIGSTOP now, so don't special case them
anymore in task_set_jobctl_pending().

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8ce96078cb76..5ad8566534e7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,8 +288,7 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) ||
-(task->flags & (PF_EXITING | PF_IO_WORKER
+   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)
-- 
2.31.0

[PATCH 3/8] Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"

2021-03-25 Thread Jens Axboe

This reverts commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5.

IO threads now take signals just fine, so there's no reason to limit them
specifically. Revert the change that prevented that from happening.

Signed-off-by: Jens Axboe 
---
 kernel/signal.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index f2a1b898da29..cb9acdfb32fa 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -834,9 +834,6 @@ static int check_kill_permission(int sig, struct 
kernel_siginfo *info,
 
if (!valid_signal(sig))
return -EINVAL;
-   /* PF_IO_WORKER threads don't take any signals */
-   if (t->flags & PF_IO_WORKER)
-   return -ESRCH;
 
if (!si_fromuser(info))
return 0;
-- 
2.31.0

[PATCH 1/8] io_uring: handle signals for IO threads like a normal thread

2021-03-25 Thread Jens Axboe

We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.

With that, we can support receiving signals, including special ones like
SIGSTOP.

Signed-off-by: Jens Axboe 
---
 fs/io-wq.c| 21 +
 fs/io_uring.c | 10 --
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index b7c1fa932cb3..2dbdc552f3ba 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -505,8 +505,14 @@ static int io_wqe_worker(void *data)
ret = schedule_timeout(WORKER_IDLE_TIMEOUT);
if (try_to_freeze() || ret)
continue;
-   if (fatal_signal_pending(current))
-   break;
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   break;
+   get_signal();
+   continue;
+   }
/* timed out, exit unless we're the fixed worker */
if (test_bit(IO_WQ_BIT_EXIT, >state) ||
!(worker->flags & IO_WORKER_F_FIXED))
@@ -715,8 +721,15 @@ static int io_wq_manager(void *data)
io_wq_check_workers(wq);
schedule_timeout(HZ);
try_to_freeze();
-   if (fatal_signal_pending(current))
-   set_bit(IO_WQ_BIT_EXIT, >state);
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   set_bit(IO_WQ_BIT_EXIT, >state);
+   else
+   get_signal();
+   continue;
+   }
} while (!test_bit(IO_WQ_BIT_EXIT, >state));
 
io_wq_check_workers(wq);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 54ea561db4a5..3a9d021db328 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6765,8 +6765,14 @@ static int io_sq_thread(void *data)
timeout = jiffies + sqd->sq_thread_idle;
continue;
}
-   if (fatal_signal_pending(current))
-   break;
+   if (signal_pending(current)) {
+   struct ksignal ksig;
+
+   if (fatal_signal_pending(current))
+   break;
+   get_signal();
+   continue;
+   }
sqt_spin = false;
cap_entries = !list_is_singular(>ctx_list);
list_for_each_entry(ctx, >ctx_list, sqd_list) {
-- 
2.31.0

[PATCH 0/6] Allow signals for IO threads

2021-03-25 Thread Jens Axboe

Hi,

As discussed in a previous thread today, the seemingly much saner approach
is just to allow signals (including SIGSTOP) for the PF_IO_WORKER IO
threads. If we just have the threads call get_signal() for
signal_pending(), then everything just falls out naturally with how
we receive and handle signals.

Patch 1 adds support for checking and calling get_signal() from the
regular IO workers, the manager, and the SQPOLL thread. Patch 2 unblocks
SIGSTOP from the default IO thread blocked mask, and the rest just revert
special cases that were put in place for PF_IO_WORKER threads.

With this done, only two special cases remain for PF_IO_WORKER, and they
aren't related to signals so not part of this patchset. But both of them
can go away as well now that we have "real" threads as IO workers, and
then we'll have zero special cases for PF_IO_WORKER.

This passes the usual regression testing, my other usual 24h run has been
kicked off. But I wanted to send this out early.

Thanks to Linus for the suggestion. As with most other good ideas, it's
obvious once you hear it. The fact that we end up with _zero_ special
cases with this is a clear sign that this is the right way to do it
indeed. The fact that this series is 2/3rds revert further drives that
point home. Also thanks to Eric for diligent review on the signal side
of things for the past changes (and hopefully ditto on this series :-))

-- 
Jens Axboe

[PATCH 2/8] kernel: unmask SIGSTOP for IO threads

2021-03-25 Thread Jens Axboe

With IO threads accepting signals, including SIGSTOP, unmask the
SIGSTOP signal from the default blocked mask.

Signed-off-by: Jens Axboe 
---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index d3171e8e88e5..d5a40552910f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2435,7 +2435,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
tsk = copy_process(NULL, 0, node, );
if (!IS_ERR(tsk)) {
sigfillset(>blocked);
-   sigdelsetmask(>blocked, sigmask(SIGKILL));
+   sigdelsetmask(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
}
return tsk;
 }
-- 
2.31.0

[PATCH 2/2] proc: don't show PF_IO_WORKER threads as threads in /proc//task/

2021-03-25 Thread Jens Axboe

We don't allow SIGSTOP and ptrace attach to these threads, and that
confuses applications like gdb that assume they can attach to any thread
listed in /proc//task/. gdb then enters an infinite loop of retrying
attach, even though it fails with the same error (-EPERM) every time.

Skip over PF_IO_WORKER threads in the proc task setup. We can't just
terminate the when we find a PF_IO_WORKER thread, as there's no real
ordering here. It's perfectly feasible to have the first thread be an
IO worker, and then a real thread after that. Hence just implement the
skip.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
---
 fs/proc/base.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3851bfcdba56..abff2fe10bfa 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3723,7 +3723,7 @@ static struct task_struct *first_tid(struct pid *pid, int 
tid, loff_t f_pos,
 */
pos = task = task->group_leader;
do {
-   if (!nr--)
+   if (same_thread_group(task, pos) && !nr--)
goto found;
} while_each_thread(task, pos);
 fail:
@@ -3744,16 +3744,22 @@ static struct task_struct *first_tid(struct pid *pid, 
int tid, loff_t f_pos,
  */
 static struct task_struct *next_tid(struct task_struct *start)
 {
-   struct task_struct *pos = NULL;
+   struct task_struct *tmp, *pos = NULL;
+
rcu_read_lock();
-   if (pid_alive(start)) {
-   pos = next_thread(start);
-   if (thread_group_leader(pos))
-   pos = NULL;
-   else
-   get_task_struct(pos);
+   if (!pid_alive(start))
+   goto no_thread;
+   list_for_each_entry_rcu(tmp, >thread_group, thread_group) {
+   if (!thread_group_leader(tmp) && same_thread_group(start, tmp)) 
{
+   get_task_struct(tmp);
+   pos = tmp;
+   break;
+   }
}
+no_thread:
rcu_read_unlock();
+   if (!pos)
+   return NULL;
put_task_struct(start);
return pos;
 }
-- 
2.31.0

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Jens Axboe

On 3/25/21 8:02 AM, Jens Axboe wrote:
> On 3/25/21 7:56 AM, Stefan Metzmacher wrote:
>> Am 25.03.21 um 14:38 schrieb Jens Axboe:
>>> On 3/25/21 6:11 AM, Stefan Metzmacher wrote:
>>>>
>>>> Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
>>>>> Stefan Metzmacher  writes:
>>>>>
>>>>>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>>>>>> From: "Eric W. Biederman" 
>>>>>>>
>>>>>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>>>>>
>>>>>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>>>>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>>>>>> signals in general, and have no means of flushing out a stop either.
>>>>>>>
>>>>>>> Longer term, we may want to look into allowing stop of these threads,
>>>>>>> as it relates to eg process freezing. For now, this prevents a spin
>>>>>>> issue if a SIGSTOP is delivered to the parent task.
>>>>>>>
>>>>>>> Reported-by: Stefan Metzmacher 
>>>>>>> Signed-off-by: Jens Axboe 
>>>>>>> Signed-off-by: "Eric W. Biederman" 
>>>>>>> Signed-off-by: Sasha Levin 
>>>>>>> ---
>>>>>>>  kernel/signal.c | 3 ++-
>>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>>>>> index 55526b941011..00a3840f6037 100644
>>>>>>> --- a/kernel/signal.c
>>>>>>> +++ b/kernel/signal.c
>>>>>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct 
>>>>>>> *task, unsigned long mask)
>>>>>>> JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>>>>>> BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & 
>>>>>>> JOBCTL_PENDING_MASK));
>>>>>>>  
>>>>>>> -   if (unlikely(fatal_signal_pending(task) || (task->flags & 
>>>>>>> PF_EXITING)))
>>>>>>> +   if (unlikely(fatal_signal_pending(task) ||
>>>>>>> +(task->flags & (PF_EXITING | PF_IO_WORKER
>>>>>>> return false;
>>>>>>>  
>>>>>>> if (mask & JOBCTL_STOP_SIGMASK)
>>>>>>>
>>>>>>
>>>>>> Again, why is this proposed for 5.11 and 5.10 already?
>>>>>
>>>>> Has the bit about the io worker kthreads been backported?
>>>>> If so this isn't horrible.  If not this is nonsense.
>>>
>>> No not yet - my plan is to do that, but not until we're 100% satisfied
>>> with it.
>>
>> Do you understand why the patches where autoselected for 5.11 and 5.10?
> 
> As far as I know, selections like these (AUTOSEL) are done by some bot
> that uses whatever criteria to see if they should be applied for earlier
> revisions. I'm sure Sasha can expand on that :-)
> 
> Hence it's reasonable to expect that sometimes it'll pick patches that
> should not go into stable, at least not just yet. It's important to
> understand that this message is just a notice that it's queued up for
> stable -rc, not that it's _in_ stable just yet. There's time to object.
> 
>>>> I don't know, I hope not...
>>>>
>>>> But I just tested v5.12-rc4 and attaching to
>>>> an application with iothreads with gdb is still not possible,
>>>> it still loops forever trying to attach to the iothreads.
>>>
>>> I do see the looping, gdb apparently doesn't give up when it gets
>>> -EPERM trying to attach to the threads. Which isn't really a kernel
>>> thing, but:
>>
>> Maybe we need to remove the iothreads from /proc/pid/tasks/
> 
> Is that how it finds them? It's arguably a bug in gdb that it just
> keeps retrying, but it would be nice if we can ensure that it just
> ignores them. Because if gdb triggers something like that, probably
> others too...
> 
>>>> And I tested 'kill -9 $pidofiothread', and it feezed the whole
>>>> machine...
>>>
>>> that sounds very strange, I haven't seen anything like that running
>>> the exact same scenario.
>>>
>>>> So there's

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Sasha Levin


On Thu, Mar 25, 2021 at 08:02:11AM -0600, Jens Axboe wrote:

On 3/25/21 7:56 AM, Stefan Metzmacher wrote:

Am 25.03.21 um 14:38 schrieb Jens Axboe:

On 3/25/21 6:11 AM, Stefan Metzmacher wrote:


Am 25.03.21 um 13:04 schrieb Eric W. Biederman:

Stefan Metzmacher  writes:


Am 25.03.21 um 12:24 schrieb Sasha Levin:

From: "Eric W. Biederman" 

[ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]

Just like we don't allow normal signals to IO threads, don't deliver a
STOP to a task that has PF_IO_WORKER set. The IO threads don't take
signals in general, and have no means of flushing out a stop either.

Longer term, we may want to look into allowing stop of these threads,
as it relates to eg process freezing. For now, this prevents a spin
issue if a SIGSTOP is delivered to the parent task.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Sasha Levin 
---
 kernel/signal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 55526b941011..00a3840f6037 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));

-   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
+   if (unlikely(fatal_signal_pending(task) ||
+(task->flags & (PF_EXITING | PF_IO_WORKER
return false;

if (mask & JOBCTL_STOP_SIGMASK)



Again, why is this proposed for 5.11 and 5.10 already?


Has the bit about the io worker kthreads been backported?
If so this isn't horrible.  If not this is nonsense.


No not yet - my plan is to do that, but not until we're 100% satisfied
with it.


Do you understand why the patches where autoselected for 5.11 and 5.10?


As far as I know, selections like these (AUTOSEL) are done by some bot
that uses whatever criteria to see if they should be applied for earlier
revisions. I'm sure Sasha can expand on that :-)


Right, it's just heuristics that help catch commits that don't have a
stable tag but should have one.


Hence it's reasonable to expect that sometimes it'll pick patches that
should not go into stable, at least not just yet. It's important to
understand that this message is just a notice that it's queued up for
stable -rc, not that it's _in_ stable just yet. There's time to object.


Right, it's even more than that: this mail (tagged with "AUTOSEL") is a
notification that happens at least a week before the patch will go in
the stable queue.

If you think any AUTOSEL patches don't need to be backported, it's
usually enough to just quickly nack them.

--
Thanks,
Sasha

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Jens Axboe

On 3/25/21 7:56 AM, Stefan Metzmacher wrote:
> Am 25.03.21 um 14:38 schrieb Jens Axboe:
>> On 3/25/21 6:11 AM, Stefan Metzmacher wrote:
>>>
>>> Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
>>>> Stefan Metzmacher  writes:
>>>>
>>>>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>>>>> From: "Eric W. Biederman" 
>>>>>>
>>>>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>>>>
>>>>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>>>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>>>>> signals in general, and have no means of flushing out a stop either.
>>>>>>
>>>>>> Longer term, we may want to look into allowing stop of these threads,
>>>>>> as it relates to eg process freezing. For now, this prevents a spin
>>>>>> issue if a SIGSTOP is delivered to the parent task.
>>>>>>
>>>>>> Reported-by: Stefan Metzmacher 
>>>>>> Signed-off-by: Jens Axboe 
>>>>>> Signed-off-by: "Eric W. Biederman" 
>>>>>> Signed-off-by: Sasha Levin 
>>>>>> ---
>>>>>>  kernel/signal.c | 3 ++-
>>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>>>> index 55526b941011..00a3840f6037 100644
>>>>>> --- a/kernel/signal.c
>>>>>> +++ b/kernel/signal.c
>>>>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct 
>>>>>> *task, unsigned long mask)
>>>>>>  JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>>>>>  BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & 
>>>>>> JOBCTL_PENDING_MASK));
>>>>>>  
>>>>>> -if (unlikely(fatal_signal_pending(task) || (task->flags & 
>>>>>> PF_EXITING)))
>>>>>> +if (unlikely(fatal_signal_pending(task) ||
>>>>>> + (task->flags & (PF_EXITING | PF_IO_WORKER
>>>>>>  return false;
>>>>>>  
>>>>>>  if (mask & JOBCTL_STOP_SIGMASK)
>>>>>>
>>>>>
>>>>> Again, why is this proposed for 5.11 and 5.10 already?
>>>>
>>>> Has the bit about the io worker kthreads been backported?
>>>> If so this isn't horrible.  If not this is nonsense.
>>
>> No not yet - my plan is to do that, but not until we're 100% satisfied
>> with it.
> 
> Do you understand why the patches where autoselected for 5.11 and 5.10?

As far as I know, selections like these (AUTOSEL) are done by some bot
that uses whatever criteria to see if they should be applied for earlier
revisions. I'm sure Sasha can expand on that :-)

Hence it's reasonable to expect that sometimes it'll pick patches that
should not go into stable, at least not just yet. It's important to
understand that this message is just a notice that it's queued up for
stable -rc, not that it's _in_ stable just yet. There's time to object.

>>> I don't know, I hope not...
>>>
>>> But I just tested v5.12-rc4 and attaching to
>>> an application with iothreads with gdb is still not possible,
>>> it still loops forever trying to attach to the iothreads.
>>
>> I do see the looping, gdb apparently doesn't give up when it gets
>> -EPERM trying to attach to the threads. Which isn't really a kernel
>> thing, but:
> 
> Maybe we need to remove the iothreads from /proc/pid/tasks/

Is that how it finds them? It's arguably a bug in gdb that it just
keeps retrying, but it would be nice if we can ensure that it just
ignores them. Because if gdb triggers something like that, probably
others too...

>>> And I tested 'kill -9 $pidofiothread', and it feezed the whole
>>> machine...
>>
>> that sounds very strange, I haven't seen anything like that running
>> the exact same scenario.
>>
>>> So there's still work to do in order to get 5.12 stable.
>>>
>>> I'm short on time currently, but I hope to send more details soon.
>>
>> Thanks! I'll play with it this morning and see if I can provoke
>> something odd related to STOP/attach.
> 
> Thanks!
> 
> Somehow I have the impression that your same_thread_group_account patch
> may fix a lot of things...

Maybe? I'll look closer.

-- 
Jens Axboe

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher

Am 25.03.21 um 14:38 schrieb Jens Axboe:
> On 3/25/21 6:11 AM, Stefan Metzmacher wrote:
>>
>> Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
>>> Stefan Metzmacher  writes:
>>>
>>>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>>>> From: "Eric W. Biederman" 
>>>>>
>>>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>>>
>>>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>>>> signals in general, and have no means of flushing out a stop either.
>>>>>
>>>>> Longer term, we may want to look into allowing stop of these threads,
>>>>> as it relates to eg process freezing. For now, this prevents a spin
>>>>> issue if a SIGSTOP is delivered to the parent task.
>>>>>
>>>>> Reported-by: Stefan Metzmacher 
>>>>> Signed-off-by: Jens Axboe 
>>>>> Signed-off-by: "Eric W. Biederman" 
>>>>> Signed-off-by: Sasha Levin 
>>>>> ---
>>>>>  kernel/signal.c | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>>> index 55526b941011..00a3840f6037 100644
>>>>> --- a/kernel/signal.c
>>>>> +++ b/kernel/signal.c
>>>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct 
>>>>> *task, unsigned long mask)
>>>>>   JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>>>>   BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>>>>  
>>>>> - if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>>>>> + if (unlikely(fatal_signal_pending(task) ||
>>>>> +  (task->flags & (PF_EXITING | PF_IO_WORKER
>>>>>   return false;
>>>>>  
>>>>>   if (mask & JOBCTL_STOP_SIGMASK)
>>>>>
>>>>
>>>> Again, why is this proposed for 5.11 and 5.10 already?
>>>
>>> Has the bit about the io worker kthreads been backported?
>>> If so this isn't horrible.  If not this is nonsense.
> 
> No not yet - my plan is to do that, but not until we're 100% satisfied
> with it.

Do you understand why the patches where autoselected for 5.11 and 5.10?

>> I don't know, I hope not...
>>
>> But I just tested v5.12-rc4 and attaching to
>> an application with iothreads with gdb is still not possible,
>> it still loops forever trying to attach to the iothreads.
> 
> I do see the looping, gdb apparently doesn't give up when it gets
> -EPERM trying to attach to the threads. Which isn't really a kernel
> thing, but:

Maybe we need to remove the iothreads from /proc/pid/tasks/

>> And I tested 'kill -9 $pidofiothread', and it feezed the whole
>> machine...
> 
> that sounds very strange, I haven't seen anything like that running
> the exact same scenario.
> 
>> So there's still work to do in order to get 5.12 stable.
>>
>> I'm short on time currently, but I hope to send more details soon.
> 
> Thanks! I'll play with it this morning and see if I can provoke
> something odd related to STOP/attach.

Thanks!

Somehow I have the impression that your same_thread_group_account patch
may fix a lot of things...

metze

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Jens Axboe

On 3/25/21 6:11 AM, Stefan Metzmacher wrote:
> 
> Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
>> Stefan Metzmacher  writes:
>>
>>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>>> From: "Eric W. Biederman" 
>>>>
>>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>>
>>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>>> signals in general, and have no means of flushing out a stop either.
>>>>
>>>> Longer term, we may want to look into allowing stop of these threads,
>>>> as it relates to eg process freezing. For now, this prevents a spin
>>>> issue if a SIGSTOP is delivered to the parent task.
>>>>
>>>> Reported-by: Stefan Metzmacher 
>>>> Signed-off-by: Jens Axboe 
>>>> Signed-off-by: "Eric W. Biederman" 
>>>> Signed-off-by: Sasha Levin 
>>>> ---
>>>>  kernel/signal.c | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>>> index 55526b941011..00a3840f6037 100644
>>>> --- a/kernel/signal.c
>>>> +++ b/kernel/signal.c
>>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>>>> unsigned long mask)
>>>>JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>>>BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>>>  
>>>> -  if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>>>> +  if (unlikely(fatal_signal_pending(task) ||
>>>> +   (task->flags & (PF_EXITING | PF_IO_WORKER
>>>>return false;
>>>>  
>>>>if (mask & JOBCTL_STOP_SIGMASK)
>>>>
>>>
>>> Again, why is this proposed for 5.11 and 5.10 already?
>>
>> Has the bit about the io worker kthreads been backported?
>> If so this isn't horrible.  If not this is nonsense.

No not yet - my plan is to do that, but not until we're 100% satisfied
with it.

> I don't know, I hope not...
> 
> But I just tested v5.12-rc4 and attaching to
> an application with iothreads with gdb is still not possible,
> it still loops forever trying to attach to the iothreads.

I do see the looping, gdb apparently doesn't give up when it gets
-EPERM trying to attach to the threads. Which isn't really a kernel
thing, but:

> And I tested 'kill -9 $pidofiothread', and it feezed the whole
> machine...

that sounds very strange, I haven't seen anything like that running
the exact same scenario.

> So there's still work to do in order to get 5.12 stable.
> 
> I'm short on time currently, but I hope to send more details soon.

Thanks! I'll play with it this morning and see if I can provoke
something odd related to STOP/attach.

-- 
Jens Axboe

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher



Am 25.03.21 um 13:04 schrieb Eric W. Biederman:
> Stefan Metzmacher  writes:
> 
>> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>>> From: "Eric W. Biederman" 
>>>
>>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>>>
>>> Just like we don't allow normal signals to IO threads, don't deliver a
>>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>>> signals in general, and have no means of flushing out a stop either.
>>>
>>> Longer term, we may want to look into allowing stop of these threads,
>>> as it relates to eg process freezing. For now, this prevents a spin
>>> issue if a SIGSTOP is delivered to the parent task.
>>>
>>> Reported-by: Stefan Metzmacher 
>>> Signed-off-by: Jens Axboe 
>>> Signed-off-by: "Eric W. Biederman" 
>>> Signed-off-by: Sasha Levin 
>>> ---
>>>  kernel/signal.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 55526b941011..00a3840f6037 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>>> unsigned long mask)
>>> JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>> BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>>  
>>> -   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>>> +   if (unlikely(fatal_signal_pending(task) ||
>>> +(task->flags & (PF_EXITING | PF_IO_WORKER
>>> return false;
>>>  
>>> if (mask & JOBCTL_STOP_SIGMASK)
>>>
>>
>> Again, why is this proposed for 5.11 and 5.10 already?
> 
> Has the bit about the io worker kthreads been backported?
> If so this isn't horrible.  If not this is nonsense.

I don't know, I hope not...

But I just tested v5.12-rc4 and attaching to
an application with iothreads with gdb is still not possible,
it still loops forever trying to attach to the iothreads.

And I tested 'kill -9 $pidofiothread', and it feezed the whole
machine...

So there's still work to do in order to get 5.12 stable.

I'm short on time currently, but I hope to send more details soon.

metze

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Eric W. Biederman

Stefan Metzmacher  writes:

> Am 25.03.21 um 12:24 schrieb Sasha Levin:
>> From: "Eric W. Biederman" 
>> 
>> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
>> 
>> Just like we don't allow normal signals to IO threads, don't deliver a
>> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
>> signals in general, and have no means of flushing out a stop either.
>> 
>> Longer term, we may want to look into allowing stop of these threads,
>> as it relates to eg process freezing. For now, this prevents a spin
>> issue if a SIGSTOP is delivered to the parent task.
>> 
>> Reported-by: Stefan Metzmacher 
>> Signed-off-by: Jens Axboe 
>> Signed-off-by: "Eric W. Biederman" 
>> Signed-off-by: Sasha Levin 
>> ---
>>  kernel/signal.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 55526b941011..00a3840f6037 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
>> unsigned long mask)
>>  JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>>  BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>>  
>> -if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
>> +if (unlikely(fatal_signal_pending(task) ||
>> + (task->flags & (PF_EXITING | PF_IO_WORKER
>>  return false;
>>  
>>  if (mask & JOBCTL_STOP_SIGMASK)
>> 
>
> Again, why is this proposed for 5.11 and 5.10 already?

Has the bit about the io worker kthreads been backported?
If so this isn't horrible.  If not this is nonsense.

Eric

Re: [PATCH AUTOSEL 5.11 42/44] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher

Am 25.03.21 um 12:24 schrieb Sasha Levin:
> From: Jens Axboe 
> 
> [ Upstream commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5 ]
> 
> They don't take signals individually, and even if they share signals with
> the parent task, don't allow them to be delivered through the worker
> thread. Linux does allow this kind of behavior for regular threads, but
> it's really a compatability thing that we need not care about for the IO
> threads.
> 
> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> Signed-off-by: Sasha Levin 
> ---
>  kernel/signal.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 5ad8566534e7..55526b941011 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -833,6 +833,9 @@ static int check_kill_permission(int sig, struct 
> kernel_siginfo *info,
>  
>   if (!valid_signal(sig))
>   return -EINVAL;
> + /* PF_IO_WORKER threads don't take any signals */
> + if (t->flags & PF_IO_WORKER)
> + return -ESRCH;

Why is that proposed for 5.11 and 5.10 now?

Are the create_io_thread() patches already backported?

I think we should hold on with the backports until
everything is stable in v5.12 final.

I'm still about to test on top of v5.12-rc4
and have a pending mail why I think this particular change is
wrong even in 5.12.

Jens, did you send these to stable?

metze

Re: [PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Stefan Metzmacher

Am 25.03.21 um 12:24 schrieb Sasha Levin:
> From: "Eric W. Biederman" 
> 
> [ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]
> 
> Just like we don't allow normal signals to IO threads, don't deliver a
> STOP to a task that has PF_IO_WORKER set. The IO threads don't take
> signals in general, and have no means of flushing out a stop either.
> 
> Longer term, we may want to look into allowing stop of these threads,
> as it relates to eg process freezing. For now, this prevents a spin
> issue if a SIGSTOP is delivered to the parent task.
> 
> Reported-by: Stefan Metzmacher 
> Signed-off-by: Jens Axboe 
> Signed-off-by: "Eric W. Biederman" 
> Signed-off-by: Sasha Levin 
> ---
>  kernel/signal.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 55526b941011..00a3840f6037 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
> unsigned long mask)
>   JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
>   BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
>  
> - if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
> + if (unlikely(fatal_signal_pending(task) ||
> +  (task->flags & (PF_EXITING | PF_IO_WORKER
>   return false;
>  
>   if (mask & JOBCTL_STOP_SIGMASK)
> 

Again, why is this proposed for 5.11 and 5.10 already?

metze

[PATCH AUTOSEL 5.10 38/39] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Sasha Levin

From: "Eric W. Biederman" 

[ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]

Just like we don't allow normal signals to IO threads, don't deliver a
STOP to a task that has PF_IO_WORKER set. The IO threads don't take
signals in general, and have no means of flushing out a stop either.

Longer term, we may want to look into allowing stop of these threads,
as it relates to eg process freezing. For now, this prevents a spin
issue if a SIGSTOP is delivered to the parent task.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Sasha Levin 
---
 kernel/signal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 18ed1f853439..1d901a789812 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
+   if (unlikely(fatal_signal_pending(task) ||
+(task->flags & (PF_EXITING | PF_IO_WORKER
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)
-- 
2.30.1

[PATCH AUTOSEL 5.10 37/39] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-25 Thread Sasha Levin

From: Jens Axboe 

[ Upstream commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5 ]

They don't take signals individually, and even if they share signals with
the parent task, don't allow them to be delivered through the worker
thread. Linux does allow this kind of behavior for regular threads, but
it's really a compatability thing that we need not care about for the IO
threads.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 kernel/signal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index ef8f2a28d37c..18ed1f853439 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -833,6 +833,9 @@ static int check_kill_permission(int sig, struct 
kernel_siginfo *info,
 
if (!valid_signal(sig))
return -EINVAL;
+   /* PF_IO_WORKER threads don't take any signals */
+   if (t->flags & PF_IO_WORKER)
+   return -ESRCH;
 
if (!si_fromuser(info))
return 0;
-- 
2.30.1

[PATCH AUTOSEL 5.11 43/44] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-25 Thread Sasha Levin

From: "Eric W. Biederman" 

[ Upstream commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597 ]

Just like we don't allow normal signals to IO threads, don't deliver a
STOP to a task that has PF_IO_WORKER set. The IO threads don't take
signals in general, and have no means of flushing out a stop either.

Longer term, we may want to look into allowing stop of these threads,
as it relates to eg process freezing. For now, this prevents a spin
issue if a SIGSTOP is delivered to the parent task.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Sasha Levin 
---
 kernel/signal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 55526b941011..00a3840f6037 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -288,7 +288,8 @@ bool task_set_jobctl_pending(struct task_struct *task, 
unsigned long mask)
JOBCTL_STOP_SIGMASK | JOBCTL_TRAPPING));
BUG_ON((mask & JOBCTL_TRAPPING) && !(mask & JOBCTL_PENDING_MASK));
 
-   if (unlikely(fatal_signal_pending(task) || (task->flags & PF_EXITING)))
+   if (unlikely(fatal_signal_pending(task) ||
+(task->flags & (PF_EXITING | PF_IO_WORKER
return false;
 
if (mask & JOBCTL_STOP_SIGMASK)
-- 
2.30.1

[PATCH AUTOSEL 5.11 42/44] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-25 Thread Sasha Levin

From: Jens Axboe 

[ Upstream commit 5be28c8f85ce99ed2d329d2ad8bdd18ea19473a5 ]

They don't take signals individually, and even if they share signals with
the parent task, don't allow them to be delivered through the worker
thread. Linux does allow this kind of behavior for regular threads, but
it's really a compatability thing that we need not care about for the IO
threads.

Reported-by: Stefan Metzmacher 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 kernel/signal.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 5ad8566534e7..55526b941011 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -833,6 +833,9 @@ static int check_kill_permission(int sig, struct 
kernel_siginfo *info,
 
if (!valid_signal(sig))
return -EINVAL;
+   /* PF_IO_WORKER threads don't take any signals */
+   if (t->flags & PF_IO_WORKER)
+   return -ESRCH;
 
if (!si_fromuser(info))
return 0;
-- 
2.30.1

Re: [PATCH 2/2] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-22 Thread Oleg Nesterov

On 03/20, Eric W. Biederman wrote:
>
> But please tell me why is it not the right thing to have the io_uring
> helper threads stop?  Why is it ok for that process to go on consuming
> cpu resources in a non-stoppable way.

Yes, I have the same question ;)

Oleg.

Re: [PATCH 2/2] signal: don't allow STOP on PF_IO_WORKER threads

2021-03-22 Thread Oleg Nesterov

On 03/20, Jens Axboe wrote:
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2349,6 +2349,10 @@ static bool do_signal_stop(int signr)
>
>   t = current;
>   while_each_thread(current, t) {
> + /* don't STOP PF_IO_WORKER threads */
> + if (t->flags & PF_IO_WORKER)
> + continue;
> +

This is not enough. At least task_join_group_stop() should check
PF_IO_WORKER too.

Oleg.

Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-21 Thread Jens Axboe

On 3/21/21 8:54 AM, Eric W. Biederman wrote:
> Jens Axboe  writes:
> 
>> On 3/20/21 3:38 PM, Eric W. Biederman wrote:
>>> Linus Torvalds  writes:
>>>
>>>> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
>>>> wrote:
>>>>>
>>>>> The creds should be reasonably in-sync with the rest of the threads.
>>>>
>>>> It's not about credentials (despite the -EPERM).
>>>>
>>>> It's about the fact that kernel threads cannot handle signals, and
>>>> then get caught in endless loops of "if (sigpending()) return
>>>> -EAGAIN".
>>>>
>>>> For a normal user thread, that "return -EAGAIN" (or whatever) will end
>>>> up returning an error to user space - and before it does that, it will
>>>> go through the "oh, returning to user space, so handle signal" path.
>>>> Which will clear sigpending etc.
>>>>
>>>> A thread that never returns to user space fundamentally cannot handle
>>>> this. The sigpending() stays on forever, the signal never gets
>>>> handled, the thread can't do anything.
>>>>
>>>> So delivering a signal to a kernel thread fundamentally cannot work
>>>> (although we do have some threads that explicitly see "oh, if I was
>>>> killed, I will exit" - think things like in-kernel nfsd etc).
>>>
>>> I agree that getting a kernel thread to receive a signal is quite
>>> tricky.  But that is not what the patch affects.
>>>
>>> The patch covers the case when instead of specifying the pid of the
>>> process to kill(2) someone specifies the tid of a thread.  Which implies
>>> that type is PIDTYPE_TGID, and in turn the signal is being placed on the
>>> t->signal->shared_pending queue.  Not the thread specific t->pending
>>> queue.
>>>
>>> So my question is since the signal is delivered to the process as a
>>> whole why do we care if someone specifies the tid of a kernel thread,
>>> rather than the tid of a userspace thread?
>>
>> Right, that's what this first patch does, and in all honesty, it's not
>> required like the 2/2 patch is. I do think it makes it more consistent,
>> though - the threads don't take signals, period. Allowing delivery from
>> eg kill(2) and then pass it to the owning task of the io_uring is
>> somewhat counterintuitive, and differs from earlier kernels where there
>> was no relationsship between that owning task and the async worker
>> thread.
>>
>> That's why I think the patch DOES make sense. These threads may share a
>> personality with the owning task, but I don't think we should be able to
>> manipulate them from userspace at all. That includes SIGSTOP, of course,
>> but also regular signals.
>>
>> Hence I do think we should do something like this.
> 
> I agree about signals.  Especially because being able to use kill(2)
> with the tid of thread is a linuxism and a backwards compatibility thing
> from before we had CLONE_THREAD.
> 
> I think for kill(2) we should just return -ESRCH.
> 
> Thank you for providing the reasoning that is what I really saw missing
> in the patches.  The why.  And software is difficult to maintain without
> the why.

Thanks Eric, I'll change that patch to -ESRCH and augment the commit
message a bit.

-- 
Jens Axboe

Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-21 Thread Eric W. Biederman

Jens Axboe  writes:

> On 3/20/21 3:38 PM, Eric W. Biederman wrote:
>> Linus Torvalds  writes:
>> 
>>> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
>>> wrote:
>>>>
>>>> The creds should be reasonably in-sync with the rest of the threads.
>>>
>>> It's not about credentials (despite the -EPERM).
>>>
>>> It's about the fact that kernel threads cannot handle signals, and
>>> then get caught in endless loops of "if (sigpending()) return
>>> -EAGAIN".
>>>
>>> For a normal user thread, that "return -EAGAIN" (or whatever) will end
>>> up returning an error to user space - and before it does that, it will
>>> go through the "oh, returning to user space, so handle signal" path.
>>> Which will clear sigpending etc.
>>>
>>> A thread that never returns to user space fundamentally cannot handle
>>> this. The sigpending() stays on forever, the signal never gets
>>> handled, the thread can't do anything.
>>>
>>> So delivering a signal to a kernel thread fundamentally cannot work
>>> (although we do have some threads that explicitly see "oh, if I was
>>> killed, I will exit" - think things like in-kernel nfsd etc).
>> 
>> I agree that getting a kernel thread to receive a signal is quite
>> tricky.  But that is not what the patch affects.
>> 
>> The patch covers the case when instead of specifying the pid of the
>> process to kill(2) someone specifies the tid of a thread.  Which implies
>> that type is PIDTYPE_TGID, and in turn the signal is being placed on the
>> t->signal->shared_pending queue.  Not the thread specific t->pending
>> queue.
>> 
>> So my question is since the signal is delivered to the process as a
>> whole why do we care if someone specifies the tid of a kernel thread,
>> rather than the tid of a userspace thread?
>
> Right, that's what this first patch does, and in all honesty, it's not
> required like the 2/2 patch is. I do think it makes it more consistent,
> though - the threads don't take signals, period. Allowing delivery from
> eg kill(2) and then pass it to the owning task of the io_uring is
> somewhat counterintuitive, and differs from earlier kernels where there
> was no relationsship between that owning task and the async worker
> thread.
>
> That's why I think the patch DOES make sense. These threads may share a
> personality with the owning task, but I don't think we should be able to
> manipulate them from userspace at all. That includes SIGSTOP, of course,
> but also regular signals.
>
> Hence I do think we should do something like this.

I agree about signals.  Especially because being able to use kill(2)
with the tid of thread is a linuxism and a backwards compatibility thing
from before we had CLONE_THREAD.

I think for kill(2) we should just return -ESRCH.

Thank you for providing the reasoning that is what I really saw missing
in the patches.  The why.  And software is difficult to maintain without
the why.





Eric

Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-20 Thread Jens Axboe

On 3/20/21 3:38 PM, Eric W. Biederman wrote:
> Linus Torvalds  writes:
> 
>> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
>> wrote:
>>>
>>> The creds should be reasonably in-sync with the rest of the threads.
>>
>> It's not about credentials (despite the -EPERM).
>>
>> It's about the fact that kernel threads cannot handle signals, and
>> then get caught in endless loops of "if (sigpending()) return
>> -EAGAIN".
>>
>> For a normal user thread, that "return -EAGAIN" (or whatever) will end
>> up returning an error to user space - and before it does that, it will
>> go through the "oh, returning to user space, so handle signal" path.
>> Which will clear sigpending etc.
>>
>> A thread that never returns to user space fundamentally cannot handle
>> this. The sigpending() stays on forever, the signal never gets
>> handled, the thread can't do anything.
>>
>> So delivering a signal to a kernel thread fundamentally cannot work
>> (although we do have some threads that explicitly see "oh, if I was
>> killed, I will exit" - think things like in-kernel nfsd etc).
> 
> I agree that getting a kernel thread to receive a signal is quite
> tricky.  But that is not what the patch affects.
> 
> The patch covers the case when instead of specifying the pid of the
> process to kill(2) someone specifies the tid of a thread.  Which implies
> that type is PIDTYPE_TGID, and in turn the signal is being placed on the
> t->signal->shared_pending queue.  Not the thread specific t->pending
> queue.
> 
> So my question is since the signal is delivered to the process as a
> whole why do we care if someone specifies the tid of a kernel thread,
> rather than the tid of a userspace thread?

Right, that's what this first patch does, and in all honesty, it's not
required like the 2/2 patch is. I do think it makes it more consistent,
though - the threads don't take signals, period. Allowing delivery from
eg kill(2) and then pass it to the owning task of the io_uring is
somewhat counterintuitive, and differs from earlier kernels where there
was no relationsship between that owning task and the async worker
thread.

That's why I think the patch DOES make sense. These threads may share a
personality with the owning task, but I don't think we should be able to
manipulate them from userspace at all. That includes SIGSTOP, of course,
but also regular signals.

Hence I do think we should do something like this.

-- 
Jens Axboe

Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-20 Thread Eric W. Biederman

Linus Torvalds  writes:

> On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  
> wrote:
>>
>> The creds should be reasonably in-sync with the rest of the threads.
>
> It's not about credentials (despite the -EPERM).
>
> It's about the fact that kernel threads cannot handle signals, and
> then get caught in endless loops of "if (sigpending()) return
> -EAGAIN".
>
> For a normal user thread, that "return -EAGAIN" (or whatever) will end
> up returning an error to user space - and before it does that, it will
> go through the "oh, returning to user space, so handle signal" path.
> Which will clear sigpending etc.
>
> A thread that never returns to user space fundamentally cannot handle
> this. The sigpending() stays on forever, the signal never gets
> handled, the thread can't do anything.
>
> So delivering a signal to a kernel thread fundamentally cannot work
> (although we do have some threads that explicitly see "oh, if I was
> killed, I will exit" - think things like in-kernel nfsd etc).

I agree that getting a kernel thread to receive a signal is quite
tricky.  But that is not what the patch affects.

The patch covers the case when instead of specifying the pid of the
process to kill(2) someone specifies the tid of a thread.  Which implies
that type is PIDTYPE_TGID, and in turn the signal is being placed on the
t->signal->shared_pending queue.  Not the thread specific t->pending
queue.

So my question is since the signal is delivered to the process as a
whole why do we care if someone specifies the tid of a kernel thread,
rather than the tid of a userspace thread?

Eric

Re: [PATCH 1/2] signal: don't allow sending any signals to PF_IO_WORKER threads

2021-03-20 Thread Linus Torvalds

On Sat, Mar 20, 2021 at 9:19 AM Eric W. Biederman  wrote:
>
> The creds should be reasonably in-sync with the rest of the threads.

It's not about credentials (despite the -EPERM).

It's about the fact that kernel threads cannot handle signals, and
then get caught in endless loops of "if (sigpending()) return
-EAGAIN".

For a normal user thread, that "return -EAGAIN" (or whatever) will end
up returning an error to user space - and before it does that, it will
go through the "oh, returning to user space, so handle signal" path.
Which will clear sigpending etc.

A thread that never returns to user space fundamentally cannot handle
this. The sigpending() stays on forever, the signal never gets
handled, the thread can't do anything.

So delivering a signal to a kernel thread fundamentally cannot work
(although we do have some threads that explicitly see "oh, if I was
killed, I will exit" - think things like in-kernel nfsd etc).

  Linus

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4091 matches

Mail list logo