RE: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!

2020-12-22 Thread Dexuan Cui
> From: Valentin Schneider 
> Sent: Tuesday, December 22, 2020 5:40 AM
> To: Dexuan Cui 
> Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> vincent.guit...@linaro.org; dietmar.eggem...@arm.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> bris...@redhat.com; x...@kernel.org; linux...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley
> 
> Subject: Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel
> BUG at kernel/sched/core.c:7596!
> 
> 
> Hi,
> 
> On 22/12/20 09:13, Dexuan Cui wrote:
> > Hi,
> > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020)
> on Hyper-V.
> > When I test hibernation, the VM can easily hit the below BUG_ON during the
> resume
> > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has
> 40 vCPUs.
> >
> > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may
> be broken?
> >
> > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and
> rq->nr_pinned==0:
> >
> > 7587 int sched_cpu_dying(unsigned int cpu)
> > 7588 {
> > 7589 struct rq *rq = cpu_rq(cpu);
> > 7590 struct rq_flags rf;
> > 7591
> > 7592 /* Handle pending wakeups and then migrate everything off
> */
> > 7593 sched_tick_stop(cpu);
> > 7594
> > 7595 rq_lock_irqsave(rq, );
> > 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> > 7597 rq_unlock_irqrestore(rq, );
> > 7598
> > 7599 calc_load_migrate(rq);
> > 7600 update_max_interval();
> > 7601 nohz_balance_exit_idle(rq);
> > 7602 hrtick_clear(rq);
> > 7603 return 0;
> > 7604 }
> >
> > The last commit that touches the BUG_ON line is the commit
> > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug
> cooperative")
> > but the commit looks good to me.
> >
> > Any idea?
> >
> 
> I'd wager this extra task is a kworker; could you give this series a try?
> 
> 
> https ://lore.kernel.org/lkml/20201218170919.2950-1-jiangshan...@gmail.com/

Thanks, Valentin! It looks like the patchset can fix the BUG_ON, though I see
a warning, which I reported here: https://lkml.org/lkml/2020/12/22/648

Thanks,
-- Dexuan


Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!

2020-12-22 Thread Valentin Schneider


Hi,

On 22/12/20 09:13, Dexuan Cui wrote:
> Hi,
> I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on 
> Hyper-V.
> When I test hibernation, the VM can easily hit the below BUG_ON during the 
> resume
> procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 
> 40 vCPUs.
>
> I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may 
> be broken?
>
> In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:
>
> 7587 int sched_cpu_dying(unsigned int cpu)
> 7588 {
> 7589 struct rq *rq = cpu_rq(cpu);
> 7590 struct rq_flags rf;
> 7591
> 7592 /* Handle pending wakeups and then migrate everything off */
> 7593 sched_tick_stop(cpu);
> 7594
> 7595 rq_lock_irqsave(rq, );
> 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> 7597 rq_unlock_irqrestore(rq, );
> 7598
> 7599 calc_load_migrate(rq);
> 7600 update_max_interval();
> 7601 nohz_balance_exit_idle(rq);
> 7602 hrtick_clear(rq);
> 7603 return 0;
> 7604 }
>
> The last commit that touches the BUG_ON line is the commit
> 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
> but the commit looks good to me.
>
> Any idea?
>

I'd wager this extra task is a kworker; could you give this series a try?

  https://lore.kernel.org/r/20201218170919.2950-1-jiangshan...@gmail.com

> Thanks,
> -- Dexuan


v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!

2020-12-22 Thread Dexuan Cui
Hi,
I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on 
Hyper-V.
When I test hibernation, the VM can easily hit the below BUG_ON during the 
resume
procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 
vCPUs.

I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be 
broken?

In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:

7587 int sched_cpu_dying(unsigned int cpu)
7588 {
7589 struct rq *rq = cpu_rq(cpu);
7590 struct rq_flags rf;
7591
7592 /* Handle pending wakeups and then migrate everything off */
7593 sched_tick_stop(cpu);
7594
7595 rq_lock_irqsave(rq, );
7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
7597 rq_unlock_irqrestore(rq, );
7598
7599 calc_load_migrate(rq);
7600 update_max_interval();
7601 nohz_balance_exit_idle(rq);
7602 hrtick_clear(rq);
7603 return 0;
7604 }

The last commit that touches the BUG_ON line is the commit
3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
but the commit looks good to me.

Any idea?

Thanks,
-- Dexuan

[5.383378] PM: Image signature found, resuming
[5.388027] PM: hibernation: resume from hibernation
[5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done.
[5.397058] OOM killer disabled.
[5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
done.
[5.413013] PM: hibernation: Marking nosave pages: [mem 
0x-0x0fff]
[5.419331] PM: hibernation: Marking nosave pages: [mem 
0x0009f000-0x000f]
[5.425502] PM: hibernation: Marking nosave pages: [mem 
0xf7ff-0x]
[5.431706] PM: hibernation: Basic memory bitmaps created
[5.465205] PM: Using 3 thread(s) for decompression
[5.469590] PM: Loading and decompressing image data (505553 pages)...
[5.492790] PM: Image loading progress:   0%
[6.532641] PM: Image loading progress:  10%
[6.899683] PM: Image loading progress:  20%
[7.251672] PM: Image loading progress:  30%
[7.598808] PM: Image loading progress:  40%
[8.056153] PM: Image loading progress:  50%
[8.534077] PM: Image loading progress:  60%
[9.029886] PM: Image loading progress:  70%
[9.542015] PM: Image loading progress:  80%
[   10.025326] PM: Image loading progress:  90%
[   10.525804] PM: Image loading progress: 100%
[   10.530241] PM: Image loading done
[   10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 
MB/s)
[   10.540827] PM: Image successfully loaded
[   10.599243] serial 00:04: disabled
[   10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 
should not be sleeping
[   10.646542] Disabling non-boot CPUs ...
[   10.694380] smpboot: CPU 1 is now offline
[   10.729044] smpboot: CPU 2 is now offline
[   10.756819] smpboot: CPU 3 is now offline
[   10.776784] smpboot: CPU 4 is now offline
[   10.800866] smpboot: CPU 5 is now offline
[   10.828923] smpboot: CPU 6 is now offline
[   10.849013] smpboot: CPU 7 is now offline
[   10.876722] smpboot: CPU 8 is now offline
[   10.909426] smpboot: CPU 9 is now offline
[   10.929360] smpboot: CPU 10 is now offline
[   10.957059] smpboot: CPU 11 is now offline
[   10.976542] smpboot: CPU 12 is now offline
[   11.008470] smpboot: CPU 13 is now offline
[   11.036356] smpboot: CPU 14 is now offline
[   11.072396] smpboot: CPU 15 is now offline
[   11.100229] smpboot: CPU 16 is now offline
[   11.128638] smpboot: CPU 17 is now offline
[   11.148479] smpboot: CPU 18 is now offline
[   11.173537] smpboot: CPU 19 is now offline
[   11.205680] smpboot: CPU 20 is now offline
[   11.231799] smpboot: CPU 21 is now offline
[   11.259562] smpboot: CPU 22 is now offline
[   11.288672] smpboot: CPU 23 is now offline
[   11.319691] smpboot: CPU 24 is now offline
[   11.355523] smpboot: CPU 25 is now offline
[   11.399469] smpboot: CPU 26 is now offline
[   11.435438] smpboot: CPU 27 is now offline
[   11.471402] smpboot: CPU 28 is now offline
[   11.515488] smpboot: CPU 29 is now offline
[   11.551355] smpboot: CPU 30 is now offline
[   11.591326] smpboot: CPU 31 is now offline
[   11.624004] smpboot: CPU 32 is now offline
[   11.659326] smpboot: CPU 33 is now offline
[   11.687478] smpboot: CPU 34 is now offline
[   11.719243] smpboot: CPU 35 is now offline
[   11.747252] smpboot: CPU 36 is now offline
[   11.771224] smpboot: CPU 37 is now offline
[   11.795089] [ cut here ]----
[   11.798052] kernel BUG at kernel/sched/core.c:7596!
[   11.798052] invalid opcode:  [#1] PREEMPT SMP PTI
[   11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: GE 
5.10.0+ #6
[   11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008  12/07/2018
[   11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0
[   11.798052] RIP: 0010:sch

Re: [arm64] kernel BUG at kernel/seccomp.c:1309!

2020-11-23 Thread Gabriel Krisman Bertazi
Jann Horn  writes:

> On Mon, Nov 23, 2020 at 2:45 PM Arnd Bergmann  wrote:
>> On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju
>>  wrote:
>> >
>> > While booting arm64 kernel the following kernel BUG noticed on several 
>> > arm64
>> > devices running linux next 20201123 tag kernel.
>> >
>> >
>> > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c
>> > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp'
>> > bce6a8cba7bf Merge branch 'linus'
>> > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp
>> > fab686eb0307 seccomp: Remove bogus __user annotations
>> > 0d831528 seccomp/cache: Report cache data through 
>> > /proc/pid/seccomp_cache
>> > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant 
>> > allow
>> > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path
>> > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag
>> >
>> >
>> > Please find these easy steps to reproduce the kernel build and boot.
>>
>> Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here
>> seems suspicious: it changes
>>
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 02aef2844c38..47763f3999f7 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -42,7 +42,7 @@ struct seccomp {
>>  extern int __secure_computing(const struct seccomp_data *sd);
>>  static inline int secure_computing(void)
>>  {
>> -   if (unlikely(test_thread_flag(TIF_SECCOMP)))
>> +   if (unlikely(test_syscall_work(SECCOMP)))
>> return  __secure_computing(NULL);
>> return 0;
>>  }
>>
>> which is in the call chain directly before
>>
>> int __secure_computing(const struct seccomp_data *sd)
>> {
>>int mode = current->seccomp.mode;
>>
>> ...
>> switch (mode) {
>> case SECCOMP_MODE_STRICT:
>> __secure_computing_strict(this_syscall);  /* may call 
>> do_exit */
>> return 0;
>> case SECCOMP_MODE_FILTER:
>> return __seccomp_filter(this_syscall, sd, false);
>> default:
>> BUG();
>> }
>> }
>>
>> Clearly, current->seccomp.mode is set to something other
>> than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER
>> while the test_syscall_work(SECCOMP) returns true, and this
>> must have not been the case earlier.
>
> Ah, I think the problem is actually in
> 3136b93c3fb2b7c19e853e049203ff8f2b9dd2cd ("entry: Expose helpers to
> migrate TIF to SYSCALL_WORK flag"). In the !GENERIC_ENTRY case, it
> adds this code:
>
> +#define set_syscall_work(fl)   \
> +   set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
> +#define test_syscall_work(fl) \
> +   test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
> +#define clear_syscall_work(fl) \
> +   clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
> +
> +#define set_task_syscall_work(t, fl) \
> +   set_ti_thread_flag(task_thread_info(t), TIF_##fl)
> +#define test_task_syscall_work(t, fl) \
> +   test_ti_thread_flag(task_thread_info(t), TIF_##fl)
> +#define clear_task_syscall_work(t, fl) \
> +   clear_ti_thread_flag(task_thread_info(t), TIF_##fl)
>
> but the SYSCALL_WORK_FLAGS are not valid on !GENERIC_ENTRY, we'll mix
> up (on arm64) SYSCALL_WORK_BIT_SECCOMP (==0) and TIF_SIGPENDING (==0).
>
> As part of fixing this, it might be a good idea to put "enum
> syscall_work_bit" behind a "#ifdef CONFIG_GENERIC_ENTRY" to avoid
> future accidents like this?

Hi Jan, Arnd,

That is correct.  This is a copy pasta mistake.  My apologies.  I didn't
have a !GENERIC_ENTRY device to test, but just the ifdef would have
caught it.

-- 
Gabriel Krisman Bertazi


Re: [arm64] kernel BUG at kernel/seccomp.c:1309!

2020-11-23 Thread Jann Horn
On Mon, Nov 23, 2020 at 2:45 PM Arnd Bergmann  wrote:
> On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju
>  wrote:
> >
> > While booting arm64 kernel the following kernel BUG noticed on several arm64
> > devices running linux next 20201123 tag kernel.
> >
> >
> > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c
> > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp'
> > bce6a8cba7bf Merge branch 'linus'
> > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp
> > fab686eb0307 seccomp: Remove bogus __user annotations
> > 0d831528 seccomp/cache: Report cache data through 
> > /proc/pid/seccomp_cache
> > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant 
> > allow
> > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path
> > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag
> >
> >
> > Please find these easy steps to reproduce the kernel build and boot.
>
> Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here
> seems suspicious: it changes
>
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 02aef2844c38..47763f3999f7 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -42,7 +42,7 @@ struct seccomp {
>  extern int __secure_computing(const struct seccomp_data *sd);
>  static inline int secure_computing(void)
>  {
> -   if (unlikely(test_thread_flag(TIF_SECCOMP)))
> +   if (unlikely(test_syscall_work(SECCOMP)))
> return  __secure_computing(NULL);
> return 0;
>  }
>
> which is in the call chain directly before
>
> int __secure_computing(const struct seccomp_data *sd)
> {
>int mode = current->seccomp.mode;
>
> ...
> switch (mode) {
> case SECCOMP_MODE_STRICT:
> __secure_computing_strict(this_syscall);  /* may call do_exit 
> */
> return 0;
> case SECCOMP_MODE_FILTER:
> return __seccomp_filter(this_syscall, sd, false);
> default:
> BUG();
> }
> }
>
> Clearly, current->seccomp.mode is set to something other
> than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER
> while the test_syscall_work(SECCOMP) returns true, and this
> must have not been the case earlier.

Ah, I think the problem is actually in
3136b93c3fb2b7c19e853e049203ff8f2b9dd2cd ("entry: Expose helpers to
migrate TIF to SYSCALL_WORK flag"). In the !GENERIC_ENTRY case, it
adds this code:

+#define set_syscall_work(fl)   \
+   set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define test_syscall_work(fl) \
+   test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+#define clear_syscall_work(fl) \
+   clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl)
+
+#define set_task_syscall_work(t, fl) \
+   set_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define test_task_syscall_work(t, fl) \
+   test_ti_thread_flag(task_thread_info(t), TIF_##fl)
+#define clear_task_syscall_work(t, fl) \
+   clear_ti_thread_flag(task_thread_info(t), TIF_##fl)

but the SYSCALL_WORK_FLAGS are not valid on !GENERIC_ENTRY, we'll mix
up (on arm64) SYSCALL_WORK_BIT_SECCOMP (==0) and TIF_SIGPENDING (==0).

As part of fixing this, it might be a good idea to put "enum
syscall_work_bit" behind a "#ifdef CONFIG_GENERIC_ENTRY" to avoid
future accidents like this?


Re: [arm64] kernel BUG at kernel/seccomp.c:1309!

2020-11-23 Thread Arnd Bergmann
On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju
 wrote:
>
> While booting arm64 kernel the following kernel BUG noticed on several arm64
> devices running linux next 20201123 tag kernel.
>
>
> $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c
> 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp'
> bce6a8cba7bf Merge branch 'linus'
> 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp
> fab686eb0307 seccomp: Remove bogus __user annotations
> 0d831528 seccomp/cache: Report cache data through /proc/pid/seccomp_cache
> 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant 
> allow
> f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path
> 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag
>
>
> Please find these easy steps to reproduce the kernel build and boot.

Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here
seems suspicious: it changes

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..47763f3999f7 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -42,7 +42,7 @@ struct seccomp {
 extern int __secure_computing(const struct seccomp_data *sd);
 static inline int secure_computing(void)
 {
-   if (unlikely(test_thread_flag(TIF_SECCOMP)))
+   if (unlikely(test_syscall_work(SECCOMP)))
return  __secure_computing(NULL);
return 0;
 }

which is in the call chain directly before

int __secure_computing(const struct seccomp_data *sd)
{
   int mode = current->seccomp.mode;

...
switch (mode) {
case SECCOMP_MODE_STRICT:
__secure_computing_strict(this_syscall);  /* may call do_exit */
return 0;
case SECCOMP_MODE_FILTER:
return __seccomp_filter(this_syscall, sd, false);
default:
BUG();
}
}

Clearly, current->seccomp.mode is set to something other
than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER
while the test_syscall_work(SECCOMP) returns true, and this
must have not been the case earlier.

 Arnd

>
> step to reproduce:
> # please install tuxmake
> # sudo pip3 install -U tuxmake
> # cd linux-next
> # tuxmake --runtime docker --target-arch arm --toolchain gcc-9
> --kconfig defconfig --kconfig-add
> https://builds.tuxbuild.com/1kgWN61pS5M35vjnVfDSvOOPd38/config
>
> # Boot the arm64 on any arm64 devices.
> # you will notice the below BUG
>
> crash log details:
> ---
> [6.941012] [ cut here ]--------
> Found device  /dev/ttyAMA3.
> [6.947587] lima f408.gpu: mod rate = 5
> [6.955422] kernel BUG at kernel/seccomp.c:1309!
> [6.955430] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [6.955437] Modules linked in: cec rfkill wlcore_sdio(+) kirin_drm
> dw_drm_dsi lima(+) drm_kms_helper gpu_sched drm fuse
> [6.955481] CPU: 2 PID: 291 Comm: systemd-udevd Not tainted
> 5.10.0-rc4-next-20201123 #2
> [6.955485] Hardware name: HiKey Development Board (DT)
> [6.955493] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
> [6.955510] pc : __secure_computing+0xe0/0xe8
> [6.958171] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
> req 40Hz, actual 40HZ div = 31)
> [6.965975] [drm] Initialized lima 1.1.0 20191231 for f408.gpu on 
> minor 0
> [6.970176] lr : syscall_trace_enter+0x1cc/0x218
> [6.970181] sp : 800012d8be10
> [6.970185] x29: 800012d8be10 x28: 0092cb00
> [6.970195] x27:  x26: 
> [6.970203] x25:  x24: 
> [6.970210] x23: 6000 x22: 0202
> [7.011614] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
> req 2500Hz, actual 2480HZ div = 0)
> [7.016457]
> [7.016461] x21: 0200 x20: 0092cb00
> [7.016470] x19: 800012d8bec0 x18: 
> [7.016478] x17:  x16: 
> [7.016485] x15:  x14: 
> [7.054116] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
> req 40Hz, actual 40HZ div = 31)
> [7.056715]
> [7.103444] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
> req 2500Hz, actual 2480HZ div = 0)
> [7.105105] x13:  x12: 
> [7.125849] x11:  x10: 
> [7.125858] x9 : 80001001bcbc x8 : 
> [7.125865] x7 :  x6 : 
> [7.125871] x5 :  x4 : 
> [7.125879] x3 :  x2 : 0092cb00
> [7.125886] x1 :  

[arm64] kernel BUG at kernel/seccomp.c:1309!

2020-11-23 Thread Naresh Kamboju
While booting arm64 kernel the following kernel BUG noticed on several arm64
devices running linux next 20201123 tag kernel.


$ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c
5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp'
bce6a8cba7bf Merge branch 'linus'
7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp
fab686eb0307 seccomp: Remove bogus __user annotations
0d831528 seccomp/cache: Report cache data through /proc/pid/seccomp_cache
8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant allow
f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path
23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag


Please find these easy steps to reproduce the kernel build and boot.

step to reproduce:
# please install tuxmake
# sudo pip3 install -U tuxmake
# cd linux-next
# tuxmake --runtime docker --target-arch arm --toolchain gcc-9
--kconfig defconfig --kconfig-add
https://builds.tuxbuild.com/1kgWN61pS5M35vjnVfDSvOOPd38/config

# Boot the arm64 on any arm64 devices.
# you will notice the below BUG

crash log details:
---
[6.941012] [ cut here ]
Found device  /dev/ttyAMA3.
[6.947587] lima f408.gpu: mod rate = 5
[6.955422] kernel BUG at kernel/seccomp.c:1309!
[6.955430] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[6.955437] Modules linked in: cec rfkill wlcore_sdio(+) kirin_drm
dw_drm_dsi lima(+) drm_kms_helper gpu_sched drm fuse
[6.955481] CPU: 2 PID: 291 Comm: systemd-udevd Not tainted
5.10.0-rc4-next-20201123 #2
[6.955485] Hardware name: HiKey Development Board (DT)
[6.955493] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[6.955510] pc : __secure_computing+0xe0/0xe8
[6.958171] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
req 40Hz, actual 40HZ div = 31)
[6.965975] [drm] Initialized lima 1.1.0 20191231 for f408.gpu on minor 0
[6.970176] lr : syscall_trace_enter+0x1cc/0x218
[6.970181] sp : 800012d8be10
[6.970185] x29: 800012d8be10 x28: 0092cb00
[6.970195] x27:  x26: 
[6.970203] x25:  x24: 
[6.970210] x23: 6000 x22: 0202
[7.011614] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
req 2500Hz, actual 2480HZ div = 0)
[7.016457]
[7.016461] x21: 0200 x20: 0092cb00
[7.016470] x19: 800012d8bec0 x18: 
[7.016478] x17:  x16: 
[7.016485] x15:  x14: 
[7.054116] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
req 40Hz, actual 40HZ div = 31)
[7.056715]
[7.103444] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot
req 2500Hz, actual 2480HZ div = 0)
[7.105105] x13:  x12: 
[7.125849] x11:  x10: 
[7.125858] x9 : 80001001bcbc x8 : 
[7.125865] x7 :  x6 : 
[7.125871] x5 :  x4 : 
[7.125879] x3 :  x2 : 0092cb00
[7.125886] x1 :  x0 : 0116
[7.125896] Call trace:
] Found device /dev/ttyAMA2.
[7.125908]  __secure_computing+0xe0/0xe8
[7.125918]  syscall_trace_enter+0x1cc/0x218
[7.125927]  el0_svc_common.constprop.0+0x19c/0x1b8
[7.125933]  do_el0_svc+0x2c/0x98
[7.125940]  el0_sync_handler+0x180/0x188
[7.125946]  el0_sync+0x174/0x180
[7.125958] Code: d2800121 97ffd9a9 d2800120 97fbf1a9 (d421)
[7.199584] ---[ end trace 463debbc21f0c7b5 ]---
[7.204205] note: systemd-udevd[291] exited with preempt_count 1
[7.210733] [ cut here ]
[7.215451] WARNING: CPU: 2 PID: #
0 at kernel/rcu/tree.c:632 rcu_eqs_enter.isra.0+0x134/0x140
[7.223927] Modules linked in: cec rfkill wlcore_sdio kirin_drm
dw_drm_dsi lima drm_kms_helper gpu_sched drm fuse
[7.234295] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G  D
  5.10.0-rc4-next-20201123 #2
[7.243252] Hardware name: HiKey Development Board (DT)
[7.248561] pstate: 23c5 (nzCv DAIF -PAN -UAO -TCO BTYPE=--)
[7.254638] pc : rcu_eqs_enter.isra.0+0x134/0x140
[7.259350] lr : rcu_idle_enter+0x18/0x28
[7.263362] sp : 8000128e3e80
[7.266678] x29: 8000128e3e80 x28: 
[7.272001] x27:  x26: 01b79080
[7.277321] x25:  x24: 0001adc9b310
[7.282641] x23:  x22: 01b79080
[7.287970] x21: 77b24b00 x20: 01b79098
[7.287979] x19: 800011c7ab40 x18: 0010
[7.287986] x17:  x16: 
[7.287993] x15: 0092cf98 x14: 0720072007200720
[7.288001] x13: 0720072007200720 x12: 03c6
[7.2

Re: PPPd Bug with kernel 5.8.5 : task pppd:6009 blocked for more than 122 seconds.

2020-09-01 Thread Martin Zaharinov
With small research in ppp_generic.c may be a problem is come in this section:

static int ppp_nl_newlink(struct net *src_net, struct net_device *dev,
  struct nlattr *tb[], struct nlattr *data[],
  struct netlink_ext_ack *extack)
{
struct ppp_config conf = {
.unit = -1,
.ifname_is_set = true,
};
struct file *file;
int err;

file = fget(nla_get_s32(data[IFLA_PPP_DEV_FD]));
if (!file)
return -EBADF;

/* rtnl_lock is already held here, but ppp_create_interface() locks
 * ppp_mutex before holding rtnl_lock. Using mutex_trylock() avoids
 * possible deadlock due to lock order inversion, at the cost of
 * pushing the problem back to userspace.
 */
if (!mutex_trylock(_mutex)) {
err = -EBUSY;
goto out;
}

if (file->f_op != _device_fops || file->private_data) {
err = -EBADF;
goto out_unlock;
}

conf.file = file;

/* Don't use device name generated by the rtnetlink layer when ifname
 * isn't specified. Let ppp_dev_configure() set the device name using
 * the PPP unit identifer as suffix (i.e. ppp). This allows
 * userspace to infer the device name using to the PPPIOCGUNIT ioctl.
 */
if (!tb[IFLA_IFNAME])
conf.ifname_is_set = false;

err = ppp_dev_configure(src_net, dev, );

out_unlock:
mutex_unlock(_mutex);
out:
fput(file);

return err;
}

Need people that under stand this part of kernel code.

> On 1 Sep 2020, at 22:36, Martin Zaharinov  wrote:
> 
> HI all kernel team
> 
> We detect bug in kernel with pppd server 
> When disconnect and connect 1000+ users after many time kernel have lock  
> please see debug log :
> 
> 
> Aug 30 05:34:55 10.20.22.1 [139221.203949][  T817] INFO: task pppd:6009 
> blocked for more than 122 seconds.
> Aug 30 05:34:55 10.20.22.1 [139221.231206][  T817]   Tainted: G   
> O  5.8.5 #1
> Aug 30 05:34:55 10.20.22.1 [139221.257538][  T817] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 30 05:34:55 10.20.22.1 [139221.309257][  T817] pppdD0  
> 6009  1 0x80004006
> Aug 30 05:34:55 10.20.22.1 [139221.335289][  T817] Call Trace:
> Aug 30 05:34:55 10.20.22.1 [139221.360927][  T817]  __schedule+0x374/0x750
> Aug 30 05:34:55 10.20.22.1 [139221.386660][  T817]  
> schedule_preempt_disabled+0x50/0xc0
> Aug 30 05:34:55 10.20.22.1 [139221.412825][  T817]  
> __mutex_lock.constprop.0+0x15d/0x490
> Aug 30 05:34:55 10.20.22.1 [139221.439154][  T817]  ppp_release+0xa8/0xe0 
> [ppp_generic]
> Aug 30 05:34:55 10.20.22.1 [139221.465557][  T817]  __fput+0xc8/0x220
> Aug 30 05:34:55 10.20.22.1 [139221.491144][  T817]  task_work_run+0x57/0x90
> Aug 30 05:34:55 10.20.22.1 [139221.516457][  T817]  do_exit+0x317/0xa60
> Aug 30 05:34:55 10.20.22.1 [139221.541571][  T817]  do_group_exit+0x2e/0x90
> Aug 30 05:34:55 10.20.22.1 [139221.566159][  T817]  get_signal+0x11b/0x7a0
> Aug 30 05:34:55 10.20.22.1 [139221.590102][  T817]  do_signal+0x25/0x1f0
> Aug 30 05:34:55 10.20.22.1 [139221.613380][  T817]  ? ktime_get_ts64+0x3e/0xe0
> Aug 30 05:34:55 10.20.22.1 [139221.636164][  T817]  ? _copy_to_user+0x22/0x30
> Aug 30 05:34:55 10.20.22.1 [139221.658333][  T817]  
> __prepare_exit_to_usermode+0xd6/0x140
> Aug 30 05:34:55 10.20.22.1 [139221.680230][  T817]  do_syscall_64+0x50/0x120
> Aug 30 05:34:55 10.20.22.1 [139221.701469][  T817]  
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Aug 30 05:34:55 10.20.22.1 [139221.722554][  T817] RIP: 0033:0x7f8387041ae6
> Aug 30 05:34:55 10.20.22.1 [139221.743000][  T817] Code: Bad RIP value.
> Aug 30 05:34:55 10.20.22.1 [139221.762781][  T817] RSP: 002b:7ffcee1e4fa8 
> EFLAGS: 0246 ORIG_RAX: 0017
> Aug 30 05:34:55 10.20.22.1 [139221.802079][  T817] RAX: fdfe RBX: 
>  RCX: 7f8387041ae6
> Aug 30 05:34:56 10.20.22.1 [139221.841537][  T817] RDX:  RSI: 
> 7ffcee1e50b0 RDI: 000d
> Aug 30 05:34:56 10.20.22.1 [139221.882678][  T817] RBP: 7ffcee1e50b0 R08: 
> 7ffcee1e5030 R09: 0010
> Aug 30 05:34:56 10.20.22.1 [139221.924623][  T817] R10:  R11: 
> 0246 R12: 1000
> Aug 30 05:34:56 10.20.22.1 [139221.967198][  T817] R13: 000c R14: 
>  R15: 7ffcee1e7130
> Aug 30 05:34:56 10.20.22.1 [139222.011266][  T817] INFO: task pppd:6092 
> blocked for more than 123 seconds.
> Aug 30 05:34:56 10.20.22.1 [139222.036779][  T817]   Tainted: G   
> O  5.8.5 #1
> Aug 30 05:34:56 10.20.22.1 [1

PPPd Bug with kernel 5.8.5 : task pppd:6009 blocked for more than 122 seconds.

2020-09-01 Thread Martin Zaharinov
HI all kernel team

We detect bug in kernel with pppd server 
When disconnect and connect 1000+ users after many time kernel have lock  
please see debug log :


Aug 30 05:34:55 10.20.22.1 [139221.203949][  T817] INFO: task pppd:6009 blocked 
for more than 122 seconds.
Aug 30 05:34:55 10.20.22.1 [139221.231206][  T817]   Tainted: G   O 
 5.8.5 #1
Aug 30 05:34:55 10.20.22.1 [139221.257538][  T817] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 05:34:55 10.20.22.1 [139221.309257][  T817] pppdD0  6009 
 1 0x80004006
Aug 30 05:34:55 10.20.22.1 [139221.335289][  T817] Call Trace:
Aug 30 05:34:55 10.20.22.1 [139221.360927][  T817]  __schedule+0x374/0x750
Aug 30 05:34:55 10.20.22.1 [139221.386660][  T817]  
schedule_preempt_disabled+0x50/0xc0
Aug 30 05:34:55 10.20.22.1 [139221.412825][  T817]  
__mutex_lock.constprop.0+0x15d/0x490
Aug 30 05:34:55 10.20.22.1 [139221.439154][  T817]  ppp_release+0xa8/0xe0 
[ppp_generic]
Aug 30 05:34:55 10.20.22.1 [139221.465557][  T817]  __fput+0xc8/0x220
Aug 30 05:34:55 10.20.22.1 [139221.491144][  T817]  task_work_run+0x57/0x90
Aug 30 05:34:55 10.20.22.1 [139221.516457][  T817]  do_exit+0x317/0xa60
Aug 30 05:34:55 10.20.22.1 [139221.541571][  T817]  do_group_exit+0x2e/0x90
Aug 30 05:34:55 10.20.22.1 [139221.566159][  T817]  get_signal+0x11b/0x7a0
Aug 30 05:34:55 10.20.22.1 [139221.590102][  T817]  do_signal+0x25/0x1f0
Aug 30 05:34:55 10.20.22.1 [139221.613380][  T817]  ? ktime_get_ts64+0x3e/0xe0
Aug 30 05:34:55 10.20.22.1 [139221.636164][  T817]  ? _copy_to_user+0x22/0x30
Aug 30 05:34:55 10.20.22.1 [139221.658333][  T817]  
__prepare_exit_to_usermode+0xd6/0x140
Aug 30 05:34:55 10.20.22.1 [139221.680230][  T817]  do_syscall_64+0x50/0x120
Aug 30 05:34:55 10.20.22.1 [139221.701469][  T817]  
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 30 05:34:55 10.20.22.1 [139221.722554][  T817] RIP: 0033:0x7f8387041ae6
Aug 30 05:34:55 10.20.22.1 [139221.743000][  T817] Code: Bad RIP value.
Aug 30 05:34:55 10.20.22.1 [139221.762781][  T817] RSP: 002b:7ffcee1e4fa8 
EFLAGS: 0246 ORIG_RAX: 0017
Aug 30 05:34:55 10.20.22.1 [139221.802079][  T817] RAX: fdfe RBX: 
 RCX: 7f8387041ae6
Aug 30 05:34:56 10.20.22.1 [139221.841537][  T817] RDX:  RSI: 
7ffcee1e50b0 RDI: 000d
Aug 30 05:34:56 10.20.22.1 [139221.882678][  T817] RBP: 7ffcee1e50b0 R08: 
7ffcee1e5030 R09: 0010
Aug 30 05:34:56 10.20.22.1 [139221.924623][  T817] R10:  R11: 
0246 R12: 1000
Aug 30 05:34:56 10.20.22.1 [139221.967198][  T817] R13: 000c R14: 
 R15: 7ffcee1e7130
Aug 30 05:34:56 10.20.22.1 [139222.011266][  T817] INFO: task pppd:6092 blocked 
for more than 123 seconds.
Aug 30 05:34:56 10.20.22.1 [139222.036779][  T817]   Tainted: G   O 
 5.8.5 #1
Aug 30 05:34:56 10.20.22.1 [139222.060127][  T817] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 05:34:56 10.20.22.1 [139222.107395][  T817] pppdD0  6092 
 1 0x80004006
Aug 30 05:34:56 10.20.22.1 [139222.131680][  T817] Call Trace:
Aug 30 05:34:56 10.20.22.1 [139222.155515][  T817]  __schedule+0x374/0x750
Aug 30 05:34:56 10.20.22.1 [139222.179548][  T817]  
schedule_preempt_disabled+0x50/0xc0
Aug 30 05:34:56 10.20.22.1 [139222.203987][  T817]  
__mutex_lock.constprop.0+0x15d/0x490
Aug 30 05:34:56 10.20.22.1 [139222.228677][  T817]  ppp_release+0xa8/0xe0 
[ppp_generic]
Aug 30 05:34:56 10.20.22.1 [139222.253507][  T817]  __fput+0xc8/0x220
Aug 30 05:34:56 10.20.22.1 [139222.278100][  T817]  task_work_run+0x57/0x90
Aug 30 05:34:56 10.20.22.1 [139222.302749][  T817]  do_exit+0x317/0xa60
Aug 30 05:34:56 10.20.22.1 [139222.327154][  T817]  do_group_exit+0x2e/0x90
Aug 30 05:34:56 10.20.22.1 [139222.351352][  T817]  get_signal+0x11b/0x7a0
Aug 30 05:34:56 10.20.22.1 [139222.375277][  T817]  ? set_next_entity+0xcd/0x340
Aug 30 05:34:56 10.20.22.1 [139222.399090][  T817]  do_signal+0x25/0x1f0
Aug 30 05:34:56 10.20.22.1 [139222.422846][  T817]  ? __schedule+0x1af/0x750
Aug 30 05:34:56 10.20.22.1 [139222.446246][  T817]  
__prepare_exit_to_usermode+0xd6/0x140
Aug 30 05:34:56 10.20.22.1 [139222.469384][  T817]  do_syscall_64+0x50/0x120
Aug 30 05:34:56 10.20.22.1 [139222.491919][  T817]  
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 30 05:34:56 10.20.22.1 [139222.514294][  T817] RIP: 0033:0x7f5850cafae6
Aug 30 05:34:56 10.20.22.1 [139222.536053][  T817] Code: Bad RIP value.
Aug 30 05:34:56 10.20.22.1 [139222.557135][  T817] RSP: 002b:7fff80131af8 
EFLAGS: 0246 ORIG_RAX: 0017
Aug 30 05:34:56 10.20.22.1 [139222.598929][  T817] RAX: fdfe RBX: 
 RCX: 7f5850cafae6
Aug 30 05:34:56 10.20.22.1 [139222.640656][  T817] RDX:  RSI: 
7fff80131c00 RDI: 000d
Aug 30 05:34:56 10.20.22.1 [139222.682379][  T817]

Re: kernel BUG at kernel/fork.c:LINE!

2020-08-23 Thread Shakeel Butt
On Sun, Aug 23, 2020 at 10:34 AM Randy Dunlap  wrote:
>
> On 8/7/20 2:16 PM, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2
> > dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com
>
> Is this fixed by
>
> commit 991e7673859ed41e7ba83c8c4e57afe8cfebe314
> Author: Shakeel Butt 
> Date:   Thu Aug 6 23:21:37 2020 -0700
>
> mm: memcontrol: account kernel stack per node
>
>
> since the BUG_ON() at line 390 is removed by this patch.
>

It's not really removed. The BUG_ON() has been moved to line 402. So,
this issue might still occur.

>
> > [ cut here ]
> > kernel BUG at kernel/fork.c:390!
> > invalid opcode:  [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
> > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 
> > 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 
> > 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
> > RSP: 0018:c90015e4f850 EFLAGS: 00010216
> > RAX: 01f4 RBX:  RCX: c90017983000
> > RDX: 0004 RSI: 81445327 RDI: 0005
> > RBP:  R08: 0001 R09: 8880a2ef9663
> > R10: 0008 R11:  R12: 
> > R13: 8880001b2280 R14: 88809e4fa840 R15: 
> > FS:  7f7f035e5700() GS:8880ae60() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  release_task_stack kernel/fork.c:447 [inline]
> >  put_task_stack+0xc4/0x230 kernel/fork.c:459
> >  finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649
> >  context_switch kernel/sched/core.c:3781 [inline]
> >  __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527
> >  schedule+0xd0/0x2a0 kernel/sched/core.c:4602
> >  freezable_schedule include/linux/freezer.h:172 [inline]
> >  futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588
> >  futex_wait+0x1df/0x560 kernel/futex.c:2690
> >  do_futex+0x15b/0x1a60 kernel/futex.c:3749
> >  __do_sys_futex kernel/futex.c:3810 [inline]
> >  __se_sys_futex kernel/futex.c:3778 [inline]
> >  __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778
> >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > RIP: 0033:0x45ccd9
> > Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff 
> > ff 0f 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> > RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca
> > RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9
> > RDX:  RSI: 0080 RDI: 0078bfa8
> > RBP: 0078bfa0 R08:  R09: 
> > R10:  R11: 0246 R12: 0078bfac
> > R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac
> > Modules linked in:
> > ---[ end trace ff14b6c5822b8142 ]---
> > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
> > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 
> > 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 
> > 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
> > RSP: 0018:c90015e4f850 EFLAGS: 00010216
> > RAX: 01f4 RBX:  RCX: c90017983000
> > RDX: 0004 RSI: 81445327 RDI: 0005
> > RBP:  R08: 00

Re: kernel BUG at kernel/fork.c:LINE!

2020-08-23 Thread Randy Dunlap
On 8/7/20 2:16 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2
> dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com

Is this fixed by

commit 991e7673859ed41e7ba83c8c4e57afe8cfebe314
Author: Shakeel Butt 
Date:   Thu Aug 6 23:21:37 2020 -0700

mm: memcontrol: account kernel stack per node


since the BUG_ON() at line 390 is removed by this patch.


> ----[ cut here ]
> kernel BUG at kernel/fork.c:390!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
> Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 
> 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 
> e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
> RSP: 0018:c90015e4f850 EFLAGS: 00010216
> RAX: 01f4 RBX:  RCX: c90017983000
> RDX: 0004 RSI: 81445327 RDI: 0005
> RBP:  R08: 0001 R09: 8880a2ef9663
> R10: 0008 R11:  R12: 
> R13: 8880001b2280 R14: 88809e4fa840 R15: 
> FS:  7f7f035e5700() GS:8880ae60() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  release_task_stack kernel/fork.c:447 [inline]
>  put_task_stack+0xc4/0x230 kernel/fork.c:459
>  finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649
>  context_switch kernel/sched/core.c:3781 [inline]
>  __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527
>  schedule+0xd0/0x2a0 kernel/sched/core.c:4602
>  freezable_schedule include/linux/freezer.h:172 [inline]
>  futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588
>  futex_wait+0x1df/0x560 kernel/futex.c:2690
>  do_futex+0x15b/0x1a60 kernel/futex.c:3749
>  __do_sys_futex kernel/futex.c:3810 [inline]
>  __se_sys_futex kernel/futex.c:3778 [inline]
>  __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x45ccd9
> Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 
> 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca
> RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9
> RDX:  RSI: 0080 RDI: 0078bfa8
> RBP: 0078bfa0 R08:  R09: 
> R10:  R11: 0246 R12: 0078bfac
> R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac
> Modules linked in:
> ---[ end trace ff14b6c5822b8142 ]---
> RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
> Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 
> 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 
> e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
> RSP: 0018:c90015e4f850 EFLAGS: 00010216
> RAX: 01f4 RBX:  RCX: c90017983000
> RDX: 0004 RSI: 81445327 RDI: 0005
> RBP:  R08: 0001 R09: 8880a2ef9663
> R10: 0008 R11:  R12: 
> R13: 8880001b2280 R14: 88809e4fa840 R15: 
> FS:  7f7f035e5700() GS:8880ae60() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> 
> 
> ---
> This report is generated by a bo

kernel BUG at kernel/fork.c:LINE!

2020-08-07 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90
kernel config:  https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2
dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085
compiler:   gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com

[ cut here ]
kernel BUG at kernel/fork.c:390!
invalid opcode:  [#1] PREEMPT SMP KASAN
CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 
41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 e9 
d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
RSP: 0018:c90015e4f850 EFLAGS: 00010216
RAX: 01f4 RBX:  RCX: c90017983000
RDX: 0004 RSI: 81445327 RDI: 0005
RBP:  R08: 0001 R09: 8880a2ef9663
R10: 0008 R11:  R12: 
R13: 8880001b2280 R14: 88809e4fa840 R15: 
FS:  7f7f035e5700() GS:8880ae60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 release_task_stack kernel/fork.c:447 [inline]
 put_task_stack+0xc4/0x230 kernel/fork.c:459
 finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649
 context_switch kernel/sched/core.c:3781 [inline]
 __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527
 schedule+0xd0/0x2a0 kernel/sched/core.c:4602
 freezable_schedule include/linux/freezer.h:172 [inline]
 futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588
 futex_wait+0x1df/0x560 kernel/futex.c:2690
 do_futex+0x15b/0x1a60 kernel/futex.c:3749
 __do_sys_futex kernel/futex.c:3810 [inline]
 __se_sys_futex kernel/futex.c:3778 [inline]
 __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45ccd9
Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca
RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9
RDX:  RSI: 0080 RDI: 0078bfa8
RBP: 0078bfa0 R08:  R09: 
R10:  R11: 0246 R12: 0078bfac
R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac
Modules linked in:
---[ end trace ff14b6c5822b8142 ]---
RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390
Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 
41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 e9 
d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24
RSP: 0018:c90015e4f850 EFLAGS: 00010216
RAX: 01f4 RBX:  RCX: c90017983000
RDX: 0004 RSI: 81445327 RDI: 0005
RBP:  R08: 0001 R09: 8880a2ef9663
R10: 0008 R11:  R12: 
R13: 8880001b2280 R14: 88809e4fa840 R15: 
FS:  7f7f035e5700() GS:8880ae60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-21 Thread Valentin Schneider


On 20/07/20 15:21, Peter Zijlstra wrote:
> On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
>> I have to admit, I do not understand the usage of prev_state in schedule(),
>> it looks really, really subtle...
>
> Right, so commit dbfb089d360 solved a problem where schedule() re-read
> prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> result load-accounting went wobbly.
>
> Now, looking at that commit again, I might've solved the problem twice
> :-P
>
> So on the one hand, I provider ordering:
>
>   LOAD p->state   LOAD-ACQUIRE p->on_rq == 0
>   MB
>   STORE p->on_rq, 0   STORE p->state, TASK_WAKING
>
> such that ttwu() will only change p->state, after on_rq==0, which is
> after loading p->state in schedule().
>
> At the same time, I also had schedule() set
> p->sched_contributes_to_load once, and then consistently used that value
> throughout, without ever looking at p->state again, which too makes it
> much harder to mess load-avg up.
>
>
> Now, the ordering in schedule(), relies on doing the p->state load
> before:
>
>   spin_lock(rq->lock)
>   smp_mb__after_spinlock();
>
> and doing a re-load check after, with the assumption that if the reload
> is different, it will not block.
>
> That said, in a crossed email, I just proposed we could simplify all
> this like so.. but now I need to go ask people to re-validate that
> loadavg muck again :-/
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a2a244af9a53..437fc3b241f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
>   local_irq_disable();
>   rcu_note_context_switch(preempt);
>
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
>   /*
>* Make sure that signal_pending_state()->signal_pending() below
>* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
>* We must re-load prev->state in case ttwu_remote() changed it
>* before we acquired rq->lock.
>*/
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;

AFAICT the ttwu/schedule ordering dance + the "trick" of updating
p->sched_contributes_to_load in __schedule() under rq lock ensures loadavg
accounting won't go awry here. I'm still a bit low on coffee, but that does
LGTM.


> + if (!preempt && prev_state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {
> @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)
>
>   /*
>* __schedule() ttwu()
> -  *   prev_state = prev->state;if 
> (READ_ONCE(p->on_rq) && ...)
> -  *   LOCK rq->lock  goto out;
> -  *   smp_mb__after_spinlock();
> smp_acquire__after_ctrl_dep();
> -  *   p->on_rq = 0;p->state = 
> TASK_WAKING;
> +  *   if (prev_state)  if (p->on_rq && ...)
> +  * p->on_rq = 0;goto out;
> +  *
> smp_acquire__after_ctrl_dep();
> +  *p->state = TASK_WAKING
> +  *
> +  * Where __schedule() and ttwu() have matching control 
> dependencies.
>*
>* After this, schedule() must not care about p->state 
> any more.
>*/


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-21 Thread peterz
On Tue, Jul 21, 2020 at 12:52:52AM -0400, Paul Gortmaker wrote:
> [Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) 
> Peter Zijlstra wrote:
> 
> > On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> > > I have to admit, I do not understand the usage of prev_state in 
> > > schedule(),
> > > it looks really, really subtle...
> > 
> > Right, so commit dbfb089d360 solved a problem where schedule() re-read
> > prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> > ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> > result load-accounting went wobbly.
> > 
> > Now, looking at that commit again, I might've solved the problem twice
> > :-P
> 
> [...]
> 
> > That said, in a crossed email, I just proposed we could simplify all
> > this like so.. but now I need to go ask people to re-validate that
> > loadavg muck again :-/
> 
> After a two hour "quick" sanity test I then gave it a full 7h run (which
> always seemed to break before dbfb089d360) and I didn't see any stuck
> load average with master from today + this change.
> 
> Paul.
> 
> root@t5610:/home/paul/git/linux-head#
> [1]+  Donenohup 
> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 
> --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1
> root@t5610:/home/paul/git/linux-head# cat /proc/version
> Linux version 5.8.0-rc6-1-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu 
> 9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul 
> 20 12:30:33 EDT 2020
> root@t5610:/home/paul/git/linux-head# uptime
>  00:49:18 up  7:41,  2 users,  load average: 0.01, 0.00, 0.63
> root@t5610:/home/paul/git/linux-head# 
> 
> --

Thanks a lot for re-testing that Paul, much appreciated!

Let me go write a Changelog for this then...


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Paul Gortmaker
[Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) 
Peter Zijlstra wrote:

> On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> > I have to admit, I do not understand the usage of prev_state in schedule(),
> > it looks really, really subtle...
> 
> Right, so commit dbfb089d360 solved a problem where schedule() re-read
> prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
> ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
> result load-accounting went wobbly.
> 
> Now, looking at that commit again, I might've solved the problem twice
> :-P

[...]

> That said, in a crossed email, I just proposed we could simplify all
> this like so.. but now I need to go ask people to re-validate that
> loadavg muck again :-/

After a two hour "quick" sanity test I then gave it a full 7h run (which
always seemed to break before dbfb089d360) and I didn't see any stuck
load average with master from today + this change.

Paul.

root@t5610:/home/paul/git/linux-head#
[1]+  Donenohup 
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 
--configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1
root@t5610:/home/paul/git/linux-head# cat /proc/version
Linux version 5.8.0-rc6-1-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu 
9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul 
20 12:30:33 EDT 2020
root@t5610:/home/paul/git/linux-head# uptime
 00:49:18 up  7:41,  2 users,  load average: 0.01, 0.00, 0.63
root@t5610:/home/paul/git/linux-head# 

--

> 
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a2a244af9a53..437fc3b241f2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
>   local_irq_disable();
>   rcu_note_context_switch(preempt);
>  
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
>   /*
>* Make sure that signal_pending_state()->signal_pending() below
>* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
>* We must re-load prev->state in case ttwu_remote() changed it
>* before we acquired rq->lock.
>*/
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {
> @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)
>  
>   /*
>* __schedule() ttwu()
> -  *   prev_state = prev->state;if 
> (READ_ONCE(p->on_rq) && ...)
> -  *   LOCK rq->lock  goto out;
> -  *   smp_mb__after_spinlock();
> smp_acquire__after_ctrl_dep();
> -  *   p->on_rq = 0;p->state = 
> TASK_WAKING;
> +  *   if (prev_state)  if (p->on_rq && ...)
> +  * p->on_rq = 0;goto out;
> +  *
> smp_acquire__after_ctrl_dep();
> +  *p->state = TASK_WAKING
> +  *
> +  * Where __schedule() and ttwu() have matching control 
> dependencies.
>*
>* After this, schedule() must not care about p->state 
> any more.
>*/


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Peter Zijlstra
On Mon, Jul 20, 2020 at 05:35:15PM +0200, Oleg Nesterov wrote:
> On 07/20, Oleg Nesterov wrote:
> >
> > On 07/20, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> > >   local_irq_disable();
> > >   rcu_note_context_switch(preempt);
> > >
> > > - /* See deactivate_task() below. */
> > > - prev_state = prev->state;
> > > -
> > >   /*
> > >* Make sure that signal_pending_state()->signal_pending() below
> > >* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> > > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> > >* We must re-load prev->state in case ttwu_remote() changed it
> > >* before we acquired rq->lock.
> > >*/
> > > - if (!preempt && prev_state && prev_state == prev->state) {
> > > + prev_state = prev->state;
> > > + if (!preempt && prev_state) {
> >
> > Heh ;) Peter, you know what? I did the same change and tried to understand
> > why it is wrong and what have I missed.
> >
> > Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
> > this code and other emails from you tomorrow.
> 
> Yes, I can no longer read this code today ;)
> 
> but now it seems to me that (in theory) we need READ_ONCE(prev->state) here
> and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep?
> 
> Probably not, I got lost.
> Probably not, I got lost.
> Probably not, I got lost.

So, task_struct::state is declared volatile (we should probably 'fix'
that some day), so that doesn't require READ_ONCE() -- in fact, that
caused a bunch of re-reads in the old code which made the loadavg race
more likely.

->on_rq is only ever written 0,1,2, there's no possibe store-tearing.
But possibly, yes, WRITE_ONCE() would be nicer.


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
On 07/20, Oleg Nesterov wrote:
>
> On 07/20, Peter Zijlstra wrote:
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
> > local_irq_disable();
> > rcu_note_context_switch(preempt);
> >
> > -   /* See deactivate_task() below. */
> > -   prev_state = prev->state;
> > -
> > /*
> >  * Make sure that signal_pending_state()->signal_pending() below
> >  * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
> >  * We must re-load prev->state in case ttwu_remote() changed it
> >  * before we acquired rq->lock.
> >  */
> > -   if (!preempt && prev_state && prev_state == prev->state) {
> > +   prev_state = prev->state;
> > +   if (!preempt && prev_state) {
>
> Heh ;) Peter, you know what? I did the same change and tried to understand
> why it is wrong and what have I missed.
>
> Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
> this code and other emails from you tomorrow.

Yes, I can no longer read this code today ;)

but now it seems to me that (in theory) we need READ_ONCE(prev->state) here
and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep?

Probably not, I got lost.
Probably not, I got lost.
Probably not, I got lost.

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
On 07/20, Peter Zijlstra wrote:
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
>   local_irq_disable();
>   rcu_note_context_switch(preempt);
>  
> - /* See deactivate_task() below. */
> - prev_state = prev->state;
> -
>   /*
>* Make sure that signal_pending_state()->signal_pending() below
>* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
>* We must re-load prev->state in case ttwu_remote() changed it
>* before we acquired rq->lock.
>*/
> - if (!preempt && prev_state && prev_state == prev->state) {
> + prev_state = prev->state;
> + if (!preempt && prev_state) {

Heh ;) Peter, you know what? I did the same change and tried to understand
why it is wrong and what have I missed.

Thanks, now I can relax. But my head hurts too, I'll probably try to re-read
this code and other emails from you tomorrow.

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Valentin Schneider


On 20/07/20 14:17, pet...@infradead.org wrote:
> On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote:
>> On 20/07/20 12:26, pet...@infradead.org wrote:
>
>> > +  /*
>> > +   * We must re-load prev->state in case ttwu_remote() changed it
>> > +   * before we acquired rq->lock.
>> > +   */
>> > +  tmp_state = prev->state;
>> > +  if (unlikely(prev_state != tmp_state)) {
>> > +  /*
>> > +   * ptrace_{,un}freeze_traced() think it is cool to change
>> > +   * ->state around behind our backs between TASK_TRACED and
>> > +   *  __TASK_TRACED.
>> > +   *
>> > +   * This is safe because this, as well as any __TASK_TRACED
>> > +   * wakeups are under siglock.
>> > +   *
>> > +   * For any other case, a changed prev_state must be to
>> > +   * TASK_RUNNING, such that when it blocks, the load has
>> > +   * happened before the smp_mb().
>> > +   *
>> > +   * Also see the comment with deactivate_task().
>> > +   */
>> > +  SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
>> > + !(tmp_state & __TASK_TRACED)));
>> > +
>>
>> IIUC if the state changed and isn't TASK_RUNNING it *has* to have
>> __TASK_TRACED, so can't that be
>>
>>   SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));
>
> Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and
> then someone goes and does task->state = __TASK_TRACED.
>
> That is, your statement is correct given the current code, but we also
> want to verify no new code comes along and does something 'creative'.
>

That is what I was trying to go for; AFAICT your approach only warns if
__TASK_TRACED gets removed between the two loads (IOW it was already
there). The way I was seeing it is:

- We only get here if prev_state != tmp_state; IOW we know we raced with
  something
- if (tmp_state), then it wasn't with ttwu_remote()
- thus it must only be with ptrace shenanigans, IOW __TASK_TRACED must be
  there.

Now, what I suggested still doesn't detect what you pointed out, or some
crazier thing that sets __TASK_TRACED *and* some other stuff. IIUC the
ptrace transformation does TASK_TRACED -> __TASK_TRACED, so we could have
it as:

  /* TODO: name me */
  #define foobar TASK_TRACED ^ __TASK_TRACED

  ...

  /* not TASK_RUNNING; check against allowed transformations
  SCHED_WARN_ON(tmp_state && ((prev_state ^ tmp_state) & ~foobar));


That said...

> Or is the heat getting to me?

... that may apply to me as well :-)


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Peter Zijlstra
On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote:
> I have to admit, I do not understand the usage of prev_state in schedule(),
> it looks really, really subtle...

Right, so commit dbfb089d360 solved a problem where schedule() re-read
prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and
ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a
result load-accounting went wobbly.

Now, looking at that commit again, I might've solved the problem twice
:-P

So on the one hand, I provider ordering:

LOAD p->state   LOAD-ACQUIRE p->on_rq == 0
MB
STORE p->on_rq, 0   STORE p->state, TASK_WAKING

such that ttwu() will only change p->state, after on_rq==0, which is
after loading p->state in schedule().

At the same time, I also had schedule() set
p->sched_contributes_to_load once, and then consistently used that value
throughout, without ever looking at p->state again, which too makes it
much harder to mess load-avg up.


Now, the ordering in schedule(), relies on doing the p->state load
before:

spin_lock(rq->lock)
smp_mb__after_spinlock();

and doing a re-load check after, with the assumption that if the reload
is different, it will not block.

That said, in a crossed email, I just proposed we could simplify all
this like so.. but now I need to go ask people to re-validate that
loadavg muck again :-/


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2a244af9a53..437fc3b241f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt)
local_irq_disable();
rcu_note_context_switch(preempt);
 
-   /* See deactivate_task() below. */
-   prev_state = prev->state;
-
/*
 * Make sure that signal_pending_state()->signal_pending() below
 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
@@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt)
 * We must re-load prev->state in case ttwu_remote() changed it
 * before we acquired rq->lock.
 */
-   if (!preempt && prev_state && prev_state == prev->state) {
+   prev_state = prev->state;
+   if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
@@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt)
 
/*
 * __schedule() ttwu()
-*   prev_state = prev->state;if 
(READ_ONCE(p->on_rq) && ...)
-*   LOCK rq->lock  goto out;
-*   smp_mb__after_spinlock();
smp_acquire__after_ctrl_dep();
-*   p->on_rq = 0;p->state = 
TASK_WAKING;
+*   if (prev_state)  if (p->on_rq && ...)
+* p->on_rq = 0;goto out;
+*
smp_acquire__after_ctrl_dep();
+*p->state = TASK_WAKING
+*
+* Where __schedule() and ttwu() have matching control 
dependencies.
 *
 * After this, schedule() must not care about p->state 
any more.
 */


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread peterz
On Mon, Jul 20, 2020 at 01:26:23PM +0200, pet...@infradead.org wrote:
>  kernel/sched/core.c | 34 --
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>   */
>  static void __sched notrace __schedule(bool preempt)
>  {
> + unsigned long prev_state, tmp_state;
>   struct task_struct *prev, *next;
>   unsigned long *switch_count;
> - unsigned long prev_state;
>   struct rq_flags rf;
>   struct rq *rq;
>   int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
>   rq_lock(rq, );
>   smp_mb__after_spinlock();
>  
> + /*
> +  * We must re-load prev->state in case ttwu_remote() changed it
> +  * before we acquired rq->lock.
> +  */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> +  * ptrace_{,un}freeze_traced() think it is cool to change
> +  * ->state around behind our backs between TASK_TRACED and
> +  *  __TASK_TRACED.
> +  *
> +  * This is safe because this, as well as any __TASK_TRACED
> +  * wakeups are under siglock.
> +  *
> +  * For any other case, a changed prev_state must be to
> +  * TASK_RUNNING, such that when it blocks, the load has
> +  * happened before the smp_mb().
> +  *
> +  * Also see the comment with deactivate_task().
> +  */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> +!(tmp_state & __TASK_TRACED)));
> +
> + prev_state = tmp_state;

While trying to write a changelog for this thing, I can't convince
myself we don't need:

smp_mb();

here. Consider:

CPU0CPU1CPU2

schedule()
  prev_state = prev->state;
  spin_lock(rq->lock);
  smp_mb__after_spin_lock();
ptrace_freeze_traced()
  spin_lock(siglock)
  task->state = __TASK_TRACED;
  spin_unlock(siglock);
  tmp_state = prev->state;
  if (prev_state != tmp_state)
prev_state = tmp_state;
  /* NO SMP_MB */
  if (prev_state)
deactivate_task()
  prev->on_rq = 0;

spin_lock(siglock);
ttwu()
  if (rq->on_rq 
&& ...)
goto unlock;
  
smp_acquire__after_ctrl_dep();
  p->state = 
TASK_WAKING;

Looses the ordering we previously relied upon. That is, CPU1's
prev->state load and prev->on_rq store can get reordered vs CPU2.

OTOH, we have a control dependency on CPU1 as well, that should provide
LOAD->STORE ordering, after all we only do the ->on_rq=0 store, IFF we
see prev_state.

So that is:

if (p->state)   if (!p->on_rq)
p->on_rq = 0;   p->state = TASK_WAKING

which matches a CTRL-DEP to a CTRL-DEP ...

But this then means we can simplify dbfb089d360 as well, but now my head
hurts.

> + }
> +
>   /* Promote REQ to ACT */
>   rq->clock_update_flags <<= 1;
>   update_rq_clock(rq);
>  
>   switch_count = >nivcsw;
> - /*
> -  * We must re-load prev->state in case ttwu_remote() changed it
> -  * before we acquired rq->lock.
> -  */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
On 07/20, Peter Zijlstra wrote:
>
> Also, is there any way to not have ptrace do this?

Well, we need to ensure that even SIGKILL can't wake the tracee up while
debugger plays with its registers/etc.

> How performance
> critical is this ptrace path?

This is a slow path.

We can probably change ptrace_check_attach() to call ptrace_freeze_traced()
after wait_task_inactive(), but I would like to not do this... Because we
actually want to avoid wait_task_inactive() when possible.

Perhaps ptrace_freeze_traced() can __task_rq_lock() to avoid the race with
__schedule() ? No, it reads prev_state before rq_lock().

> Because I really hate having to add code
> to __schedule() to deal with this horrible thing.

Oh yes, I agree.

I have to admit, I do not understand the usage of prev_state in schedule(),
it looks really, really subtle...

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread peterz
On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote:
> On 20/07/20 12:26, pet...@infradead.org wrote:

> > +   /*
> > +* We must re-load prev->state in case ttwu_remote() changed it
> > +* before we acquired rq->lock.
> > +*/
> > +   tmp_state = prev->state;
> > +   if (unlikely(prev_state != tmp_state)) {
> > +   /*
> > +* ptrace_{,un}freeze_traced() think it is cool to change
> > +* ->state around behind our backs between TASK_TRACED and
> > +*  __TASK_TRACED.
> > +*
> > +* This is safe because this, as well as any __TASK_TRACED
> > +* wakeups are under siglock.
> > +*
> > +* For any other case, a changed prev_state must be to
> > +* TASK_RUNNING, such that when it blocks, the load has
> > +* happened before the smp_mb().
> > +*
> > +* Also see the comment with deactivate_task().
> > +*/
> > +   SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> > +  !(tmp_state & __TASK_TRACED)));
> > +
> 
> IIUC if the state changed and isn't TASK_RUNNING it *has* to have
> __TASK_TRACED, so can't that be
> 
>   SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));

Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and
then someone goes and does task->state = __TASK_TRACED.

That is, your statement is correct given the current code, but we also
want to verify no new code comes along and does something 'creative'.

Or is the heat getting to me?


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Christian Brauner
On Mon, Jul 20, 2020 at 01:26:23PM +0200, pet...@infradead.org wrote:
> On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote:
> > On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> > > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > > > Peter,
> > > > 
> > > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected 
> > > > by
> > > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> > > > 
> > > > Can this change race with
> > > > 
> > > > if (signal_pending_state(prev->state, prev)) {
> > > > prev->state = TASK_RUNNING;
> > > > }
> > > > 
> > > > in __schedule() ? Hopefully not, signal-state is protected by siglock 
> > > > too.
> > > > 
> > > > So I think this logic was correct even if it doesn't look nice. But 
> > > > "doesn't
> > > > look nice" is true for the whole ptrace code ;)
> > > 
> > > *groan*... another bit of obscure magic :-(
> > > 
> > > let me go try and wake up and figure out how best to deal with this.
> 
> This then? That seems to survive the strace thing.

Ran a mainline v5.8-rc6 -> reproduced crash
Ran a mainling v5.8-rc6 + patch below -> testsuite runs fine/not able to 
reproduce crash

So fwiw
Tested-by: Christian Brauner  
Acked-by: Christian Brauner 

> 
> ---
>  kernel/sched/core.c | 34 --
>  1 file changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>   */
>  static void __sched notrace __schedule(bool preempt)
>  {
> + unsigned long prev_state, tmp_state;
>   struct task_struct *prev, *next;
>   unsigned long *switch_count;
> - unsigned long prev_state;
>   struct rq_flags rf;
>   struct rq *rq;
>   int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
>   rq_lock(rq, );
>   smp_mb__after_spinlock();
>  
> + /*
> +  * We must re-load prev->state in case ttwu_remote() changed it
> +  * before we acquired rq->lock.
> +  */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> +  * ptrace_{,un}freeze_traced() think it is cool to change
> +  * ->state around behind our backs between TASK_TRACED and
> +  *  __TASK_TRACED.
> +  *
> +  * This is safe because this, as well as any __TASK_TRACED
> +  * wakeups are under siglock.
> +  *
> +  * For any other case, a changed prev_state must be to
> +  * TASK_RUNNING, such that when it blocks, the load has
> +  * happened before the smp_mb().
> +  *
> +  * Also see the comment with deactivate_task().
> +  */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> +!(tmp_state & __TASK_TRACED)));
> +
> + prev_state = tmp_state;
> + }
> +
>   /* Promote REQ to ACT */
>   rq->clock_update_flags <<= 1;
>   update_rq_clock(rq);
>  
>   switch_count = >nivcsw;
> - /*
> -  * We must re-load prev->state in case ttwu_remote() changed it
> -  * before we acquired rq->lock.
> -  */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Valentin Schneider


On 20/07/20 12:26, pet...@infradead.org wrote:
> ---
>  kernel/sched/core.c | 34 --
>  1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e15543cb84812..b5973d7fa521c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>   */
>  static void __sched notrace __schedule(bool preempt)
>  {
> + unsigned long prev_state, tmp_state;
>   struct task_struct *prev, *next;
>   unsigned long *switch_count;
> - unsigned long prev_state;
>   struct rq_flags rf;
>   struct rq *rq;
>   int cpu;
> @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
>   rq_lock(rq, );
>   smp_mb__after_spinlock();
>
> + /*
> +  * We must re-load prev->state in case ttwu_remote() changed it
> +  * before we acquired rq->lock.
> +  */
> + tmp_state = prev->state;
> + if (unlikely(prev_state != tmp_state)) {
> + /*
> +  * ptrace_{,un}freeze_traced() think it is cool to change
> +  * ->state around behind our backs between TASK_TRACED and
> +  *  __TASK_TRACED.
> +  *
> +  * This is safe because this, as well as any __TASK_TRACED
> +  * wakeups are under siglock.
> +  *
> +  * For any other case, a changed prev_state must be to
> +  * TASK_RUNNING, such that when it blocks, the load has
> +  * happened before the smp_mb().
> +  *
> +  * Also see the comment with deactivate_task().
> +  */
> + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
> +!(tmp_state & __TASK_TRACED)));
> +

IIUC if the state changed and isn't TASK_RUNNING it *has* to have
__TASK_TRACED, so can't that be

  SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED));

Other than that, LGTM.

> + prev_state = tmp_state;
> + }
> +
>   /* Promote REQ to ACT */
>   rq->clock_update_flags <<= 1;
>   update_rq_clock(rq);
>
>   switch_count = >nivcsw;
> - /*
> -  * We must re-load prev->state in case ttwu_remote() changed it
> -  * before we acquired rq->lock.
> -  */
> - if (!preempt && prev_state && prev_state == prev->state) {
> + if (!preempt && prev_state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Jiri Slaby
On 20. 07. 20, 13:26, pet...@infradead.org wrote:
> On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote:
>> On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
>>> On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
 Peter,

 Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
 ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
 ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
 do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).

 Can this change race with

if (signal_pending_state(prev->state, prev)) {
prev->state = TASK_RUNNING;
}

 in __schedule() ? Hopefully not, signal-state is protected by siglock too.

 So I think this logic was correct even if it doesn't look nice. But 
 "doesn't
 look nice" is true for the whole ptrace code ;)
>>>
>>> *groan*... another bit of obscure magic :-(
>>>
>>> let me go try and wake up and figure out how best to deal with this.
> 
> This then? That seems to survive the strace thing.

FWIW for me too.

thanks,
-- 
js


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread peterz
On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote:
> On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > > Peter,
> > > 
> > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> > > 
> > > Can this change race with
> > > 
> > >   if (signal_pending_state(prev->state, prev)) {
> > >   prev->state = TASK_RUNNING;
> > >   }
> > > 
> > > in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> > > 
> > > So I think this logic was correct even if it doesn't look nice. But 
> > > "doesn't
> > > look nice" is true for the whole ptrace code ;)
> > 
> > *groan*... another bit of obscure magic :-(
> > 
> > let me go try and wake up and figure out how best to deal with this.

This then? That seems to survive the strace thing.

---
 kernel/sched/core.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e15543cb84812..b5973d7fa521c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
  */
 static void __sched notrace __schedule(bool preempt)
 {
+   unsigned long prev_state, tmp_state;
struct task_struct *prev, *next;
unsigned long *switch_count;
-   unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
rq_lock(rq, );
smp_mb__after_spinlock();
 
+   /*
+* We must re-load prev->state in case ttwu_remote() changed it
+* before we acquired rq->lock.
+*/
+   tmp_state = prev->state;
+   if (unlikely(prev_state != tmp_state)) {
+   /*
+* ptrace_{,un}freeze_traced() think it is cool to change
+* ->state around behind our backs between TASK_TRACED and
+*  __TASK_TRACED.
+*
+* This is safe because this, as well as any __TASK_TRACED
+* wakeups are under siglock.
+*
+* For any other case, a changed prev_state must be to
+* TASK_RUNNING, such that when it blocks, the load has
+* happened before the smp_mb().
+*
+* Also see the comment with deactivate_task().
+*/
+   SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED &&
+  !(tmp_state & __TASK_TRACED)));
+
+   prev_state = tmp_state;
+   }
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
 
switch_count = >nivcsw;
-   /*
-* We must re-load prev->state in case ttwu_remote() changed it
-* before we acquired rq->lock.
-*/
-   if (!preempt && prev_state && prev_state == prev->state) {
+   if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread peterz
On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> > Peter,
> > 
> > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> > 
> > Can this change race with
> > 
> > if (signal_pending_state(prev->state, prev)) {
> > prev->state = TASK_RUNNING;
> > }
> > 
> > in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> > 
> > So I think this logic was correct even if it doesn't look nice. But "doesn't
> > look nice" is true for the whole ptrace code ;)
> 
> *groan*... another bit of obscure magic :-(
> 
> let me go try and wake up and figure out how best to deal with this.

So clearly I'm still missing something, the below results in:

[   63.760863] [ cut here ]
[   63.766019] !(tmp_state & __TASK_TRACED)
[   63.766030] WARNING: CPU: 33 PID: 33801 at kernel/sched/core.c:4158 
__schedule+0x6bd/0x8e0

Also, is there any way to not have ptrace do this? How performance
critical is this ptrace path? Because I really hate having to add code
to __schedule() to deal with this horrible thing.


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e15543cb84812..f65a801d268b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct rq_flags *rf)
  */
 static void __sched notrace __schedule(bool preempt)
 {
+   unsigned long prev_state, tmp_state;
struct task_struct *prev, *next;
unsigned long *switch_count;
-   unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt)
rq_lock(rq, );
smp_mb__after_spinlock();
 
+   /*
+* We must re-load prev->state in case ttwu_remote() changed it
+* before we acquired rq->lock.
+*/
+   tmp_state = prev->state;
+   if (unlikely(prev_state != tmp_state)) {
+   if (prev_state & __TASK_TRACED) {
+   /*
+* ptrace_{,un}freeze_traced() think it is cool
+* to change ->state around behind our backs
+* between TASK_TRACED and __TASK_TRACED.
+*
+* Luckily this transformation doesn't affect
+* sched_contributes_to_load.
+*/
+   SCHED_WARN_ON(!(tmp_state & __TASK_TRACED));
+   } else {
+   /*
+* For any other case, a changed prev_state
+* must be to TASK_RUNNING, such that...
+*/
+   SCHED_WARN_ON(tmp_state != TASK_RUNNING);
+   }
+   prev_state = tmp_state;
+   }
+
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
 
switch_count = >nivcsw;
-   /*
-* We must re-load prev->state in case ttwu_remote() changed it
-* before we acquired rq->lock.
-*/
-   if (!preempt && prev_state && prev_state == prev->state) {
+   if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Peter Zijlstra
On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote:
> Peter,
> 
> Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
> ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
> ->siglock held. That is why ptrace_freeze_traced() assumes it can safely
> do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).
> 
> Can this change race with
> 
>   if (signal_pending_state(prev->state, prev)) {
>   prev->state = TASK_RUNNING;
>   }
> 
> in __schedule() ? Hopefully not, signal-state is protected by siglock too.
> 
> So I think this logic was correct even if it doesn't look nice. But "doesn't
> look nice" is true for the whole ptrace code ;)

*groan*... another bit of obscure magic :-(

let me go try and wake up and figure out how best to deal with this.


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
Peter,

Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by
->siglock. In particular, ttwu(__TASK_TRACED) must be always called with
->siglock held. That is why ptrace_freeze_traced() assumes it can safely
do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock).

Can this change race with

if (signal_pending_state(prev->state, prev)) {
prev->state = TASK_RUNNING;
}

in __schedule() ? Hopefully not, signal-state is protected by siglock too.

So I think this logic was correct even if it doesn't look nice. But "doesn't
look nice" is true for the whole ptrace code ;)

On 07/20, Oleg Nesterov wrote:
>
> On 07/20, Jiri Slaby wrote:
> >
> > You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
> > people.
>
> Thanks... so with this patch __schedule() does
>
>   prev_state = prev->state;
>
>   ...
>
>   if (!preempt && prev_state && prev_state == prev->state) {
>   if (signal_pending_state(prev_state, prev)) {
>   prev->state = TASK_RUNNING;
>   } else {
>
> and ptrace_freeze_traced() can change ->state in between. This means
> that this task can return from __schedule() with ->state != RUNNING,
> this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent()
> you reported.
>
> Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
On 07/20, Jiri Slaby wrote:
>
> On 18. 07. 20, 19:14, Oleg Nesterov wrote:
> >
> > This is already wrong. But
> >
> > Where does this __might_sleep() come from ??? I ses no blocking calls
> > in ptrace_stop(). Not to mention it is called with ->siglock held and
> > right after this lock is dropped we take tasklist_lock.
>
> Decoded stacktrace:
>
> > ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 
> > include/linux/freezer.h:128 include/linux/freezer.h:173 
> > kernel/signal.c:2217)
> > ptrace_do_notify (kernel/signal.c:2272)
> > ptrace_notify (arch/x86/include/asm/paravirt.h:656 
> > arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 
> > include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 
> > kernel/signal.c:2282)
> > syscall_trace_enter (include/linux/tracehook.h:73 
> > include/linux/tracehook.h:104 arch/x86/entry/common.c:159)
> > do_syscall_64 (arch/x86/entry/common.c:380)
> > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)
>
> It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in
> ptrace_stop.

Aha, thanks a lot!

try_to_freeze_unsafe() is called after schedule() which must return with
->state = RUNNING, so this matches another WARN_ON(current->state) added
by debugging patch after freezable_schedule().

Somehow I decided __might_sleep() was called before read_unlock/schedule.

> >
> >   How this connects to the debugging patch I sent? Did you see this 
> > warning
> >   without that patch?
>
> I suppose this made it appear:
> +CONFIG_PREEMPT_COUNT=y
> -# CONFIG_DEBUG_ATOMIC_SLEEP is not set
> +CONFIG_DEBUG_ATOMIC_SLEEP=y

I see,

> Please see my other e-mail, all this is with dbfb089d360b applied. Maybe
> it makes more sense now?

Yes. Thanks Jiri!

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Oleg Nesterov
On 07/20, Jiri Slaby wrote:
>
> You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
> people.

Thanks... so with this patch __schedule() does

prev_state = prev->state;

...

if (!preempt && prev_state && prev_state == prev->state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {

and ptrace_freeze_traced() can change ->state in between. This means
that this task can return from __schedule() with ->state != RUNNING,
this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent()
you reported.

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-20 Thread Jiri Slaby
On 18. 07. 20, 19:14, Oleg Nesterov wrote:
> On 07/18, Jiri Slaby wrote:
>>
>> On 17. 07. 20, 14:40, Oleg Nesterov wrote:
>>>
>>> please see the updated patch below, lets check ptrace_unfreeze() too.
>>
>> Sure, dmesg attached.
> 
> Thanks a lot!
> 
> But I am totally confused...
> 
>> [   94.513944] [ cut here ]
>> [   94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at 
>> [<2fe279e9>] ptrace_check_attach+0xbf/0x110
> 
> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
> ptrace_freeze_traced(),
> 
>> [   94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 
>> __might_sleep+0x6c/0x70
>> [   94.514020] Modules linked in: ata_generic(E) pata_acpi(E) 
>> crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) 
>> nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) 
>> blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) 
>> reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) 
>> dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
>> [   94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE 
>> 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
>> [   94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
>> [   94.514087] RIP: 0010:__might_sleep+0x6c/0x70
>> [   94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 
>> 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 
>> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
>> [   94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286
>> [   94.514093] RAX:  RBX: 8dc82b503e00 RCX: 
>> 0489
>> [   94.514094] RDX: 0001 RSI: 0096 RDI: 
>> 0247
>> [   94.514095] RBP: 8f6ffa6b R08: 0004 R09: 
>> 0489
>> [   94.514095] R10:  R11: 0001 R12: 
>> 0039
>> [   94.514096] R13:  R14: 0001 R15: 
>> 8dc82b5045e4
>> [   94.514098] FS:  7fa00f1f9240() GS:8dcb0c00() 
>> knlGS:
>> [   94.514099] CS:  0010 DS:  ES:  CR0: 80050033
>> [   94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: 
>> 00360ee0
>> [   94.514103] DR0:  DR1:  DR2: 
>> 
>> [   94.514104] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [   94.514105] Call Trace:
>> [   94.514821]  ptrace_stop+0x1a9/0x300
> 
> This is already wrong. But
> 
>   Where does this __might_sleep() come from ??? I ses no blocking calls
>   in ptrace_stop(). Not to mention it is called with ->siglock held and
>   right after this lock is dropped we take tasklist_lock.

Decoded stacktrace:

> ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 
> include/linux/freezer.h:128 include/linux/freezer.h:173 kernel/signal.c:2217) 
>  
> ptrace_do_notify (kernel/signal.c:2272)  
> ptrace_notify (arch/x86/include/asm/paravirt.h:656 
> arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 
> include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 
> kernel/signal.c:2282)  
> syscall_trace_enter (include/linux/tracehook.h:73 
> include/linux/tracehook.h:104 arch/x86/entry/common.c:159)
> do_syscall_64 (arch/x86/entry/common.c:380)  
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124)  

It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in
ptrace_stop.

> 
>   How this connects to the debugging patch I sent? Did you see this 
> warning
>   without that patch?

I suppose this made it appear:
+CONFIG_PREEMPT_COUNT=y
-# CONFIG_DEBUG_ATOMIC_SLEEP is not set
+CONFIG_DEBUG_ATOMIC_SLEEP=y

>> [   94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 
>> ptrace_stop+0x1d8/0x300
> ...
>> [   94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
> 
> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
> BUG_ON() in do_notify_parent() you reported.
> 
> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
> This must not be possible.
> 
> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
> nobody should do this.
> 
> Strange.

Please see my other e-mail, all this is with dbfb089d360b applied. Maybe
it makes more sense now?

thanks,
-- 
js


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-19 Thread Jiri Slaby
On 18. 07. 20, 19:44, Christian Brauner wrote:
> On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote:
>> On 07/18, Jiri Slaby wrote:
>>>
>>> On 17. 07. 20, 14:40, Oleg Nesterov wrote:

 please see the updated patch below, lets check ptrace_unfreeze() too.
>>>
>>> Sure, dmesg attached.
>>
>> Thanks a lot!
>>
>> But I am totally confused...
>>
>>> [   94.513944] [ cut here ]
>>> [   94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at 
>>> [<2fe279e9>] ptrace_check_attach+0xbf/0x110
>>
>> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
>> ptrace_freeze_traced(),
>>
>>> [   94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 
>>> __might_sleep+0x6c/0x70
>>> [   94.514020] Modules linked in: ata_generic(E) pata_acpi(E) 
>>> crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) 
>>> nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) 
>>> blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) 
>>> reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) 
>>> dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
>>> [   94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE 
>>> 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
>>> [   94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
>>> [   94.514087] RIP: 0010:__might_sleep+0x6c/0x70
>>> [   94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 
>>> 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff 
>>> <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
>>> [   94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286
>>> [   94.514093] RAX:  RBX: 8dc82b503e00 RCX: 
>>> 0489
>>> [   94.514094] RDX: 0001 RSI: 0096 RDI: 
>>> 0247
>>> [   94.514095] RBP: 8f6ffa6b R08: 0004 R09: 
>>> 0489
>>> [   94.514095] R10:  R11: 0001 R12: 
>>> 0039
>>> [   94.514096] R13:  R14: 0001 R15: 
>>> 8dc82b5045e4
>>> [   94.514098] FS:  7fa00f1f9240() GS:8dcb0c00() 
>>> knlGS:
>>> [   94.514099] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: 
>>> 00360ee0
>>> [   94.514103] DR0:  DR1:  DR2: 
>>> 
>>> [   94.514104] DR3:  DR6: fffe0ff0 DR7: 
>>> 0400
>>> [   94.514105] Call Trace:
>>> [   94.514821]  ptrace_stop+0x1a9/0x300
>>
>> This is already wrong. But
>>
>>  Where does this __might_sleep() come from ??? I ses no blocking calls
>>  in ptrace_stop(). Not to mention it is called with ->siglock held and
>>  right after this lock is dropped we take tasklist_lock.
>>
>>  How this connects to the debugging patch I sent? Did you see this 
>> warning
>>  without that patch?
>>  
>>
>>> [   94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 
>>> ptrace_stop+0x1d8/0x300
>> ...
>>> [   94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
>>
>> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
>> BUG_ON() in do_notify_parent() you reported.
>>
>> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
>> This must not be possible.
>>
>> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
>> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
>> nobody should do this.
>>
>> Strange.
> 
> I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the
> strace test-suite with
> 
> make check -j4
> 
> and I wasn't able to after multiple retries. Jiri, just to make sure
> this is upstream 5.8-rc4 without any additional patches?

You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more
people.

So if you try with pure vanilla 5.8-rc6 (it contains the commit), you
should be able to reproduce. I am.

OTOH 5.8-rc6 with that commit reverted -- I cannot reproduce there. So
it must be it.

> Anything special required to reproduce this in the way you run strace
> and so on?

Nothing special there. Except the HW: when I run a VM with 16+
processors, it's much more likely to be hit (usually, on the first run
of make check).

thanks,
-- 
js


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-19 Thread Oleg Nesterov
Hi Hillf,

On 07/19, Hillf Danton wrote:
>
> Dunno if the wheel prior to JOBCTL_TASK_WORK helps debug the warnings.
>
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2541,7 +2541,7 @@ bool get_signal(struct ksignal *ksig)
>
>  relock:
>   spin_lock_irq(>siglock);
> - current->jobctl &= ~JOBCTL_TASK_WORK;
> + task_clear_jobctl_pending(current, JOBCTL_TASK_WORK);
>   if (unlikely(current->task_works)) {
>   spin_unlock_irq(>siglock);
>   task_work_run();
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -43,8 +43,8 @@ task_work_add(struct task_struct *task,
>   break;
>   case TWA_SIGNAL:
>   if (lock_task_sighand(task, )) {
> - task->jobctl |= JOBCTL_TASK_WORK;
> - signal_wake_up(task, 0);
> + if (task_set_jobctl_pending(task, JOBCTL_TASK_WORK))
> + recalc_sigpending_and_wake(task);
>   unlock_task_sighand(task, );

To be be honest I don't understand why do you think this makes any sense...

But this doesn't matter, please note that the Jiri tested the kernel with
e91b48162332480f5 (which added JOBCTL_TASK_WORK) reverted.

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-18 Thread Christian Brauner
On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote:
> On 07/18, Jiri Slaby wrote:
> >
> > On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> > >
> > > please see the updated patch below, lets check ptrace_unfreeze() too.
> >
> > Sure, dmesg attached.
> 
> Thanks a lot!
> 
> But I am totally confused...
> 
> > [   94.513944] [ cut here ]
> > [   94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at 
> > [<2fe279e9>] ptrace_check_attach+0xbf/0x110
> 
> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
> ptrace_freeze_traced(),
> 
> > [   94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 
> > __might_sleep+0x6c/0x70
> > [   94.514020] Modules linked in: ata_generic(E) pata_acpi(E) 
> > crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) 
> > nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) 
> > blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) 
> > reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) 
> > dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E)
> > [   94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE 
> > 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
> > [   94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
> > [   94.514087] RIP: 0010:__might_sleep+0x6c/0x70
> > [   94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 
> > 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff 
> > <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
> > [   94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286
> > [   94.514093] RAX:  RBX: 8dc82b503e00 RCX: 
> > 0489
> > [   94.514094] RDX: 0001 RSI: 0096 RDI: 
> > 0247
> > [   94.514095] RBP: 8f6ffa6b R08: 0004 R09: 
> > 0489
> > [   94.514095] R10:  R11: 0001 R12: 
> > 0039
> > [   94.514096] R13:  R14: 0001 R15: 
> > 8dc82b5045e4
> > [   94.514098] FS:  7fa00f1f9240() GS:8dcb0c00() 
> > knlGS:
> > [   94.514099] CS:  0010 DS:  ES:  CR0: 80050033
> > [   94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: 
> > 00360ee0
> > [   94.514103] DR0:  DR1:  DR2: 
> > 
> > [   94.514104] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [   94.514105] Call Trace:
> > [   94.514821]  ptrace_stop+0x1a9/0x300
> 
> This is already wrong. But
> 
>   Where does this __might_sleep() come from ??? I ses no blocking calls
>   in ptrace_stop(). Not to mention it is called with ->siglock held and
>   right after this lock is dropped we take tasklist_lock.
> 
>   How this connects to the debugging patch I sent? Did you see this 
> warning
>   without that patch?
>   
> 
> > [   94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 
> > ptrace_stop+0x1d8/0x300
> ...
> > [   94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300
> 
> This is WARN_ON(current->state) added to ptrace_stop(), this can explain
> BUG_ON() in do_notify_parent() you reported.
> 
> So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
> This must not be possible.
> 
> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
> nobody should do this.
> 
> Strange.

I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the
strace test-suite with

make check -j4

and I wasn't able to after multiple retries. Jiri, just to make sure
this is upstream 5.8-rc4 without any additional patches?
Anything special required to reproduce this in the way you run strace
and so on?

Christian


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-18 Thread Oleg Nesterov
On 07/18, Jiri Slaby wrote:
>
> On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> >
> > please see the updated patch below, lets check ptrace_unfreeze() too.
>
> Sure, dmesg attached.

Thanks a lot!

But I am totally confused...

> [   94.513944] [ cut here ]
> [   94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at 
> [<2fe279e9>] ptrace_check_attach+0xbf/0x110

OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in
ptrace_freeze_traced(),

> [   94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 
> __might_sleep+0x6c/0x70
> [   94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) 
> qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) 
> fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) 
> raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) 
> dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) 
> virtio_rng(E)
> [   94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE 
> 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased)
> [   94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
> [   94.514087] RIP: 0010:__might_sleep+0x6c/0x70
> [   94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 
> 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b 
> eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd
> [   94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286
> [   94.514093] RAX:  RBX: 8dc82b503e00 RCX: 
> 0489
> [   94.514094] RDX: 0001 RSI: 0096 RDI: 
> 0247
> [   94.514095] RBP: 8f6ffa6b R08: 0004 R09: 
> 0489
> [   94.514095] R10:  R11: 0001 R12: 
> 0039
> [   94.514096] R13:  R14: 0001 R15: 
> 8dc82b5045e4
> [   94.514098] FS:  7fa00f1f9240() GS:8dcb0c00() 
> knlGS:
> [   94.514099] CS:  0010 DS:  ES:  CR0: 80050033
> [   94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: 
> 00360ee0
> [   94.514103] DR0:  DR1:  DR2: 
> 
> [   94.514104] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   94.514105] Call Trace:
> [   94.514821]  ptrace_stop+0x1a9/0x300

This is already wrong. But

Where does this __might_sleep() come from ??? I ses no blocking calls
in ptrace_stop(). Not to mention it is called with ->siglock held and
right after this lock is dropped we take tasklist_lock.

How this connects to the debugging patch I sent? Did you see this 
warning
without that patch?


> [   94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 
> ptrace_stop+0x1d8/0x300
...
> [   94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300

This is WARN_ON(current->state) added to ptrace_stop(), this can explain
BUG_ON() in do_notify_parent() you reported.

So, the tracee returns from schedule() with ->state != TASK_RUNNING ???
This must not be possible.

OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can
only happen if it races with ttwu(__TASK_TRACED) without ->siglock held,
nobody should do this.

Strange.

Oleg.



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-18 Thread Jiri Slaby
On 17. 07. 20, 13:12, Christian Brauner wrote:
> On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote:
>> On 17. 07. 20, 12:45, Jiri Slaby wrote:
>>> Hi,
>>>
>>> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
>>> and i586:
>>
>> make check needs -jsomething, running is sequentially (-j1) doesn't
>> trigger it. After the error, I cannot run anything. Like ps to find out
>> what test caused the crash...
>>
>> 5.7 was fine.
> 
> Can you try and reproduce with
> 
> commit e91b48162332480f5840902268108bb7fb7a44c7
> Author: Oleg Nesterov 
> Date:   Tue Jun 30 17:32:54 2020 +0200
> 
> task_work: teach task_work_add() to do signal_wake_up()
> 
> reverted, please?

I reverted these 3:
+This reverts commit b7db41c9e03b5189bc94993bd50e4506ac9e34c1.
+This reverts commit ce593a6c480a22acba08795be313c0c6d49dd35d.
+This reverts commit e91b48162332480f5840902268108bb7fb7a44c7.

But it still crashes.

thanks,
-- 
js


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-18 Thread Jiri Slaby
On 17. 07. 20, 14:40, Oleg Nesterov wrote:
> On 07/17, Oleg Nesterov wrote:
>>
>> On 07/17, Jiri Slaby wrote:
>>>
>>> On 17. 07. 20, 12:45, Jiri Slaby wrote:
 Hi,

 the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
 and i586:
>>>
>>> make check needs -jsomething, running is sequentially (-j1) doesn't
>>> trigger it. After the error, I cannot run anything. Like ps to find out
>>> what test caused the crash...
>>
>> Strange... I'll try to reproduce but I can't do this till Monday.
>>
>> Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.
> 
> please see the updated patch below, lets check ptrace_unfreeze() too.

Sure, dmesg attached.

thanks,
-- 
js
[0.00] Linux version 5.8.0-rc5-100.g55927f9-default (geeko@buildhost) 
(gcc (SUSE Linux) 10.1.1 20200625 [revision 
c91e43e9363bd119a695d64505f96539fa451bf2], GNU ld (GNU Binutils; openSUSE 
Tumbleweed) 2.34.0.20200325-1) #1 SMP Sat Jul 18 06:57:08 UTC 2020 (55927f9)
[0.00] Command line: root=/dev/disk/by-id/virtio-0 rootfstype=ext4 
rootflags=noatime ext4.allow_unsupported=1 kpti=off pti=off spectre_v2=off 
panic=1 quiet no-kvmclock elevator=noop nmi_watchdog=0 rw 
rd.driver.pre=binfmt_misc console=ttyS0 init=/.build/build
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xbffd8fff] usable
[0.00] BIOS-e820: [mem 0xbffd9000-0xbfff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x000521ff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.8 present.
[0.00] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014
[0.00] Hypervisor detected: KVM
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 2799.858 MHz processor
[0.000348] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.000349] e820: remove [mem 0x000a-0x000f] usable
[0.000354] last_pfn = 0x522000 max_arch_pfn = 0x4
[0.000380] MTRR default type: write-back
[0.000380] MTRR fixed ranges enabled:
[0.000381]   0-9 write-back
[0.000382]   A-B uncachable
[0.000383]   C-F write-protect
[0.000383] MTRR variable ranges enabled:
[0.000384]   0 base 000C000 mask 3FFC000 uncachable
[0.000385]   1 disabled
[0.000385]   2 disabled
[0.000385]   3 disabled
[0.000386]   4 disabled
[0.000386]   5 disabled
[0.000387]   6 disabled
[0.000387]   7 disabled
[0.000394] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.000408] last_pfn = 0xbffd9 max_arch_pfn = 0x4
[0.015173] check: Scanning 1 areas for low memory corruption
[0.015226] Kernel/User page tables isolation: disabled on command line.
[0.015230] Using GB pages for direct mapping
[0.015416] RAMDISK: [mem 0x7f7a2000-0x7fff]
[0.015459] ACPI: Early table checksum verification disabled
[0.015474] ACPI: RSDP 0x000F5B30 14 (v00 BOCHS )
[0.015477] ACPI: RSDT 0xBFFE2369 30 (v01 BOCHS  BXPCRSDT 
0001 BXPC 0001)
[0.015486] ACPI: FACP 0xBFFE1D4D 74 (v01 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.015493] ACPI: DSDT 0xBFFDD400 00494D (v01 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.015497] ACPI: FACS 0xBFFDD3C0 40
[0.015499] ACPI: APIC 0xBFFE1DC1 000570 (v01 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.015502] ACPI: HPET 0xBFFE2331 38 (v01 BOCHS  BXPCHPET 
0001 BXPC 0001)
[0.015512] ACPI: Local APIC address 0xfee0
[0.015847] No NUMA configuration found
[0.015847] Faking a node at [mem 0x-0x000521ff]
[0.015862] NODE_DATA(0) allocated [mem 0x521fea000-0x521ff]
[0.016605] Zone ranges:
[0.016606]   DMA  [mem 0x1000-0x00ff]
[0.016606]   DMA32[mem 0x0100-0x]
[0.016607]   Normal   [mem 0x0001-0x000521ff]
[0.016608]   Device   empty
[0.016608] Movable zone 

Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-17 Thread Oleg Nesterov
On 07/17, Oleg Nesterov wrote:
>
> On 07/17, Jiri Slaby wrote:
> >
> > On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > > Hi,
> > >
> > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > > and i586:
> >
> > make check needs -jsomething, running is sequentially (-j1) doesn't
> > trigger it. After the error, I cannot run anything. Like ps to find out
> > what test caused the crash...
>
> Strange... I'll try to reproduce but I can't do this till Monday.
>
> Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.

please see the updated patch below, lets check ptrace_unfreeze() too.

Oleg.

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..71c76bc7dec6 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -180,6 +180,7 @@ static bool ptrace_freeze_traced(struct task_struct *task)
 
spin_lock_irq(>sighand->siglock);
if (task_is_traced(task) && !__fatal_signal_pending(task)) {
+   task->task_state_change = _THIS_IP_;
task->state = __TASK_TRACED;
ret = true;
}
@@ -203,8 +204,10 @@ static void ptrace_unfreeze_traced(struct task_struct 
*task)
if (task->state == __TASK_TRACED) {
if (__fatal_signal_pending(task))
wake_up_state(task, __TASK_TRACED);
-   else
+   else {
+   task->task_state_change = _THIS_IP_;
task->state = TASK_TRACED;
+   }
}
spin_unlock_irq(>sighand->siglock);
 }
diff --git a/kernel/signal.c b/kernel/signal.c
index ee22ec78fd6d..ba4c4c9ed9b5 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(sig == -1);
 
/* do_notify_parent_cldstop should have been called instead.  */
-   BUG_ON(task_is_stopped_or_traced(tsk));
+   WARN(task_is_stopped_or_traced(tsk),
+   "exist with state=%ld set at %pS",
+   tsk->state, (void*)tsk->task_state_change);
 
BUG_ON(!tsk->ptrace &&
   (tsk->group_leader != tsk || !thread_group_empty(tsk)));
@@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int 
clear_code, kernel_siginfo_t
preempt_enable_no_resched();
freezable_schedule();
cgroup_leave_frozen(true);
+   WARN_ON(current->state);
} else {
/*
 * By the time we got the lock, our tracer went away.
@@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr)
/* Now we don't run again until woken by SIGCONT or SIGKILL */
cgroup_enter_frozen();
freezable_schedule();
+   WARN_ON(current->state);
return true;
} else {
/*



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-17 Thread Oleg Nesterov
On 07/17, Jiri Slaby wrote:
>
> On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > Hi,
> >
> > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > and i586:
>
> make check needs -jsomething, running is sequentially (-j1) doesn't
> trigger it. After the error, I cannot run anything. Like ps to find out
> what test caused the crash...

Strange... I'll try to reproduce but I can't do this till Monday.

Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP.

Oleg.

diff --git a/kernel/signal.c b/kernel/signal.c
index ee22ec78fd6d..84782843eb1c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(sig == -1);
 
/* do_notify_parent_cldstop should have been called instead.  */
-   BUG_ON(task_is_stopped_or_traced(tsk));
+   WARN(task_is_stopped_or_traced(tsk),
+   "exist with state=%ld set at %pS",
+   tsk->state, (void*)tsk->task_state_change);
 
BUG_ON(!tsk->ptrace &&
   (tsk->group_leader != tsk || !thread_group_empty(tsk)));
@@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int 
clear_code, kernel_siginfo_t
preempt_enable_no_resched();
freezable_schedule();
cgroup_leave_frozen(true);
+   WARN_ON(current->state);
} else {
/*
 * By the time we got the lock, our tracer went away.
@@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr)
/* Now we don't run again until woken by SIGCONT or SIGKILL */
cgroup_enter_frozen();
freezable_schedule();
+   WARN_ON(current->state);
return true;
} else {
/*



Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-17 Thread Christian Brauner
On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote:
> On 17. 07. 20, 12:45, Jiri Slaby wrote:
> > Hi,
> > 
> > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> > and i586:
> 
> make check needs -jsomething, running is sequentially (-j1) doesn't
> trigger it. After the error, I cannot run anything. Like ps to find out
> what test caused the crash...
> 
> 5.7 was fine.

Can you try and reproduce with

commit e91b48162332480f5840902268108bb7fb7a44c7
Author: Oleg Nesterov 
Date:   Tue Jun 30 17:32:54 2020 +0200

task_work: teach task_work_add() to do signal_wake_up()

reverted, please?

Thanks!
Christian


Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-17 Thread Jiri Slaby
On 17. 07. 20, 12:45, Jiri Slaby wrote:
> Hi,
> 
> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
> and i586:

make check needs -jsomething, running is sequentially (-j1) doesn't
trigger it. After the error, I cannot run anything. Like ps to find out
what test caused the crash...

5.7 was fine.

>> kernel BUG at kernel/signal.c:1917!
>> invalid opcode:  [#1] SMP NOPTI 
>> CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted 
>> 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased)
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
>> RIP: 0010:do_notify_parent+0x239/0x270 
>> Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 
>> 89 e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 
>> 05 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0
>> RSP: 0018:a05400f93d00 EFLAGS: 00010002
>> RAX: 0008 RBX: 9024ebb6dd00 RCX: 000102f9
>> RDX:  RSI: 0011 RDI: 9024ebb6dd00
>> RBP: 9024ebb6dd00 R08:  R09: 000d
>> R10: 7f781300 R11: c2690721ae80 R12: 
>> R13:  R14: 0009 R15: 
>> FS:  7f7812ffc700() GS:902537dc() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 7f9c246a4a10 CR3: 0001eba7c000 CR4: 000406e0
>> Call Trace:
>>  exit_notify+0x1c0/0x1f0 
>>  do_exit+0x234/0x430
>>  do_group_exit+0x33/0xa0 
>>  get_signal+0x171/0x640
>>  do_signal+0x2a/0x1c0
>>  exit_to_usermode_loop+0x89/0x110
>>  __prepare_exit_to_usermode+0x92/0xd0
>>  do_syscall_64+0x59/0xd0 
>>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> RIP: 0033:0x7f78149db992
>> Code: Bad RIP value.
>> RSP: 002b:7f7812ffbed0 EFLAGS: 0293 ORIG_RAX: 0022
>> RAX: fdfe RBX:  RCX: 7f78149db992
>> RDX: 0001 RSI: 558a4e07e033 RDI: 
>> RBP:  R08:  R09: 7f7812ffc700
>> R10: 7f7812ffc9d0 R11: 0293 R12: 
>> R13: 7ffe4853b4bf R14: 7ffe4853b4c0 R15: 7f7812ffc700
>> Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 
>> nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor 
>> raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio 
>> dm_crypt dm_mod binfmt_misc loop sg virtio_rng
>> ---[ end trace 364c3385f192fee3 ]---
> 
> Any ideas?
> 
> thanks,
> 


-- 
js


5.8-rc*: kernel BUG at kernel/signal.c:1917

2020-07-17 Thread Jiri Slaby
Hi,

the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64
and i586:
> kernel BUG at kernel/signal.c:1917!
> invalid opcode:  [#1] SMP NOPTI 
> CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted 
> 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased)
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
> RIP: 0010:do_notify_parent+0x239/0x270 
> Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 89 
> e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 05 
> 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0
> RSP: 0018:a05400f93d00 EFLAGS: 00010002
> RAX: 0008 RBX: 9024ebb6dd00 RCX: 000102f9
> RDX:  RSI: 0011 RDI: 9024ebb6dd00
> RBP: 9024ebb6dd00 R08:  R09: 000d
> R10: 7f781300 R11: c2690721ae80 R12: 
> R13:  R14: 0009 R15: 
> FS:  7f7812ffc700() GS:902537dc() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f9c246a4a10 CR3: 0001eba7c000 CR4: 000406e0
> Call Trace:
>  exit_notify+0x1c0/0x1f0 
>  do_exit+0x234/0x430
>  do_group_exit+0x33/0xa0 
>  get_signal+0x171/0x640
>  do_signal+0x2a/0x1c0
>  exit_to_usermode_loop+0x89/0x110
>  __prepare_exit_to_usermode+0x92/0xd0
>  do_syscall_64+0x59/0xd0 
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f78149db992
> Code: Bad RIP value.
> RSP: 002b:7f7812ffbed0 EFLAGS: 0293 ORIG_RAX: 0022
> RAX: fdfe RBX:  RCX: 7f78149db992
> RDX: 0001 RSI: 558a4e07e033 RDI: 
> RBP:  R08:  R09: 7f7812ffc700
> R10: 7f7812ffc9d0 R11: 0293 R12: 
> R13: 7ffe4853b4bf R14: 7ffe4853b4c0 R15: 7f7812ffc700
> Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 
> nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor 
> raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio 
> dm_crypt dm_mod binfmt_misc loop sg virtio_rng
> ---[ end trace 364c3385f192fee3 ]---

Any ideas?

thanks,
-- 
js
suse labs


Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-14 Thread Yi Zheng
Hi, Thomas

I canceled that patch. In my testing, that will not fix the problem.
Why the IRQ is unexpectedly masked, that is not an easy bug for me.
More time need to hacking the driver/kernel code.

Thanks for your reply.

Thomas Gleixner  于2019年10月14日周一 下午8:34写道:
>
> On Tue, 8 Oct 2019, Yi Zheng wrote:
> >  There is some defects on IRQ processing:
> >
> >  (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and 
> > ACK
> >  action is executed: On my machine, it runs the 'else' branch:
> >
> > static inline void mask_ack_irq(struct irq_desc *desc)
> > {
> > if (desc->irq_data.chip->irq_mask_ack) {
> > desc->irq_data.chip->irq_mask_ack(>irq_data);
> > irq_state_set_masked(desc);
> > } else {
> > mask_irq(desc);
> > if (desc->irq_data.chip->irq_ack)
> > 
> > desc->irq_data.chip->irq_ack(>irq_data);
> > }
> > }
> >
> >  It is an 2-steps procedure:
> >  1. mask_irq()
> >  2. desc->irq_data.chip->irq_ack()
> >
> >  the 2nd step, the function ptr is omap_mask_ack_irq(), which
> >  _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.
>
> Sure. Where is the problem?
>
> >  (2) mask_irq()/unmask_irq() are not atomic actions: They check the
> >  IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by 
> > calling
> >  the function ptrs which installed by irq controller drv.  Then, 
> > those 2
> >  functions set/clear the IRQD_IRQ_MASKED flag.
> >
> >  I think the sequence of the hw/sw action should be mirrored 
> > reversed:
> >  mask_irq():
> > check IRQD_IRQ_MASKED;
> > set hardware IRQ mask register;
> > set software IRQD_IRQ_MASKED flag;
> >
> >  unmask_irq():
> > check IRQD_IRQ_MASKED;
> > /* NOTE: should before the hw unmask action!! */
> > clear software IRQD_IRQ_MASKED flag;
> > clear hardware IRQ mask register;
> >
> >  The current unmask_irq(), hw-mask action runs before sw-mask 
> > action,
> >  which gives an very small time window. That cause an unexpected
> >  iterated IRQ.
>
> It's completely irrelevant because _ALL_ those operations run with
> irq_desc->lock held. So nothing can actually observe that state.
>
> >  Here is my the detail of my analyzing of handle_level_irq():
> >
> >  (1) Let record the HW-IRQ-Controller Status and the SW-Flag 
> > IRQD_IRQ_MASKED
> >  pair as following: (hw-mask, sw-mask).
> >
> >  (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the 
> > HW
> >  unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, 
> > there
> >  is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may 
> > triggered.
> >
> >  In that time window, the mask status is (0, 1), which is no an 
> > valid
> >  value.
>
> Again. Irrelevant because not observable.
>
> >   My fixup is in the attachment, which remove the unexpected time 
> > window of
> >   IRQ iteration.
>
> Please don't send attachments. See 
> Documentation/process/submitting-patches.rst
>
> Thanks,
>
> tglx
>


Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-14 Thread Thomas Gleixner
On Tue, 8 Oct 2019, Yi Zheng wrote:
>  There is some defects on IRQ processing:
> 
>  (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK
>  action is executed: On my machine, it runs the 'else' branch:
> 
> static inline void mask_ack_irq(struct irq_desc *desc)
> {
> if (desc->irq_data.chip->irq_mask_ack) {
> desc->irq_data.chip->irq_mask_ack(>irq_data);
> irq_state_set_masked(desc);
> } else {
> mask_irq(desc);
> if (desc->irq_data.chip->irq_ack)
> desc->irq_data.chip->irq_ack(>irq_data);
> }
> }
> 
>  It is an 2-steps procedure:
>  1. mask_irq()
>  2. desc->irq_data.chip->irq_ack()
> 
>  the 2nd step, the function ptr is omap_mask_ack_irq(), which
>  _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.

Sure. Where is the problem?

>  (2) mask_irq()/unmask_irq() are not atomic actions: They check the
>  IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling
>  the function ptrs which installed by irq controller drv.  Then, 
> those 2
>  functions set/clear the IRQD_IRQ_MASKED flag.
> 
>  I think the sequence of the hw/sw action should be mirrored reversed:
>  mask_irq():
> check IRQD_IRQ_MASKED;
> set hardware IRQ mask register;
> set software IRQD_IRQ_MASKED flag;
> 
>  unmask_irq():
> check IRQD_IRQ_MASKED;
> /* NOTE: should before the hw unmask action!! */
> clear software IRQD_IRQ_MASKED flag;
> clear hardware IRQ mask register;
> 
>  The current unmask_irq(), hw-mask action runs before sw-mask action,
>  which gives an very small time window. That cause an unexpected
>  iterated IRQ.

It's completely irrelevant because _ALL_ those operations run with
irq_desc->lock held. So nothing can actually observe that state.

>  Here is my the detail of my analyzing of handle_level_irq():
> 
>  (1) Let record the HW-IRQ-Controller Status and the SW-Flag 
> IRQD_IRQ_MASKED
>  pair as following: (hw-mask, sw-mask).
> 
>  (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW
>  unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, 
> there
>  is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered.
> 
>  In that time window, the mask status is (0, 1), which is no an valid
>  value.

Again. Irrelevant because not observable.

>   My fixup is in the attachment, which remove the unexpected time window 
> of
>   IRQ iteration.

Please don't send attachments. See Documentation/process/submitting-patches.rst

Thanks,

tglx



Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-10 Thread Tony Lindgren
* Yi Zheng  [191010 06:22]:
> Hi
> 
>My patch is canceled. I have tested that simple adjustment, the
> device IRQ masked unexpectedly. It seems that it is more easily to
> occur with that patch.
> 
> So, the root cause is not found yet.

OK. Based on your description, it could be a missing flush of a
posted write somewhere in the gpio-omap or intc path.

Regards,

Tony


Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-10 Thread Yi Zheng
Hi

   My patch is canceled. I have tested that simple adjustment, the
device IRQ masked unexpectedly. It seems that it is more easily to
occur with that patch.

So, the root cause is not found yet.


Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-09 Thread Tony Lindgren
Hi,

* Yi Zheng  [191008 13:06]:
>   NOTE: (1) My SoC is a single core ARM chip: TI-AM3352, so the raw
>  spin-lock irq_desc->lock will be optimized to
>  nothing. handle_level_irq() has no spin-lock protection, right?

Well not always, With CONFIG_SMP we modify only some of the SMP code on boot,
see arch/arm/kernel/head.S for smp_on_up and then the related macro usage.

>  (2) In AM3352, INTC driver ACK the IRQ by write 0x01 into INTC 
> Control
>  Register(offset 0x48).  The chip doc seems that bit[0] of
>  INTC-Control Reg is only an enable/disable flag.  The IRQ may
>  generated even if no ACK action done. Any one can give me an
>  clarification?

The TI INTC is probably better documented in dm3630 trm, it's the same
controller but with a different revision.

>  (3) My analysis is not verified on the real machine. After some code
>  change for debug(add counter to indicates the iteration level, 
> save
>  the IRQ mask status etc.), the device IRQ wrongly masked problem
>  vanished. In fact, the original code can not re-produce the
>  phenomena easily. In tens of machine, only one can get the bug. I
>  have try my best to hacking the code, but the only verified 
> result
>  is here: when bug occur, the HW IRQ is masked, but the
>  IRQD_IRQ_MASKED flag is cleared.
> 
>   My fixup is in the attachment, which remove the unexpected time window 
> of
>   IRQ iteration.

Let's see what Thomas has to say for that. Meanwhile, please take a
look at Documentation/process/submitting-patches.rst for getting things
right for sending out patches that can be applied without manual
editing :)

Cheers,

Tony

> --- kernel/irq/chip.c 2019-07-13 09:28:23.683787367 +0800
> +++ /tmp/chip.c   2019-10-08 11:32:35.082258572 +0800
> @@ -432,8 +432,8 @@ void unmask_irq(struct irq_desc *desc)
>   return;
>  
>   if (desc->irq_data.chip->irq_unmask) {
> - desc->irq_data.chip->irq_unmask(>irq_data);
>   irq_state_clr_masked(desc);
> + desc->irq_data.chip->irq_unmask(>irq_data);
>   }
>  }
>  



Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)

2019-10-08 Thread Yi Zheng
Hi,

 I found something wrong on my AM3352 SoC machine, the GPIO triggered IRQ is
 masked unexpectedly.  That bug cause the devices using that GPIO-IRQ can
 not work. Even the latest kernel version (v5.4-rc2-20-geda57a0e4299)!

 After a long time hacking, I guess the bug is in kernel/irq/chip.c, the
 important base code for _ALL_ the Processor Platform! That is why this mail
 is sent to you.

 Shortly speaking, the bug is about wrong interrupt iteration. That cause
 the software flag "IRQD_IRQ_MASKED" not refect the real masking status
 register in the interrupt controller (INTC,
 drivers/irqchip/irq-omap-intc.c).

 Here is my hw/sw settings:
 (1) FPGA implements some UART, using a GPIO pin as INT signal,

 (2) In the SoC, 481ac000.gpio is under the INTC controller, irq=28, and
 hwirq=32 It is an level triggered IRQ.

 (3) The ISR calling stack shows that handle_level_irq() is the most
 important function for debugging.

 There is some defects on IRQ processing:

 (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK
 action is executed: On my machine, it runs the 'else' branch:

static inline void mask_ack_irq(struct irq_desc *desc)
{
if (desc->irq_data.chip->irq_mask_ack) {
desc->irq_data.chip->irq_mask_ack(>irq_data);
irq_state_set_masked(desc);
} else {
mask_irq(desc);
if (desc->irq_data.chip->irq_ack)
desc->irq_data.chip->irq_ack(>irq_data);
}
}

 It is an 2-steps procedure:
 1. mask_irq()
 2. desc->irq_data.chip->irq_ack()

 the 2nd step, the function ptr is omap_mask_ack_irq(), which
 _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.

 (2) mask_irq()/unmask_irq() are not atomic actions: They check the
 IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling
 the function ptrs which installed by irq controller drv.  Then, those 2
 functions set/clear the IRQD_IRQ_MASKED flag.

 I think the sequence of the hw/sw action should be mirrored reversed:
 mask_irq():
check IRQD_IRQ_MASKED;
set hardware IRQ mask register;
set software IRQD_IRQ_MASKED flag;

 unmask_irq():
check IRQD_IRQ_MASKED;
/* NOTE: should before the hw unmask action!! */
clear software IRQD_IRQ_MASKED flag;
clear hardware IRQ mask register;

 The current unmask_irq(), hw-mask action runs before sw-mask action,
 which gives an very small time window. That cause an unexpected
 iterated IRQ.

 Here is my the detail of my analyzing of handle_level_irq():

 (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED
 pair as following: (hw-mask, sw-mask).

 (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW
 unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there
 is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered.

 In that time window, the mask status is (0, 1), which is no an valid
 value.

 (3) In the 2nd level of the ISR(IRQ-28), The mask status is IRQ-28(0, 1),
 so mask_irq() do nothing, because sw-flag is "1".  That is an wrong
 status, the programmer thinks that IRQ-28 has been masked, but
 physically not!

 (4) Before the ACK func-ptr calling, there comes the 3rd level IRQ (28/32)!
 Although mask_irq() do not physically mask the IRQ, ACK acion
 (omap_mask_ack_irq) of the omap-intc drv mask the IRQ physically. The
 3rd ISR runs OK.

 When 3rd level ISR exist, the mask status is (0, 0), that is OK!

 (5) The 3rd level ISR finished, the 2nd level ISR continue.  It run ACK
 function ptr -- omap_mask_ack_irq(). The HW-IRQ-mask is set again!
 Now, the mask status is (1, 0), it is unreasonable value!

 (6) The 2nd level ISR run to cond_unmask_irq(). Due to the ill-formed
 mask-status value(1, 0), the unmask_irq() will not be called.  Even in
 unmask_irq(), another checking SW-Flag exists. The real unmask action
 will not run!

 (7) Now, the 2nd level ISR return, with the mask status (1, 0).  The 1st
 level ISR continues, in unmask_irq(), it run irq_state_clr_masked();
 And it repeatedly clear the IRQD_IRQ_MASKED. The final mask status is
 (1, 0).

 What (1, 0) value means? The CPU call will not receive IRQ any more!
 That is my bug phenomenon. If I clean the hardware interrupt
 controller's mask bit, my devices work again.


  NOTE: (1) My SoC is a single

Maybe a bug in kernel/irq/chip.c unmask_irq(), device irq is masked unexpectedly.

2019-10-08 Thread Yi Zheng
Hi,

 I found something wrong on my AM3352 SoC machine, the GPIO
triggered IRQ is masked
 unexpectedly.  That bug cause the devices using that GPIO-IRQ can
not work. Even the
 latest kernel version (v5.4-rc2-20-geda57a0e4299)!

 After a long time hacking, I guess the bug is in
kernel/irq/chip.c, the important
 base code for _ALL_ the Processor Platform! That is why this mail
is sent to you.

 Shortly speaking, the bug is about wrong interrupt iteration.
That cause the software
 flag "IRQD_IRQ_MASKED" not refect the real masking status
register in the interrupt
 controller (INTC, drivers/irqchip/irq-omap-intc.c).

 Here is my hw/sw settings:
 (1) FPGA implements some UART, using a GPIO pin as INT signal,

 (2) In the SoC, 481ac000.gpio is under the INTC controller,
irq=28, and hwirq=32
 It is an level triggered IRQ.

 (3) The ISR calling stack shows that handle_level_irq() is the
most important
 function for debugging.

 There is some defects on IRQ processing:

 (1) At the beginning of handle_level_irq(), the IRQ-28 is masked,
and ACK action is executed:
 On my machine, it runs the 'else' branch:
static inline void mask_ack_irq(struct irq_desc *desc)
{
if (desc->irq_data.chip->irq_mask_ack) {
desc->irq_data.chip->irq_mask_ack(>irq_data);
irq_state_set_masked(desc);
} else {
mask_irq(desc);
if (desc->irq_data.chip->irq_ack)
desc->irq_data.chip->irq_ack(>irq_data);
}
}

 It is an 2-steps procedure:
 1. mask_irq()
 2. desc->irq_data.chip->irq_ack()

 the 2nd step, the function ptr is omap_mask_ack_irq(), which
 _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action.

 (2) mask_irq()/unmask_irq() are not atomic actions:
 They check the IRQD_IRQ_MASKED flag firstly, and then
 mask/unmask the irq by calling the function ptrs which
installed by irq controller drv.
 Then, those 2 functions set/clear the IRQD_IRQ_MASKED flag.

 I think the sequence of the hw/sw action should be mirrored reversed:
 mask_irq():
check IRQD_IRQ_MASKED;
set hardware IRQ mask register;
set software IRQD_IRQ_MASKED flag;

 unmask_irq():
check IRQD_IRQ_MASKED;
clear software IRQD_IRQ_MASKED flag; /* NOTE: should
before the hw unmask action!! */
clear hardware IRQ mask register;

 The current unmask_irq(), hw-mask action runs before sw-mask
action, which gives an very small
 time window. That cause an unexpected iterated IRQ.

 Here is my the detail of my analyzing of handle_level_irq():

 (1) Let record the HW-IRQ-Controller Status and the SW-Flag
IRQD_IRQ_MASKED pair as following:
 (hw-mask, sw-mask).

 (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(),
after the HW unmask action, and
 before the sw-flag IRQD_IRQ_MASKED is cleared, there is a
VERY SMALL TIME WINDOW,
 in which, another IRQ-28 may triggered.

 In that time window, the mask status is (0, 1), which is no
an valid value.

 (3) In the 2nd level of the ISR(IRQ-28), The mask status is
IRQ-28(0, 1), so
 mask_irq() do nothing, because sw-flag is "1".  That is an
wrong status, the
 programmer thinks that IRQ-28 has been masked, but physically not!

 (4) Before the ACK func-ptr calling, there comes the 3rd level IRQ (28/32)!
 Although mask_irq() do not physically mask the IRQ, ACK acion
(omap_mask_ack_irq)
 of the omap-intc drv mask the IRQ physically. The 3rd ISR runs OK.

 When 3rd level ISR exist, the mask status is (0, 0), that is OK!

 (5) The 3rd level ISR finished, the 2nd level ISR continue.
 It run ACK function ptr -- omap_mask_ack_irq(). The
HW-IRQ-mask is set again!
 Now, the mask status is (1, 0), it is unreasonable value!

 (6) The 2nd level ISR run to cond_unmask_irq(). Due to the
ill-formed mask-status
 value(1, 0), the unmask_irq() will not be called.  Even in
unmask_irq(), another
 checking SW-Flag exists. The real unmask action will not run!

 (7) Now, the 2nd level ISR return, with the mask status (1, 0).
 The 1st level ISR continues, in unmask_irq(), it run
irq_state_clr_masked();
 And it repeatedly clear the IRQD_IRQ_MASKED. The final mask
status is (1, 0).

 What (1, 0) value means? The CPU call will not receive IRQ any more!
 That is my bug phenomenon. If I clean the hardware interrupt
controller's mask bit, my
 devices work again.


  NOTE:
 (1) My SoC is a single core ARM chip: TI-AM3352, so the raw spin-lock
 irq_desc-&g

kernel BUG at kernel/time/timer.c:LINE! (4)

2019-10-07 Thread syzbot

Hello,

syzbot found the following crash on:

HEAD commit:58d5f26a usb-fuzzer: main usb gadget fuzzer driver
git tree:   https://github.com/google/kasan.git usb-fuzzer
console output: https://syzkaller.appspot.com/x/log.txt?x=11e5b20d60
kernel config:  https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58
dashboard link: https://syzkaller.appspot.com/bug?extid=f49d12d34f2321cf4df2
compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=13d9c71360

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+f49d12d34f2321cf4...@syzkaller.appspotmail.com

[ cut here ]
kernel BUG at kernel/time/timer.c:956!
invalid opcode:  [#1] SMP KASAN
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc1+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:__mod_timer kernel/time/timer.c:956 [inline]
RIP: 0010:__mod_timer kernel/time/timer.c:949 [inline]
RIP: 0010:mod_timer+0x5a2/0xb50 kernel/time/timer.c:1100
Code: 45 10 c7 44 24 14 ff ff ff ff 48 89 44 24 08 48 8d 45 20 48 c7 44 24  
18 00 00 00 00 48 89 04 24 e9 5a fc ff ff e8 ae ce 0e 00 <0f> 0b e8 a7 ce  
0e 00 4c 89 74 24 20 e9 37 fe ff ff e8 98 ce 0e 00

RSP: 0018:8881db209930 EFLAGS: 00010006
RAX: 86c2b200 RBX: a688 RCX: 83efc583
RDX: 0100 RSI: 812f4d82 RDI: 8881d2356200
RBP: 8881d23561e8 R08: 86c2b200 R09: ed103a46abeb
R10: ed103a46abea R11: 8881d2355f53 R12: dc00
R13: 11103b64132d R14: 8881d2355f50 R15: 0006
FS:  () GS:8881db20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f75e2799000 CR3: 0001d3b07000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 imon_touch_event drivers/media/rc/imon.c:1348 [inline]
 imon_incoming_packet.isra.0+0x2546/0x2f10 drivers/media/rc/imon.c:1603
 usb_rx_callback_intf0+0x151/0x1e0 drivers/media/rc/imon.c:1734
 __usb_hcd_giveback_urb+0x1f2/0x470 drivers/usb/core/hcd.c:1654
 usb_hcd_giveback_urb+0x368/0x420 drivers/usb/core/hcd.c:1719
 dummy_timer+0x120f/0x2fa2 drivers/usb/gadget/udc/dummy_hcd.c:1965
 call_timer_fn+0x179/0x650 kernel/time/timer.c:1404
 expire_timers kernel/time/timer.c:1449 [inline]
 __run_timers kernel/time/timer.c:1773 [inline]
 __run_timers kernel/time/timer.c:1740 [inline]
 run_timer_softirq+0x5e3/0x1490 kernel/time/timer.c:1786
 __do_softirq+0x221/0x912 kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0x178/0x1a0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x12f/0x500 arch/x86/kernel/apic/apic.c:1137
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
 
RIP: 0010:default_idle+0x28/0x2e0 arch/x86/kernel/process.c:581
Code: 90 90 41 56 41 55 65 44 8b 2d 44 3a 8f 7a 41 54 55 53 0f 1f 44 00 00  
e8 36 ee d0 fb e9 07 00 00 00 0f 00 2d fa dd 4f 00 fb f4 <65> 44 8b 2d 20  
3a 8f 7a 0f 1f 44 00 00 5b 5d 41 5c 41 5d 41 5e c3

RSP: 0018:86c07da8 EFLAGS: 0246 ORIG_RAX: ff13
RAX: 0007 RBX: 86c2b200 RCX: 
RDX:  RSI: 0006 RDI: 86c2ba4c
RBP: fbfff0d85640 R08: 86c2b200 R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x3b6/0x500 kernel/sched/idle.c:263
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:355
 start_kernel+0x82a/0x864 init/main.c:784
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
Modules linked in:
---[ end trace 3b8169377985e706 ]---
RIP: 0010:__mod_timer kernel/time/timer.c:956 [inline]
RIP: 0010:__mod_timer kernel/time/timer.c:949 [inline]
RIP: 0010:mod_timer+0x5a2/0xb50 kernel/time/timer.c:1100
Code: 45 10 c7 44 24 14 ff ff ff ff 48 89 44 24 08 48 8d 45 20 48 c7 44 24  
18 00 00 00 00 48 89 04 24 e9 5a fc ff ff e8 ae ce 0e 00 <0f> 0b e8 a7 ce  
0e 00 4c 89 74 24 20 e9 37 fe ff ff e8 98 ce 0e 00

RSP: 0018:8881db209930 EFLAGS: 00010006
RAX: 86c2b200 RBX: a688 RCX: 83efc583
RDX: 0100 RSI: 812f4d82 RDI: 8881d2356200
RBP: 8881d23561e8 R08: 86c2b200 R09: ed103a46abeb
R10: ed103a46abea R11: 8881d2355f53 R12: dc00
R13: 11103b64132d R14: 8881d2355f50 R15: 0006
FS:  () GS:8881db20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f75e2799000 CR3: 0001d3b07000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3: 00

Re: Bug at kernel/cred.c +443

2019-09-09 Thread rishabhb

On 2019-09-09 15:47, risha...@codeaurora.org wrote:

Hi Miklos

In 4.19 kernel when we write to a file that doesn't exist we see the
following stack:
[  377.382745]  ovl_create_or_link+0xac/0x710
[  377.382745]  ovl_create_object+0xb8/0x110
[  377.382745]  ovl_create+0x34/0x40
[  377.382745]  path_openat+0xd44/0x15a8
[  377.382745]  do_filp_open+0x80/0x128
[  377.382745]  do_sys_open+0x140/0x250
[  377.382745]  __arm64_sys_openat+0x2c/0x38

If the override_cred flag = off, the ovl_override_cred and
ovl_revert_cred just returns NULL.
But there is another override_cred in between these two functions;
put_cred(override_creds(override_cred));
put_cred(override_cred);

This will override the credentials permanently as there is no
corresponding revert_cred associated.
So whenever we do commit_creds for this task, we see a BUG_ON at
kernel/cred.c +443.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cred.c#n443

Should this override_cred be changed to ovl_override_cred to maintain
consistency and avoid this
BUG_ON?


Thanks,
Rishabh


Corrected line number in the subject.


Bug at kernel/cred.c +432

2019-09-09 Thread rishabhb

Hi Miklos

In 4.19 kernel when we write to a file that doesn't exist we see the
following stack:
[  377.382745]  ovl_create_or_link+0xac/0x710
[  377.382745]  ovl_create_object+0xb8/0x110
[  377.382745]  ovl_create+0x34/0x40
[  377.382745]  path_openat+0xd44/0x15a8
[  377.382745]  do_filp_open+0x80/0x128
[  377.382745]  do_sys_open+0x140/0x250
[  377.382745]  __arm64_sys_openat+0x2c/0x38

If the override_cred flag = off, the ovl_override_cred and 
ovl_revert_cred just returns NULL.

But there is another override_cred in between these two functions;
put_cred(override_creds(override_cred));
put_cred(override_cred);

This will override the credentials permanently as there is no 
corresponding revert_cred associated.
So whenever we do commit_creds for this task, we see a BUG_ON at 
kernel/cred.c +443.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cred.c#n443

Should this override_cred be changed to ovl_override_cred to maintain 
consistency and avoid this

BUG_ON?


Thanks,
Rishabh


Reminder: 1 open syzbot bug in "kernel/cgroup" subsystem

2019-07-23 Thread Eric Biggers
[This email was generated by a script.  Let me know if you have any suggestions
to make it better, or if you want it re-generated with the latest status.]

Of the currently open syzbot reports against the upstream kernel, I've manually
marked 1 of them as possibly being a bug in the "kernel/cgroup" subsystem.

If you believe this bug is no longer valid, please close the syzbot report by
sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the
original thread, as explained at https://goo.gl/tpsmEJ#status

If you believe I misattributed this bug to the "kernel/cgroup" subsystem, please
let me know, and if possible forward the report to the correct people or mailing
list.

Here is the bug:


Title:  WARNING in cgroup_apply_control_enable
Last occurred:  4 days ago
Reported:   379 days ago
Branches:   Mainline and others
Dashboard link: 
https://syzkaller.appspot.com/bug?id=4cb6430631b7567848a6736be59d0350b3ada2ea
Original thread:
https://lkml.kernel.org/lkml/09f9a305708fa...@google.com/T/#u

This bug has a C reproducer.

No one replied to the original thread for this bug.

If you fix this bug, please add the following tag to the commit:
Reported-by: syzbot+5493b2a54d31d6aea...@syzkaller.appspotmail.com

If you send any email or patch for this bug, please consider replying to the
original thread.  For the git send-email command to use, or tips on how to reply
if the thread isn't in your mailbox, see the "Reply instructions" at
https://lkml.kernel.org/r/09f9a305708fa...@google.com



Reminder: 1 open syzbot bug in "kernel/cgroup" subsystem

2019-07-09 Thread Eric Biggers
[This email was generated by a script.  Let me know if you have any suggestions
to make it better, or if you want it re-generated with the latest status.]

Of the currently open syzbot reports against the upstream kernel, I've manually
marked 1 of them as possibly being a bug in the "kernel/cgroup" subsystem.

If you believe this bug is no longer valid, please close the syzbot report by
sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the
original thread, as explained at https://goo.gl/tpsmEJ#status

If you believe I misattributed this bug to the "kernel/cgroup" subsystem, please
let me know, and if possible forward the report to the correct people or mailing
list.

Here is the bug:


Title:  WARNING in cgroup_apply_control_enable
Last occurred:  9 days ago
Reported:   365 days ago
Branches:   Mainline and others
Dashboard link: 
https://syzkaller.appspot.com/bug?id=4cb6430631b7567848a6736be59d0350b3ada2ea
Original thread:
https://lkml.kernel.org/lkml/09f9a305708fa...@google.com/T/#u

This bug has a C reproducer.

No one replied to the original thread for this bug.

If you fix this bug, please add the following tag to the commit:
Reported-by: syzbot+5493b2a54d31d6aea...@syzkaller.appspotmail.com

If you send any email or patch for this bug, please consider replying to the
original thread.  For the git send-email command to use, or tips on how to reply
if the thread isn't in your mailbox, see the "Reply instructions" at
https://lkml.kernel.org/r/09f9a305708fa...@google.com



Re: kernel BUG at kernel/cred.c:434!

2019-04-23 Thread Paul Moore
On Tue, Apr 23, 2019 at 12:08 AM Yang Yingliang
 wrote:
> On 2019/4/23 3:48, Paul Moore wrote:
> > On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang  
> > wrote:
> >> I'm not sure you got my point.
> > I went back and looked at your previous emails again to try and
> > understand what you are talking about, and I'm a little confused by
> > some of the output ...
> >
> >> --- a/kernel/acct.c
> >> +++ b/kernel/acct.c
> >> @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct
> >> *acct)
> >>   flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
> >>   current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
> >>   /* Perform file operations on behalf of whoever enabled
> >> accounting */
> >> +   pr_info("task:%px new cred:%px real cred:%px cred:%px\n",
> >> current, file->f_cred, current->real_cred, current->cred);
> >>   orig_cred = override_creds(file->f_cred);
> > Okay, with this patch applied we should the task/cred info when
> > do_acct_process is called.  Got it.
> >
> >> Messages:
> >> [   56.643298] task:88841a9595c0 new cred:88841ae450c0 real
> >> cred:88841ae450c0 cred:88841ae450c0//They are same.
> > Okay, it looks like do_acct_process() was called and f_cred,
> > real_cred, and cred are all the same.
>
> This is a original message, without patch applied.

The patch I am referring to is your pr_info patch (above).  I'm not
talking about any other patches at the moment; I just want to
understand the example dmesg output you copied into your email.

With that in mind, the message above seems to indicate that
do_acct_process() has been invoked with f_cred, real_cred, and cred
all pointing to the same credentials struct, yes?

> >> [   56.646609] Process accounting resumed
> > It looks like do_acct_process() has called check_free_space() now.  So
> > far so good.

...

> >> [   56.649943] task:88841a9595c0 new cred:88841ae450c0 real
> >> cred:88841c96c300 cred:88841ae450c0
> > Wait a minute ... why are we seeing this again?  Looking at the task
> > pointer and the timestamp, this is the same task exiting and trying to
> > write to the accounting file, yes?  This output is particularly
> > curious since it appears that real_cred has changed; where is this
> > happening?
>
> This is the message when the BUG_ON was triggered without applying any
> fix patch.

The only place in the code that generates this message is the bit of
code that you patches in using the pr_info() patch (above), yes?  If
so, that would seem to indicate that the same task is calling
do_acct_process() twice, yes?  I may be fundamentally misunderstanding
something about process accounting, but I though do_acct_process()
would only be called once for a given task - while it was exiting.
Yes, no?

> If we apply this patch "proc: prevent changes to overridden
> credentials", program
> runs like this:

I'd like to focus on understanding the dmesg output you shared first,
because it doesn't seem to make sense to me.

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-22 Thread Yang Yingliang




On 2019/4/23 3:48, Paul Moore wrote:

On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang  wrote:

I'm not sure you got my point.

I went back and looked at your previous emails again to try and
understand what you are talking about, and I'm a little confused by
some of the output ...


--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct
*acct)
  flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
  current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
  /* Perform file operations on behalf of whoever enabled
accounting */
+   pr_info("task:%px new cred:%px real cred:%px cred:%px\n",
current, file->f_cred, current->real_cred, current->cred);
  orig_cred = override_creds(file->f_cred);

Okay, with this patch applied we should the task/cred info when
do_acct_process is called.  Got it.


Messages:
[   56.643298] task:88841a9595c0 new cred:88841ae450c0 real
cred:88841ae450c0 cred:88841ae450c0//They are same.

Okay, it looks like do_acct_process() was called and f_cred,
real_cred, and cred are all the same.

This is a original message, without patch applied.



[   56.646609] Process accounting resumed

It looks like do_acct_process() has called check_free_space() now.  So
far so good.


[   56.649943] task:88841a9595c0 new cred:88841ae450c0 real
cred:88841c96c300 cred:88841ae450c0

Wait a minute ... why are we seeing this again?  Looking at the task
pointer and the timestamp, this is the same task exiting and trying to
write to the accounting file, yes?  This output is particularly
curious since it appears that real_cred has changed; where is this
happening?

This is the message when the BUG_ON was triggered without applying any
fix patch.


If we apply this patch "proc: prevent changes to overridden 
credentials", program

runs like this:

1. As print message shows, before overriden, the pointer has the 
following value:

real_cread=cred=0x88841ae450c0, f_cred=0x88841ae450c0
override_creds() is called in do_acct_process():
...
/* Perform file operations on behalf of whoever enabled accounting */
orig_cred = override_creds(file->f_cred);
...


2. After override_creds(), if (current_cred() != current_real_cred()) is 
not work here,

we will call commit_creds()  in security_setprocattr().
...
/* Prevent changes to overridden credentials. */
if (current_cred() != current_real_cred()) {
rcu_read_unlock();
return -EBUSY;
}
...


3. After commit_creds(), we have new cred and real_cred.
security_setprocattr()//commit_creds is called here

4. revert_creds() is called in in do_acct_process(), the cred
is reverted to the old value(0x88841ae450c0)
...
current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim;
revert_creds(orig_cred);

5. After reverting, cred and real_cred are not equal.
If it has a risk to trigger the BUG_ON, when doing another
commit_creds() ?





Re: kernel BUG at kernel/cred.c:434!

2019-04-22 Thread Paul Moore
On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang  wrote:
> I'm not sure you got my point.

I went back and looked at your previous emails again to try and
understand what you are talking about, and I'm a little confused by
some of the output ...

> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct
> *acct)
>  flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
>  current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
>  /* Perform file operations on behalf of whoever enabled
> accounting */
> +   pr_info("task:%px new cred:%px real cred:%px cred:%px\n",
> current, file->f_cred, current->real_cred, current->cred);
>  orig_cred = override_creds(file->f_cred);

Okay, with this patch applied we should the task/cred info when
do_acct_process is called.  Got it.

> Messages:
> [   56.643298] task:88841a9595c0 new cred:88841ae450c0 real
> cred:88841ae450c0 cred:88841ae450c0//They are same.

Okay, it looks like do_acct_process() was called and f_cred,
real_cred, and cred are all the same.

> [   56.646609] Process accounting resumed

It looks like do_acct_process() has called check_free_space() now.  So
far so good.

> [   56.649943] task:88841a9595c0 new cred:88841ae450c0 real
> cred:88841c96c300 cred:88841ae450c0

Wait a minute ... why are we seeing this again?  Looking at the task
pointer and the timestamp, this is the same task exiting and trying to
write to the accounting file, yes?  This output is particularly
curious since it appears that real_cred has changed; where is this
happening?

> [   56.653565] [ cut here ]
> [   56.655119] kernel BUG at kernel/cred.c:434!
> [   56.656590] invalid opcode:  [#1] SMP PTI
> [   56.658033] CPU: 2 PID: 4169 Comm: syz-executor.15 Not tainted
> 5.1.0-rc4-00034-g869e3305f23d-dirty #143
> [   56.661077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [   56.664895] RIP: 0010:commit_creds+0x1eb/0x230
> [   56.666344] Code: 43 1c 0f 85 08 ff ff ff e9 10 ff ff ff 8b 45 10 39
> 43 10 0f 85 18 ff ff ff 8b 43 20 39 45 20 0f 85 0c ff ff ff e9 14 ff ff
> ff <0f> 0b 48 c7 c7 d0 d2 49 82 e8 17 3b 3e 00 0f 0b 48 c7 c7 c0 d2 49
> [   56.672410] RSP: 0018:c90003a17b20 EFLAGS: 00010287
> [   56.674098] RAX: 88841a9595c0 RBX: 88841ae450c0 RCX:
> 
> [   56.676410] RDX: 0001 RSI: 0020 RDI:
> 88841c96ce40
> [   56.678691] RBP: 0001 R08: 0080 R09:
> 
> [   56.680997] R10: 88841c9265a0 R11: 810d6940 R12:
> 88841a9595c0
> [   56.681198] task:88841a9195c0 new cred:88841aeaa0c0 real
> cred:88841aeaa0c0 cred:88841aeaa0c0
> [   56.683293] R13: 0040 R14: 88841c96ce40 R15:
> 0040
> [   56.683296] FS:  7f5969a5c700() GS:88842fa8()
> knlGS:
> [   56.683297] CS:  0010 DS:  ES:  CR0: 80050033
> [   56.683299] CR2: 7f82742214f0 CR3: 00041cbc0005 CR4:
> 000206e0
> [   56.683305] Call Trace:
> [   56.683340]  selinux_setprocattr+0x17b/0x480
> [   56.686513] Process accounting resumed
> [   56.688849]  proc_pid_attr_write+0xc0/0xf0
> [   56.688857]  __kernel_write+0x4f/0xf0
> [   56.688866]  do_acct_process+0x538/0x750
> [   56.703090]  ? __schedule+0x290/0x960
> [   56.704311]  ? __queue_work+0x160/0x5c0
> [   56.705571]  acct_pin_kill+0x1e/0x70
> [   56.706743]  pin_kill+0x81/0x150
> [   56.707813]  ? finish_wait+0x80/0x80
> [   56.708985]  mnt_pin_kill+0x1e/0x30
> [   56.710127]  cleanup_mnt+0x6e/0x70
> [   56.711247]  task_work_run+0x8a/0xb0
> [   56.712453]  do_exit+0x2e0/0xc80
> [   56.713525]  do_group_exit+0x33/0xb0
> [   56.714701]  get_signal+0x143/0x810
> [   56.715865]  do_signal+0x36/0x610
> [   56.716962]  ? __x64_sys_futex+0x134/0x180
> [   56.718307]  ? _copy_to_user+0x22/0x30
> [   56.719606]  exit_to_usermode_loop+0x80/0xe0
> [   56.721003]  do_syscall_64+0x16c/0x180
> [   56.722242]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-20 Thread Yang Yingliang




On 2019/4/20 0:13, Paul Moore wrote:

On Fri, Apr 19, 2019 at 10:34 AM Yang Yingliang
 wrote:

On 2019/4/19 21:24, Paul Moore wrote:

On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang
 wrote:

On 2019/4/19 10:04, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:

On 2019/4/18 8:24, Casey Schaufler wrote:

On 4/17/2019 4:39 PM, Paul Moore wrote:

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?

I'm fine with the change going into proc_pid_attr_write().

The cred != real_cred checking is not enough.

Consider this situation, when doing override, cred, real_cred and
new_cred are all same:

after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?

It's possible the new  cred is equal to current->real_cred and
current->cred,
so after overrides_creds(), they have the same value.

Both task_struct.cred and task_struct.real_cred are pointer values,
assuming that one uses prepare_creds() to allocate/initialize a new
cred struct for use with override_creds() then the newly created cred
should never be equal to task_struct.real_cred.  Am I missing
something, or are you thinking of something else?

In do_acct_process(), file->f_cred may equal to current->real_cred, I
confirm
it by adding some debug message in do_acct_process() like this:

I would expect that; real_cred is the task's objective DAC
credentials, so using it for f_cred makes sense.

What we are now talking about is the task's subjective credentials,
which can be overridden via override_creds(), and are what the LSMs
change via proc_pid_attr_write().

I'm not sure you got my point.

I was saying cred != real_cred check is not quite right, because the 
cred can

be overridden by a same pointer as my print messages showing.

"cred != real_cred" means override_creds() is called, but "cred == 
real_cred"

doesn't mean override_creds() is not called.

When we use "cred != real_cred" check, we may lost the situation that cred
is overridden by a same pointer. In this case, we will do 
override_creds() =>

commit_creds() => revert_creds(), this make cred != real_cred, when a new
commit_creds() is called, it also will trigger a BUG_ON().



--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct
*acct)
  flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
  current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
  /* Perform file operations on behalf of whoever enabled
accounting */
+   pr_info("task:%px new cred:%px real cred:%px cred:%px\n",
current, file->f_cred, current->real_cred, current->cred);
  orig_cred = override_creds(file->f_cred);





Re: kernel BUG at kernel/cred.c:434!

2019-04-19 Thread Paul Moore
On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang
 wrote:
> On 2019/4/19 10:04, Paul Moore wrote:
> > On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
> >  wrote:
> >> On 2019/4/18 8:24, Casey Schaufler wrote:
> >>> On 4/17/2019 4:39 PM, Paul Moore wrote:
>  Since it looks like all three LSMs which implement the setprocattr
>  hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
>  a better choice for the cred != read_cred check, but I would want to
>  make sure John and Casey are okay with that.
> 
>  John?
> 
>  Casey?
> >>> I'm fine with the change going into proc_pid_attr_write().
> >> The cred != real_cred checking is not enough.
> >>
> >> Consider this situation, when doing override, cred, real_cred and
> >> new_cred are all same:
> >>
> >> after override_creds()cred == real_cred == new1_cred
> > I'm sorry, you've lost me.  After override_creds() returns
> > current->cred and current->real_cred are not going to be the same,
> > yes?
>
> It's possible the new  cred is equal to current->real_cred and
> current->cred,
> so after overrides_creds(), they have the same value.

Both task_struct.cred and task_struct.real_cred are pointer values,
assuming that one uses prepare_creds() to allocate/initialize a new
cred struct for use with override_creds() then the newly created cred
should never be equal to task_struct.real_cred.  Am I missing
something, or are you thinking of something else?

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-19 Thread Paul Moore
On Fri, Apr 19, 2019 at 10:34 AM Yang Yingliang
 wrote:
> On 2019/4/19 21:24, Paul Moore wrote:
> > On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang
> >  wrote:
> >> On 2019/4/19 10:04, Paul Moore wrote:
> >>> On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
> >>>  wrote:
>  On 2019/4/18 8:24, Casey Schaufler wrote:
> > On 4/17/2019 4:39 PM, Paul Moore wrote:
> >> Since it looks like all three LSMs which implement the setprocattr
> >> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
> >> a better choice for the cred != read_cred check, but I would want to
> >> make sure John and Casey are okay with that.
> >>
> >> John?
> >>
> >> Casey?
> > I'm fine with the change going into proc_pid_attr_write().
>  The cred != real_cred checking is not enough.
> 
>  Consider this situation, when doing override, cred, real_cred and
>  new_cred are all same:
> 
>  after override_creds()cred == real_cred == new1_cred
> >>> I'm sorry, you've lost me.  After override_creds() returns
> >>> current->cred and current->real_cred are not going to be the same,
> >>> yes?
> >> It's possible the new  cred is equal to current->real_cred and
> >> current->cred,
> >> so after overrides_creds(), they have the same value.
> > Both task_struct.cred and task_struct.real_cred are pointer values,
> > assuming that one uses prepare_creds() to allocate/initialize a new
> > cred struct for use with override_creds() then the newly created cred
> > should never be equal to task_struct.real_cred.  Am I missing
> > something, or are you thinking of something else?
>
> In do_acct_process(), file->f_cred may equal to current->real_cred, I
> confirm
> it by adding some debug message in do_acct_process() like this:

I would expect that; real_cred is the task's objective DAC
credentials, so using it for f_cred makes sense.

What we are now talking about is the task's subjective credentials,
which can be overridden via override_creds(), and are what the LSMs
change via proc_pid_attr_write().

> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct
> *acct)
>  flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
>  current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
>  /* Perform file operations on behalf of whoever enabled
> accounting */
> +   pr_info("task:%px new cred:%px real cred:%px cred:%px\n",
> current, file->f_cred, current->real_cred, current->cred);
>  orig_cred = override_creds(file->f_cred);

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-19 Thread Yang Yingliang




On 2019/4/19 21:24, Paul Moore wrote:

On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang
 wrote:

On 2019/4/19 10:04, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:

On 2019/4/18 8:24, Casey Schaufler wrote:

On 4/17/2019 4:39 PM, Paul Moore wrote:

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?

I'm fine with the change going into proc_pid_attr_write().

The cred != real_cred checking is not enough.

Consider this situation, when doing override, cred, real_cred and
new_cred are all same:

after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?

It's possible the new  cred is equal to current->real_cred and
current->cred,
so after overrides_creds(), they have the same value.

Both task_struct.cred and task_struct.real_cred are pointer values,
assuming that one uses prepare_creds() to allocate/initialize a new
cred struct for use with override_creds() then the newly created cred
should never be equal to task_struct.real_cred.  Am I missing
something, or are you thinking of something else?
In do_acct_process(), file->f_cred may equal to current->real_cred, I 
confirm

it by adding some debug message in do_acct_process() like this:

--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct 
*acct)

flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
/* Perform file operations on behalf of whoever enabled 
accounting */
+   pr_info("task:%px new cred:%px real cred:%px cred:%px\n", 
current, file->f_cred, current->real_cred, current->cred);

orig_cred = override_creds(file->f_cred);



Messages:
[   56.643298] task:88841a9595c0 new cred:88841ae450c0 real 
cred:88841ae450c0 cred:88841ae450c0//They are same.

[   56.646609] Process accounting resumed
[   56.649943] task:88841a9595c0 new cred:88841ae450c0 real 
cred:88841c96c300 cred:88841ae450c0

[   56.653565] --------[ cut here ]
[   56.655119] kernel BUG at kernel/cred.c:434!
[   56.656590] invalid opcode:  [#1] SMP PTI
[   56.658033] CPU: 2 PID: 4169 Comm: syz-executor.15 Not tainted 
5.1.0-rc4-00034-g869e3305f23d-dirty #143
[   56.661077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014

[   56.664895] RIP: 0010:commit_creds+0x1eb/0x230
[   56.666344] Code: 43 1c 0f 85 08 ff ff ff e9 10 ff ff ff 8b 45 10 39 
43 10 0f 85 18 ff ff ff 8b 43 20 39 45 20 0f 85 0c ff ff ff e9 14 ff ff 
ff <0f> 0b 48 c7 c7 d0 d2 49 82 e8 17 3b 3e 00 0f 0b 48 c7 c7 c0 d2 49

[   56.672410] RSP: 0018:c90003a17b20 EFLAGS: 00010287
[   56.674098] RAX: 88841a9595c0 RBX: 88841ae450c0 RCX: 

[   56.676410] RDX: 0001 RSI: 0020 RDI: 
88841c96ce40
[   56.678691] RBP: 0001 R08: 0080 R09: 

[   56.680997] R10: 88841c9265a0 R11: 810d6940 R12: 
88841a9595c0
[   56.681198] task:88841a9195c0 new cred:88841aeaa0c0 real 
cred:88841aeaa0c0 cred:88841aeaa0c0
[   56.683293] R13: 0040 R14: 88841c96ce40 R15: 
0040
[   56.683296] FS:  7f5969a5c700() GS:88842fa8() 
knlGS:

[   56.683297] CS:  0010 DS:  ES:  CR0: 80050033
[   56.683299] CR2: 7f82742214f0 CR3: 00041cbc0005 CR4: 
000206e0

[   56.683305] Call Trace:
[   56.683340]  selinux_setprocattr+0x17b/0x480
[   56.686513] Process accounting resumed
[   56.688849]  proc_pid_attr_write+0xc0/0xf0
[   56.688857]  __kernel_write+0x4f/0xf0
[   56.688866]  do_acct_process+0x538/0x750
[   56.703090]  ? __schedule+0x290/0x960
[   56.704311]  ? __queue_work+0x160/0x5c0
[   56.705571]  acct_pin_kill+0x1e/0x70
[   56.706743]  pin_kill+0x81/0x150
[   56.707813]  ? finish_wait+0x80/0x80
[   56.708985]  mnt_pin_kill+0x1e/0x30
[   56.710127]  cleanup_mnt+0x6e/0x70
[   56.711247]  task_work_run+0x8a/0xb0
[   56.712453]  do_exit+0x2e0/0xc80
[   56.713525]  do_group_exit+0x33/0xb0
[   56.714701]  get_signal+0x143/0x810
[   56.715865]  do_signal+0x36/0x610
[   56.716962]  ? __x64_sys_futex+0x134/0x180
[   56.718307]  ? _copy_to_user+0x22/0x30
[   56.719606]  exit_to_usermode_loop+0x80/0xe0
[   56.721003]  do_syscall_64+0x16c/0x180
[   56.722242]  entry_SYSCALL_64_after_hwframe+0x44/0xa9





Re: kernel BUG at kernel/cred.c:434!

2019-04-18 Thread Yang Yingliang




On 2019/4/19 10:04, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:

On 2019/4/18 8:24, Casey Schaufler wrote:

On 4/17/2019 4:39 PM, Paul Moore wrote:

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?

I'm fine with the change going into proc_pid_attr_write().

The cred != real_cred checking is not enough.

Consider this situation, when doing override, cred, real_cred and
new_cred are all same:

after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?
It's possible the new  cred is equal to current->real_cred and 
current->cred,

so after overrides_creds(), they have the same value.



after prepare_creds() new2_cred
after commit_creds() becasue the check is false, so cred ==
real_cred == new2_cred
after revert_creds()cred == new1_cred, real_cred == new2_cred

It will cause cred != real_cred finally.





Re: kernel BUG at kernel/cred.c:434!

2019-04-18 Thread Paul Moore
On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:
> On 2019/4/18 8:24, Casey Schaufler wrote:
> > On 4/17/2019 4:39 PM, Paul Moore wrote:
> >>
> >> Since it looks like all three LSMs which implement the setprocattr
> >> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
> >> a better choice for the cred != read_cred check, but I would want to
> >> make sure John and Casey are okay with that.
> >>
> >> John?
> >>
> >> Casey?
> >
> > I'm fine with the change going into proc_pid_attr_write().
>
> The cred != real_cred checking is not enough.
>
> Consider this situation, when doing override, cred, real_cred and
> new_cred are all same:
>
> after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?

> after prepare_creds() new2_cred
> after commit_creds() becasue the check is false, so cred ==
> real_cred == new2_cred
> after revert_creds()cred == new1_cred, real_cred == new2_cred
>
> It will cause cred != real_cred finally.

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-18 Thread Stephen Smalley

On 4/17/19 12:42 PM, Casey Schaufler wrote:

On 4/17/2019 9:27 AM, Oleg Nesterov wrote:

On 04/17, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:

On 04/17, Paul Moore wrote:

I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.

it seems that the simplest workaround should simply add the additional
cred == real_cred into proc_pid_attr_write().

Yes, that is simple, but I worry about what other LSMs might want to
do.  While I believe failing if the effective creds are not the same
as the real_creds is okay for SELinux (possibly Smack too), I worry
about what other LSMs may want to do.  After all,
proc_pid_attr_write() doesn't change the the creds itself, that is
something the specific LSMs do.

Yes, but if proc_pid_attr_write() is called with cred != real_cred then
something is already wrong?

In fact, I think that something is already wrong if it is not called by
user-space directly. Too late to ask, but why is this /proc/self/attr/
magic not implemented via syscall(s) ?


Shell scripts, for one thing. It's a straightforward and appropriate
use of the /proc interface. System calls would require additional change
to existing programs, whereas using the /proc interface allows a good
deal to be done in the containing scripts.


It is fairly awkward/fragile to use these interfaces from shell scripts 
due to limitations of the interface (e.g. no partial writes, thereby 
breaking if the shell splits up the data into multiple write() calls) 
and due to the fact that most of the attributes (except for current) are 
typically reset/cleared upon exec to prevent undue caller/callee 
influence.  It works well enough for echo -n "somelabel" > 
/proc/self/attr/current but not much else, and even that doesn't work 
without the -n due to multiple write() calls.


Just for the record, this functionality was originally implemented via 
separate system calls at least for SELinux (in its original kernel 
patch), then multiplexed through the single LSM security system call 
upon porting to LSM (before merging to mainline), but viro and others 
strongly favored using /proc as the interface for getting/setting any 
new process attributes.  Understandable, but it does impose some 
limitations, including a dependency on having a writable proc mount in 
the namespace of any process that needs to set any of these attributes,


Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Yang Yingliang

Hi, Casey

On 2019/4/18 8:24, Casey Schaufler wrote:

On 4/17/2019 4:39 PM, Paul Moore wrote:

On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov  wrote:

On 04/17, Paul Moore wrote:
On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  
wrote:

On 04/17, Paul Moore wrote:

I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.

it seems that the simplest workaround should simply add the 
additional

cred == real_cred into proc_pid_attr_write().

Yes, that is simple, but I worry about what other LSMs might want to
do.  While I believe failing if the effective creds are not the same
as the real_creds is okay for SELinux (possibly Smack too), I worry
about what other LSMs may want to do.  After all,
proc_pid_attr_write() doesn't change the the creds itself, that is
something the specific LSMs do.

Yes, but if proc_pid_attr_write() is called with cred != real_cred then
something is already wrong?

True, or at least I would think so.

Looking at the current tree there are three LSMs which implement
setprocattr hooks: SELinux, Smack, and AppArmor.  I know Casey has
already mentioned that he wasn't able to trigger the problem in Smack,
but looking at smack_setprocattr() I see the similar commit_creds()
usage so I would expect the same problem in Smack; what say you Casey?


I say that my test program runs without ill effect. I call acct()
with "/proc/self/attr/current", which succeeds and enables accounting
just like it is supposed to. I then have the program open
"/proc/self/attr/current" and read it, all of which goes swimmingly.
When Smack frees a cred it usually does not free any memory of its
own, so it is conceivable that I'm just getting lucky. Or, I may not
have sufficient debug enabled.


  Looking at apparmor_setprocattr(), it appears that it too could end
up calling commit_creds() via aa_set_current_hat().

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?


I'm fine with the change going into proc_pid_attr_write().

The cred != real_cred checking is not enough.

Consider this situation, when doing override, cred, real_cred and 
new_cred are all same:


after override_creds()cred == real_cred == new1_cred
after prepare_creds() new2_cred
after commit_creds() becasue the check is false, so cred == 
real_cred == new2_cred

after revert_creds()cred == new1_cred, real_cred == new2_cred

It will cause cred != real_cred finally.


Regards,
Yang





Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Casey Schaufler

On 4/17/2019 4:39 PM, Paul Moore wrote:

On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov  wrote:

On 04/17, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:

On 04/17, Paul Moore wrote:

I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.

it seems that the simplest workaround should simply add the additional
cred == real_cred into proc_pid_attr_write().

Yes, that is simple, but I worry about what other LSMs might want to
do.  While I believe failing if the effective creds are not the same
as the real_creds is okay for SELinux (possibly Smack too), I worry
about what other LSMs may want to do.  After all,
proc_pid_attr_write() doesn't change the the creds itself, that is
something the specific LSMs do.

Yes, but if proc_pid_attr_write() is called with cred != real_cred then
something is already wrong?

True, or at least I would think so.

Looking at the current tree there are three LSMs which implement
setprocattr hooks: SELinux, Smack, and AppArmor.  I know Casey has
already mentioned that he wasn't able to trigger the problem in Smack,
but looking at smack_setprocattr() I see the similar commit_creds()
usage so I would expect the same problem in Smack; what say you Casey?


I say that my test program runs without ill effect. I call acct()
with "/proc/self/attr/current", which succeeds and enables accounting
just like it is supposed to. I then have the program open
"/proc/self/attr/current" and read it, all of which goes swimmingly.
When Smack frees a cred it usually does not free any memory of its
own, so it is conceivable that I'm just getting lucky. Or, I may not
have sufficient debug enabled.


  Looking at apparmor_setprocattr(), it appears that it too could end
up calling commit_creds() via aa_set_current_hat().

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?


I'm fine with the change going into proc_pid_attr_write().



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread John Johansen
On 4/17/19 4:39 PM, Paul Moore wrote:
> On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov  wrote:
>> On 04/17, Paul Moore wrote:
>>>
>>> On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:
 On 04/17, Paul Moore wrote:
>
> I'm tempted to simply return an error in selinux_setprocattr() if
> the task's credentials are not the same as its real_cred;

 What about other modules? I have no idea what smack_setprocattr() is,
 but it too does prepare_creds/commit creds.

 it seems that the simplest workaround should simply add the additional
 cred == real_cred into proc_pid_attr_write().
>>>
>>> Yes, that is simple, but I worry about what other LSMs might want to
>>> do.  While I believe failing if the effective creds are not the same
>>> as the real_creds is okay for SELinux (possibly Smack too), I worry
>>> about what other LSMs may want to do.  After all,
>>> proc_pid_attr_write() doesn't change the the creds itself, that is
>>> something the specific LSMs do.
>>
>> Yes, but if proc_pid_attr_write() is called with cred != real_cred then
>> something is already wrong?
> 
> True, or at least I would think so.
> 
> Looking at the current tree there are three LSMs which implement
> setprocattr hooks: SELinux, Smack, and AppArmor.  I know Casey has
> already mentioned that he wasn't able to trigger the problem in Smack,
> but looking at smack_setprocattr() I see the similar commit_creds()
> usage so I would expect the same problem in Smack; what say you Casey?
>  Looking at apparmor_setprocattr(), it appears that it too could end
> up calling commit_creds() via aa_set_current_hat().
> 
> Since it looks like all three LSMs which implement the setprocattr
> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
> a better choice for the cred != read_cred check, but I would want to
> make sure John and Casey are okay with that.
> 
> John?

heh yeah,

seems I messed up our check, we actually have

if (current_cred() != current_real_cred())
return -EBUSY;

on the change_profile path and missed it on the change_hat
one.

It makes sense to lift the check up earlier



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Paul Moore
On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov  wrote:
> On 04/17, Paul Moore wrote:
> >
> > On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:
> > > On 04/17, Paul Moore wrote:
> > > >
> > > > I'm tempted to simply return an error in selinux_setprocattr() if
> > > > the task's credentials are not the same as its real_cred;
> > >
> > > What about other modules? I have no idea what smack_setprocattr() is,
> > > but it too does prepare_creds/commit creds.
> > >
> > > it seems that the simplest workaround should simply add the additional
> > > cred == real_cred into proc_pid_attr_write().
> >
> > Yes, that is simple, but I worry about what other LSMs might want to
> > do.  While I believe failing if the effective creds are not the same
> > as the real_creds is okay for SELinux (possibly Smack too), I worry
> > about what other LSMs may want to do.  After all,
> > proc_pid_attr_write() doesn't change the the creds itself, that is
> > something the specific LSMs do.
>
> Yes, but if proc_pid_attr_write() is called with cred != real_cred then
> something is already wrong?

True, or at least I would think so.

Looking at the current tree there are three LSMs which implement
setprocattr hooks: SELinux, Smack, and AppArmor.  I know Casey has
already mentioned that he wasn't able to trigger the problem in Smack,
but looking at smack_setprocattr() I see the similar commit_creds()
usage so I would expect the same problem in Smack; what say you Casey?
 Looking at apparmor_setprocattr(), it appears that it too could end
up calling commit_creds() via aa_set_current_hat().

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Casey Schaufler

On 4/17/2019 9:27 AM, Oleg Nesterov wrote:

On 04/17, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:

On 04/17, Paul Moore wrote:

I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.

it seems that the simplest workaround should simply add the additional
cred == real_cred into proc_pid_attr_write().

Yes, that is simple, but I worry about what other LSMs might want to
do.  While I believe failing if the effective creds are not the same
as the real_creds is okay for SELinux (possibly Smack too), I worry
about what other LSMs may want to do.  After all,
proc_pid_attr_write() doesn't change the the creds itself, that is
something the specific LSMs do.

Yes, but if proc_pid_attr_write() is called with cred != real_cred then
something is already wrong?

In fact, I think that something is already wrong if it is not called by
user-space directly. Too late to ask, but why is this /proc/self/attr/
magic not implemented via syscall(s) ?


Shell scripts, for one thing. It's a straightforward and appropriate
use of the /proc interface. System calls would require additional change
to existing programs, whereas using the /proc interface allows a good
deal to be done in the containing scripts.
 



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Oleg Nesterov
On 04/17, Paul Moore wrote:
>
> On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:
> > On 04/17, Paul Moore wrote:
> > >
> > > I'm tempted to simply return an error in selinux_setprocattr() if
> > > the task's credentials are not the same as its real_cred;
> >
> > What about other modules? I have no idea what smack_setprocattr() is,
> > but it too does prepare_creds/commit creds.
> >
> > it seems that the simplest workaround should simply add the additional
> > cred == real_cred into proc_pid_attr_write().
>
> Yes, that is simple, but I worry about what other LSMs might want to
> do.  While I believe failing if the effective creds are not the same
> as the real_creds is okay for SELinux (possibly Smack too), I worry
> about what other LSMs may want to do.  After all,
> proc_pid_attr_write() doesn't change the the creds itself, that is
> something the specific LSMs do.

Yes, but if proc_pid_attr_write() is called with cred != real_cred then
something is already wrong?

In fact, I think that something is already wrong if it is not called by
user-space directly. Too late to ask, but why is this /proc/self/attr/
magic not implemented via syscall(s) ?

But, Paul, this is up to you. I don't understand this all even remotely.

Oleg.



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Paul Moore
On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov  wrote:
> On 04/17, Paul Moore wrote:
> >
> > I'm tempted to simply return an error in selinux_setprocattr() if
> > the task's credentials are not the same as its real_cred;
>
> What about other modules? I have no idea what smack_setprocattr() is,
> but it too does prepare_creds/commit creds.
>
> it seems that the simplest workaround should simply add the additional
> cred == real_cred into proc_pid_attr_write().

Yes, that is simple, but I worry about what other LSMs might want to
do.  While I believe failing if the effective creds are not the same
as the real_creds is okay for SELinux (possibly Smack too), I worry
about what other LSMs may want to do.  After all,
proc_pid_attr_write() doesn't change the the creds itself, that is
something the specific LSMs do.

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Casey Schaufler

On 4/17/2019 7:57 AM, Oleg Nesterov wrote:

On 04/17, Paul Moore wrote:

I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.


For what it's worth, my test for Smack does not reproduce
the problem.



it seems that the simplest workaround should simply add the additional
cred == real_cred into proc_pid_attr_write().

Oleg.



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Oleg Nesterov
On 04/17, Paul Moore wrote:
>
> I'm tempted to simply return an error in selinux_setprocattr() if
> the task's credentials are not the same as its real_cred;

What about other modules? I have no idea what smack_setprocattr() is,
but it too does prepare_creds/commit creds.

it seems that the simplest workaround should simply add the additional
cred == real_cred into proc_pid_attr_write().

Oleg.



Re: kernel BUG at kernel/cred.c:434!

2019-04-17 Thread Paul Moore
On Tue, Apr 16, 2019 at 10:46 AM chengjian (D)  wrote:
> On 2019/4/16 11:40, Kees Cook wrote:
> > On Mon, Apr 15, 2019 at 11:20 AM Paul Moore  wrote:
> >> On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov  wrote:
> >>> On 04/15, Paul Moore wrote:
>  On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:
> > Well, acct("/proc/self/attr/current") doesn't look like a good idea, 
> > but I do
> > not know where should we put the additional check... And probably
> > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit 
> > the
> > same problem, do_coredump() does override_creds() too.
> >
> > May be just add
> >
> >  if (current->cred != current->real_cred)
> >  return -EACCES;
> >
> > into proc_pid_attr_write(), I dunno.
>  Is the problem that do_acct_process() is calling override_creds() and
>  the returned/old credentials are being freed before do_acct_process()
>  can reinstall the creds via revert_creds()?  Presumably because the
>  process accounting is causing the credentials to be replaced?
> >>> Afaics, the problem is that do_acct_process() does override_creds() and
> >>> then __kernel_write(). Which calls proc_pid_attr_write(), which in turn 
> >>> calls
> >>> selinux_setprocattr(), which does another prepare_creds() + 
> >>> commit_creds();
> >>> and commit_creds() hits
> >>>
> >>>  BUG_ON(task->cred != old);
> >> Gotcha.  In the process of looking at the backtrace I forgot about the
> >> BUG_ON() at the top of the oops message.
> >>
> >> I wonder what terrible things would happen if we changed the BUG_ON()
> >> in commit_creds to simple returning an error an error code to the
> >> caller.  There is a warning/requirement in commit_creds() function
> >> header comment that it should always return 0.
> > Would callers be expected to call abort_creds() on failure? There are
> > a number of places where it'd need fixing up. And would likely be best
> > with a __must_check marking.
> >
> > It seems like avoiding the pathological case might be simpler?
>
> Yeah, Avoiding this pathological case is a better solution.

No arguments that this is particularly messed up scenario, I'm just
trying to arrive at a solution that isn't too ugly.

> From: Yang Yingliang 
> Date: Sat, 13 Apr 2019 21:56:01 +0800
> Subject: [PATCH] fix cred bug_on
>
> Signed-off-by: Yang Yingliang 
> ---
>   kernel/acct.c| 3 ++-
>   kernel/cred.c| 6 ++
>   security/selinux/hooks.c | 2 ++
>   3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index addf7732fb56..f2065f899eee 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -522,7 +522,8 @@ static void do_acct_process(struct bsd_acct_struct
> *acct)
>   }
>   out:
>   current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim;
> -revert_creds(orig_cred);
> +if (orig_cred == current->real_cred)// [2]
> +revert_creds(orig_cred);
>   }
>
>   /**
> diff --git a/kernel/cred.c b/kernel/cred.c
> index ecf03657e71c..c4d5ba92fb9b 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -522,6 +522,9 @@ const struct cred *override_creds(const struct cred
> *new)
>   {
>   const struct cred *old = current->cred;
>
> +if (old == new)//  [3]
> +return old;
> +
>   kdebug("override_creds(%p{%d,%d})", new,
>  atomic_read(>usage),
>  read_cred_subscribers(new));
> @@ -551,6 +554,9 @@ void revert_creds(const struct cred *old)
>   {
>   const struct cred *override = current->cred;
>
> +if (override == old)// [3]
> +return;
> +
>   kdebug("revert_creds(%p{%d,%d})", old,
>  atomic_read(>usage),
>  read_cred_subscribers(old));
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index b5017beb4ef7..bc8108e4e90f 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -6590,6 +6590,8 @@ static int selinux_setprocattr(const char *name,
> void *value, size_t size)
>   goto abort_change;
>   }
>
> +if (current->cred != current->real_cred)// [1]
> +revert_creds(current->real_cred);
>   commit_creds(new);
>   return size;

Doing the revert only to then commit the creds seems really ugly to
me.  I'm tempted to simply return an error in selinux_setprocattr() if
the task's credentials are not the same as its real_cred; if we do
that I believe we should resolve this problem.  The accounting write
to the SELinux file in /proc would fail of course, but I think we can
all consider that as a positive side-effect.

While I don't think this should have a negative impact on anything
else, I haven't convinced myself of that just yet.

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-16 Thread chengjian (D)



On 2019/4/16 11:40, Kees Cook wrote:

On Mon, Apr 15, 2019 at 11:20 AM Paul Moore  wrote:

On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov  wrote:

On 04/15, Paul Moore wrote:

On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:

Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do
not know where should we put the additional check... And probably
"echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the
same problem, do_coredump() does override_creds() too.

May be just add

 if (current->cred != current->real_cred)
 return -EACCES;

into proc_pid_attr_write(), I dunno.

Is the problem that do_acct_process() is calling override_creds() and
the returned/old credentials are being freed before do_acct_process()
can reinstall the creds via revert_creds()?  Presumably because the
process accounting is causing the credentials to be replaced?

Afaics, the problem is that do_acct_process() does override_creds() and
then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls
selinux_setprocattr(), which does another prepare_creds() + commit_creds();
and commit_creds() hits

 BUG_ON(task->cred != old);

Gotcha.  In the process of looking at the backtrace I forgot about the
BUG_ON() at the top of the oops message.

I wonder what terrible things would happen if we changed the BUG_ON()
in commit_creds to simple returning an error an error code to the
caller.  There is a warning/requirement in commit_creds() function
header comment that it should always return 0.

Would callers be expected to call abort_creds() on failure? There are
a number of places where it'd need fixing up. And would likely be best
with a __must_check marking.

It seems like avoiding the pathological case might be simpler?



Yeah, Avoiding this pathological case is a better solution.

It seems like that we can't commit_creds() during

override_creds() and revert_creds().

So how about just put commit_creds outside !


just like:

    override_creds()   // cred  -=> new

    // may BUG_ON if commit_creds done.

    revert_creds() //  cred -=> old 
<---|


    commit_creds   //  cred = real_cred = new    |

[revert_creds]--|


[1]--Before we call commit_creds in selinux_setprocattr(),

if we find that cred != real_cred, it may have been overridden

before, we should revert it.

[2]--The same to revert_creds, when we found someone have committed,

orig_cred != current->real_cred may hits, this means that

we have reverted before(see [1]).

[3]--Sometimes new and old are the same, then we need to consider this

situation specially.


The code just like:


From: Yang Yingliang 
Date: Sat, 13 Apr 2019 21:56:01 +0800
Subject: [PATCH] fix cred bug_on

Signed-off-by: Yang Yingliang 
---
 kernel/acct.c    | 3 ++-
 kernel/cred.c    | 6 ++
 security/selinux/hooks.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/acct.c b/kernel/acct.c
index addf7732fb56..f2065f899eee 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -522,7 +522,8 @@ static void do_acct_process(struct bsd_acct_struct 
*acct)

 }
 out:
 current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim;
-    revert_creds(orig_cred);
+    if (orig_cred == current->real_cred)    // [2]
+        revert_creds(orig_cred);
 }

 /**
diff --git a/kernel/cred.c b/kernel/cred.c
index ecf03657e71c..c4d5ba92fb9b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -522,6 +522,9 @@ const struct cred *override_creds(const struct cred 
*new)

 {
 const struct cred *old = current->cred;

+    if (old == new)    //  [3]
+        return old;
+
 kdebug("override_creds(%p{%d,%d})", new,
    atomic_read(>usage),
    read_cred_subscribers(new));
@@ -551,6 +554,9 @@ void revert_creds(const struct cred *old)
 {
 const struct cred *override = current->cred;

+    if (override == old)    // [3]
+        return;
+
 kdebug("revert_creds(%p{%d,%d})", old,
    atomic_read(>usage),
    read_cred_subscribers(old));
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index b5017beb4ef7..bc8108e4e90f 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6590,6 +6590,8 @@ static int selinux_setprocattr(const char *name, 
void *value, size_t size)

     goto abort_change;
 }

+    if (current->cred != current->real_cred)    // [1]
+        revert_creds(current->real_cred);
 commit_creds(new);
 return size;

--
2.17.1


We have tested this patch for 3 days and it works well.

Are there any cases that are not covered here ?


Thanks.

    Cheng Jian




Re: kernel BUG at kernel/cred.c:434!

2019-04-15 Thread Kees Cook
On Mon, Apr 15, 2019 at 11:20 AM Paul Moore  wrote:
>
> On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov  wrote:
> > On 04/15, Paul Moore wrote:
> > >
> > > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:
> > > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, 
> > > > but I do
> > > > not know where should we put the additional check... And probably
> > > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit 
> > > > the
> > > > same problem, do_coredump() does override_creds() too.
> > > >
> > > > May be just add
> > > >
> > > > if (current->cred != current->real_cred)
> > > > return -EACCES;
> > > >
> > > > into proc_pid_attr_write(), I dunno.
> > >
> > > Is the problem that do_acct_process() is calling override_creds() and
> > > the returned/old credentials are being freed before do_acct_process()
> > > can reinstall the creds via revert_creds()?  Presumably because the
> > > process accounting is causing the credentials to be replaced?
> >
> > Afaics, the problem is that do_acct_process() does override_creds() and
> > then __kernel_write(). Which calls proc_pid_attr_write(), which in turn 
> > calls
> > selinux_setprocattr(), which does another prepare_creds() + commit_creds();
> > and commit_creds() hits
> >
> > BUG_ON(task->cred != old);
>
> Gotcha.  In the process of looking at the backtrace I forgot about the
> BUG_ON() at the top of the oops message.
>
> I wonder what terrible things would happen if we changed the BUG_ON()
> in commit_creds to simple returning an error an error code to the
> caller.  There is a warning/requirement in commit_creds() function
> header comment that it should always return 0.

Would callers be expected to call abort_creds() on failure? There are
a number of places where it'd need fixing up. And would likely be best
with a __must_check marking.

It seems like avoiding the pathological case might be simpler?

-- 
Kees Cook


Re: kernel BUG at kernel/cred.c:434!

2019-04-15 Thread Paul Moore
On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov  wrote:
> On 04/15, Paul Moore wrote:
> >
> > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:
> > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, but 
> > > I do
> > > not know where should we put the additional check... And probably
> > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the
> > > same problem, do_coredump() does override_creds() too.
> > >
> > > May be just add
> > >
> > > if (current->cred != current->real_cred)
> > > return -EACCES;
> > >
> > > into proc_pid_attr_write(), I dunno.
> >
> > Is the problem that do_acct_process() is calling override_creds() and
> > the returned/old credentials are being freed before do_acct_process()
> > can reinstall the creds via revert_creds()?  Presumably because the
> > process accounting is causing the credentials to be replaced?
>
> Afaics, the problem is that do_acct_process() does override_creds() and
> then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls
> selinux_setprocattr(), which does another prepare_creds() + commit_creds();
> and commit_creds() hits
>
> BUG_ON(task->cred != old);

Gotcha.  In the process of looking at the backtrace I forgot about the
BUG_ON() at the top of the oops message.

I wonder what terrible things would happen if we changed the BUG_ON()
in commit_creds to simple returning an error an error code to the
caller.  There is a warning/requirement in commit_creds() function
header comment that it should always return 0.

-- 
paul moore
www.paul-moore.com


Re: kernel BUG at kernel/cred.c:434!

2019-04-15 Thread Oleg Nesterov
On 04/15, Paul Moore wrote:
>
> On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:
> > Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I 
> > do
> > not know where should we put the additional check... And probably
> > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the
> > same problem, do_coredump() does override_creds() too.
> >
> > May be just add
> >
> > if (current->cred != current->real_cred)
> > return -EACCES;
> >
> > into proc_pid_attr_write(), I dunno.
>
> Is the problem that do_acct_process() is calling override_creds() and
> the returned/old credentials are being freed before do_acct_process()
> can reinstall the creds via revert_creds()?  Presumably because the
> process accounting is causing the credentials to be replaced?

Afaics, the problem is that do_acct_process() does override_creds() and
then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls
selinux_setprocattr(), which does another prepare_creds() + commit_creds();
and commit_creds() hits

BUG_ON(task->cred != old);

Oleg.



Re: kernel BUG at kernel/cred.c:434!

2019-04-15 Thread Paul Moore
On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov  wrote:
> Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do
> not know where should we put the additional check... And probably
> "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the
> same problem, do_coredump() does override_creds() too.
>
> May be just add
>
> if (current->cred != current->real_cred)
> return -EACCES;
>
> into proc_pid_attr_write(), I dunno.

Is the problem that do_acct_process() is calling override_creds() and
the returned/old credentials are being freed before do_acct_process()
can reinstall the creds via revert_creds()?  Presumably because the
process accounting is causing the credentials to be replaced?

> On 04/12, Casey Schaufler wrote:
> >
> > On 4/11/2019 11:21 PM, chengjian (D) wrote:
> >
> > Added LSM and SELinux lists.
> >
> >
> > >Hi.
> > >
> > >
> > >syzkaller reported the following BUG:
> > >
> > >[   73.146973] kernel BUG at kernel/cred.c:434!
> > >[   73.150231] invalid opcode:  [#1] SMP KASAN PTI
> > >[   73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted
> > >5.1.0-rc4-00062-g2d06b235815e-dirty #2
> > >[   73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > >rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > >[   73.159798] RIP: 0010:commit_creds+0xadb/0xe50
> > >[   73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f
> > >8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00
> > ><0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48
> > >[   73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293
> > >[   73.169636] RAX: 8883767b RBX: 88837f111300 RCX:
> > >8124b5db
> > >[   73.171962] RDX:  RSI: 83c9b140 RDI:
> > >88837f111300
> > >[   73.174310] RBP: 888376610400 R08:  R09:
> > >0004
> > >[   73.176646] R10: 0001 R11: ed107c655acf R12:
> > >8883767b
> > >[   73.178527] Process accounting resumed
> > >[   73.179021] R13: 88837f111900 R14: 88837f111300 R15:
> > >8883767b0ac0
> > >[   73.179029] FS:  7f2d207f9700() GS:8883e328()
> > >knlGS:
> > >[   73.179034] CS:  0010 DS:  ES:  CR0: 80050033
> > >[   73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4:
> > >000206e0
> > >[   73.179047] Call Trace:
> > >[   73.190461]  selinux_setprocattr+0x2ea/0x8f0
> > >[   73.191925]  ? ptrace_parent_sid+0x530/0x530
> > >[   73.193436]  ? proc_pid_attr_write+0x185/0x5a0
> > >[   73.194967]  security_setprocattr+0xa1/0x100
> > >[   73.196408]  proc_pid_attr_write+0x307/0x5a0
> > >[   73.197869]  ? mem_read+0x40/0x40
> > >[   73.199013]  __vfs_write+0x81/0x100
> > >[   73.200222]  __kernel_write+0xf8/0x330
> > >[   73.201562]  do_acct_process+0xca5/0x1340
> > >[   73.202969]  ? __ia32_sys_acct+0x1e0/0x1e0
> > >[   73.204498]  ? find_held_lock+0x2f/0x1e0
> > >[   73.205857]  ? rcu_irq_exit+0xec/0x2c0
> > >[   73.207160]  ? lock_downgrade+0x630/0x630
> > >[   73.208541]  acct_pin_kill+0x63/0x150
> > >[   73.209816]  pin_kill+0x16d/0x7c0
> > >[   73.210934]  ? lockdep_hardirqs_on+0x5e0/0x5e0
> > >[   73.212452]  ? xas_start+0x155/0x510
> > >[   73.213705]  ? pin_insert+0x50/0x50
> > >[   73.214903]  ? finish_wait+0x270/0x270
> > >[   73.216213]  ? cpumask_next+0x57/0x90
> > >[   73.217442]  ? mnt_pin_kill+0x68/0x1d0
> > >[   73.218851]  mnt_pin_kill+0x68/0x1d0
> > >[   73.220398]  cleanup_mnt+0x11b/0x150
> > >[   73.221970]  task_work_run+0x136/0x1b0
> > >[   73.223427]  do_exit+0x830/0x2ca0
> > >[   73.224586]  ? trace_hardirqs_off+0x3b/0x180
> > >[   73.226088]  ? mm_update_next_owner+0x6a0/0x6a0
> > >[   73.227622]  ? find_held_lock+0x2f/0x1e0
> > >[   73.228954]  ? get_signal+0x2cf/0x1c00
> > >[   73.230236]  ? lock_downgrade+0x630/0x630
> > >[   73.231628]  ? rwlock_bug.part.0+0x90/0x90
> > >[   73.233020]  do_group_exit+0x106/0x2f0
> > >[   73.234330]  get_signal+0x325/0x1c00
> > >[   73.235571]  do_signal+0x97/0x1670
> > >[   73.236739]  ? do_send_specific+0x12d/0x220
> > >[   73.238213]  ? lock_downgrade+0x630/0x630
> > >[   73.239566]  ? setup_sigcontext

Re: kernel BUG at kernel/cred.c:434!

2019-04-15 Thread Oleg Nesterov
Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do
not know where should we put the additional check... And probably
"echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the
same problem, do_coredump() does override_creds() too.

May be just add

if (current->cred != current->real_cred)
return -EACCES;

into proc_pid_attr_write(), I dunno.



On 04/12, Casey Schaufler wrote:
>
> On 4/11/2019 11:21 PM, chengjian (D) wrote:
> 
> Added LSM and SELinux lists.
> 
> 
> >Hi.
> >
> >
> >syzkaller reported the following BUG:
> >
> >[   73.146973] kernel BUG at kernel/cred.c:434!
> >[   73.150231] invalid opcode:  [#1] SMP KASAN PTI
> >[   73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted
> >5.1.0-rc4-00062-g2d06b235815e-dirty #2
> >[   73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> >[   73.159798] RIP: 0010:commit_creds+0xadb/0xe50
> >[   73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f
> >8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00
> ><0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48
> >[   73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293
> >[   73.169636] RAX: 8883767b RBX: 88837f111300 RCX:
> >8124b5db
> >[   73.171962] RDX:  RSI: 83c9b140 RDI:
> >88837f111300
> >[   73.174310] RBP: 888376610400 R08:  R09:
> >0004
> >[   73.176646] R10: 0001 R11: ed107c655acf R12:
> >8883767b
> >[   73.178527] Process accounting resumed
> >[   73.179021] R13: 88837f111900 R14: 88837f111300 R15:
> >8883767b0ac0
> >[   73.179029] FS:  7f2d207f9700() GS:8883e328()
> >knlGS:
> >[   73.179034] CS:  0010 DS:  ES:  CR0: 80050033
> >[   73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4:
> >000206e0
> >[   73.179047] Call Trace:
> >[   73.190461]  selinux_setprocattr+0x2ea/0x8f0
> >[   73.191925]  ? ptrace_parent_sid+0x530/0x530
> >[   73.193436]  ? proc_pid_attr_write+0x185/0x5a0
> >[   73.194967]  security_setprocattr+0xa1/0x100
> >[   73.196408]  proc_pid_attr_write+0x307/0x5a0
> >[   73.197869]  ? mem_read+0x40/0x40
> >[   73.199013]  __vfs_write+0x81/0x100
> >[   73.200222]  __kernel_write+0xf8/0x330
> >[   73.201562]  do_acct_process+0xca5/0x1340
> >[   73.202969]  ? __ia32_sys_acct+0x1e0/0x1e0
> >[   73.204498]  ? find_held_lock+0x2f/0x1e0
> >[   73.205857]  ? rcu_irq_exit+0xec/0x2c0
> >[   73.207160]  ? lock_downgrade+0x630/0x630
> >[   73.208541]  acct_pin_kill+0x63/0x150
> >[   73.209816]  pin_kill+0x16d/0x7c0
> >[   73.210934]  ? lockdep_hardirqs_on+0x5e0/0x5e0
> >[   73.212452]  ? xas_start+0x155/0x510
> >[   73.213705]  ? pin_insert+0x50/0x50
> >[   73.214903]  ? finish_wait+0x270/0x270
> >[   73.216213]  ? cpumask_next+0x57/0x90
> >[   73.217442]  ? mnt_pin_kill+0x68/0x1d0
> >[   73.218851]  mnt_pin_kill+0x68/0x1d0
> >[   73.220398]  cleanup_mnt+0x11b/0x150
> >[   73.221970]  task_work_run+0x136/0x1b0
> >[   73.223427]  do_exit+0x830/0x2ca0
> >[   73.224586]  ? trace_hardirqs_off+0x3b/0x180
> >[   73.226088]  ? mm_update_next_owner+0x6a0/0x6a0
> >[   73.227622]  ? find_held_lock+0x2f/0x1e0
> >[   73.228954]  ? get_signal+0x2cf/0x1c00
> >[   73.230236]  ? lock_downgrade+0x630/0x630
> >[   73.231628]  ? rwlock_bug.part.0+0x90/0x90
> >[   73.233020]  do_group_exit+0x106/0x2f0
> >[   73.234330]  get_signal+0x325/0x1c00
> >[   73.235571]  do_signal+0x97/0x1670
> >[   73.236739]  ? do_send_specific+0x12d/0x220
> >[   73.238213]  ? lock_downgrade+0x630/0x630
> >[   73.239566]  ? setup_sigcontext+0x820/0x820
> >[   73.240982]  ? check_kill_permission+0x4a/0x510
> >[   73.242509]  ? do_send_specific+0x156/0x220
> >[   73.243905]  ? do_tkill+0x1c4/0x260
> >[   73.245081]  ? do_send_specific+0x220/0x220
> >[   73.246514]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >[   73.248061]  ? exit_to_usermode_loop+0x97/0x1d0
> >[   73.249619]  exit_to_usermode_loop+0x108/0x1d0
> >[   73.251129]  do_syscall_64+0x461/0x580
> >[   73.252454]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >[   73.254219] RIP: 0033:0x462eb9
> >[   73.255327] Code: Bad RIP value.
> >[   73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX:
> >00c8
> >[   73.259454] RAX: 

Re: kernel BUG at kernel/cred.c:434!

2019-04-12 Thread Casey Schaufler

On 4/11/2019 11:21 PM, chengjian (D) wrote:

Added LSM and SELinux lists.



Hi.


syzkaller reported the following BUG:

[   73.146973] kernel BUG at kernel/cred.c:434!
[   73.150231] invalid opcode:  [#1] SMP KASAN PTI
[   73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted 
5.1.0-rc4-00062-g2d06b235815e-dirty #2
[   73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014

[   73.159798] RIP: 0010:commit_creds+0xadb/0xe50
[   73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 
03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 
a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 
0b 48

[   73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293
[   73.169636] RAX: 8883767b RBX: 88837f111300 RCX: 
8124b5db
[   73.171962] RDX:  RSI: 83c9b140 RDI: 
88837f111300
[   73.174310] RBP: 888376610400 R08:  R09: 
0004
[   73.176646] R10: 0001 R11: ed107c655acf R12: 
8883767b

[   73.178527] Process accounting resumed
[   73.179021] R13: 88837f111900 R14: 88837f111300 R15: 
8883767b0ac0
[   73.179029] FS:  7f2d207f9700() GS:8883e328() 
knlGS:

[   73.179034] CS:  0010 DS:  ES:  CR0: 80050033
[   73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: 
000206e0

[   73.179047] Call Trace:
[   73.190461]  selinux_setprocattr+0x2ea/0x8f0
[   73.191925]  ? ptrace_parent_sid+0x530/0x530
[   73.193436]  ? proc_pid_attr_write+0x185/0x5a0
[   73.194967]  security_setprocattr+0xa1/0x100
[   73.196408]  proc_pid_attr_write+0x307/0x5a0
[   73.197869]  ? mem_read+0x40/0x40
[   73.199013]  __vfs_write+0x81/0x100
[   73.200222]  __kernel_write+0xf8/0x330
[   73.201562]  do_acct_process+0xca5/0x1340
[   73.202969]  ? __ia32_sys_acct+0x1e0/0x1e0
[   73.204498]  ? find_held_lock+0x2f/0x1e0
[   73.205857]  ? rcu_irq_exit+0xec/0x2c0
[   73.207160]  ? lock_downgrade+0x630/0x630
[   73.208541]  acct_pin_kill+0x63/0x150
[   73.209816]  pin_kill+0x16d/0x7c0
[   73.210934]  ? lockdep_hardirqs_on+0x5e0/0x5e0
[   73.212452]  ? xas_start+0x155/0x510
[   73.213705]  ? pin_insert+0x50/0x50
[   73.214903]  ? finish_wait+0x270/0x270
[   73.216213]  ? cpumask_next+0x57/0x90
[   73.217442]  ? mnt_pin_kill+0x68/0x1d0
[   73.218851]  mnt_pin_kill+0x68/0x1d0
[   73.220398]  cleanup_mnt+0x11b/0x150
[   73.221970]  task_work_run+0x136/0x1b0
[   73.223427]  do_exit+0x830/0x2ca0
[   73.224586]  ? trace_hardirqs_off+0x3b/0x180
[   73.226088]  ? mm_update_next_owner+0x6a0/0x6a0
[   73.227622]  ? find_held_lock+0x2f/0x1e0
[   73.228954]  ? get_signal+0x2cf/0x1c00
[   73.230236]  ? lock_downgrade+0x630/0x630
[   73.231628]  ? rwlock_bug.part.0+0x90/0x90
[   73.233020]  do_group_exit+0x106/0x2f0
[   73.234330]  get_signal+0x325/0x1c00
[   73.235571]  do_signal+0x97/0x1670
[   73.236739]  ? do_send_specific+0x12d/0x220
[   73.238213]  ? lock_downgrade+0x630/0x630
[   73.239566]  ? setup_sigcontext+0x820/0x820
[   73.240982]  ? check_kill_permission+0x4a/0x510
[   73.242509]  ? do_send_specific+0x156/0x220
[   73.243905]  ? do_tkill+0x1c4/0x260
[   73.245081]  ? do_send_specific+0x220/0x220
[   73.246514]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   73.248061]  ? exit_to_usermode_loop+0x97/0x1d0
[   73.249619]  exit_to_usermode_loop+0x108/0x1d0
[   73.251129]  do_syscall_64+0x461/0x580
[   73.252454]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   73.254219] RIP: 0033:0x462eb9
[   73.255327] Code: Bad RIP value.
[   73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX: 
00c8
[   73.259454] RAX:  RBX: 0073bf00 RCX: 
00462eb9
[   73.262309] RDX:  RSI: 001e RDI: 
0005
[   73.265064] RBP: 0002 R08:  R09: 

[   73.267774] R10:  R11: 0246 R12: 
7f2d207f96bc
[   73.270546] R13: 004c5509 R14: 007042f0 R15: 


[   73.273542] Modules linked in:
[   73.274670] Dumping ftrace buffer:
[   73.275852]    (ftrace buffer empty)
[   73.277187] ---[ end trace dde36a95f458175d ]---
[   73.278834] RIP: 0010:commit_creds+0xadb/0xe50
[   73.280549] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 
03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 
a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 
0b 48

[   73.287090] RSP: :88836e65f5d0 EFLAGS: 00010293
[   73.288917] RAX: 8883767b RBX: 88837f111300 RCX: 
8124b5db
[   73.291390] RDX:  RSI: 83c9b140 RDI: 
88837f111300
[   73.293864] RBP: 888376610400 R08:  R09: 
0004
[   73.296370] R10: 0001 R11: ed107c655acf R12: 
8883767b
[   73.299275] R13: 8883

kernel BUG at kernel/cred.c:434!

2019-04-12 Thread chengjian (D)

Hi.


syzkaller reported the following BUG:

[   73.146973] kernel BUG at kernel/cred.c:434!
[   73.150231] invalid opcode:  [#1] SMP KASAN PTI
[   73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted 
5.1.0-rc4-00062-g2d06b235815e-dirty #2
[   73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014

[   73.159798] RIP: 0010:commit_creds+0xadb/0xe50
[   73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 
0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 
00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48

[   73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293
[   73.169636] RAX: 8883767b RBX: 88837f111300 RCX: 
8124b5db
[   73.171962] RDX:  RSI: 83c9b140 RDI: 
88837f111300
[   73.174310] RBP: 888376610400 R08:  R09: 
0004
[   73.176646] R10: 0001 R11: ed107c655acf R12: 
8883767b

[   73.178527] Process accounting resumed
[   73.179021] R13: 88837f111900 R14: 88837f111300 R15: 
8883767b0ac0
[   73.179029] FS:  7f2d207f9700() GS:8883e328() 
knlGS:

[   73.179034] CS:  0010 DS:  ES:  CR0: 80050033
[   73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: 
000206e0

[   73.179047] Call Trace:
[   73.190461]  selinux_setprocattr+0x2ea/0x8f0
[   73.191925]  ? ptrace_parent_sid+0x530/0x530
[   73.193436]  ? proc_pid_attr_write+0x185/0x5a0
[   73.194967]  security_setprocattr+0xa1/0x100
[   73.196408]  proc_pid_attr_write+0x307/0x5a0
[   73.197869]  ? mem_read+0x40/0x40
[   73.199013]  __vfs_write+0x81/0x100
[   73.200222]  __kernel_write+0xf8/0x330
[   73.201562]  do_acct_process+0xca5/0x1340
[   73.202969]  ? __ia32_sys_acct+0x1e0/0x1e0
[   73.204498]  ? find_held_lock+0x2f/0x1e0
[   73.205857]  ? rcu_irq_exit+0xec/0x2c0
[   73.207160]  ? lock_downgrade+0x630/0x630
[   73.208541]  acct_pin_kill+0x63/0x150
[   73.209816]  pin_kill+0x16d/0x7c0
[   73.210934]  ? lockdep_hardirqs_on+0x5e0/0x5e0
[   73.212452]  ? xas_start+0x155/0x510
[   73.213705]  ? pin_insert+0x50/0x50
[   73.214903]  ? finish_wait+0x270/0x270
[   73.216213]  ? cpumask_next+0x57/0x90
[   73.217442]  ? mnt_pin_kill+0x68/0x1d0
[   73.218851]  mnt_pin_kill+0x68/0x1d0
[   73.220398]  cleanup_mnt+0x11b/0x150
[   73.221970]  task_work_run+0x136/0x1b0
[   73.223427]  do_exit+0x830/0x2ca0
[   73.224586]  ? trace_hardirqs_off+0x3b/0x180
[   73.226088]  ? mm_update_next_owner+0x6a0/0x6a0
[   73.227622]  ? find_held_lock+0x2f/0x1e0
[   73.228954]  ? get_signal+0x2cf/0x1c00
[   73.230236]  ? lock_downgrade+0x630/0x630
[   73.231628]  ? rwlock_bug.part.0+0x90/0x90
[   73.233020]  do_group_exit+0x106/0x2f0
[   73.234330]  get_signal+0x325/0x1c00
[   73.235571]  do_signal+0x97/0x1670
[   73.236739]  ? do_send_specific+0x12d/0x220
[   73.238213]  ? lock_downgrade+0x630/0x630
[   73.239566]  ? setup_sigcontext+0x820/0x820
[   73.240982]  ? check_kill_permission+0x4a/0x510
[   73.242509]  ? do_send_specific+0x156/0x220
[   73.243905]  ? do_tkill+0x1c4/0x260
[   73.245081]  ? do_send_specific+0x220/0x220
[   73.246514]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   73.248061]  ? exit_to_usermode_loop+0x97/0x1d0
[   73.249619]  exit_to_usermode_loop+0x108/0x1d0
[   73.251129]  do_syscall_64+0x461/0x580
[   73.252454]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   73.254219] RIP: 0033:0x462eb9
[   73.255327] Code: Bad RIP value.
[   73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX: 
00c8
[   73.259454] RAX:  RBX: 0073bf00 RCX: 
00462eb9
[   73.262309] RDX:  RSI: 001e RDI: 
0005
[   73.265064] RBP: 0002 R08:  R09: 

[   73.267774] R10:  R11: 0246 R12: 
7f2d207f96bc
[   73.270546] R13: 004c5509 R14: 007042f0 R15: 


[   73.273542] Modules linked in:
[   73.274670] Dumping ftrace buffer:
[   73.275852]    (ftrace buffer empty)
[   73.277187] ---[ end trace dde36a95f458175d ]---
[   73.278834] RIP: 0010:commit_creds+0xadb/0xe50
[   73.280549] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 
0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 
00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48

[   73.287090] RSP: :88836e65f5d0 EFLAGS: 00010293
[   73.288917] RAX: 8883767b RBX: 88837f111300 RCX: 
8124b5db
[   73.291390] RDX:  RSI: 83c9b140 RDI: 
88837f111300
[   73.293864] RBP: 888376610400 R08:  R09: 
0004
[   73.296370] R10: 0001 R11: ed107c655acf R12: 
8883767b
[   73.299275] R13: 88837f111900 R14: 88837f111300 R15: 
8883767b0ac0

[   73.301351] Process accoun

[PATCH v2 3/3] riscv: Support BUG() in kernel module

2019-03-04 Thread Vincent Chen
The kernel module is loaded into vmalloc region which is located below
to the PAGE_OFFSET. Hence the condition, pc < PAGE_OFFSET, in the
is_valid_bugaddr() will filter out all trap exceptions triggered
by kernel module. To support BUG() in kernel module, the condition is
changed to pc < VMALLOC_START.

Signed-off-by: Vincent Chen 
---
 arch/riscv/kernel/traps.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 0cd0137..9551388 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -156,7 +156,7 @@ int is_valid_bugaddr(unsigned long pc)
 {
bug_insn_t insn;
 
-   if (pc < PAGE_OFFSET)
+   if (pc < VMALLOC_START)
return 0;
if (probe_kernel_address((bug_insn_t *)pc, insn))
return 0;
-- 
1.7.1



[PATCH 2/3] riscv: Support BUG() in kernel module

2019-02-28 Thread Vincent Chen
The kernel module is loaded into vmalloc region which is located below
to the PAGE_OFFSET. Hence the condition, pc < PAGE_OFFSET, in the
is_valid_bugaddr() will filter out all trap exceptions triggered
by kernel module. To support BUG() in kernel module, the condition is
changed to pc < VMALLOC_START.

Signed-off-by: Vincent Chen 
---
 arch/riscv/kernel/traps.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index deae0e5..dee0e5e 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -144,7 +144,7 @@ int is_valid_bugaddr(unsigned long pc)
 {
bug_insn_t insn;
 
-   if (pc < PAGE_OFFSET)
+   if (pc < VMALLOC_START)
return 0;
if (probe_kernel_address((bug_insn_t *)pc, insn))
return 0;
-- 
1.7.1



Re: kernel BUG at kernel/cred.c:825!

2019-01-21 Thread Santosh Kumar Pradhan
Hi Laura,

On 17/01/19, 12:39 AM, "linux-nfs-ow...@vger.kernel.org on behalf of Laura 
Abbott"  wrote:

On 1/7/19 11:18 AM, Dave Jones wrote:
> [   53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state 
recovery directory
> [   53.981216] NFSD: starting 45-second grace period (net f098)
> [   54.006802] CRED: Invalid credentials
> [   54.006880] CRED: At ./include/linux/cred.h:253
> [   54.006899] CRED: Specified credentials: 5daa4529
> [   54.006916] CRED: ->magic=0, put_addr=  (null)
> [   54.006927] CRED: ->usage=1, subscr=0
> [   54.006935] CRED: ->*uid = { 0,0,0,0 }
> [   54.006944] CRED: ->*gid = { 0,0,0,0 }
> [   54.006954] ----[ cut here ]----
> [   54.006964] kernel BUG at kernel/cred.c:825!

 > 

Fedora still seems to be hitting this as of -rc2

The FIX is made into –rc3.

Best R,
Santosh




Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk

2019-01-17 Thread Nathan Chancellor
On Thu, Jan 17, 2019 at 06:26:58PM +0800, Ming Lei wrote:
> On Wed, Jan 16, 2019 at 09:54:05AM +0100, Krzysztof Kozlowski wrote:
> > On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski  wrote:
> > >
> > > Hi,
> > >
> > > On today's next-20190116 I see a bug during boot:
> > > [ 6.843308] kernel BUG at ../block/bio.c:1833!
> > > [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
> > > ...
> > > [ 7.543824] [] (bio_split) from [<>] ( (null))
> > > [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2)
> > >
> > > (not much in the calltrace)
> > > On all my boards. Also QEMU-arm fails.
> > >
> > 
> > I forgot the bisect commit:
> > 
> > 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit
> > commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3
> > Author: Ming Lei 
> > Date:   Fri Jan 11 19:01:15 2019 +0800
> > 
> > block: use bio_for_each_bvec() to compute multi-page bvec count
> > 
> > :04 04 d79b2e71d308650df4764ff644f29d3a24dbab96
> > 0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block
> 
> It should be one 32-bit arch specific issue, I guess.
> 
> The following patch should fix this issue:
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index dc4877eaf9f9..4dd7183de849 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -166,6 +166,9 @@ static unsigned get_max_segment_size(struct request_queue 
> *q,
>  {
>   unsigned long mask = queue_segment_boundary(q);
>  
> + if (mask == BLK_SEG_BOUNDARY_MASK)
> + return queue_max_segment_size(q);
> +
>   return min_t(unsigned long, mask - (mask & offset) + 1,
>queue_max_segment_size(q));
>  }
> 
> Thanks,
> Ming

Hi Ming,

We saw this issue on our CI as well: 
https://travis-ci.com/ClangBuiltLinux/continuous-integration/jobs/170724607

I added your patch and tested locally, it boots fine now.

Tested-by: Nathan Chancellor 

Thanks,
Nathan


Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk

2019-01-17 Thread Ming Lei
On Wed, Jan 16, 2019 at 09:54:05AM +0100, Krzysztof Kozlowski wrote:
> On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski  wrote:
> >
> > Hi,
> >
> > On today's next-20190116 I see a bug during boot:
> > [ 6.843308] kernel BUG at ../block/bio.c:1833!
> > [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
> > ...
> > [ 7.543824] [] (bio_split) from [<>] ( (null))
> > [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2)
> >
> > (not much in the calltrace)
> > On all my boards. Also QEMU-arm fails.
> >
> 
> I forgot the bisect commit:
> 
> 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit
> commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3
> Author: Ming Lei 
> Date:   Fri Jan 11 19:01:15 2019 +0800
> 
> block: use bio_for_each_bvec() to compute multi-page bvec count
> 
> :04 04 d79b2e71d308650df4764ff644f29d3a24dbab96
> 0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block

It should be one 32-bit arch specific issue, I guess.

The following patch should fix this issue:

diff --git a/block/blk-merge.c b/block/blk-merge.c
index dc4877eaf9f9..4dd7183de849 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -166,6 +166,9 @@ static unsigned get_max_segment_size(struct request_queue 
*q,
 {
unsigned long mask = queue_segment_boundary(q);
 
+   if (mask == BLK_SEG_BOUNDARY_MASK)
+   return queue_max_segment_size(q);
+
return min_t(unsigned long, mask - (mask & offset) + 1,
 queue_max_segment_size(q));
 }

Thanks,
Ming


Re: kernel BUG at kernel/cred.c:825!

2019-01-16 Thread Laura Abbott

On 1/7/19 11:18 AM, Dave Jones wrote:

[   53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery 
directory
[   53.981216] NFSD: starting 45-second grace period (net f098)
[   54.006802] CRED: Invalid credentials
[   54.006880] CRED: At ./include/linux/cred.h:253
[   54.006899] CRED: Specified credentials: 5daa4529
[   54.006916] CRED: ->magic=0, put_addr=  (null)
[   54.006927] CRED: ->usage=1, subscr=0
[   54.006935] CRED: ->*uid = { 0,0,0,0 }
[   54.006944] CRED: ->*gid = { 0,0,0,0 }
[   54.006954] [ cut here ]
[   54.006964] kernel BUG at kernel/cred.c:825!
[   54.006977] invalid opcode:  [#1] SMP RIP: __invalid_creds+0x48/0x50
[   54.006987] CPU: 2 PID: 814 Comm: mount.nfs Tainted: GW 
5.0.0-rc1-backup+ #1
[   54.006997] Hardware name: ASUS All Series/Z97-DELUXE, BIOS 2602 08/18/2015
[   54.007171] RIP: 0010:__invalid_creds+0x48/0x50
[   54.007184] Code: 44 89 e2 48 89 ee 48 c7 c7 37 3e 53 ba e8 f7 8f 03 00 48 c7 c6 
49 3e 53 ba 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 48 fd ff ff <0f> 0b 66 0f 1f 
44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 fe
[   54.007207] RSP: 0018:c9e33a30 EFLAGS: 00010286
[   54.007219] RAX: 001a RBX: ba960300 RCX: 0006
[   54.007234] RDX:  RSI: 8884276f8818 RDI: 88842f895710
[   54.007246] RBP: ba5274c3 R08: 0001 R09: 
[   54.007254] R10: c9e33a50 R11:  R12: 00fd
[   54.007261] R13: 88842c1a6a08 R14: ba960300 R15: c9e33d60
[   54.007269] FS:  7f73770cb140() GS:88842f88() 
knlGS:
[   54.007277] CS:  0010 DS:  ES:  CR0: 80050033
[   54.007283] CR2: 5557d17d1000 CR3: 0004122ba006 CR4: 001606e0
[   54.007359] Call Trace:
[   54.007366]  nfs4_discover_server_trunking+0x286/0x310
[   54.007376]  nfs4_init_client+0xe8/0x260
[   54.007389]  ? nfs_get_client+0x519/0x610
[   54.007401]  ? _raw_spin_unlock+0x24/0x30
[   54.007412]  ? nfs_get_client+0x519/0x610
[   54.007424]  nfs4_set_client+0xb8/0x100
[   54.007439]  nfs4_create_server+0xfe/0x270
[   54.007451]  ? pcpu_alloc+0x611/0x8a0
[   54.007462]  nfs4_remote_mount+0x28/0x50
[   54.007474]  mount_fs+0xf/0x80
[   54.007487]  vfs_kern_mount+0x62/0x160
[   54.007498]  nfs_do_root_mount+0x7f/0xc0
[   54.007510]  nfs4_try_mount+0x3f/0xc0
[   54.007521]  ? get_nfs_version+0x11/0x50
[   54.007536]  nfs_fs_mount+0x61b/0xbd0
[   54.007548]  ? rcu_read_lock_sched_held+0x66/0x70
[   54.007560]  ? nfs_clone_super+0x70/0x70
[   54.007571]  ? nfs_destroy_inode+0x20/0x20
[   54.007585]  ? mount_fs+0xf/0x80
[   54.007595]  mount_fs+0xf/0x80
[   54.007606]  vfs_kern_mount+0x62/0x160
[   54.007618]  do_mount+0x1d1/0xd40
[   54.007631]  ? copy_mount_options+0xd2/0x170
[   54.007643]  ksys_mount+0x7e/0xd0
[   54.007654]  __x64_sys_mount+0x21/0x30
[   54.007665]  do_syscall_64+0x6d/0x660
[   54.007677]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   54.007690]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   54.007702] RIP: 0033:0x7f7377e97a1a
[   54.007713] Code: 48 8b 0d 71 e4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 
84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff 
ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48
[   54.007736] RSP: 002b:7ffc73d9b4a8 EFLAGS: 0202 ORIG_RAX: 
00a5
[   54.007751] RAX: ffda RBX:  RCX: 7f7377e97a1a
[   54.007764] RDX: 5632beb51b50 RSI: 5632beb51b70 RDI: 5632beb53880
[   54.007780] RBP: 7ffc73d9b600 R08: 5632beb556b0 R09: 33643a303036343a
[   54.007794] R10: 0c00 R11: 0202 R12: 7ffc73d9b600
[   54.007807] R13: 5632beb548a0 R14: 001c R15: 7ffc73d9b510



Fedora still seems to be hitting this as of -rc2


Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-16 Thread Oleg Nesterov
On 01/12, Kohli, Gaurav wrote:
>
> HI Peter, Oleg,
>
> as per flag and state this seems to be possible only from below code:

Not sure I understand you,

> XXX: 0 1 0x40844c
> PF_NOFREEZE
> PF_RANDOMIZE
> PF_SIGNALED
> PF_FORKNOEXEC
> PF_EXITING
> PF_EXITPIDONE
>
> above state shows do_exit runs properely and  if somehow after parked stated
> , TASK_WAKEKILL got set and signal_pending_state returns 1 in below case:
>
>  switch_count = >nivcsw;
> if (!preempt && prev->state) {
> if (unlikely(signal_pending_state(prev->state, prev))) {
> prev->state = TASK_RUNNING;
> } else {
> deactivate_task(rq, prev, DEQUEUE_SLEEP |
> DEQUEUE_NOCLOCK);

or task->state was TASK_RUNNING when __schedule() was called, or the deactivated
dead task was woken up later...

The only problem is that every case looks "obviously impossible" ;) I have no
idea whats going on, I can only suggest more stupid debugging patches which 
might
narrow the problem.

Oleg.



Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk

2019-01-16 Thread Krzysztof Kozlowski
On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski  wrote:
>
> Hi,
>
> On today's next-20190116 I see a bug during boot:
> [ 6.843308] kernel BUG at ../block/bio.c:1833!
> [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
> ...
> [ 7.543824] [] (bio_split) from [<>] ( (null))
> [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2)
>
> (not much in the calltrace)
> On all my boards. Also QEMU-arm fails.
>

I forgot the bisect commit:

258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit
commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3
Author: Ming Lei 
Date:   Fri Jan 11 19:01:15 2019 +0800

block: use bio_for_each_bvec() to compute multi-page bvec count

:04 04 d79b2e71d308650df4764ff644f29d3a24dbab96
0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block

BR,
Krzysztof


[BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk

2019-01-16 Thread Krzysztof Kozlowski
Hi,

On today's next-20190116 I see a bug during boot:
[ 6.843308] kernel BUG at ../block/bio.c:1833!
[ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
...
[ 7.543824] [] (bio_split) from [<>] ( (null))
[ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2)

(not much in the calltrace)
On all my boards. Also QEMU-arm fails.

Ultimately the target fails to mount /home from SD card (although root
is from NFSv4):
Timed out waiting for device dev-disk-by\x2dlabel-home.device.

Full console logs:
https://krzk.eu/#/builders/25/builds/1079/steps/12/logs/serial0
https://krzk.eu/#/builders/1/builds/3031/steps/12/logs/serial0

1. Arch ARM Linux
2. exynos_defconfig
  - All my Odroid boards (ARMv7, Exynos)
  - Odroid HC1, U3, XU3, XU
  - Systemd
3. All boards boot from TFTP with NFS root (NFSv4)

The QEMU arm fails with more information:

[ 0.596078] loop: module loaded
[ 0.601633] virtio_blk virtio0: [vda] 10240 512-byte logical blocks
(5.24 MB/5.00 MiB)
[ 0.603320] Internal error: Oops - undefined instruction: 0 [#1] ARM
[ 0.603504] Modules linked in:
[ 0.603784] CPU: 0 PID: 1 Comm: swapper Tainted: G W
5.0.0-rc2-next-20190116-gcbeb3db #1
[ 0.603905] Hardware name: ARM-Versatile Express
[ 0.604329] PC is at bio_split+0x5c/0x64
[ 0.604461] LR is at blk_queue_split+0x47c/0x5f4
[ 0.604553] pc : [<8036e2c4>] lr : [<80376134>] psr: 6013
[ 0.604652] sp : 8681f878 ip :  fp : 8699df00
[ 0.604750] r10: 8681f914 r9 :  r8 : 007e
[ 0.604838] r7 :  r6 : 8547ac28 r5 :  r4 : 
[ 0.604937] r3 : 8547b068 r2 : 0060 r1 :  r0 : 8699df00
[ 0.605070] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 0.605185] Control: 10c5387d Table: 60004059 DAC: 0051
[ 0.605338] Process swapper (pid: 1, stack limit = 0x(ptrval))
...
[ 0.619332] [<8036e2c4>] (bio_split) from [<80376134>]
(blk_queue_split+0x47c/0x5f4)
[ 0.619480] [<80376134>] (blk_queue_split) from [<8037bee8>]
(blk_mq_make_request+0x80/0x3bc)
[ 0.619608] [<8037bee8>] (blk_mq_make_request) from [<80371840>]
(generic_make_request+0x1b0/0x364)
[ 0.619733] [<80371840>] (generic_make_request) from [<80371a8c>]
(submit_bio+0x98/0x194)
[ 0.619850] [<80371a8c>] (submit_bio) from [<8020e1b8>]
(submit_bh_wbc.constprop.21+0x154/0x17c)
[ 0.619980] [<8020e1b8>] (submit_bh_wbc.constprop.21) from
[<80210634>] (block_read_full_page+0x3b0/0x3d0)
[ 0.620109] [<80210634>] (block_read_full_page) from [<8019139c>]
(do_read_cache_page+0x180/0x33c)
[ 0.620237] [<8019139c>] (do_read_cache_page) from [<8019156c>]
(read_cache_page+0x14/0x1c)
[ 0.620388] [<8019156c>] (read_cache_page) from [<80383be4>]
(read_dev_sector+0x2c/0xbc)
[ 0.620508] [<80383be4>] (read_dev_sector) from [<80386e7c>]
(read_lba+0x134/0x1b4)
[ 0.620623] [<80386e7c>] (read_lba) from [<80387234>] (efi_partition+0xbc/0x8e8)
[ 0.620741] [<80387234>] (efi_partition) from [<80386114>]
(check_partition+0x11c/0x1e8)
[ 0.620856] [<80386114>] (check_partition) from [<80384298>]
(rescan_partitions+0x94/0x45c)
[ 0.620971] [<80384298>] (rescan_partitions) from [<80213230>]
(__blkdev_get+0x260/0x404)
[ 0.621087] [<80213230>] (__blkdev_get) from [<802134d8>]
(blkdev_get+0x104/0x2f4)
[ 0.621206] [<802134d8>] (blkdev_get) from [<80382f9c>]
(__device_add_disk+0x498/0x4fc)
[ 0.621325] [<80382f9c>] (__device_add_disk) from [<804317f0>]
(virtblk_probe+0x444/0x744)
[ 0.621444] [<804317f0>] (virtblk_probe) from [<803d3df0>]
(virtio_dev_probe+0x258/0x330)
[ 0.621561] [<803d3df0>] (virtio_dev_probe) from [<80415864>]
(really_probe+0xf4/0x2c8)
[ 0.621675] [<80415864>] (really_probe) from [<80415b9c>]
(driver_probe_device+0x60/0x16c)
[ 0.621788] [<80415b9c>] (driver_probe_device) from [<80415d84>]
(__driver_attach+0xdc/0xe0)
[ 0.621911] [<80415d84>] (__driver_attach) from [<80413d3c>]
(bus_for_each_dev+0x68/0xb4)
[ 0.622024] [<80413d3c>] (bus_for_each_dev) from [<80414e04>]
(bus_add_driver+0x100/0x20c)
[ 0.622137] [<80414e04>] (bus_add_driver) from [<804166b0>]
(driver_register+0x78/0x10c)
[ 0.622253] [<804166b0>] (driver_register) from [<8091b5e4>] (init+0x70/0x9c)
[ 0.622403] [<8091b5e4>] (init) from [<801025ec>] (do_one_initcall+0x54/0x194)
[ 0.622525] [<801025ec>] (do_one_initcall) from [<80900e14>]
(kernel_init_freeable+0x12c/0x1c8)
[ 0.622658] [<80900e14>] (kernel_init_freeable) from [<80628c08>]
(kernel_init+0x8/0x110)
[ 0.622776] [<80628c08>] (kernel_init) from [<801010e8>]
(ret_from_fork+0x14/0x2c)
[ 0.622887] Exception stack(0x8681ffb0 to 0x8681fff8)
[ 0.623021] ffa0:    
[ 0.623242] ffc0:     
  00

Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-12 Thread Kohli, Gaurav

HI Peter, Oleg,

as per flag and state this seems to be possible only from below code:
XXX: 0 1 0x40844c
PF_NOFREEZE
PF_RANDOMIZE
PF_SIGNALED
PF_FORKNOEXEC
PF_EXITING
PF_EXITPIDONE

above state shows do_exit runs properely and  if somehow after parked 
stated , TASK_WAKEKILL got set and signal_pending_state returns 1 in 
below case:


 switch_count = >nivcsw;
if (!preempt && prev->state) {
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
deactivate_task(rq, prev, DEQUEUE_SLEEP | 
DEQUEUE_NOCLOCK);


Regards
Gaurav


On 1/11/2019 9:47 PM, Qian Cai wrote:

On Fri, 2019-01-11 at 16:07 +0530, Kohli, Gaurav wrote:


On 1/7/2019 11:26 PM, Oleg Nesterov wrote:

pr_crit("XXX: %ld %d\n", current->state, current->on_rq);


Can we also add flags, this may help to know the path of problem:

   pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq,
current->flags);



XXX: 0 1 0x40844c



--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc. is a member of the Code Aurora Forum,

a Linux Foundation Collaborative Project.


Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-11 Thread Qian Cai
On Fri, 2019-01-11 at 16:07 +0530, Kohli, Gaurav wrote:
> 
> On 1/7/2019 11:26 PM, Oleg Nesterov wrote:
> > pr_crit("XXX: %ld %d\n", current->state, current->on_rq);
> 
> Can we also add flags, this may help to know the path of problem:
> 
>   pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq,    
> current->flags);
> 

XXX: 0 1 0x40844c


Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-11 Thread Kohli, Gaurav




On 1/7/2019 11:26 PM, Oleg Nesterov wrote:

pr_crit("XXX: %ld %d\n", current->state, current->on_rq);


Can we also add flags, this may help to know the path of problem:

 pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq, 	 
current->flags);


--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc. is a member of the Code Aurora Forum,

a Linux Foundation Collaborative Project.


kernel BUG at kernel/cred.c:825!

2019-01-07 Thread Dave Jones
[   53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery 
directory
[   53.981216] NFSD: starting 45-second grace period (net f098)
[   54.006802] CRED: Invalid credentials
[   54.006880] CRED: At ./include/linux/cred.h:253
[   54.006899] CRED: Specified credentials: 5daa4529 
[   54.006916] CRED: ->magic=0, put_addr=  (null)
[   54.006927] CRED: ->usage=1, subscr=0
[   54.006935] CRED: ->*uid = { 0,0,0,0 }
[   54.006944] CRED: ->*gid = { 0,0,0,0 }
[   54.006954] [ cut here ]
[   54.006964] kernel BUG at kernel/cred.c:825!
[   54.006977] invalid opcode:  [#1] SMP RIP: __invalid_creds+0x48/0x50
[   54.006987] CPU: 2 PID: 814 Comm: mount.nfs Tainted: GW 
5.0.0-rc1-backup+ #1
[   54.006997] Hardware name: ASUS All Series/Z97-DELUXE, BIOS 2602 08/18/2015
[   54.007171] RIP: 0010:__invalid_creds+0x48/0x50
[   54.007184] Code: 44 89 e2 48 89 ee 48 c7 c7 37 3e 53 ba e8 f7 8f 03 00 48 
c7 c6 49 3e 53 ba 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 48 fd ff ff <0f> 0b 66 
0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 fe
[   54.007207] RSP: 0018:c9e33a30 EFLAGS: 00010286
[   54.007219] RAX: 001a RBX: ba960300 RCX: 0006
[   54.007234] RDX:  RSI: 8884276f8818 RDI: 88842f895710
[   54.007246] RBP: ba5274c3 R08: 0001 R09: 
[   54.007254] R10: c9e33a50 R11:  R12: 00fd
[   54.007261] R13: 88842c1a6a08 R14: ba960300 R15: c9e33d60
[   54.007269] FS:  7f73770cb140() GS:88842f88() 
knlGS:
[   54.007277] CS:  0010 DS:  ES:  CR0: 80050033
[   54.007283] CR2: 5557d17d1000 CR3: 0004122ba006 CR4: 001606e0
[   54.007359] Call Trace:
[   54.007366]  nfs4_discover_server_trunking+0x286/0x310
[   54.007376]  nfs4_init_client+0xe8/0x260
[   54.007389]  ? nfs_get_client+0x519/0x610
[   54.007401]  ? _raw_spin_unlock+0x24/0x30
[   54.007412]  ? nfs_get_client+0x519/0x610
[   54.007424]  nfs4_set_client+0xb8/0x100
[   54.007439]  nfs4_create_server+0xfe/0x270
[   54.007451]  ? pcpu_alloc+0x611/0x8a0
[   54.007462]  nfs4_remote_mount+0x28/0x50
[   54.007474]  mount_fs+0xf/0x80
[   54.007487]  vfs_kern_mount+0x62/0x160
[   54.007498]  nfs_do_root_mount+0x7f/0xc0
[   54.007510]  nfs4_try_mount+0x3f/0xc0
[   54.007521]  ? get_nfs_version+0x11/0x50
[   54.007536]  nfs_fs_mount+0x61b/0xbd0
[   54.007548]  ? rcu_read_lock_sched_held+0x66/0x70
[   54.007560]  ? nfs_clone_super+0x70/0x70
[   54.007571]  ? nfs_destroy_inode+0x20/0x20
[   54.007585]  ? mount_fs+0xf/0x80
[   54.007595]  mount_fs+0xf/0x80
[   54.007606]  vfs_kern_mount+0x62/0x160
[   54.007618]  do_mount+0x1d1/0xd40
[   54.007631]  ? copy_mount_options+0xd2/0x170
[   54.007643]  ksys_mount+0x7e/0xd0
[   54.007654]  __x64_sys_mount+0x21/0x30
[   54.007665]  do_syscall_64+0x6d/0x660
[   54.007677]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[   54.007690]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   54.007702] RIP: 0033:0x7f7377e97a1a
[   54.007713] Code: 48 8b 0d 71 e4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48
[   54.007736] RSP: 002b:7ffc73d9b4a8 EFLAGS: 0202 ORIG_RAX: 
00a5
[   54.007751] RAX: ffda RBX:  RCX: 7f7377e97a1a
[   54.007764] RDX: 5632beb51b50 RSI: 5632beb51b70 RDI: 5632beb53880
[   54.007780] RBP: 7ffc73d9b600 R08: 5632beb556b0 R09: 33643a303036343a
[   54.007794] R10: 0c00 R11: 0202 R12: 7ffc73d9b600
[   54.007807] R13: 5632beb548a0 R14: 001c R15: 7ffc73d9b510



Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-07 Thread Oleg Nesterov
On 01/07, Qian Cai wrote:
>
>
> On 1/7/19 8:52 AM, Peter Zijlstra wrote:
> > On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote:
> >> Running some mmap() workloads to put the system on low memory situation 
> >> with
> >> swapping and OOM, and then it trigger this BUG(),
> >>
> >> void __noreturn do_task_dead(void)
> >> {
> >> /* Causes final put_task_struct in finish_task_switch(): */
> >> set_special_state(TASK_DEAD);
> >>
> >> /* Tell freezer to ignore us: */
> >> current->flags |= PF_NOFREEZE;
> >>
> >> __schedule(false);
> >> BUG();
> >>
> >> /* Avoid "noreturn function does return" - but don't continue if 
> >> BUG()
> >> is a NOP: */
> >> for (;;)
> >> cpu_relax();
> >> }
> >
> > This would mean that we somehow loose the TASK_DEAD state before hitting
> > schedule(), but that is something that should be avoided by
> > set_special_state(), which is supposed to serialize against concurrent
> > wake-ups.

or may be pick_next_task() somehow returns the deactivated TASK_DEAD task?

> > How readily does this reproduce?
>
> Running LTP oom01 [1] triggered it at least once in five attempts every time 
> so
> far on v4.20+. Have not tried much on v5.0-rc1 yet.

Can you add

pr_crit("XXX: %ld %d\n", current->state, current->on_rq);

before that BUG() and reproduce?

Oleg.



Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-07 Thread Qian Cai



On 1/7/19 8:52 AM, Peter Zijlstra wrote:
> On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote:
>> Running some mmap() workloads to put the system on low memory situation with
>> swapping and OOM, and then it trigger this BUG(),
>>
>> void __noreturn do_task_dead(void)
>> {
>> /* Causes final put_task_struct in finish_task_switch(): */
>> set_special_state(TASK_DEAD);
>>
>> /* Tell freezer to ignore us: */
>> current->flags |= PF_NOFREEZE;
>>
>> __schedule(false);
>> BUG();
>>
>> /* Avoid "noreturn function does return" - but don't continue if 
>> BUG()
>> is a NOP: */
>> for (;;)
>> cpu_relax();
>> }
> 
> This would mean that we somehow loose the TASK_DEAD state before hitting
> schedule(), but that is something that should be avoided by
> set_special_state(), which is supposed to serialize against concurrent
> wake-ups.
> 
> Also see commit: b5bf9a90bbeb ("sched/core: Introduce set_special_state()")
> 
> How readily does this reproduce?

Running LTP oom01 [1] triggered it at least once in five attempts every time so
far on v4.20+. Have not tried much on v5.0-rc1 yet.

[1]
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c


Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-07 Thread Peter Zijlstra
On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote:
> Running some mmap() workloads to put the system on low memory situation with
> swapping and OOM, and then it trigger this BUG(),
> 
> void __noreturn do_task_dead(void)
> {
> /* Causes final put_task_struct in finish_task_switch(): */
> set_special_state(TASK_DEAD);
> 
> /* Tell freezer to ignore us: */
> current->flags |= PF_NOFREEZE;
> 
> __schedule(false);
> BUG();
> 
> /* Avoid "noreturn function does return" - but don't continue if BUG()
> is a NOP: */
> for (;;)
> cpu_relax();
> }

This would mean that we somehow loose the TASK_DEAD state before hitting
schedule(), but that is something that should be avoided by
set_special_state(), which is supposed to serialize against concurrent
wake-ups.

Also see commit: b5bf9a90bbeb ("sched/core: Introduce set_special_state()")

How readily does this reproduce?


  1   2   3   4   5   6   7   8   9   10   >