RE: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
> From: Valentin Schneider > Sent: Tuesday, December 22, 2020 5:40 AM > To: Dexuan Cui > Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com; > vincent.guit...@linaro.org; dietmar.eggem...@arm.com; > rost...@goodmis.org; bseg...@google.com; mgor...@suse.de; > bris...@redhat.com; x...@kernel.org; linux...@vger.kernel.org; > linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley > > Subject: Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel > BUG at kernel/sched/core.c:7596! > > > Hi, > > On 22/12/20 09:13, Dexuan Cui wrote: > > Hi, > > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) > on Hyper-V. > > When I test hibernation, the VM can easily hit the below BUG_ON during the > resume > > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has > 40 vCPUs. > > > > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may > be broken? > > > > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and > rq->nr_pinned==0: > > > > 7587 int sched_cpu_dying(unsigned int cpu) > > 7588 { > > 7589 struct rq *rq = cpu_rq(cpu); > > 7590 struct rq_flags rf; > > 7591 > > 7592 /* Handle pending wakeups and then migrate everything off > */ > > 7593 sched_tick_stop(cpu); > > 7594 > > 7595 rq_lock_irqsave(rq, ); > > 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); > > 7597 rq_unlock_irqrestore(rq, ); > > 7598 > > 7599 calc_load_migrate(rq); > > 7600 update_max_interval(); > > 7601 nohz_balance_exit_idle(rq); > > 7602 hrtick_clear(rq); > > 7603 return 0; > > 7604 } > > > > The last commit that touches the BUG_ON line is the commit > > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug > cooperative") > > but the commit looks good to me. > > > > Any idea? > > > > I'd wager this extra task is a kworker; could you give this series a try? > > > https ://lore.kernel.org/lkml/20201218170919.2950-1-jiangshan...@gmail.com/ Thanks, Valentin! It looks like the patchset can fix the BUG_ON, though I see a warning, which I reported here: https://lkml.org/lkml/2020/12/22/648 Thanks, -- Dexuan
Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
Hi, On 22/12/20 09:13, Dexuan Cui wrote: > Hi, > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on > Hyper-V. > When I test hibernation, the VM can easily hit the below BUG_ON during the > resume > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has > 40 vCPUs. > > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may > be broken? > > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0: > > 7587 int sched_cpu_dying(unsigned int cpu) > 7588 { > 7589 struct rq *rq = cpu_rq(cpu); > 7590 struct rq_flags rf; > 7591 > 7592 /* Handle pending wakeups and then migrate everything off */ > 7593 sched_tick_stop(cpu); > 7594 > 7595 rq_lock_irqsave(rq, ); > 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); > 7597 rq_unlock_irqrestore(rq, ); > 7598 > 7599 calc_load_migrate(rq); > 7600 update_max_interval(); > 7601 nohz_balance_exit_idle(rq); > 7602 hrtick_clear(rq); > 7603 return 0; > 7604 } > > The last commit that touches the BUG_ON line is the commit > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative") > but the commit looks good to me. > > Any idea? > I'd wager this extra task is a kworker; could you give this series a try? https://lore.kernel.org/r/20201218170919.2950-1-jiangshan...@gmail.com > Thanks, > -- Dexuan
v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!
Hi, I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on Hyper-V. When I test hibernation, the VM can easily hit the below BUG_ON during the resume procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 vCPUs. I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be broken? In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0: 7587 int sched_cpu_dying(unsigned int cpu) 7588 { 7589 struct rq *rq = cpu_rq(cpu); 7590 struct rq_flags rf; 7591 7592 /* Handle pending wakeups and then migrate everything off */ 7593 sched_tick_stop(cpu); 7594 7595 rq_lock_irqsave(rq, ); 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); 7597 rq_unlock_irqrestore(rq, ); 7598 7599 calc_load_migrate(rq); 7600 update_max_interval(); 7601 nohz_balance_exit_idle(rq); 7602 hrtick_clear(rq); 7603 return 0; 7604 } The last commit that touches the BUG_ON line is the commit 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative") but the commit looks good to me. Any idea? Thanks, -- Dexuan [5.383378] PM: Image signature found, resuming [5.388027] PM: hibernation: resume from hibernation [5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done. [5.397058] OOM killer disabled. [5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [5.413013] PM: hibernation: Marking nosave pages: [mem 0x-0x0fff] [5.419331] PM: hibernation: Marking nosave pages: [mem 0x0009f000-0x000f] [5.425502] PM: hibernation: Marking nosave pages: [mem 0xf7ff-0x] [5.431706] PM: hibernation: Basic memory bitmaps created [5.465205] PM: Using 3 thread(s) for decompression [5.469590] PM: Loading and decompressing image data (505553 pages)... [5.492790] PM: Image loading progress: 0% [6.532641] PM: Image loading progress: 10% [6.899683] PM: Image loading progress: 20% [7.251672] PM: Image loading progress: 30% [7.598808] PM: Image loading progress: 40% [8.056153] PM: Image loading progress: 50% [8.534077] PM: Image loading progress: 60% [9.029886] PM: Image loading progress: 70% [9.542015] PM: Image loading progress: 80% [ 10.025326] PM: Image loading progress: 90% [ 10.525804] PM: Image loading progress: 100% [ 10.530241] PM: Image loading done [ 10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 MB/s) [ 10.540827] PM: Image successfully loaded [ 10.599243] serial 00:04: disabled [ 10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 should not be sleeping [ 10.646542] Disabling non-boot CPUs ... [ 10.694380] smpboot: CPU 1 is now offline [ 10.729044] smpboot: CPU 2 is now offline [ 10.756819] smpboot: CPU 3 is now offline [ 10.776784] smpboot: CPU 4 is now offline [ 10.800866] smpboot: CPU 5 is now offline [ 10.828923] smpboot: CPU 6 is now offline [ 10.849013] smpboot: CPU 7 is now offline [ 10.876722] smpboot: CPU 8 is now offline [ 10.909426] smpboot: CPU 9 is now offline [ 10.929360] smpboot: CPU 10 is now offline [ 10.957059] smpboot: CPU 11 is now offline [ 10.976542] smpboot: CPU 12 is now offline [ 11.008470] smpboot: CPU 13 is now offline [ 11.036356] smpboot: CPU 14 is now offline [ 11.072396] smpboot: CPU 15 is now offline [ 11.100229] smpboot: CPU 16 is now offline [ 11.128638] smpboot: CPU 17 is now offline [ 11.148479] smpboot: CPU 18 is now offline [ 11.173537] smpboot: CPU 19 is now offline [ 11.205680] smpboot: CPU 20 is now offline [ 11.231799] smpboot: CPU 21 is now offline [ 11.259562] smpboot: CPU 22 is now offline [ 11.288672] smpboot: CPU 23 is now offline [ 11.319691] smpboot: CPU 24 is now offline [ 11.355523] smpboot: CPU 25 is now offline [ 11.399469] smpboot: CPU 26 is now offline [ 11.435438] smpboot: CPU 27 is now offline [ 11.471402] smpboot: CPU 28 is now offline [ 11.515488] smpboot: CPU 29 is now offline [ 11.551355] smpboot: CPU 30 is now offline [ 11.591326] smpboot: CPU 31 is now offline [ 11.624004] smpboot: CPU 32 is now offline [ 11.659326] smpboot: CPU 33 is now offline [ 11.687478] smpboot: CPU 34 is now offline [ 11.719243] smpboot: CPU 35 is now offline [ 11.747252] smpboot: CPU 36 is now offline [ 11.771224] smpboot: CPU 37 is now offline [ 11.795089] [ cut here ]---- [ 11.798052] kernel BUG at kernel/sched/core.c:7596! [ 11.798052] invalid opcode: [#1] PREEMPT SMP PTI [ 11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: GE 5.10.0+ #6 [ 11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0 [ 11.798052] RIP: 0010:sch
Re: [arm64] kernel BUG at kernel/seccomp.c:1309!
Jann Horn writes: > On Mon, Nov 23, 2020 at 2:45 PM Arnd Bergmann wrote: >> On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju >> wrote: >> > >> > While booting arm64 kernel the following kernel BUG noticed on several >> > arm64 >> > devices running linux next 20201123 tag kernel. >> > >> > >> > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c >> > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp' >> > bce6a8cba7bf Merge branch 'linus' >> > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp >> > fab686eb0307 seccomp: Remove bogus __user annotations >> > 0d831528 seccomp/cache: Report cache data through >> > /proc/pid/seccomp_cache >> > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant >> > allow >> > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path >> > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag >> > >> > >> > Please find these easy steps to reproduce the kernel build and boot. >> >> Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here >> seems suspicious: it changes >> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> index 02aef2844c38..47763f3999f7 100644 >> --- a/include/linux/seccomp.h >> +++ b/include/linux/seccomp.h >> @@ -42,7 +42,7 @@ struct seccomp { >> extern int __secure_computing(const struct seccomp_data *sd); >> static inline int secure_computing(void) >> { >> - if (unlikely(test_thread_flag(TIF_SECCOMP))) >> + if (unlikely(test_syscall_work(SECCOMP))) >> return __secure_computing(NULL); >> return 0; >> } >> >> which is in the call chain directly before >> >> int __secure_computing(const struct seccomp_data *sd) >> { >>int mode = current->seccomp.mode; >> >> ... >> switch (mode) { >> case SECCOMP_MODE_STRICT: >> __secure_computing_strict(this_syscall); /* may call >> do_exit */ >> return 0; >> case SECCOMP_MODE_FILTER: >> return __seccomp_filter(this_syscall, sd, false); >> default: >> BUG(); >> } >> } >> >> Clearly, current->seccomp.mode is set to something other >> than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER >> while the test_syscall_work(SECCOMP) returns true, and this >> must have not been the case earlier. > > Ah, I think the problem is actually in > 3136b93c3fb2b7c19e853e049203ff8f2b9dd2cd ("entry: Expose helpers to > migrate TIF to SYSCALL_WORK flag"). In the !GENERIC_ENTRY case, it > adds this code: > > +#define set_syscall_work(fl) \ > + set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) > +#define test_syscall_work(fl) \ > + test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) > +#define clear_syscall_work(fl) \ > + clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) > + > +#define set_task_syscall_work(t, fl) \ > + set_ti_thread_flag(task_thread_info(t), TIF_##fl) > +#define test_task_syscall_work(t, fl) \ > + test_ti_thread_flag(task_thread_info(t), TIF_##fl) > +#define clear_task_syscall_work(t, fl) \ > + clear_ti_thread_flag(task_thread_info(t), TIF_##fl) > > but the SYSCALL_WORK_FLAGS are not valid on !GENERIC_ENTRY, we'll mix > up (on arm64) SYSCALL_WORK_BIT_SECCOMP (==0) and TIF_SIGPENDING (==0). > > As part of fixing this, it might be a good idea to put "enum > syscall_work_bit" behind a "#ifdef CONFIG_GENERIC_ENTRY" to avoid > future accidents like this? Hi Jan, Arnd, That is correct. This is a copy pasta mistake. My apologies. I didn't have a !GENERIC_ENTRY device to test, but just the ifdef would have caught it. -- Gabriel Krisman Bertazi
Re: [arm64] kernel BUG at kernel/seccomp.c:1309!
On Mon, Nov 23, 2020 at 2:45 PM Arnd Bergmann wrote: > On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju > wrote: > > > > While booting arm64 kernel the following kernel BUG noticed on several arm64 > > devices running linux next 20201123 tag kernel. > > > > > > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c > > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp' > > bce6a8cba7bf Merge branch 'linus' > > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp > > fab686eb0307 seccomp: Remove bogus __user annotations > > 0d831528 seccomp/cache: Report cache data through > > /proc/pid/seccomp_cache > > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant > > allow > > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path > > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag > > > > > > Please find these easy steps to reproduce the kernel build and boot. > > Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here > seems suspicious: it changes > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 02aef2844c38..47763f3999f7 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -42,7 +42,7 @@ struct seccomp { > extern int __secure_computing(const struct seccomp_data *sd); > static inline int secure_computing(void) > { > - if (unlikely(test_thread_flag(TIF_SECCOMP))) > + if (unlikely(test_syscall_work(SECCOMP))) > return __secure_computing(NULL); > return 0; > } > > which is in the call chain directly before > > int __secure_computing(const struct seccomp_data *sd) > { >int mode = current->seccomp.mode; > > ... > switch (mode) { > case SECCOMP_MODE_STRICT: > __secure_computing_strict(this_syscall); /* may call do_exit > */ > return 0; > case SECCOMP_MODE_FILTER: > return __seccomp_filter(this_syscall, sd, false); > default: > BUG(); > } > } > > Clearly, current->seccomp.mode is set to something other > than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER > while the test_syscall_work(SECCOMP) returns true, and this > must have not been the case earlier. Ah, I think the problem is actually in 3136b93c3fb2b7c19e853e049203ff8f2b9dd2cd ("entry: Expose helpers to migrate TIF to SYSCALL_WORK flag"). In the !GENERIC_ENTRY case, it adds this code: +#define set_syscall_work(fl) \ + set_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) +#define test_syscall_work(fl) \ + test_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) +#define clear_syscall_work(fl) \ + clear_ti_thread_flag(current_thread_info(), SYSCALL_WORK_##fl) + +#define set_task_syscall_work(t, fl) \ + set_ti_thread_flag(task_thread_info(t), TIF_##fl) +#define test_task_syscall_work(t, fl) \ + test_ti_thread_flag(task_thread_info(t), TIF_##fl) +#define clear_task_syscall_work(t, fl) \ + clear_ti_thread_flag(task_thread_info(t), TIF_##fl) but the SYSCALL_WORK_FLAGS are not valid on !GENERIC_ENTRY, we'll mix up (on arm64) SYSCALL_WORK_BIT_SECCOMP (==0) and TIF_SIGPENDING (==0). As part of fixing this, it might be a good idea to put "enum syscall_work_bit" behind a "#ifdef CONFIG_GENERIC_ENTRY" to avoid future accidents like this?
Re: [arm64] kernel BUG at kernel/seccomp.c:1309!
On Mon, Nov 23, 2020 at 12:15 PM Naresh Kamboju wrote: > > While booting arm64 kernel the following kernel BUG noticed on several arm64 > devices running linux next 20201123 tag kernel. > > > $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c > 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp' > bce6a8cba7bf Merge branch 'linus' > 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp > fab686eb0307 seccomp: Remove bogus __user annotations > 0d831528 seccomp/cache: Report cache data through /proc/pid/seccomp_cache > 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant > allow > f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path > 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag > > > Please find these easy steps to reproduce the kernel build and boot. Adding Gabriel Krisman Bertazi to Cc, as the last patch (23d67a54857a) here seems suspicious: it changes diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..47763f3999f7 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -42,7 +42,7 @@ struct seccomp { extern int __secure_computing(const struct seccomp_data *sd); static inline int secure_computing(void) { - if (unlikely(test_thread_flag(TIF_SECCOMP))) + if (unlikely(test_syscall_work(SECCOMP))) return __secure_computing(NULL); return 0; } which is in the call chain directly before int __secure_computing(const struct seccomp_data *sd) { int mode = current->seccomp.mode; ... switch (mode) { case SECCOMP_MODE_STRICT: __secure_computing_strict(this_syscall); /* may call do_exit */ return 0; case SECCOMP_MODE_FILTER: return __seccomp_filter(this_syscall, sd, false); default: BUG(); } } Clearly, current->seccomp.mode is set to something other than SECCOMP_MODE_STRICT or SECCOMP_MODE_FILTER while the test_syscall_work(SECCOMP) returns true, and this must have not been the case earlier. Arnd > > step to reproduce: > # please install tuxmake > # sudo pip3 install -U tuxmake > # cd linux-next > # tuxmake --runtime docker --target-arch arm --toolchain gcc-9 > --kconfig defconfig --kconfig-add > https://builds.tuxbuild.com/1kgWN61pS5M35vjnVfDSvOOPd38/config > > # Boot the arm64 on any arm64 devices. > # you will notice the below BUG > > crash log details: > --- > [6.941012] [ cut here ]-------- > Found device /dev/ttyAMA3. > [6.947587] lima f408.gpu: mod rate = 5 > [6.955422] kernel BUG at kernel/seccomp.c:1309! > [6.955430] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP > [6.955437] Modules linked in: cec rfkill wlcore_sdio(+) kirin_drm > dw_drm_dsi lima(+) drm_kms_helper gpu_sched drm fuse > [6.955481] CPU: 2 PID: 291 Comm: systemd-udevd Not tainted > 5.10.0-rc4-next-20201123 #2 > [6.955485] Hardware name: HiKey Development Board (DT) > [6.955493] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) > [6.955510] pc : __secure_computing+0xe0/0xe8 > [6.958171] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot > req 40Hz, actual 40HZ div = 31) > [6.965975] [drm] Initialized lima 1.1.0 20191231 for f408.gpu on > minor 0 > [6.970176] lr : syscall_trace_enter+0x1cc/0x218 > [6.970181] sp : 800012d8be10 > [6.970185] x29: 800012d8be10 x28: 0092cb00 > [6.970195] x27: x26: > [6.970203] x25: x24: > [6.970210] x23: 6000 x22: 0202 > [7.011614] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot > req 2500Hz, actual 2480HZ div = 0) > [7.016457] > [7.016461] x21: 0200 x20: 0092cb00 > [7.016470] x19: 800012d8bec0 x18: > [7.016478] x17: x16: > [7.016485] x15: x14: > [7.054116] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot > req 40Hz, actual 40HZ div = 31) > [7.056715] > [7.103444] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot > req 2500Hz, actual 2480HZ div = 0) > [7.105105] x13: x12: > [7.125849] x11: x10: > [7.125858] x9 : 80001001bcbc x8 : > [7.125865] x7 : x6 : > [7.125871] x5 : x4 : > [7.125879] x3 : x2 : 0092cb00 > [7.125886] x1 :
[arm64] kernel BUG at kernel/seccomp.c:1309!
While booting arm64 kernel the following kernel BUG noticed on several arm64 devices running linux next 20201123 tag kernel. $ git log --oneline next-20201120..next-20201123 -- kernel/seccomp.c 5c5c5fa055ea Merge remote-tracking branch 'seccomp/for-next/seccomp' bce6a8cba7bf Merge branch 'linus' 7ef95e3dbcee Merge branch 'for-linus/seccomp' into for-next/seccomp fab686eb0307 seccomp: Remove bogus __user annotations 0d831528 seccomp/cache: Report cache data through /proc/pid/seccomp_cache 8e01b51a31a1 seccomp/cache: Add "emulator" to check if filter is constant allow f9d480b6ffbe seccomp/cache: Lookup syscall allowlist bitmap for fast path 23d67a54857a seccomp: Migrate to use SYSCALL_WORK flag Please find these easy steps to reproduce the kernel build and boot. step to reproduce: # please install tuxmake # sudo pip3 install -U tuxmake # cd linux-next # tuxmake --runtime docker --target-arch arm --toolchain gcc-9 --kconfig defconfig --kconfig-add https://builds.tuxbuild.com/1kgWN61pS5M35vjnVfDSvOOPd38/config # Boot the arm64 on any arm64 devices. # you will notice the below BUG crash log details: --- [6.941012] [ cut here ] Found device /dev/ttyAMA3. [6.947587] lima f408.gpu: mod rate = 5 [6.955422] kernel BUG at kernel/seccomp.c:1309! [6.955430] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [6.955437] Modules linked in: cec rfkill wlcore_sdio(+) kirin_drm dw_drm_dsi lima(+) drm_kms_helper gpu_sched drm fuse [6.955481] CPU: 2 PID: 291 Comm: systemd-udevd Not tainted 5.10.0-rc4-next-20201123 #2 [6.955485] Hardware name: HiKey Development Board (DT) [6.955493] pstate: 8005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) [6.955510] pc : __secure_computing+0xe0/0xe8 [6.958171] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot req 40Hz, actual 40HZ div = 31) [6.965975] [drm] Initialized lima 1.1.0 20191231 for f408.gpu on minor 0 [6.970176] lr : syscall_trace_enter+0x1cc/0x218 [6.970181] sp : 800012d8be10 [6.970185] x29: 800012d8be10 x28: 0092cb00 [6.970195] x27: x26: [6.970203] x25: x24: [6.970210] x23: 6000 x22: 0202 [7.011614] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot req 2500Hz, actual 2480HZ div = 0) [7.016457] [7.016461] x21: 0200 x20: 0092cb00 [7.016470] x19: 800012d8bec0 x18: [7.016478] x17: x16: [7.016485] x15: x14: [7.054116] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot req 40Hz, actual 40HZ div = 31) [7.056715] [7.103444] mmc_host mmc2: Bus speed (slot 0) = 2480Hz (slot req 2500Hz, actual 2480HZ div = 0) [7.105105] x13: x12: [7.125849] x11: x10: [7.125858] x9 : 80001001bcbc x8 : [7.125865] x7 : x6 : [7.125871] x5 : x4 : [7.125879] x3 : x2 : 0092cb00 [7.125886] x1 : x0 : 0116 [7.125896] Call trace: ] Found device /dev/ttyAMA2. [7.125908] __secure_computing+0xe0/0xe8 [7.125918] syscall_trace_enter+0x1cc/0x218 [7.125927] el0_svc_common.constprop.0+0x19c/0x1b8 [7.125933] do_el0_svc+0x2c/0x98 [7.125940] el0_sync_handler+0x180/0x188 [7.125946] el0_sync+0x174/0x180 [7.125958] Code: d2800121 97ffd9a9 d2800120 97fbf1a9 (d421) [7.199584] ---[ end trace 463debbc21f0c7b5 ]--- [7.204205] note: systemd-udevd[291] exited with preempt_count 1 [7.210733] [ cut here ] [7.215451] WARNING: CPU: 2 PID: # 0 at kernel/rcu/tree.c:632 rcu_eqs_enter.isra.0+0x134/0x140 [7.223927] Modules linked in: cec rfkill wlcore_sdio kirin_drm dw_drm_dsi lima drm_kms_helper gpu_sched drm fuse [7.234295] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G D 5.10.0-rc4-next-20201123 #2 [7.243252] Hardware name: HiKey Development Board (DT) [7.248561] pstate: 23c5 (nzCv DAIF -PAN -UAO -TCO BTYPE=--) [7.254638] pc : rcu_eqs_enter.isra.0+0x134/0x140 [7.259350] lr : rcu_idle_enter+0x18/0x28 [7.263362] sp : 8000128e3e80 [7.266678] x29: 8000128e3e80 x28: [7.272001] x27: x26: 01b79080 [7.277321] x25: x24: 0001adc9b310 [7.282641] x23: x22: 01b79080 [7.287970] x21: 77b24b00 x20: 01b79098 [7.287979] x19: 800011c7ab40 x18: 0010 [7.287986] x17: x16: [7.287993] x15: 0092cf98 x14: 0720072007200720 [7.288001] x13: 0720072007200720 x12: 03c6 [7.2
Re: PPPd Bug with kernel 5.8.5 : task pppd:6009 blocked for more than 122 seconds.
With small research in ppp_generic.c may be a problem is come in this section: static int ppp_nl_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { struct ppp_config conf = { .unit = -1, .ifname_is_set = true, }; struct file *file; int err; file = fget(nla_get_s32(data[IFLA_PPP_DEV_FD])); if (!file) return -EBADF; /* rtnl_lock is already held here, but ppp_create_interface() locks * ppp_mutex before holding rtnl_lock. Using mutex_trylock() avoids * possible deadlock due to lock order inversion, at the cost of * pushing the problem back to userspace. */ if (!mutex_trylock(_mutex)) { err = -EBUSY; goto out; } if (file->f_op != _device_fops || file->private_data) { err = -EBADF; goto out_unlock; } conf.file = file; /* Don't use device name generated by the rtnetlink layer when ifname * isn't specified. Let ppp_dev_configure() set the device name using * the PPP unit identifer as suffix (i.e. ppp). This allows * userspace to infer the device name using to the PPPIOCGUNIT ioctl. */ if (!tb[IFLA_IFNAME]) conf.ifname_is_set = false; err = ppp_dev_configure(src_net, dev, ); out_unlock: mutex_unlock(_mutex); out: fput(file); return err; } Need people that under stand this part of kernel code. > On 1 Sep 2020, at 22:36, Martin Zaharinov wrote: > > HI all kernel team > > We detect bug in kernel with pppd server > When disconnect and connect 1000+ users after many time kernel have lock > please see debug log : > > > Aug 30 05:34:55 10.20.22.1 [139221.203949][ T817] INFO: task pppd:6009 > blocked for more than 122 seconds. > Aug 30 05:34:55 10.20.22.1 [139221.231206][ T817] Tainted: G > O 5.8.5 #1 > Aug 30 05:34:55 10.20.22.1 [139221.257538][ T817] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Aug 30 05:34:55 10.20.22.1 [139221.309257][ T817] pppdD0 > 6009 1 0x80004006 > Aug 30 05:34:55 10.20.22.1 [139221.335289][ T817] Call Trace: > Aug 30 05:34:55 10.20.22.1 [139221.360927][ T817] __schedule+0x374/0x750 > Aug 30 05:34:55 10.20.22.1 [139221.386660][ T817] > schedule_preempt_disabled+0x50/0xc0 > Aug 30 05:34:55 10.20.22.1 [139221.412825][ T817] > __mutex_lock.constprop.0+0x15d/0x490 > Aug 30 05:34:55 10.20.22.1 [139221.439154][ T817] ppp_release+0xa8/0xe0 > [ppp_generic] > Aug 30 05:34:55 10.20.22.1 [139221.465557][ T817] __fput+0xc8/0x220 > Aug 30 05:34:55 10.20.22.1 [139221.491144][ T817] task_work_run+0x57/0x90 > Aug 30 05:34:55 10.20.22.1 [139221.516457][ T817] do_exit+0x317/0xa60 > Aug 30 05:34:55 10.20.22.1 [139221.541571][ T817] do_group_exit+0x2e/0x90 > Aug 30 05:34:55 10.20.22.1 [139221.566159][ T817] get_signal+0x11b/0x7a0 > Aug 30 05:34:55 10.20.22.1 [139221.590102][ T817] do_signal+0x25/0x1f0 > Aug 30 05:34:55 10.20.22.1 [139221.613380][ T817] ? ktime_get_ts64+0x3e/0xe0 > Aug 30 05:34:55 10.20.22.1 [139221.636164][ T817] ? _copy_to_user+0x22/0x30 > Aug 30 05:34:55 10.20.22.1 [139221.658333][ T817] > __prepare_exit_to_usermode+0xd6/0x140 > Aug 30 05:34:55 10.20.22.1 [139221.680230][ T817] do_syscall_64+0x50/0x120 > Aug 30 05:34:55 10.20.22.1 [139221.701469][ T817] > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > Aug 30 05:34:55 10.20.22.1 [139221.722554][ T817] RIP: 0033:0x7f8387041ae6 > Aug 30 05:34:55 10.20.22.1 [139221.743000][ T817] Code: Bad RIP value. > Aug 30 05:34:55 10.20.22.1 [139221.762781][ T817] RSP: 002b:7ffcee1e4fa8 > EFLAGS: 0246 ORIG_RAX: 0017 > Aug 30 05:34:55 10.20.22.1 [139221.802079][ T817] RAX: fdfe RBX: > RCX: 7f8387041ae6 > Aug 30 05:34:56 10.20.22.1 [139221.841537][ T817] RDX: RSI: > 7ffcee1e50b0 RDI: 000d > Aug 30 05:34:56 10.20.22.1 [139221.882678][ T817] RBP: 7ffcee1e50b0 R08: > 7ffcee1e5030 R09: 0010 > Aug 30 05:34:56 10.20.22.1 [139221.924623][ T817] R10: R11: > 0246 R12: 1000 > Aug 30 05:34:56 10.20.22.1 [139221.967198][ T817] R13: 000c R14: > R15: 7ffcee1e7130 > Aug 30 05:34:56 10.20.22.1 [139222.011266][ T817] INFO: task pppd:6092 > blocked for more than 123 seconds. > Aug 30 05:34:56 10.20.22.1 [139222.036779][ T817] Tainted: G > O 5.8.5 #1 > Aug 30 05:34:56 10.20.22.1 [1
PPPd Bug with kernel 5.8.5 : task pppd:6009 blocked for more than 122 seconds.
HI all kernel team We detect bug in kernel with pppd server When disconnect and connect 1000+ users after many time kernel have lock please see debug log : Aug 30 05:34:55 10.20.22.1 [139221.203949][ T817] INFO: task pppd:6009 blocked for more than 122 seconds. Aug 30 05:34:55 10.20.22.1 [139221.231206][ T817] Tainted: G O 5.8.5 #1 Aug 30 05:34:55 10.20.22.1 [139221.257538][ T817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 30 05:34:55 10.20.22.1 [139221.309257][ T817] pppdD0 6009 1 0x80004006 Aug 30 05:34:55 10.20.22.1 [139221.335289][ T817] Call Trace: Aug 30 05:34:55 10.20.22.1 [139221.360927][ T817] __schedule+0x374/0x750 Aug 30 05:34:55 10.20.22.1 [139221.386660][ T817] schedule_preempt_disabled+0x50/0xc0 Aug 30 05:34:55 10.20.22.1 [139221.412825][ T817] __mutex_lock.constprop.0+0x15d/0x490 Aug 30 05:34:55 10.20.22.1 [139221.439154][ T817] ppp_release+0xa8/0xe0 [ppp_generic] Aug 30 05:34:55 10.20.22.1 [139221.465557][ T817] __fput+0xc8/0x220 Aug 30 05:34:55 10.20.22.1 [139221.491144][ T817] task_work_run+0x57/0x90 Aug 30 05:34:55 10.20.22.1 [139221.516457][ T817] do_exit+0x317/0xa60 Aug 30 05:34:55 10.20.22.1 [139221.541571][ T817] do_group_exit+0x2e/0x90 Aug 30 05:34:55 10.20.22.1 [139221.566159][ T817] get_signal+0x11b/0x7a0 Aug 30 05:34:55 10.20.22.1 [139221.590102][ T817] do_signal+0x25/0x1f0 Aug 30 05:34:55 10.20.22.1 [139221.613380][ T817] ? ktime_get_ts64+0x3e/0xe0 Aug 30 05:34:55 10.20.22.1 [139221.636164][ T817] ? _copy_to_user+0x22/0x30 Aug 30 05:34:55 10.20.22.1 [139221.658333][ T817] __prepare_exit_to_usermode+0xd6/0x140 Aug 30 05:34:55 10.20.22.1 [139221.680230][ T817] do_syscall_64+0x50/0x120 Aug 30 05:34:55 10.20.22.1 [139221.701469][ T817] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 30 05:34:55 10.20.22.1 [139221.722554][ T817] RIP: 0033:0x7f8387041ae6 Aug 30 05:34:55 10.20.22.1 [139221.743000][ T817] Code: Bad RIP value. Aug 30 05:34:55 10.20.22.1 [139221.762781][ T817] RSP: 002b:7ffcee1e4fa8 EFLAGS: 0246 ORIG_RAX: 0017 Aug 30 05:34:55 10.20.22.1 [139221.802079][ T817] RAX: fdfe RBX: RCX: 7f8387041ae6 Aug 30 05:34:56 10.20.22.1 [139221.841537][ T817] RDX: RSI: 7ffcee1e50b0 RDI: 000d Aug 30 05:34:56 10.20.22.1 [139221.882678][ T817] RBP: 7ffcee1e50b0 R08: 7ffcee1e5030 R09: 0010 Aug 30 05:34:56 10.20.22.1 [139221.924623][ T817] R10: R11: 0246 R12: 1000 Aug 30 05:34:56 10.20.22.1 [139221.967198][ T817] R13: 000c R14: R15: 7ffcee1e7130 Aug 30 05:34:56 10.20.22.1 [139222.011266][ T817] INFO: task pppd:6092 blocked for more than 123 seconds. Aug 30 05:34:56 10.20.22.1 [139222.036779][ T817] Tainted: G O 5.8.5 #1 Aug 30 05:34:56 10.20.22.1 [139222.060127][ T817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 30 05:34:56 10.20.22.1 [139222.107395][ T817] pppdD0 6092 1 0x80004006 Aug 30 05:34:56 10.20.22.1 [139222.131680][ T817] Call Trace: Aug 30 05:34:56 10.20.22.1 [139222.155515][ T817] __schedule+0x374/0x750 Aug 30 05:34:56 10.20.22.1 [139222.179548][ T817] schedule_preempt_disabled+0x50/0xc0 Aug 30 05:34:56 10.20.22.1 [139222.203987][ T817] __mutex_lock.constprop.0+0x15d/0x490 Aug 30 05:34:56 10.20.22.1 [139222.228677][ T817] ppp_release+0xa8/0xe0 [ppp_generic] Aug 30 05:34:56 10.20.22.1 [139222.253507][ T817] __fput+0xc8/0x220 Aug 30 05:34:56 10.20.22.1 [139222.278100][ T817] task_work_run+0x57/0x90 Aug 30 05:34:56 10.20.22.1 [139222.302749][ T817] do_exit+0x317/0xa60 Aug 30 05:34:56 10.20.22.1 [139222.327154][ T817] do_group_exit+0x2e/0x90 Aug 30 05:34:56 10.20.22.1 [139222.351352][ T817] get_signal+0x11b/0x7a0 Aug 30 05:34:56 10.20.22.1 [139222.375277][ T817] ? set_next_entity+0xcd/0x340 Aug 30 05:34:56 10.20.22.1 [139222.399090][ T817] do_signal+0x25/0x1f0 Aug 30 05:34:56 10.20.22.1 [139222.422846][ T817] ? __schedule+0x1af/0x750 Aug 30 05:34:56 10.20.22.1 [139222.446246][ T817] __prepare_exit_to_usermode+0xd6/0x140 Aug 30 05:34:56 10.20.22.1 [139222.469384][ T817] do_syscall_64+0x50/0x120 Aug 30 05:34:56 10.20.22.1 [139222.491919][ T817] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Aug 30 05:34:56 10.20.22.1 [139222.514294][ T817] RIP: 0033:0x7f5850cafae6 Aug 30 05:34:56 10.20.22.1 [139222.536053][ T817] Code: Bad RIP value. Aug 30 05:34:56 10.20.22.1 [139222.557135][ T817] RSP: 002b:7fff80131af8 EFLAGS: 0246 ORIG_RAX: 0017 Aug 30 05:34:56 10.20.22.1 [139222.598929][ T817] RAX: fdfe RBX: RCX: 7f5850cafae6 Aug 30 05:34:56 10.20.22.1 [139222.640656][ T817] RDX: RSI: 7fff80131c00 RDI: 000d Aug 30 05:34:56 10.20.22.1 [139222.682379][ T817]
Re: kernel BUG at kernel/fork.c:LINE!
On Sun, Aug 23, 2020 at 10:34 AM Randy Dunlap wrote: > > On 8/7/20 2:16 PM, syzbot wrote: > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org.. > > git tree: upstream > > console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90 > > kernel config: https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2 > > dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085 > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com > > Is this fixed by > > commit 991e7673859ed41e7ba83c8c4e57afe8cfebe314 > Author: Shakeel Butt > Date: Thu Aug 6 23:21:37 2020 -0700 > > mm: memcontrol: account kernel stack per node > > > since the BUG_ON() at line 390 is removed by this patch. > It's not really removed. The BUG_ON() has been moved to line 402. So, this issue might still occur. > > > [ cut here ] > > kernel BUG at kernel/fork.c:390! > > invalid opcode: [#1] PREEMPT SMP KASAN > > CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 > > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b > > 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 > > 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 > > RSP: 0018:c90015e4f850 EFLAGS: 00010216 > > RAX: 01f4 RBX: RCX: c90017983000 > > RDX: 0004 RSI: 81445327 RDI: 0005 > > RBP: R08: 0001 R09: 8880a2ef9663 > > R10: 0008 R11: R12: > > R13: 8880001b2280 R14: 88809e4fa840 R15: > > FS: 7f7f035e5700() GS:8880ae60() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > Call Trace: > > release_task_stack kernel/fork.c:447 [inline] > > put_task_stack+0xc4/0x230 kernel/fork.c:459 > > finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649 > > context_switch kernel/sched/core.c:3781 [inline] > > __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527 > > schedule+0xd0/0x2a0 kernel/sched/core.c:4602 > > freezable_schedule include/linux/freezer.h:172 [inline] > > futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588 > > futex_wait+0x1df/0x560 kernel/futex.c:2690 > > do_futex+0x15b/0x1a60 kernel/futex.c:3749 > > __do_sys_futex kernel/futex.c:3810 [inline] > > __se_sys_futex kernel/futex.c:3778 [inline] > > __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778 > > do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > RIP: 0033:0x45ccd9 > > Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 > > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > > ff 0f 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > > RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca > > RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9 > > RDX: RSI: 0080 RDI: 0078bfa8 > > RBP: 0078bfa0 R08: R09: > > R10: R11: 0246 R12: 0078bfac > > R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac > > Modules linked in: > > ---[ end trace ff14b6c5822b8142 ]--- > > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 > > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b > > 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 > > 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 > > RSP: 0018:c90015e4f850 EFLAGS: 00010216 > > RAX: 01f4 RBX: RCX: c90017983000 > > RDX: 0004 RSI: 81445327 RDI: 0005 > > RBP: R08: 00
Re: kernel BUG at kernel/fork.c:LINE!
On 8/7/20 2:16 PM, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90 > kernel config: https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2 > dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085 > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com Is this fixed by commit 991e7673859ed41e7ba83c8c4e57afe8cfebe314 Author: Shakeel Butt Date: Thu Aug 6 23:21:37 2020 -0700 mm: memcontrol: account kernel stack per node since the BUG_ON() at line 390 is removed by this patch. > ----[ cut here ] > kernel BUG at kernel/fork.c:390! > invalid opcode: [#1] PREEMPT SMP KASAN > CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d > 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 > e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 > RSP: 0018:c90015e4f850 EFLAGS: 00010216 > RAX: 01f4 RBX: RCX: c90017983000 > RDX: 0004 RSI: 81445327 RDI: 0005 > RBP: R08: 0001 R09: 8880a2ef9663 > R10: 0008 R11: R12: > R13: 8880001b2280 R14: 88809e4fa840 R15: > FS: 7f7f035e5700() GS:8880ae60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > release_task_stack kernel/fork.c:447 [inline] > put_task_stack+0xc4/0x230 kernel/fork.c:459 > finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649 > context_switch kernel/sched/core.c:3781 [inline] > __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527 > schedule+0xd0/0x2a0 kernel/sched/core.c:4602 > freezable_schedule include/linux/freezer.h:172 [inline] > futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588 > futex_wait+0x1df/0x560 kernel/futex.c:2690 > do_futex+0x15b/0x1a60 kernel/futex.c:3749 > __do_sys_futex kernel/futex.c:3810 [inline] > __se_sys_futex kernel/futex.c:3778 [inline] > __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778 > do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x45ccd9 > Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f > 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00 > RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca > RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9 > RDX: RSI: 0080 RDI: 0078bfa8 > RBP: 0078bfa0 R08: R09: > R10: R11: 0246 R12: 0078bfac > R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac > Modules linked in: > ---[ end trace ff14b6c5822b8142 ]--- > RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 > Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d > 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 > e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 > RSP: 0018:c90015e4f850 EFLAGS: 00010216 > RAX: 01f4 RBX: RCX: c90017983000 > RDX: 0004 RSI: 81445327 RDI: 0005 > RBP: R08: 0001 R09: 8880a2ef9663 > R10: 0008 R11: R12: > R13: 8880001b2280 R14: 88809e4fa840 R15: > FS: 7f7f035e5700() GS:8880ae60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > > > --- > This report is generated by a bo
kernel BUG at kernel/fork.c:LINE!
Hello, syzbot found the following issue on: HEAD commit:fffe3ae0 Merge tag 'for-linus-hmm' of git://git.kernel.org.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=1194d90a90 kernel config: https://syzkaller.appspot.com/x/.config?x=18bb86f2e4ebfda2 dashboard link: https://syzkaller.appspot.com/bug?extid=3776ecd80aac504e6085 compiler: gcc (GCC) 10.1.0-syz 20200507 Unfortunately, I don't have any reproducer for this issue yet. IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+3776ecd80aac504e6...@syzkaller.appspotmail.com [ cut here ] kernel BUG at kernel/fork.c:390! invalid opcode: [#1] PREEMPT SMP KASAN CPU: 0 PID: 5239 Comm: syz-executor.1 Not tainted 5.8.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 RSP: 0018:c90015e4f850 EFLAGS: 00010216 RAX: 01f4 RBX: RCX: c90017983000 RDX: 0004 RSI: 81445327 RDI: 0005 RBP: R08: 0001 R09: 8880a2ef9663 R10: 0008 R11: R12: R13: 8880001b2280 R14: 88809e4fa840 R15: FS: 7f7f035e5700() GS:8880ae60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: release_task_stack kernel/fork.c:447 [inline] put_task_stack+0xc4/0x230 kernel/fork.c:459 finish_task_switch+0x52a/0x750 kernel/sched/core.c:3649 context_switch kernel/sched/core.c:3781 [inline] __schedule+0x8ed/0x21e0 kernel/sched/core.c:4527 schedule+0xd0/0x2a0 kernel/sched/core.c:4602 freezable_schedule include/linux/freezer.h:172 [inline] futex_wait_queue_me+0x2a7/0x570 kernel/futex.c:2588 futex_wait+0x1df/0x560 kernel/futex.c:2690 do_futex+0x15b/0x1a60 kernel/futex.c:3749 __do_sys_futex kernel/futex.c:3810 [inline] __se_sys_futex kernel/futex.c:3778 [inline] __x64_sys_futex+0x378/0x4e0 kernel/futex.c:3778 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45ccd9 Code: 2d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb b5 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7f7f035e4cf8 EFLAGS: 0246 ORIG_RAX: 00ca RAX: ffda RBX: 0078bfa8 RCX: 0045ccd9 RDX: RSI: 0080 RDI: 0078bfa8 RBP: 0078bfa0 R08: R09: R10: R11: 0246 R12: 0078bfac R13: 7fff896cb67f R14: 7f7f035e59c0 R15: 0078bfac Modules linked in: ---[ end trace ff14b6c5822b8142 ]--- RIP: 0010:account_kernel_stack+0x297/0x320 kernel/fork.c:390 Code: 89 e2 be 23 00 00 00 48 89 ef c1 e2 05 e8 81 9e 75 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f e9 ae c9 2f 00 e8 a9 c9 2f 00 <0f> 0b e8 f2 50 6f 00 e9 d2 fd ff ff e8 98 c9 2f 00 48 c7 c6 20 24 RSP: 0018:c90015e4f850 EFLAGS: 00010216 RAX: 01f4 RBX: RCX: c90017983000 RDX: 0004 RSI: 81445327 RDI: 0005 RBP: R08: 0001 R09: 8880a2ef9663 R10: 0008 R11: R12: R13: 8880001b2280 R14: 88809e4fa840 R15: FS: 7f7f035e5700() GS:8880ae60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 001b2fd3 CR3: 9b747000 CR4: 001506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 20/07/20 15:21, Peter Zijlstra wrote: > On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote: >> I have to admit, I do not understand the usage of prev_state in schedule(), >> it looks really, really subtle... > > Right, so commit dbfb089d360 solved a problem where schedule() re-read > prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and > ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a > result load-accounting went wobbly. > > Now, looking at that commit again, I might've solved the problem twice > :-P > > So on the one hand, I provider ordering: > > LOAD p->state LOAD-ACQUIRE p->on_rq == 0 > MB > STORE p->on_rq, 0 STORE p->state, TASK_WAKING > > such that ttwu() will only change p->state, after on_rq==0, which is > after loading p->state in schedule(). > > At the same time, I also had schedule() set > p->sched_contributes_to_load once, and then consistently used that value > throughout, without ever looking at p->state again, which too makes it > much harder to mess load-avg up. > > > Now, the ordering in schedule(), relies on doing the p->state load > before: > > spin_lock(rq->lock) > smp_mb__after_spinlock(); > > and doing a re-load check after, with the assumption that if the reload > is different, it will not block. > > That said, in a crossed email, I just proposed we could simplify all > this like so.. but now I need to go ask people to re-validate that > loadavg muck again :-/ > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a2a244af9a53..437fc3b241f2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) > local_irq_disable(); > rcu_note_context_switch(preempt); > > - /* See deactivate_task() below. */ > - prev_state = prev->state; > - > /* >* Make sure that signal_pending_state()->signal_pending() below >* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) >* We must re-load prev->state in case ttwu_remote() changed it >* before we acquired rq->lock. >*/ > - if (!preempt && prev_state && prev_state == prev->state) { > + prev_state = prev->state; AFAICT the ttwu/schedule ordering dance + the "trick" of updating p->sched_contributes_to_load in __schedule() under rq lock ensures loadavg accounting won't go awry here. I'm still a bit low on coffee, but that does LGTM. > + if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt) > > /* >* __schedule() ttwu() > - * prev_state = prev->state;if > (READ_ONCE(p->on_rq) && ...) > - * LOCK rq->lock goto out; > - * smp_mb__after_spinlock(); > smp_acquire__after_ctrl_dep(); > - * p->on_rq = 0;p->state = > TASK_WAKING; > + * if (prev_state) if (p->on_rq && ...) > + * p->on_rq = 0;goto out; > + * > smp_acquire__after_ctrl_dep(); > + *p->state = TASK_WAKING > + * > + * Where __schedule() and ttwu() have matching control > dependencies. >* >* After this, schedule() must not care about p->state > any more. >*/
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Tue, Jul 21, 2020 at 12:52:52AM -0400, Paul Gortmaker wrote: > [Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) > Peter Zijlstra wrote: > > > On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote: > > > I have to admit, I do not understand the usage of prev_state in > > > schedule(), > > > it looks really, really subtle... > > > > Right, so commit dbfb089d360 solved a problem where schedule() re-read > > prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and > > ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a > > result load-accounting went wobbly. > > > > Now, looking at that commit again, I might've solved the problem twice > > :-P > > [...] > > > That said, in a crossed email, I just proposed we could simplify all > > this like so.. but now I need to go ask people to re-validate that > > loadavg muck again :-/ > > After a two hour "quick" sanity test I then gave it a full 7h run (which > always seemed to break before dbfb089d360) and I didn't see any stuck > load average with master from today + this change. > > Paul. > > root@t5610:/home/paul/git/linux-head# > [1]+ Donenohup > tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 > --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1 > root@t5610:/home/paul/git/linux-head# cat /proc/version > Linux version 5.8.0-rc6-1-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu > 9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul > 20 12:30:33 EDT 2020 > root@t5610:/home/paul/git/linux-head# uptime > 00:49:18 up 7:41, 2 users, load average: 0.01, 0.00, 0.63 > root@t5610:/home/paul/git/linux-head# > > -- Thanks a lot for re-testing that Paul, much appreciated! Let me go write a Changelog for this then...
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
[Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917] On 20/07/2020 (Mon 16:21) Peter Zijlstra wrote: > On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote: > > I have to admit, I do not understand the usage of prev_state in schedule(), > > it looks really, really subtle... > > Right, so commit dbfb089d360 solved a problem where schedule() re-read > prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and > ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a > result load-accounting went wobbly. > > Now, looking at that commit again, I might've solved the problem twice > :-P [...] > That said, in a crossed email, I just proposed we could simplify all > this like so.. but now I need to go ask people to re-validate that > loadavg muck again :-/ After a two hour "quick" sanity test I then gave it a full 7h run (which always seemed to break before dbfb089d360) and I didn't see any stuck load average with master from today + this change. Paul. root@t5610:/home/paul/git/linux-head# [1]+ Donenohup tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 420 --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1 root@t5610:/home/paul/git/linux-head# cat /proc/version Linux version 5.8.0-rc6-1-g5714ee50bb43-dirty (paul@t5610) (gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #3 SMP Mon Jul 20 12:30:33 EDT 2020 root@t5610:/home/paul/git/linux-head# uptime 00:49:18 up 7:41, 2 users, load average: 0.01, 0.00, 0.63 root@t5610:/home/paul/git/linux-head# -- > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a2a244af9a53..437fc3b241f2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) > local_irq_disable(); > rcu_note_context_switch(preempt); > > - /* See deactivate_task() below. */ > - prev_state = prev->state; > - > /* >* Make sure that signal_pending_state()->signal_pending() below >* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) >* We must re-load prev->state in case ttwu_remote() changed it >* before we acquired rq->lock. >*/ > - if (!preempt && prev_state && prev_state == prev->state) { > + prev_state = prev->state; > + if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt) > > /* >* __schedule() ttwu() > - * prev_state = prev->state;if > (READ_ONCE(p->on_rq) && ...) > - * LOCK rq->lock goto out; > - * smp_mb__after_spinlock(); > smp_acquire__after_ctrl_dep(); > - * p->on_rq = 0;p->state = > TASK_WAKING; > + * if (prev_state) if (p->on_rq && ...) > + * p->on_rq = 0;goto out; > + * > smp_acquire__after_ctrl_dep(); > + *p->state = TASK_WAKING > + * > + * Where __schedule() and ttwu() have matching control > dependencies. >* >* After this, schedule() must not care about p->state > any more. >*/
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 05:35:15PM +0200, Oleg Nesterov wrote: > On 07/20, Oleg Nesterov wrote: > > > > On 07/20, Peter Zijlstra wrote: > > > > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) > > > local_irq_disable(); > > > rcu_note_context_switch(preempt); > > > > > > - /* See deactivate_task() below. */ > > > - prev_state = prev->state; > > > - > > > /* > > >* Make sure that signal_pending_state()->signal_pending() below > > >* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > > > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) > > >* We must re-load prev->state in case ttwu_remote() changed it > > >* before we acquired rq->lock. > > >*/ > > > - if (!preempt && prev_state && prev_state == prev->state) { > > > + prev_state = prev->state; > > > + if (!preempt && prev_state) { > > > > Heh ;) Peter, you know what? I did the same change and tried to understand > > why it is wrong and what have I missed. > > > > Thanks, now I can relax. But my head hurts too, I'll probably try to re-read > > this code and other emails from you tomorrow. > > Yes, I can no longer read this code today ;) > > but now it seems to me that (in theory) we need READ_ONCE(prev->state) here > and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep? > > Probably not, I got lost. > Probably not, I got lost. > Probably not, I got lost. So, task_struct::state is declared volatile (we should probably 'fix' that some day), so that doesn't require READ_ONCE() -- in fact, that caused a bunch of re-reads in the old code which made the loadavg race more likely. ->on_rq is only ever written 0,1,2, there's no possibe store-tearing. But possibly, yes, WRITE_ONCE() would be nicer.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/20, Oleg Nesterov wrote: > > On 07/20, Peter Zijlstra wrote: > > > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) > > local_irq_disable(); > > rcu_note_context_switch(preempt); > > > > - /* See deactivate_task() below. */ > > - prev_state = prev->state; > > - > > /* > > * Make sure that signal_pending_state()->signal_pending() below > > * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) > > * We must re-load prev->state in case ttwu_remote() changed it > > * before we acquired rq->lock. > > */ > > - if (!preempt && prev_state && prev_state == prev->state) { > > + prev_state = prev->state; > > + if (!preempt && prev_state) { > > Heh ;) Peter, you know what? I did the same change and tried to understand > why it is wrong and what have I missed. > > Thanks, now I can relax. But my head hurts too, I'll probably try to re-read > this code and other emails from you tomorrow. Yes, I can no longer read this code today ;) but now it seems to me that (in theory) we need READ_ONCE(prev->state) here and probably WRITE_ONCE(on_rq) in deactivate_task() to ensure ctrl-dep? Probably not, I got lost. Probably not, I got lost. Probably not, I got lost. Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/20, Peter Zijlstra wrote: > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) > local_irq_disable(); > rcu_note_context_switch(preempt); > > - /* See deactivate_task() below. */ > - prev_state = prev->state; > - > /* >* Make sure that signal_pending_state()->signal_pending() below >* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) > @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) >* We must re-load prev->state in case ttwu_remote() changed it >* before we acquired rq->lock. >*/ > - if (!preempt && prev_state && prev_state == prev->state) { > + prev_state = prev->state; > + if (!preempt && prev_state) { Heh ;) Peter, you know what? I did the same change and tried to understand why it is wrong and what have I missed. Thanks, now I can relax. But my head hurts too, I'll probably try to re-read this code and other emails from you tomorrow. Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 20/07/20 14:17, pet...@infradead.org wrote: > On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote: >> On 20/07/20 12:26, pet...@infradead.org wrote: > >> > + /* >> > + * We must re-load prev->state in case ttwu_remote() changed it >> > + * before we acquired rq->lock. >> > + */ >> > + tmp_state = prev->state; >> > + if (unlikely(prev_state != tmp_state)) { >> > + /* >> > + * ptrace_{,un}freeze_traced() think it is cool to change >> > + * ->state around behind our backs between TASK_TRACED and >> > + * __TASK_TRACED. >> > + * >> > + * This is safe because this, as well as any __TASK_TRACED >> > + * wakeups are under siglock. >> > + * >> > + * For any other case, a changed prev_state must be to >> > + * TASK_RUNNING, such that when it blocks, the load has >> > + * happened before the smp_mb(). >> > + * >> > + * Also see the comment with deactivate_task(). >> > + */ >> > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && >> > + !(tmp_state & __TASK_TRACED))); >> > + >> >> IIUC if the state changed and isn't TASK_RUNNING it *has* to have >> __TASK_TRACED, so can't that be >> >> SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED)); > > Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and > then someone goes and does task->state = __TASK_TRACED. > > That is, your statement is correct given the current code, but we also > want to verify no new code comes along and does something 'creative'. > That is what I was trying to go for; AFAICT your approach only warns if __TASK_TRACED gets removed between the two loads (IOW it was already there). The way I was seeing it is: - We only get here if prev_state != tmp_state; IOW we know we raced with something - if (tmp_state), then it wasn't with ttwu_remote() - thus it must only be with ptrace shenanigans, IOW __TASK_TRACED must be there. Now, what I suggested still doesn't detect what you pointed out, or some crazier thing that sets __TASK_TRACED *and* some other stuff. IIUC the ptrace transformation does TASK_TRACED -> __TASK_TRACED, so we could have it as: /* TODO: name me */ #define foobar TASK_TRACED ^ __TASK_TRACED ... /* not TASK_RUNNING; check against allowed transformations SCHED_WARN_ON(tmp_state && ((prev_state ^ tmp_state) & ~foobar)); That said... > Or is the heat getting to me? ... that may apply to me as well :-)
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 04:02:24PM +0200, Oleg Nesterov wrote: > I have to admit, I do not understand the usage of prev_state in schedule(), > it looks really, really subtle... Right, so commit dbfb089d360 solved a problem where schedule() re-read prev->state vs prev->on_rq = 0. That is, schedule()'s dequeue and ttwu()'s enqueue disagreed over sched_contributes_to_load. and as a result load-accounting went wobbly. Now, looking at that commit again, I might've solved the problem twice :-P So on the one hand, I provider ordering: LOAD p->state LOAD-ACQUIRE p->on_rq == 0 MB STORE p->on_rq, 0 STORE p->state, TASK_WAKING such that ttwu() will only change p->state, after on_rq==0, which is after loading p->state in schedule(). At the same time, I also had schedule() set p->sched_contributes_to_load once, and then consistently used that value throughout, without ever looking at p->state again, which too makes it much harder to mess load-avg up. Now, the ordering in schedule(), relies on doing the p->state load before: spin_lock(rq->lock) smp_mb__after_spinlock(); and doing a re-load check after, with the assumption that if the reload is different, it will not block. That said, in a crossed email, I just proposed we could simplify all this like so.. but now I need to go ask people to re-validate that loadavg muck again :-/ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a2a244af9a53..437fc3b241f2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4193,9 +4193,6 @@ static void __sched notrace __schedule(bool preempt) local_irq_disable(); rcu_note_context_switch(preempt); - /* See deactivate_task() below. */ - prev_state = prev->state; - /* * Make sure that signal_pending_state()->signal_pending() below * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) @@ -4223,7 +4220,8 @@ static void __sched notrace __schedule(bool preempt) * We must re-load prev->state in case ttwu_remote() changed it * before we acquired rq->lock. */ - if (!preempt && prev_state && prev_state == prev->state) { + prev_state = prev->state; + if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { @@ -4237,10 +4235,12 @@ static void __sched notrace __schedule(bool preempt) /* * __schedule() ttwu() -* prev_state = prev->state;if (READ_ONCE(p->on_rq) && ...) -* LOCK rq->lock goto out; -* smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); -* p->on_rq = 0;p->state = TASK_WAKING; +* if (prev_state) if (p->on_rq && ...) +* p->on_rq = 0;goto out; +* smp_acquire__after_ctrl_dep(); +*p->state = TASK_WAKING +* +* Where __schedule() and ttwu() have matching control dependencies. * * After this, schedule() must not care about p->state any more. */
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 01:26:23PM +0200, pet...@infradead.org wrote: > kernel/sched/core.c | 34 -- > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e15543cb84812..b5973d7fa521c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > */ > static void __sched notrace __schedule(bool preempt) > { > + unsigned long prev_state, tmp_state; > struct task_struct *prev, *next; > unsigned long *switch_count; > - unsigned long prev_state; > struct rq_flags rf; > struct rq *rq; > int cpu; > @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt) > rq_lock(rq, ); > smp_mb__after_spinlock(); > > + /* > + * We must re-load prev->state in case ttwu_remote() changed it > + * before we acquired rq->lock. > + */ > + tmp_state = prev->state; > + if (unlikely(prev_state != tmp_state)) { > + /* > + * ptrace_{,un}freeze_traced() think it is cool to change > + * ->state around behind our backs between TASK_TRACED and > + * __TASK_TRACED. > + * > + * This is safe because this, as well as any __TASK_TRACED > + * wakeups are under siglock. > + * > + * For any other case, a changed prev_state must be to > + * TASK_RUNNING, such that when it blocks, the load has > + * happened before the smp_mb(). > + * > + * Also see the comment with deactivate_task(). > + */ > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && > +!(tmp_state & __TASK_TRACED))); > + > + prev_state = tmp_state; While trying to write a changelog for this thing, I can't convince myself we don't need: smp_mb(); here. Consider: CPU0CPU1CPU2 schedule() prev_state = prev->state; spin_lock(rq->lock); smp_mb__after_spin_lock(); ptrace_freeze_traced() spin_lock(siglock) task->state = __TASK_TRACED; spin_unlock(siglock); tmp_state = prev->state; if (prev_state != tmp_state) prev_state = tmp_state; /* NO SMP_MB */ if (prev_state) deactivate_task() prev->on_rq = 0; spin_lock(siglock); ttwu() if (rq->on_rq && ...) goto unlock; smp_acquire__after_ctrl_dep(); p->state = TASK_WAKING; Looses the ordering we previously relied upon. That is, CPU1's prev->state load and prev->on_rq store can get reordered vs CPU2. OTOH, we have a control dependency on CPU1 as well, that should provide LOAD->STORE ordering, after all we only do the ->on_rq=0 store, IFF we see prev_state. So that is: if (p->state) if (!p->on_rq) p->on_rq = 0; p->state = TASK_WAKING which matches a CTRL-DEP to a CTRL-DEP ... But this then means we can simplify dbfb089d360 as well, but now my head hurts. > + } > + > /* Promote REQ to ACT */ > rq->clock_update_flags <<= 1; > update_rq_clock(rq); > > switch_count = >nivcsw; > - /* > - * We must re-load prev->state in case ttwu_remote() changed it > - * before we acquired rq->lock. > - */ > - if (!preempt && prev_state && prev_state == prev->state) { > + if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else {
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/20, Peter Zijlstra wrote: > > Also, is there any way to not have ptrace do this? Well, we need to ensure that even SIGKILL can't wake the tracee up while debugger plays with its registers/etc. > How performance > critical is this ptrace path? This is a slow path. We can probably change ptrace_check_attach() to call ptrace_freeze_traced() after wait_task_inactive(), but I would like to not do this... Because we actually want to avoid wait_task_inactive() when possible. Perhaps ptrace_freeze_traced() can __task_rq_lock() to avoid the race with __schedule() ? No, it reads prev_state before rq_lock(). > Because I really hate having to add code > to __schedule() to deal with this horrible thing. Oh yes, I agree. I have to admit, I do not understand the usage of prev_state in schedule(), it looks really, really subtle... Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 01:20:26PM +0100, Valentin Schneider wrote: > On 20/07/20 12:26, pet...@infradead.org wrote: > > + /* > > +* We must re-load prev->state in case ttwu_remote() changed it > > +* before we acquired rq->lock. > > +*/ > > + tmp_state = prev->state; > > + if (unlikely(prev_state != tmp_state)) { > > + /* > > +* ptrace_{,un}freeze_traced() think it is cool to change > > +* ->state around behind our backs between TASK_TRACED and > > +* __TASK_TRACED. > > +* > > +* This is safe because this, as well as any __TASK_TRACED > > +* wakeups are under siglock. > > +* > > +* For any other case, a changed prev_state must be to > > +* TASK_RUNNING, such that when it blocks, the load has > > +* happened before the smp_mb(). > > +* > > +* Also see the comment with deactivate_task(). > > +*/ > > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && > > + !(tmp_state & __TASK_TRACED))); > > + > > IIUC if the state changed and isn't TASK_RUNNING it *has* to have > __TASK_TRACED, so can't that be > > SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED)); Suppose task->state == TASK_UNINTERRUPTIBLE, and task != current, and then someone goes and does task->state = __TASK_TRACED. That is, your statement is correct given the current code, but we also want to verify no new code comes along and does something 'creative'. Or is the heat getting to me?
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 01:26:23PM +0200, pet...@infradead.org wrote: > On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote: > > On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote: > > > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote: > > > > Peter, > > > > > > > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected > > > > by > > > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with > > > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely > > > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). > > > > > > > > Can this change race with > > > > > > > > if (signal_pending_state(prev->state, prev)) { > > > > prev->state = TASK_RUNNING; > > > > } > > > > > > > > in __schedule() ? Hopefully not, signal-state is protected by siglock > > > > too. > > > > > > > > So I think this logic was correct even if it doesn't look nice. But > > > > "doesn't > > > > look nice" is true for the whole ptrace code ;) > > > > > > *groan*... another bit of obscure magic :-( > > > > > > let me go try and wake up and figure out how best to deal with this. > > This then? That seems to survive the strace thing. Ran a mainline v5.8-rc6 -> reproduced crash Ran a mainling v5.8-rc6 + patch below -> testsuite runs fine/not able to reproduce crash So fwiw Tested-by: Christian Brauner Acked-by: Christian Brauner > > --- > kernel/sched/core.c | 34 -- > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e15543cb84812..b5973d7fa521c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > */ > static void __sched notrace __schedule(bool preempt) > { > + unsigned long prev_state, tmp_state; > struct task_struct *prev, *next; > unsigned long *switch_count; > - unsigned long prev_state; > struct rq_flags rf; > struct rq *rq; > int cpu; > @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt) > rq_lock(rq, ); > smp_mb__after_spinlock(); > > + /* > + * We must re-load prev->state in case ttwu_remote() changed it > + * before we acquired rq->lock. > + */ > + tmp_state = prev->state; > + if (unlikely(prev_state != tmp_state)) { > + /* > + * ptrace_{,un}freeze_traced() think it is cool to change > + * ->state around behind our backs between TASK_TRACED and > + * __TASK_TRACED. > + * > + * This is safe because this, as well as any __TASK_TRACED > + * wakeups are under siglock. > + * > + * For any other case, a changed prev_state must be to > + * TASK_RUNNING, such that when it blocks, the load has > + * happened before the smp_mb(). > + * > + * Also see the comment with deactivate_task(). > + */ > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && > +!(tmp_state & __TASK_TRACED))); > + > + prev_state = tmp_state; > + } > + > /* Promote REQ to ACT */ > rq->clock_update_flags <<= 1; > update_rq_clock(rq); > > switch_count = >nivcsw; > - /* > - * We must re-load prev->state in case ttwu_remote() changed it > - * before we acquired rq->lock. > - */ > - if (!preempt && prev_state && prev_state == prev->state) { > + if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else {
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 20/07/20 12:26, pet...@infradead.org wrote: > --- > kernel/sched/core.c | 34 -- > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index e15543cb84812..b5973d7fa521c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > */ > static void __sched notrace __schedule(bool preempt) > { > + unsigned long prev_state, tmp_state; > struct task_struct *prev, *next; > unsigned long *switch_count; > - unsigned long prev_state; > struct rq_flags rf; > struct rq *rq; > int cpu; > @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt) > rq_lock(rq, ); > smp_mb__after_spinlock(); > > + /* > + * We must re-load prev->state in case ttwu_remote() changed it > + * before we acquired rq->lock. > + */ > + tmp_state = prev->state; > + if (unlikely(prev_state != tmp_state)) { > + /* > + * ptrace_{,un}freeze_traced() think it is cool to change > + * ->state around behind our backs between TASK_TRACED and > + * __TASK_TRACED. > + * > + * This is safe because this, as well as any __TASK_TRACED > + * wakeups are under siglock. > + * > + * For any other case, a changed prev_state must be to > + * TASK_RUNNING, such that when it blocks, the load has > + * happened before the smp_mb(). > + * > + * Also see the comment with deactivate_task(). > + */ > + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && > +!(tmp_state & __TASK_TRACED))); > + IIUC if the state changed and isn't TASK_RUNNING it *has* to have __TASK_TRACED, so can't that be SCHED_WARN_ON(tmp_state && !(tmp_state & __TASK_TRACED)); Other than that, LGTM. > + prev_state = tmp_state; > + } > + > /* Promote REQ to ACT */ > rq->clock_update_flags <<= 1; > update_rq_clock(rq); > > switch_count = >nivcsw; > - /* > - * We must re-load prev->state in case ttwu_remote() changed it > - * before we acquired rq->lock. > - */ > - if (!preempt && prev_state && prev_state == prev->state) { > + if (!preempt && prev_state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else {
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 20. 07. 20, 13:26, pet...@infradead.org wrote: > On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote: >> On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote: >>> On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote: Peter, Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with ->siglock held. That is why ptrace_freeze_traced() assumes it can safely do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). Can this change race with if (signal_pending_state(prev->state, prev)) { prev->state = TASK_RUNNING; } in __schedule() ? Hopefully not, signal-state is protected by siglock too. So I think this logic was correct even if it doesn't look nice. But "doesn't look nice" is true for the whole ptrace code ;) >>> >>> *groan*... another bit of obscure magic :-( >>> >>> let me go try and wake up and figure out how best to deal with this. > > This then? That seems to survive the strace thing. FWIW for me too. thanks, -- js
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 12:59:24PM +0200, pet...@infradead.org wrote: > On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote: > > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote: > > > Peter, > > > > > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by > > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with > > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely > > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). > > > > > > Can this change race with > > > > > > if (signal_pending_state(prev->state, prev)) { > > > prev->state = TASK_RUNNING; > > > } > > > > > > in __schedule() ? Hopefully not, signal-state is protected by siglock too. > > > > > > So I think this logic was correct even if it doesn't look nice. But > > > "doesn't > > > look nice" is true for the whole ptrace code ;) > > > > *groan*... another bit of obscure magic :-( > > > > let me go try and wake up and figure out how best to deal with this. This then? That seems to survive the strace thing. --- kernel/sched/core.c | 34 -- 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e15543cb84812..b5973d7fa521c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) */ static void __sched notrace __schedule(bool preempt) { + unsigned long prev_state, tmp_state; struct task_struct *prev, *next; unsigned long *switch_count; - unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt) rq_lock(rq, ); smp_mb__after_spinlock(); + /* +* We must re-load prev->state in case ttwu_remote() changed it +* before we acquired rq->lock. +*/ + tmp_state = prev->state; + if (unlikely(prev_state != tmp_state)) { + /* +* ptrace_{,un}freeze_traced() think it is cool to change +* ->state around behind our backs between TASK_TRACED and +* __TASK_TRACED. +* +* This is safe because this, as well as any __TASK_TRACED +* wakeups are under siglock. +* +* For any other case, a changed prev_state must be to +* TASK_RUNNING, such that when it blocks, the load has +* happened before the smp_mb(). +* +* Also see the comment with deactivate_task(). +*/ + SCHED_WARN_ON(tmp_state && (prev_state & __TASK_TRACED && + !(tmp_state & __TASK_TRACED))); + + prev_state = tmp_state; + } + /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); switch_count = >nivcsw; - /* -* We must re-load prev->state in case ttwu_remote() changed it -* before we acquired rq->lock. -*/ - if (!preempt && prev_state && prev_state == prev->state) { + if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else {
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 10:41:06AM +0200, Peter Zijlstra wrote: > On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote: > > Peter, > > > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by > > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with > > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely > > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). > > > > Can this change race with > > > > if (signal_pending_state(prev->state, prev)) { > > prev->state = TASK_RUNNING; > > } > > > > in __schedule() ? Hopefully not, signal-state is protected by siglock too. > > > > So I think this logic was correct even if it doesn't look nice. But "doesn't > > look nice" is true for the whole ptrace code ;) > > *groan*... another bit of obscure magic :-( > > let me go try and wake up and figure out how best to deal with this. So clearly I'm still missing something, the below results in: [ 63.760863] [ cut here ] [ 63.766019] !(tmp_state & __TASK_TRACED) [ 63.766030] WARNING: CPU: 33 PID: 33801 at kernel/sched/core.c:4158 __schedule+0x6bd/0x8e0 Also, is there any way to not have ptrace do this? How performance critical is this ptrace path? Because I really hate having to add code to __schedule() to deal with this horrible thing. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e15543cb84812..f65a801d268b6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4100,9 +4100,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) */ static void __sched notrace __schedule(bool preempt) { + unsigned long prev_state, tmp_state; struct task_struct *prev, *next; unsigned long *switch_count; - unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4140,16 +4140,38 @@ static void __sched notrace __schedule(bool preempt) rq_lock(rq, ); smp_mb__after_spinlock(); + /* +* We must re-load prev->state in case ttwu_remote() changed it +* before we acquired rq->lock. +*/ + tmp_state = prev->state; + if (unlikely(prev_state != tmp_state)) { + if (prev_state & __TASK_TRACED) { + /* +* ptrace_{,un}freeze_traced() think it is cool +* to change ->state around behind our backs +* between TASK_TRACED and __TASK_TRACED. +* +* Luckily this transformation doesn't affect +* sched_contributes_to_load. +*/ + SCHED_WARN_ON(!(tmp_state & __TASK_TRACED)); + } else { + /* +* For any other case, a changed prev_state +* must be to TASK_RUNNING, such that... +*/ + SCHED_WARN_ON(tmp_state != TASK_RUNNING); + } + prev_state = tmp_state; + } + /* Promote REQ to ACT */ rq->clock_update_flags <<= 1; update_rq_clock(rq); switch_count = >nivcsw; - /* -* We must re-load prev->state in case ttwu_remote() changed it -* before we acquired rq->lock. -*/ - if (!preempt && prev_state && prev_state == prev->state) { + if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else {
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Mon, Jul 20, 2020 at 10:26:58AM +0200, Oleg Nesterov wrote: > Peter, > > Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by > ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with > ->siglock held. That is why ptrace_freeze_traced() assumes it can safely > do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). > > Can this change race with > > if (signal_pending_state(prev->state, prev)) { > prev->state = TASK_RUNNING; > } > > in __schedule() ? Hopefully not, signal-state is protected by siglock too. > > So I think this logic was correct even if it doesn't look nice. But "doesn't > look nice" is true for the whole ptrace code ;) *groan*... another bit of obscure magic :-( let me go try and wake up and figure out how best to deal with this.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
Peter, Let me add another note. TASK_TRACED/TASK_STOPPED was always protected by ->siglock. In particular, ttwu(__TASK_TRACED) must be always called with ->siglock held. That is why ptrace_freeze_traced() assumes it can safely do s/TASK_TRACED/__TASK_TRACED/ under spin_lock(siglock). Can this change race with if (signal_pending_state(prev->state, prev)) { prev->state = TASK_RUNNING; } in __schedule() ? Hopefully not, signal-state is protected by siglock too. So I think this logic was correct even if it doesn't look nice. But "doesn't look nice" is true for the whole ptrace code ;) On 07/20, Oleg Nesterov wrote: > > On 07/20, Jiri Slaby wrote: > > > > You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more > > people. > > Thanks... so with this patch __schedule() does > > prev_state = prev->state; > > ... > > if (!preempt && prev_state && prev_state == prev->state) { > if (signal_pending_state(prev_state, prev)) { > prev->state = TASK_RUNNING; > } else { > > and ptrace_freeze_traced() can change ->state in between. This means > that this task can return from __schedule() with ->state != RUNNING, > this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent() > you reported. > > Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/20, Jiri Slaby wrote: > > On 18. 07. 20, 19:14, Oleg Nesterov wrote: > > > > This is already wrong. But > > > > Where does this __might_sleep() come from ??? I ses no blocking calls > > in ptrace_stop(). Not to mention it is called with ->siglock held and > > right after this lock is dropped we take tasklist_lock. > > Decoded stacktrace: > > > ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 > > include/linux/freezer.h:128 include/linux/freezer.h:173 > > kernel/signal.c:2217) > > ptrace_do_notify (kernel/signal.c:2272) > > ptrace_notify (arch/x86/include/asm/paravirt.h:656 > > arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 > > include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 > > kernel/signal.c:2282) > > syscall_trace_enter (include/linux/tracehook.h:73 > > include/linux/tracehook.h:104 arch/x86/entry/common.c:159) > > do_syscall_64 (arch/x86/entry/common.c:380) > > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124) > > It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in > ptrace_stop. Aha, thanks a lot! try_to_freeze_unsafe() is called after schedule() which must return with ->state = RUNNING, so this matches another WARN_ON(current->state) added by debugging patch after freezable_schedule(). Somehow I decided __might_sleep() was called before read_unlock/schedule. > > > > How this connects to the debugging patch I sent? Did you see this > > warning > > without that patch? > > I suppose this made it appear: > +CONFIG_PREEMPT_COUNT=y > -# CONFIG_DEBUG_ATOMIC_SLEEP is not set > +CONFIG_DEBUG_ATOMIC_SLEEP=y I see, > Please see my other e-mail, all this is with dbfb089d360b applied. Maybe > it makes more sense now? Yes. Thanks Jiri! Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/20, Jiri Slaby wrote: > > You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more > people. Thanks... so with this patch __schedule() does prev_state = prev->state; ... if (!preempt && prev_state && prev_state == prev->state) { if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { and ptrace_freeze_traced() can change ->state in between. This means that this task can return from __schedule() with ->state != RUNNING, this can explain BUG_ON(task_is_stopped_or_traced) in do_notify_parent() you reported. Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 18. 07. 20, 19:14, Oleg Nesterov wrote: > On 07/18, Jiri Slaby wrote: >> >> On 17. 07. 20, 14:40, Oleg Nesterov wrote: >>> >>> please see the updated patch below, lets check ptrace_unfreeze() too. >> >> Sure, dmesg attached. > > Thanks a lot! > > But I am totally confused... > >> [ 94.513944] [ cut here ] >> [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at >> [<2fe279e9>] ptrace_check_attach+0xbf/0x110 > > OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in > ptrace_freeze_traced(), > >> [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 >> __might_sleep+0x6c/0x70 >> [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) >> crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) >> nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) >> blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) >> reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) >> dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E) >> [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE >> 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased) >> [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 >> [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70 >> [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b >> 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> >> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd >> [ 94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286 >> [ 94.514093] RAX: RBX: 8dc82b503e00 RCX: >> 0489 >> [ 94.514094] RDX: 0001 RSI: 0096 RDI: >> 0247 >> [ 94.514095] RBP: 8f6ffa6b R08: 0004 R09: >> 0489 >> [ 94.514095] R10: R11: 0001 R12: >> 0039 >> [ 94.514096] R13: R14: 0001 R15: >> 8dc82b5045e4 >> [ 94.514098] FS: 7fa00f1f9240() GS:8dcb0c00() >> knlGS: >> [ 94.514099] CS: 0010 DS: ES: CR0: 80050033 >> [ 94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: >> 00360ee0 >> [ 94.514103] DR0: DR1: DR2: >> >> [ 94.514104] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 94.514105] Call Trace: >> [ 94.514821] ptrace_stop+0x1a9/0x300 > > This is already wrong. But > > Where does this __might_sleep() come from ??? I ses no blocking calls > in ptrace_stop(). Not to mention it is called with ->siglock held and > right after this lock is dropped we take tasklist_lock. Decoded stacktrace: > ptrace_stop (include/linux/freezer.h:57 include/linux/freezer.h:67 > include/linux/freezer.h:128 include/linux/freezer.h:173 kernel/signal.c:2217) > > ptrace_do_notify (kernel/signal.c:2272) > ptrace_notify (arch/x86/include/asm/paravirt.h:656 > arch/x86/include/asm/qspinlock.h:55 include/linux/spinlock.h:211 > include/linux/spinlock_api_smp.h:167 include/linux/spinlock.h:403 > kernel/signal.c:2282) > syscall_trace_enter (include/linux/tracehook.h:73 > include/linux/tracehook.h:104 arch/x86/entry/common.c:159) > do_syscall_64 (arch/x86/entry/common.c:380) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:124) It is try_to_freeze_unsafe in try_to_freeze in freezable_schedule in ptrace_stop. > > How this connects to the debugging patch I sent? Did you see this > warning > without that patch? I suppose this made it appear: +CONFIG_PREEMPT_COUNT=y -# CONFIG_DEBUG_ATOMIC_SLEEP is not set +CONFIG_DEBUG_ATOMIC_SLEEP=y >> [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 >> ptrace_stop+0x1d8/0x300 > ... >> [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300 > > This is WARN_ON(current->state) added to ptrace_stop(), this can explain > BUG_ON() in do_notify_parent() you reported. > > So, the tracee returns from schedule() with ->state != TASK_RUNNING ??? > This must not be possible. > > OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can > only happen if it races with ttwu(__TASK_TRACED) without ->siglock held, > nobody should do this. > > Strange. Please see my other e-mail, all this is with dbfb089d360b applied. Maybe it makes more sense now? thanks, -- js
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 18. 07. 20, 19:44, Christian Brauner wrote: > On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote: >> On 07/18, Jiri Slaby wrote: >>> >>> On 17. 07. 20, 14:40, Oleg Nesterov wrote: please see the updated patch below, lets check ptrace_unfreeze() too. >>> >>> Sure, dmesg attached. >> >> Thanks a lot! >> >> But I am totally confused... >> >>> [ 94.513944] [ cut here ] >>> [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at >>> [<2fe279e9>] ptrace_check_attach+0xbf/0x110 >> >> OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in >> ptrace_freeze_traced(), >> >>> [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 >>> __might_sleep+0x6c/0x70 >>> [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) >>> crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) >>> nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) >>> blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) >>> reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) >>> dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E) >>> [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE >>> 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased) >>> [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 >>> [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70 >>> [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b >>> 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff >>> <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd >>> [ 94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286 >>> [ 94.514093] RAX: RBX: 8dc82b503e00 RCX: >>> 0489 >>> [ 94.514094] RDX: 0001 RSI: 0096 RDI: >>> 0247 >>> [ 94.514095] RBP: 8f6ffa6b R08: 0004 R09: >>> 0489 >>> [ 94.514095] R10: R11: 0001 R12: >>> 0039 >>> [ 94.514096] R13: R14: 0001 R15: >>> 8dc82b5045e4 >>> [ 94.514098] FS: 7fa00f1f9240() GS:8dcb0c00() >>> knlGS: >>> [ 94.514099] CS: 0010 DS: ES: CR0: 80050033 >>> [ 94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: >>> 00360ee0 >>> [ 94.514103] DR0: DR1: DR2: >>> >>> [ 94.514104] DR3: DR6: fffe0ff0 DR7: >>> 0400 >>> [ 94.514105] Call Trace: >>> [ 94.514821] ptrace_stop+0x1a9/0x300 >> >> This is already wrong. But >> >> Where does this __might_sleep() come from ??? I ses no blocking calls >> in ptrace_stop(). Not to mention it is called with ->siglock held and >> right after this lock is dropped we take tasklist_lock. >> >> How this connects to the debugging patch I sent? Did you see this >> warning >> without that patch? >> >> >>> [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 >>> ptrace_stop+0x1d8/0x300 >> ... >>> [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300 >> >> This is WARN_ON(current->state) added to ptrace_stop(), this can explain >> BUG_ON() in do_notify_parent() you reported. >> >> So, the tracee returns from schedule() with ->state != TASK_RUNNING ??? >> This must not be possible. >> >> OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can >> only happen if it races with ttwu(__TASK_TRACED) without ->siglock held, >> nobody should do this. >> >> Strange. > > I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the > strace test-suite with > > make check -j4 > > and I wasn't able to after multiple retries. Jiri, just to make sure > this is upstream 5.8-rc4 without any additional patches? You tackled it, we cherry-picked dbfb089d360 to our kernels. Ccing more people. So if you try with pure vanilla 5.8-rc6 (it contains the commit), you should be able to reproduce. I am. OTOH 5.8-rc6 with that commit reverted -- I cannot reproduce there. So it must be it. > Anything special required to reproduce this in the way you run strace > and so on? Nothing special there. Except the HW: when I run a VM with 16+ processors, it's much more likely to be hit (usually, on the first run of make check). thanks, -- js
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
Hi Hillf, On 07/19, Hillf Danton wrote: > > Dunno if the wheel prior to JOBCTL_TASK_WORK helps debug the warnings. > > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2541,7 +2541,7 @@ bool get_signal(struct ksignal *ksig) > > relock: > spin_lock_irq(>siglock); > - current->jobctl &= ~JOBCTL_TASK_WORK; > + task_clear_jobctl_pending(current, JOBCTL_TASK_WORK); > if (unlikely(current->task_works)) { > spin_unlock_irq(>siglock); > task_work_run(); > --- a/kernel/task_work.c > +++ b/kernel/task_work.c > @@ -43,8 +43,8 @@ task_work_add(struct task_struct *task, > break; > case TWA_SIGNAL: > if (lock_task_sighand(task, )) { > - task->jobctl |= JOBCTL_TASK_WORK; > - signal_wake_up(task, 0); > + if (task_set_jobctl_pending(task, JOBCTL_TASK_WORK)) > + recalc_sigpending_and_wake(task); > unlock_task_sighand(task, ); To be be honest I don't understand why do you think this makes any sense... But this doesn't matter, please note that the Jiri tested the kernel with e91b48162332480f5 (which added JOBCTL_TASK_WORK) reverted. Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Sat, Jul 18, 2020 at 07:14:07PM +0200, Oleg Nesterov wrote: > On 07/18, Jiri Slaby wrote: > > > > On 17. 07. 20, 14:40, Oleg Nesterov wrote: > > > > > > please see the updated patch below, lets check ptrace_unfreeze() too. > > > > Sure, dmesg attached. > > Thanks a lot! > > But I am totally confused... > > > [ 94.513944] [ cut here ] > > [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at > > [<2fe279e9>] ptrace_check_attach+0xbf/0x110 > > OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in > ptrace_freeze_traced(), > > > [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 > > __might_sleep+0x6c/0x70 > > [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) > > crc32_pclmul(E) qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) > > nls_cp437(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) > > blake2b_generic(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_intel(E) > > reiserfs(E) squashfs(E) fuse(E) dm_snapshot(E) dm_bufio(E) dm_crypt(E) > > dm_mod(E) binfmt_misc(E) loop(E) sg(E) virtio_rng(E) > > [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE > > 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased) > > [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 > > [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70 > > [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b > > 70 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff > > <0f> 0b eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd > > [ 94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286 > > [ 94.514093] RAX: RBX: 8dc82b503e00 RCX: > > 0489 > > [ 94.514094] RDX: 0001 RSI: 0096 RDI: > > 0247 > > [ 94.514095] RBP: 8f6ffa6b R08: 0004 R09: > > 0489 > > [ 94.514095] R10: R11: 0001 R12: > > 0039 > > [ 94.514096] R13: R14: 0001 R15: > > 8dc82b5045e4 > > [ 94.514098] FS: 7fa00f1f9240() GS:8dcb0c00() > > knlGS: > > [ 94.514099] CS: 0010 DS: ES: CR0: 80050033 > > [ 94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: > > 00360ee0 > > [ 94.514103] DR0: DR1: DR2: > > > > [ 94.514104] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [ 94.514105] Call Trace: > > [ 94.514821] ptrace_stop+0x1a9/0x300 > > This is already wrong. But > > Where does this __might_sleep() come from ??? I ses no blocking calls > in ptrace_stop(). Not to mention it is called with ->siglock held and > right after this lock is dropped we take tasklist_lock. > > How this connects to the debugging patch I sent? Did you see this > warning > without that patch? > > > > [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 > > ptrace_stop+0x1d8/0x300 > ... > > [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300 > > This is WARN_ON(current->state) added to ptrace_stop(), this can explain > BUG_ON() in do_notify_parent() you reported. > > So, the tracee returns from schedule() with ->state != TASK_RUNNING ??? > This must not be possible. > > OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can > only happen if it races with ttwu(__TASK_TRACED) without ->siglock held, > nobody should do this. > > Strange. I have tried to reproduce this with an vanilla upstream 5.8-rc4 and the strace test-suite with make check -j4 and I wasn't able to after multiple retries. Jiri, just to make sure this is upstream 5.8-rc4 without any additional patches? Anything special required to reproduce this in the way you run strace and so on? Christian
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/18, Jiri Slaby wrote: > > On 17. 07. 20, 14:40, Oleg Nesterov wrote: > > > > please see the updated patch below, lets check ptrace_unfreeze() too. > > Sure, dmesg attached. Thanks a lot! But I am totally confused... > [ 94.513944] [ cut here ] > [ 94.513985] do not call blocking ops when !TASK_RUNNING; state=8 set at > [<2fe279e9>] ptrace_check_attach+0xbf/0x110 OK, so the ptracer has already did the TASK_TRACED -> __TASK_TRACED change in ptrace_freeze_traced(), > [ 94.514019] WARNING: CPU: 16 PID: 34171 at kernel/sched/core.c:6881 > __might_sleep+0x6c/0x70 > [ 94.514020] Modules linked in: ata_generic(E) pata_acpi(E) crc32_pclmul(E) > qemu_fw_cfg(E) ata_piix(E) e1000(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) > fat(E) virtio_blk(E) virtio_mmio(E) xfs(E) btrfs(E) blake2b_generic(E) xor(E) > raid6_pq(E) libcrc32c(E) crc32c_intel(E) reiserfs(E) squashfs(E) fuse(E) > dm_snapshot(E) dm_bufio(E) dm_crypt(E) dm_mod(E) binfmt_misc(E) loop(E) sg(E) > virtio_rng(E) > [ 94.514082] CPU: 16 PID: 34171 Comm: strace Tainted: GE > 5.8.0-rc5-100.g55927f9-default #1 openSUSE Tumbleweed (unreleased) > [ 94.514084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 > [ 94.514087] RIP: 0010:__might_sleep+0x6c/0x70 > [ 94.514090] Code: 41 5c 41 5d e9 25 ff ff ff 48 8b 90 68 1a 00 00 48 8b 70 > 10 48 c7 c7 10 45 70 8f c6 05 4f a9 68 01 01 48 89 d1 e8 7a bb fc ff <0f> 0b > eb c8 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 fd > [ 94.514092] RSP: 0018:9ffa4ba1be00 EFLAGS: 00010286 > [ 94.514093] RAX: RBX: 8dc82b503e00 RCX: > 0489 > [ 94.514094] RDX: 0001 RSI: 0096 RDI: > 0247 > [ 94.514095] RBP: 8f6ffa6b R08: 0004 R09: > 0489 > [ 94.514095] R10: R11: 0001 R12: > 0039 > [ 94.514096] R13: R14: 0001 R15: > 8dc82b5045e4 > [ 94.514098] FS: 7fa00f1f9240() GS:8dcb0c00() > knlGS: > [ 94.514099] CS: 0010 DS: ES: CR0: 80050033 > [ 94.514100] CR2: 557b53d25877 CR3: 0004ca490005 CR4: > 00360ee0 > [ 94.514103] DR0: DR1: DR2: > > [ 94.514104] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 94.514105] Call Trace: > [ 94.514821] ptrace_stop+0x1a9/0x300 This is already wrong. But Where does this __might_sleep() come from ??? I ses no blocking calls in ptrace_stop(). Not to mention it is called with ->siglock held and right after this lock is dropped we take tasklist_lock. How this connects to the debugging patch I sent? Did you see this warning without that patch? > [ 94.514888] WARNING: CPU: 16 PID: 34171 at kernel/signal.c:2219 > ptrace_stop+0x1d8/0x300 ... > [ 94.514899] RIP: 0010:ptrace_stop+0x1d8/0x300 This is WARN_ON(current->state) added to ptrace_stop(), this can explain BUG_ON() in do_notify_parent() you reported. So, the tracee returns from schedule() with ->state != TASK_RUNNING ??? This must not be possible. OK, perhaps task->state was changed by ptrace_unfreeze_traced()? this can only happen if it races with ttwu(__TASK_TRACED) without ->siglock held, nobody should do this. Strange. Oleg.
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 17. 07. 20, 13:12, Christian Brauner wrote: > On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote: >> On 17. 07. 20, 12:45, Jiri Slaby wrote: >>> Hi, >>> >>> the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 >>> and i586: >> >> make check needs -jsomething, running is sequentially (-j1) doesn't >> trigger it. After the error, I cannot run anything. Like ps to find out >> what test caused the crash... >> >> 5.7 was fine. > > Can you try and reproduce with > > commit e91b48162332480f5840902268108bb7fb7a44c7 > Author: Oleg Nesterov > Date: Tue Jun 30 17:32:54 2020 +0200 > > task_work: teach task_work_add() to do signal_wake_up() > > reverted, please? I reverted these 3: +This reverts commit b7db41c9e03b5189bc94993bd50e4506ac9e34c1. +This reverts commit ce593a6c480a22acba08795be313c0c6d49dd35d. +This reverts commit e91b48162332480f5840902268108bb7fb7a44c7. But it still crashes. thanks, -- js
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 17. 07. 20, 14:40, Oleg Nesterov wrote: > On 07/17, Oleg Nesterov wrote: >> >> On 07/17, Jiri Slaby wrote: >>> >>> On 17. 07. 20, 12:45, Jiri Slaby wrote: Hi, the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 and i586: >>> >>> make check needs -jsomething, running is sequentially (-j1) doesn't >>> trigger it. After the error, I cannot run anything. Like ps to find out >>> what test caused the crash... >> >> Strange... I'll try to reproduce but I can't do this till Monday. >> >> Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP. > > please see the updated patch below, lets check ptrace_unfreeze() too. Sure, dmesg attached. thanks, -- js [0.00] Linux version 5.8.0-rc5-100.g55927f9-default (geeko@buildhost) (gcc (SUSE Linux) 10.1.1 20200625 [revision c91e43e9363bd119a695d64505f96539fa451bf2], GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.34.0.20200325-1) #1 SMP Sat Jul 18 06:57:08 UTC 2020 (55927f9) [0.00] Command line: root=/dev/disk/by-id/virtio-0 rootfstype=ext4 rootflags=noatime ext4.allow_unsupported=1 kpti=off pti=off spectre_v2=off panic=1 quiet no-kvmclock elevator=noop nmi_watchdog=0 rw rd.driver.pre=binfmt_misc console=ttyS0 init=/.build/build [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xbffd8fff] usable [0.00] BIOS-e820: [mem 0xbffd9000-0xbfff] reserved [0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved [0.00] BIOS-e820: [mem 0xfffc-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x000521ff] usable [0.00] NX (Execute Disable) protection: active [0.00] SMBIOS 2.8 present. [0.00] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-rebuilt.suse.com 04/01/2014 [0.00] Hypervisor detected: KVM [0.00] tsc: Fast TSC calibration using PIT [0.00] tsc: Detected 2799.858 MHz processor [0.000348] e820: update [mem 0x-0x0fff] usable ==> reserved [0.000349] e820: remove [mem 0x000a-0x000f] usable [0.000354] last_pfn = 0x522000 max_arch_pfn = 0x4 [0.000380] MTRR default type: write-back [0.000380] MTRR fixed ranges enabled: [0.000381] 0-9 write-back [0.000382] A-B uncachable [0.000383] C-F write-protect [0.000383] MTRR variable ranges enabled: [0.000384] 0 base 000C000 mask 3FFC000 uncachable [0.000385] 1 disabled [0.000385] 2 disabled [0.000385] 3 disabled [0.000386] 4 disabled [0.000386] 5 disabled [0.000387] 6 disabled [0.000387] 7 disabled [0.000394] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT [0.000408] last_pfn = 0xbffd9 max_arch_pfn = 0x4 [0.015173] check: Scanning 1 areas for low memory corruption [0.015226] Kernel/User page tables isolation: disabled on command line. [0.015230] Using GB pages for direct mapping [0.015416] RAMDISK: [mem 0x7f7a2000-0x7fff] [0.015459] ACPI: Early table checksum verification disabled [0.015474] ACPI: RSDP 0x000F5B30 14 (v00 BOCHS ) [0.015477] ACPI: RSDT 0xBFFE2369 30 (v01 BOCHS BXPCRSDT 0001 BXPC 0001) [0.015486] ACPI: FACP 0xBFFE1D4D 74 (v01 BOCHS BXPCFACP 0001 BXPC 0001) [0.015493] ACPI: DSDT 0xBFFDD400 00494D (v01 BOCHS BXPCDSDT 0001 BXPC 0001) [0.015497] ACPI: FACS 0xBFFDD3C0 40 [0.015499] ACPI: APIC 0xBFFE1DC1 000570 (v01 BOCHS BXPCAPIC 0001 BXPC 0001) [0.015502] ACPI: HPET 0xBFFE2331 38 (v01 BOCHS BXPCHPET 0001 BXPC 0001) [0.015512] ACPI: Local APIC address 0xfee0 [0.015847] No NUMA configuration found [0.015847] Faking a node at [mem 0x-0x000521ff] [0.015862] NODE_DATA(0) allocated [mem 0x521fea000-0x521ff] [0.016605] Zone ranges: [0.016606] DMA [mem 0x1000-0x00ff] [0.016606] DMA32[mem 0x0100-0x] [0.016607] Normal [mem 0x0001-0x000521ff] [0.016608] Device empty [0.016608] Movable zone
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/17, Oleg Nesterov wrote: > > On 07/17, Jiri Slaby wrote: > > > > On 17. 07. 20, 12:45, Jiri Slaby wrote: > > > Hi, > > > > > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 > > > and i586: > > > > make check needs -jsomething, running is sequentially (-j1) doesn't > > trigger it. After the error, I cannot run anything. Like ps to find out > > what test caused the crash... > > Strange... I'll try to reproduce but I can't do this till Monday. > > Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP. please see the updated patch below, lets check ptrace_unfreeze() too. Oleg. diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 43d6179508d6..71c76bc7dec6 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -180,6 +180,7 @@ static bool ptrace_freeze_traced(struct task_struct *task) spin_lock_irq(>sighand->siglock); if (task_is_traced(task) && !__fatal_signal_pending(task)) { + task->task_state_change = _THIS_IP_; task->state = __TASK_TRACED; ret = true; } @@ -203,8 +204,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task) if (task->state == __TASK_TRACED) { if (__fatal_signal_pending(task)) wake_up_state(task, __TASK_TRACED); - else + else { + task->task_state_change = _THIS_IP_; task->state = TASK_TRACED; + } } spin_unlock_irq(>sighand->siglock); } diff --git a/kernel/signal.c b/kernel/signal.c index ee22ec78fd6d..ba4c4c9ed9b5 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(sig == -1); /* do_notify_parent_cldstop should have been called instead. */ - BUG_ON(task_is_stopped_or_traced(tsk)); + WARN(task_is_stopped_or_traced(tsk), + "exist with state=%ld set at %pS", + tsk->state, (void*)tsk->task_state_change); BUG_ON(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); @@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t preempt_enable_no_resched(); freezable_schedule(); cgroup_leave_frozen(true); + WARN_ON(current->state); } else { /* * By the time we got the lock, our tracer went away. @@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr) /* Now we don't run again until woken by SIGCONT or SIGKILL */ cgroup_enter_frozen(); freezable_schedule(); + WARN_ON(current->state); return true; } else { /*
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 07/17, Jiri Slaby wrote: > > On 17. 07. 20, 12:45, Jiri Slaby wrote: > > Hi, > > > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 > > and i586: > > make check needs -jsomething, running is sequentially (-j1) doesn't > trigger it. After the error, I cannot run anything. Like ps to find out > what test caused the crash... Strange... I'll try to reproduce but I can't do this till Monday. Meanwhile, could you try the patch below? It needs CONFIG_DEBUG_ATOMIC_SLEEP. Oleg. diff --git a/kernel/signal.c b/kernel/signal.c index ee22ec78fd6d..84782843eb1c 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1914,7 +1914,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(sig == -1); /* do_notify_parent_cldstop should have been called instead. */ - BUG_ON(task_is_stopped_or_traced(tsk)); + WARN(task_is_stopped_or_traced(tsk), + "exist with state=%ld set at %pS", + tsk->state, (void*)tsk->task_state_change); BUG_ON(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); @@ -2214,6 +2216,7 @@ static void ptrace_stop(int exit_code, int why, int clear_code, kernel_siginfo_t preempt_enable_no_resched(); freezable_schedule(); cgroup_leave_frozen(true); + WARN_ON(current->state); } else { /* * By the time we got the lock, our tracer went away. @@ -2393,6 +2396,7 @@ static bool do_signal_stop(int signr) /* Now we don't run again until woken by SIGCONT or SIGKILL */ cgroup_enter_frozen(); freezable_schedule(); + WARN_ON(current->state); return true; } else { /*
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On Fri, Jul 17, 2020 at 01:04:38PM +0200, Jiri Slaby wrote: > On 17. 07. 20, 12:45, Jiri Slaby wrote: > > Hi, > > > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 > > and i586: > > make check needs -jsomething, running is sequentially (-j1) doesn't > trigger it. After the error, I cannot run anything. Like ps to find out > what test caused the crash... > > 5.7 was fine. Can you try and reproduce with commit e91b48162332480f5840902268108bb7fb7a44c7 Author: Oleg Nesterov Date: Tue Jun 30 17:32:54 2020 +0200 task_work: teach task_work_add() to do signal_wake_up() reverted, please? Thanks! Christian
Re: 5.8-rc*: kernel BUG at kernel/signal.c:1917
On 17. 07. 20, 12:45, Jiri Slaby wrote: > Hi, > > the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 > and i586: make check needs -jsomething, running is sequentially (-j1) doesn't trigger it. After the error, I cannot run anything. Like ps to find out what test caused the crash... 5.7 was fine. >> kernel BUG at kernel/signal.c:1917! >> invalid opcode: [#1] SMP NOPTI >> CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted >> 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased) >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 >> RIP: 0010:do_notify_parent+0x239/0x270 >> Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 >> 89 e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 >> 05 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0 >> RSP: 0018:a05400f93d00 EFLAGS: 00010002 >> RAX: 0008 RBX: 9024ebb6dd00 RCX: 000102f9 >> RDX: RSI: 0011 RDI: 9024ebb6dd00 >> RBP: 9024ebb6dd00 R08: R09: 000d >> R10: 7f781300 R11: c2690721ae80 R12: >> R13: R14: 0009 R15: >> FS: 7f7812ffc700() GS:902537dc() knlGS: >> CS: 0010 DS: ES: CR0: 80050033 >> CR2: 7f9c246a4a10 CR3: 0001eba7c000 CR4: 000406e0 >> Call Trace: >> exit_notify+0x1c0/0x1f0 >> do_exit+0x234/0x430 >> do_group_exit+0x33/0xa0 >> get_signal+0x171/0x640 >> do_signal+0x2a/0x1c0 >> exit_to_usermode_loop+0x89/0x110 >> __prepare_exit_to_usermode+0x92/0xd0 >> do_syscall_64+0x59/0xd0 >> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> RIP: 0033:0x7f78149db992 >> Code: Bad RIP value. >> RSP: 002b:7f7812ffbed0 EFLAGS: 0293 ORIG_RAX: 0022 >> RAX: fdfe RBX: RCX: 7f78149db992 >> RDX: 0001 RSI: 558a4e07e033 RDI: >> RBP: R08: R09: 7f7812ffc700 >> R10: 7f7812ffc9d0 R11: 0293 R12: >> R13: 7ffe4853b4bf R14: 7ffe4853b4c0 R15: 7f7812ffc700 >> Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 >> nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor >> raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio >> dm_crypt dm_mod binfmt_misc loop sg virtio_rng >> ---[ end trace 364c3385f192fee3 ]--- > > Any ideas? > > thanks, > -- js
5.8-rc*: kernel BUG at kernel/signal.c:1917
Hi, the strace testsuite triggers this on 5.8-rc4 and -rc5 both on x86_64 and i586: > kernel BUG at kernel/signal.c:1917! > invalid opcode: [#1] SMP NOPTI > CPU: 7 PID: 18367 Comm: filter-unavaila Not tainted > 5.8.0-rc4-3.g2cd7849-default #1 openSUSE Tumbleweed (unreleased) > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 > RIP: 0010:do_notify_parent+0x239/0x270 > Code: d1 e8 41 89 c5 41 83 e5 01 48 8b 95 28 05 00 00 45 31 c0 4c 89 f6 44 89 > e7 b9 01 00 00 00 e8 ce f2 ff ff e9 5b ff ff ff 0f 0b <0f> 0b 48 39 bf 50 05 > 00 00 75 17 48 8b 97 c0 05 00 00 48 8d 87 c0 > RSP: 0018:a05400f93d00 EFLAGS: 00010002 > RAX: 0008 RBX: 9024ebb6dd00 RCX: 000102f9 > RDX: RSI: 0011 RDI: 9024ebb6dd00 > RBP: 9024ebb6dd00 R08: R09: 000d > R10: 7f781300 R11: c2690721ae80 R12: > R13: R14: 0009 R15: > FS: 7f7812ffc700() GS:902537dc() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7f9c246a4a10 CR3: 0001eba7c000 CR4: 000406e0 > Call Trace: > exit_notify+0x1c0/0x1f0 > do_exit+0x234/0x430 > do_group_exit+0x33/0xa0 > get_signal+0x171/0x640 > do_signal+0x2a/0x1c0 > exit_to_usermode_loop+0x89/0x110 > __prepare_exit_to_usermode+0x92/0xd0 > do_syscall_64+0x59/0xd0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7f78149db992 > Code: Bad RIP value. > RSP: 002b:7f7812ffbed0 EFLAGS: 0293 ORIG_RAX: 0022 > RAX: fdfe RBX: RCX: 7f78149db992 > RDX: 0001 RSI: 558a4e07e033 RDI: > RBP: R08: R09: 7f7812ffc700 > R10: 7f7812ffc9d0 R11: 0293 R12: > R13: 7ffe4853b4bf R14: 7ffe4853b4c0 R15: 7f7812ffc700 > Modules linked in: ata_generic ata_piix qemu_fw_cfg e1000 nls_iso8859_1 > nls_cp437 vfat fat virtio_blk virtio_mmio xfs btrfs blake2b_generic xor > raid6_pq libcrc32c crc32c_intel reiserfs squashfs fuse dm_snapshot dm_bufio > dm_crypt dm_mod binfmt_misc loop sg virtio_rng > ---[ end trace 364c3385f192fee3 ]--- Any ideas? thanks, -- js suse labs
Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
Hi, Thomas I canceled that patch. In my testing, that will not fix the problem. Why the IRQ is unexpectedly masked, that is not an easy bug for me. More time need to hacking the driver/kernel code. Thanks for your reply. Thomas Gleixner 于2019年10月14日周一 下午8:34写道: > > On Tue, 8 Oct 2019, Yi Zheng wrote: > > There is some defects on IRQ processing: > > > > (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and > > ACK > > action is executed: On my machine, it runs the 'else' branch: > > > > static inline void mask_ack_irq(struct irq_desc *desc) > > { > > if (desc->irq_data.chip->irq_mask_ack) { > > desc->irq_data.chip->irq_mask_ack(>irq_data); > > irq_state_set_masked(desc); > > } else { > > mask_irq(desc); > > if (desc->irq_data.chip->irq_ack) > > > > desc->irq_data.chip->irq_ack(>irq_data); > > } > > } > > > > It is an 2-steps procedure: > > 1. mask_irq() > > 2. desc->irq_data.chip->irq_ack() > > > > the 2nd step, the function ptr is omap_mask_ack_irq(), which > > _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action. > > Sure. Where is the problem? > > > (2) mask_irq()/unmask_irq() are not atomic actions: They check the > > IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by > > calling > > the function ptrs which installed by irq controller drv. Then, > > those 2 > > functions set/clear the IRQD_IRQ_MASKED flag. > > > > I think the sequence of the hw/sw action should be mirrored > > reversed: > > mask_irq(): > > check IRQD_IRQ_MASKED; > > set hardware IRQ mask register; > > set software IRQD_IRQ_MASKED flag; > > > > unmask_irq(): > > check IRQD_IRQ_MASKED; > > /* NOTE: should before the hw unmask action!! */ > > clear software IRQD_IRQ_MASKED flag; > > clear hardware IRQ mask register; > > > > The current unmask_irq(), hw-mask action runs before sw-mask > > action, > > which gives an very small time window. That cause an unexpected > > iterated IRQ. > > It's completely irrelevant because _ALL_ those operations run with > irq_desc->lock held. So nothing can actually observe that state. > > > Here is my the detail of my analyzing of handle_level_irq(): > > > > (1) Let record the HW-IRQ-Controller Status and the SW-Flag > > IRQD_IRQ_MASKED > > pair as following: (hw-mask, sw-mask). > > > > (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the > > HW > > unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, > > there > > is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may > > triggered. > > > > In that time window, the mask status is (0, 1), which is no an > > valid > > value. > > Again. Irrelevant because not observable. > > > My fixup is in the attachment, which remove the unexpected time > > window of > > IRQ iteration. > > Please don't send attachments. See > Documentation/process/submitting-patches.rst > > Thanks, > > tglx >
Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
On Tue, 8 Oct 2019, Yi Zheng wrote: > There is some defects on IRQ processing: > > (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK > action is executed: On my machine, it runs the 'else' branch: > > static inline void mask_ack_irq(struct irq_desc *desc) > { > if (desc->irq_data.chip->irq_mask_ack) { > desc->irq_data.chip->irq_mask_ack(>irq_data); > irq_state_set_masked(desc); > } else { > mask_irq(desc); > if (desc->irq_data.chip->irq_ack) > desc->irq_data.chip->irq_ack(>irq_data); > } > } > > It is an 2-steps procedure: > 1. mask_irq() > 2. desc->irq_data.chip->irq_ack() > > the 2nd step, the function ptr is omap_mask_ack_irq(), which > _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action. Sure. Where is the problem? > (2) mask_irq()/unmask_irq() are not atomic actions: They check the > IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling > the function ptrs which installed by irq controller drv. Then, > those 2 > functions set/clear the IRQD_IRQ_MASKED flag. > > I think the sequence of the hw/sw action should be mirrored reversed: > mask_irq(): > check IRQD_IRQ_MASKED; > set hardware IRQ mask register; > set software IRQD_IRQ_MASKED flag; > > unmask_irq(): > check IRQD_IRQ_MASKED; > /* NOTE: should before the hw unmask action!! */ > clear software IRQD_IRQ_MASKED flag; > clear hardware IRQ mask register; > > The current unmask_irq(), hw-mask action runs before sw-mask action, > which gives an very small time window. That cause an unexpected > iterated IRQ. It's completely irrelevant because _ALL_ those operations run with irq_desc->lock held. So nothing can actually observe that state. > Here is my the detail of my analyzing of handle_level_irq(): > > (1) Let record the HW-IRQ-Controller Status and the SW-Flag > IRQD_IRQ_MASKED > pair as following: (hw-mask, sw-mask). > > (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW > unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, > there > is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered. > > In that time window, the mask status is (0, 1), which is no an valid > value. Again. Irrelevant because not observable. > My fixup is in the attachment, which remove the unexpected time window > of > IRQ iteration. Please don't send attachments. See Documentation/process/submitting-patches.rst Thanks, tglx
Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
* Yi Zheng [191010 06:22]: > Hi > >My patch is canceled. I have tested that simple adjustment, the > device IRQ masked unexpectedly. It seems that it is more easily to > occur with that patch. > > So, the root cause is not found yet. OK. Based on your description, it could be a missing flush of a posted write somewhere in the gpio-omap or intc path. Regards, Tony
Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
Hi My patch is canceled. I have tested that simple adjustment, the device IRQ masked unexpectedly. It seems that it is more easily to occur with that patch. So, the root cause is not found yet.
Re: Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
Hi, * Yi Zheng [191008 13:06]: > NOTE: (1) My SoC is a single core ARM chip: TI-AM3352, so the raw > spin-lock irq_desc->lock will be optimized to > nothing. handle_level_irq() has no spin-lock protection, right? Well not always, With CONFIG_SMP we modify only some of the SMP code on boot, see arch/arm/kernel/head.S for smp_on_up and then the related macro usage. > (2) In AM3352, INTC driver ACK the IRQ by write 0x01 into INTC > Control > Register(offset 0x48). The chip doc seems that bit[0] of > INTC-Control Reg is only an enable/disable flag. The IRQ may > generated even if no ACK action done. Any one can give me an > clarification? The TI INTC is probably better documented in dm3630 trm, it's the same controller but with a different revision. > (3) My analysis is not verified on the real machine. After some code > change for debug(add counter to indicates the iteration level, > save > the IRQ mask status etc.), the device IRQ wrongly masked problem > vanished. In fact, the original code can not re-produce the > phenomena easily. In tens of machine, only one can get the bug. I > have try my best to hacking the code, but the only verified > result > is here: when bug occur, the HW IRQ is masked, but the > IRQD_IRQ_MASKED flag is cleared. > > My fixup is in the attachment, which remove the unexpected time window > of > IRQ iteration. Let's see what Thomas has to say for that. Meanwhile, please take a look at Documentation/process/submitting-patches.rst for getting things right for sending out patches that can be applied without manual editing :) Cheers, Tony > --- kernel/irq/chip.c 2019-07-13 09:28:23.683787367 +0800 > +++ /tmp/chip.c 2019-10-08 11:32:35.082258572 +0800 > @@ -432,8 +432,8 @@ void unmask_irq(struct irq_desc *desc) > return; > > if (desc->irq_data.chip->irq_unmask) { > - desc->irq_data.chip->irq_unmask(>irq_data); > irq_state_clr_masked(desc); > + desc->irq_data.chip->irq_unmask(>irq_data); > } > } >
Maybe a bug in kernel/irq/chip.c unmask_irq(), device IRQ masked unexpectedly. (re-formated the mail body, sorry)
Hi, I found something wrong on my AM3352 SoC machine, the GPIO triggered IRQ is masked unexpectedly. That bug cause the devices using that GPIO-IRQ can not work. Even the latest kernel version (v5.4-rc2-20-geda57a0e4299)! After a long time hacking, I guess the bug is in kernel/irq/chip.c, the important base code for _ALL_ the Processor Platform! That is why this mail is sent to you. Shortly speaking, the bug is about wrong interrupt iteration. That cause the software flag "IRQD_IRQ_MASKED" not refect the real masking status register in the interrupt controller (INTC, drivers/irqchip/irq-omap-intc.c). Here is my hw/sw settings: (1) FPGA implements some UART, using a GPIO pin as INT signal, (2) In the SoC, 481ac000.gpio is under the INTC controller, irq=28, and hwirq=32 It is an level triggered IRQ. (3) The ISR calling stack shows that handle_level_irq() is the most important function for debugging. There is some defects on IRQ processing: (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK action is executed: On my machine, it runs the 'else' branch: static inline void mask_ack_irq(struct irq_desc *desc) { if (desc->irq_data.chip->irq_mask_ack) { desc->irq_data.chip->irq_mask_ack(>irq_data); irq_state_set_masked(desc); } else { mask_irq(desc); if (desc->irq_data.chip->irq_ack) desc->irq_data.chip->irq_ack(>irq_data); } } It is an 2-steps procedure: 1. mask_irq() 2. desc->irq_data.chip->irq_ack() the 2nd step, the function ptr is omap_mask_ack_irq(), which _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action. (2) mask_irq()/unmask_irq() are not atomic actions: They check the IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling the function ptrs which installed by irq controller drv. Then, those 2 functions set/clear the IRQD_IRQ_MASKED flag. I think the sequence of the hw/sw action should be mirrored reversed: mask_irq(): check IRQD_IRQ_MASKED; set hardware IRQ mask register; set software IRQD_IRQ_MASKED flag; unmask_irq(): check IRQD_IRQ_MASKED; /* NOTE: should before the hw unmask action!! */ clear software IRQD_IRQ_MASKED flag; clear hardware IRQ mask register; The current unmask_irq(), hw-mask action runs before sw-mask action, which gives an very small time window. That cause an unexpected iterated IRQ. Here is my the detail of my analyzing of handle_level_irq(): (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED pair as following: (hw-mask, sw-mask). (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered. In that time window, the mask status is (0, 1), which is no an valid value. (3) In the 2nd level of the ISR(IRQ-28), The mask status is IRQ-28(0, 1), so mask_irq() do nothing, because sw-flag is "1". That is an wrong status, the programmer thinks that IRQ-28 has been masked, but physically not! (4) Before the ACK func-ptr calling, there comes the 3rd level IRQ (28/32)! Although mask_irq() do not physically mask the IRQ, ACK acion (omap_mask_ack_irq) of the omap-intc drv mask the IRQ physically. The 3rd ISR runs OK. When 3rd level ISR exist, the mask status is (0, 0), that is OK! (5) The 3rd level ISR finished, the 2nd level ISR continue. It run ACK function ptr -- omap_mask_ack_irq(). The HW-IRQ-mask is set again! Now, the mask status is (1, 0), it is unreasonable value! (6) The 2nd level ISR run to cond_unmask_irq(). Due to the ill-formed mask-status value(1, 0), the unmask_irq() will not be called. Even in unmask_irq(), another checking SW-Flag exists. The real unmask action will not run! (7) Now, the 2nd level ISR return, with the mask status (1, 0). The 1st level ISR continues, in unmask_irq(), it run irq_state_clr_masked(); And it repeatedly clear the IRQD_IRQ_MASKED. The final mask status is (1, 0). What (1, 0) value means? The CPU call will not receive IRQ any more! That is my bug phenomenon. If I clean the hardware interrupt controller's mask bit, my devices work again. NOTE: (1) My SoC is a single
Maybe a bug in kernel/irq/chip.c unmask_irq(), device irq is masked unexpectedly.
Hi, I found something wrong on my AM3352 SoC machine, the GPIO triggered IRQ is masked unexpectedly. That bug cause the devices using that GPIO-IRQ can not work. Even the latest kernel version (v5.4-rc2-20-geda57a0e4299)! After a long time hacking, I guess the bug is in kernel/irq/chip.c, the important base code for _ALL_ the Processor Platform! That is why this mail is sent to you. Shortly speaking, the bug is about wrong interrupt iteration. That cause the software flag "IRQD_IRQ_MASKED" not refect the real masking status register in the interrupt controller (INTC, drivers/irqchip/irq-omap-intc.c). Here is my hw/sw settings: (1) FPGA implements some UART, using a GPIO pin as INT signal, (2) In the SoC, 481ac000.gpio is under the INTC controller, irq=28, and hwirq=32 It is an level triggered IRQ. (3) The ISR calling stack shows that handle_level_irq() is the most important function for debugging. There is some defects on IRQ processing: (1) At the beginning of handle_level_irq(), the IRQ-28 is masked, and ACK action is executed: On my machine, it runs the 'else' branch: static inline void mask_ack_irq(struct irq_desc *desc) { if (desc->irq_data.chip->irq_mask_ack) { desc->irq_data.chip->irq_mask_ack(>irq_data); irq_state_set_masked(desc); } else { mask_irq(desc); if (desc->irq_data.chip->irq_ack) desc->irq_data.chip->irq_ack(>irq_data); } } It is an 2-steps procedure: 1. mask_irq() 2. desc->irq_data.chip->irq_ack() the 2nd step, the function ptr is omap_mask_ack_irq(), which _MASK_ the hardware INTC-IRQ-32 and then do the real ACK action. (2) mask_irq()/unmask_irq() are not atomic actions: They check the IRQD_IRQ_MASKED flag firstly, and then mask/unmask the irq by calling the function ptrs which installed by irq controller drv. Then, those 2 functions set/clear the IRQD_IRQ_MASKED flag. I think the sequence of the hw/sw action should be mirrored reversed: mask_irq(): check IRQD_IRQ_MASKED; set hardware IRQ mask register; set software IRQD_IRQ_MASKED flag; unmask_irq(): check IRQD_IRQ_MASKED; clear software IRQD_IRQ_MASKED flag; /* NOTE: should before the hw unmask action!! */ clear hardware IRQ mask register; The current unmask_irq(), hw-mask action runs before sw-mask action, which gives an very small time window. That cause an unexpected iterated IRQ. Here is my the detail of my analyzing of handle_level_irq(): (1) Let record the HW-IRQ-Controller Status and the SW-Flag IRQD_IRQ_MASKED pair as following: (hw-mask, sw-mask). (2) In the 1st level of IRQ-28 ISR calling, in unmask_irq(), after the HW unmask action, and before the sw-flag IRQD_IRQ_MASKED is cleared, there is a VERY SMALL TIME WINDOW, in which, another IRQ-28 may triggered. In that time window, the mask status is (0, 1), which is no an valid value. (3) In the 2nd level of the ISR(IRQ-28), The mask status is IRQ-28(0, 1), so mask_irq() do nothing, because sw-flag is "1". That is an wrong status, the programmer thinks that IRQ-28 has been masked, but physically not! (4) Before the ACK func-ptr calling, there comes the 3rd level IRQ (28/32)! Although mask_irq() do not physically mask the IRQ, ACK acion (omap_mask_ack_irq) of the omap-intc drv mask the IRQ physically. The 3rd ISR runs OK. When 3rd level ISR exist, the mask status is (0, 0), that is OK! (5) The 3rd level ISR finished, the 2nd level ISR continue. It run ACK function ptr -- omap_mask_ack_irq(). The HW-IRQ-mask is set again! Now, the mask status is (1, 0), it is unreasonable value! (6) The 2nd level ISR run to cond_unmask_irq(). Due to the ill-formed mask-status value(1, 0), the unmask_irq() will not be called. Even in unmask_irq(), another checking SW-Flag exists. The real unmask action will not run! (7) Now, the 2nd level ISR return, with the mask status (1, 0). The 1st level ISR continues, in unmask_irq(), it run irq_state_clr_masked(); And it repeatedly clear the IRQD_IRQ_MASKED. The final mask status is (1, 0). What (1, 0) value means? The CPU call will not receive IRQ any more! That is my bug phenomenon. If I clean the hardware interrupt controller's mask bit, my devices work again. NOTE: (1) My SoC is a single core ARM chip: TI-AM3352, so the raw spin-lock irq_desc-&g
kernel BUG at kernel/time/timer.c:LINE! (4)
Hello, syzbot found the following crash on: HEAD commit:58d5f26a usb-fuzzer: main usb gadget fuzzer driver git tree: https://github.com/google/kasan.git usb-fuzzer console output: https://syzkaller.appspot.com/x/log.txt?x=11e5b20d60 kernel config: https://syzkaller.appspot.com/x/.config?x=aa5dac3cda4ffd58 dashboard link: https://syzkaller.appspot.com/bug?extid=f49d12d34f2321cf4df2 compiler: gcc (GCC) 9.0.0 20181231 (experimental) syz repro: https://syzkaller.appspot.com/x/repro.syz?x=13d9c71360 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+f49d12d34f2321cf4...@syzkaller.appspotmail.com [ cut here ] kernel BUG at kernel/time/timer.c:956! invalid opcode: [#1] SMP KASAN CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc1+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__mod_timer kernel/time/timer.c:956 [inline] RIP: 0010:__mod_timer kernel/time/timer.c:949 [inline] RIP: 0010:mod_timer+0x5a2/0xb50 kernel/time/timer.c:1100 Code: 45 10 c7 44 24 14 ff ff ff ff 48 89 44 24 08 48 8d 45 20 48 c7 44 24 18 00 00 00 00 48 89 04 24 e9 5a fc ff ff e8 ae ce 0e 00 <0f> 0b e8 a7 ce 0e 00 4c 89 74 24 20 e9 37 fe ff ff e8 98 ce 0e 00 RSP: 0018:8881db209930 EFLAGS: 00010006 RAX: 86c2b200 RBX: a688 RCX: 83efc583 RDX: 0100 RSI: 812f4d82 RDI: 8881d2356200 RBP: 8881d23561e8 R08: 86c2b200 R09: ed103a46abeb R10: ed103a46abea R11: 8881d2355f53 R12: dc00 R13: 11103b64132d R14: 8881d2355f50 R15: 0006 FS: () GS:8881db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f75e2799000 CR3: 0001d3b07000 CR4: 001406f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: imon_touch_event drivers/media/rc/imon.c:1348 [inline] imon_incoming_packet.isra.0+0x2546/0x2f10 drivers/media/rc/imon.c:1603 usb_rx_callback_intf0+0x151/0x1e0 drivers/media/rc/imon.c:1734 __usb_hcd_giveback_urb+0x1f2/0x470 drivers/usb/core/hcd.c:1654 usb_hcd_giveback_urb+0x368/0x420 drivers/usb/core/hcd.c:1719 dummy_timer+0x120f/0x2fa2 drivers/usb/gadget/udc/dummy_hcd.c:1965 call_timer_fn+0x179/0x650 kernel/time/timer.c:1404 expire_timers kernel/time/timer.c:1449 [inline] __run_timers kernel/time/timer.c:1773 [inline] __run_timers kernel/time/timer.c:1740 [inline] run_timer_softirq+0x5e3/0x1490 kernel/time/timer.c:1786 __do_softirq+0x221/0x912 kernel/softirq.c:292 invoke_softirq kernel/softirq.c:373 [inline] irq_exit+0x178/0x1a0 kernel/softirq.c:413 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x12f/0x500 arch/x86/kernel/apic/apic.c:1137 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830 RIP: 0010:default_idle+0x28/0x2e0 arch/x86/kernel/process.c:581 Code: 90 90 41 56 41 55 65 44 8b 2d 44 3a 8f 7a 41 54 55 53 0f 1f 44 00 00 e8 36 ee d0 fb e9 07 00 00 00 0f 00 2d fa dd 4f 00 fb f4 <65> 44 8b 2d 20 3a 8f 7a 0f 1f 44 00 00 5b 5d 41 5c 41 5d 41 5e c3 RSP: 0018:86c07da8 EFLAGS: 0246 ORIG_RAX: ff13 RAX: 0007 RBX: 86c2b200 RCX: RDX: RSI: 0006 RDI: 86c2ba4c RBP: fbfff0d85640 R08: 86c2b200 R09: R10: R11: R12: R13: R14: R15: cpuidle_idle_call kernel/sched/idle.c:154 [inline] do_idle+0x3b6/0x500 kernel/sched/idle.c:263 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:355 start_kernel+0x82a/0x864 init/main.c:784 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241 Modules linked in: ---[ end trace 3b8169377985e706 ]--- RIP: 0010:__mod_timer kernel/time/timer.c:956 [inline] RIP: 0010:__mod_timer kernel/time/timer.c:949 [inline] RIP: 0010:mod_timer+0x5a2/0xb50 kernel/time/timer.c:1100 Code: 45 10 c7 44 24 14 ff ff ff ff 48 89 44 24 08 48 8d 45 20 48 c7 44 24 18 00 00 00 00 48 89 04 24 e9 5a fc ff ff e8 ae ce 0e 00 <0f> 0b e8 a7 ce 0e 00 4c 89 74 24 20 e9 37 fe ff ff e8 98 ce 0e 00 RSP: 0018:8881db209930 EFLAGS: 00010006 RAX: 86c2b200 RBX: a688 RCX: 83efc583 RDX: 0100 RSI: 812f4d82 RDI: 8881d2356200 RBP: 8881d23561e8 R08: 86c2b200 R09: ed103a46abeb R10: ed103a46abea R11: 8881d2355f53 R12: dc00 R13: 11103b64132d R14: 8881d2355f50 R15: 0006 FS: () GS:8881db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f75e2799000 CR3: 0001d3b07000 CR4: 001406f0 DR0: DR1: DR2: DR3: 00
Re: Bug at kernel/cred.c +443
On 2019-09-09 15:47, risha...@codeaurora.org wrote: Hi Miklos In 4.19 kernel when we write to a file that doesn't exist we see the following stack: [ 377.382745] ovl_create_or_link+0xac/0x710 [ 377.382745] ovl_create_object+0xb8/0x110 [ 377.382745] ovl_create+0x34/0x40 [ 377.382745] path_openat+0xd44/0x15a8 [ 377.382745] do_filp_open+0x80/0x128 [ 377.382745] do_sys_open+0x140/0x250 [ 377.382745] __arm64_sys_openat+0x2c/0x38 If the override_cred flag = off, the ovl_override_cred and ovl_revert_cred just returns NULL. But there is another override_cred in between these two functions; put_cred(override_creds(override_cred)); put_cred(override_cred); This will override the credentials permanently as there is no corresponding revert_cred associated. So whenever we do commit_creds for this task, we see a BUG_ON at kernel/cred.c +443. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cred.c#n443 Should this override_cred be changed to ovl_override_cred to maintain consistency and avoid this BUG_ON? Thanks, Rishabh Corrected line number in the subject.
Bug at kernel/cred.c +432
Hi Miklos In 4.19 kernel when we write to a file that doesn't exist we see the following stack: [ 377.382745] ovl_create_or_link+0xac/0x710 [ 377.382745] ovl_create_object+0xb8/0x110 [ 377.382745] ovl_create+0x34/0x40 [ 377.382745] path_openat+0xd44/0x15a8 [ 377.382745] do_filp_open+0x80/0x128 [ 377.382745] do_sys_open+0x140/0x250 [ 377.382745] __arm64_sys_openat+0x2c/0x38 If the override_cred flag = off, the ovl_override_cred and ovl_revert_cred just returns NULL. But there is another override_cred in between these two functions; put_cred(override_creds(override_cred)); put_cred(override_cred); This will override the credentials permanently as there is no corresponding revert_cred associated. So whenever we do commit_creds for this task, we see a BUG_ON at kernel/cred.c +443. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cred.c#n443 Should this override_cred be changed to ovl_override_cred to maintain consistency and avoid this BUG_ON? Thanks, Rishabh
Reminder: 1 open syzbot bug in "kernel/cgroup" subsystem
[This email was generated by a script. Let me know if you have any suggestions to make it better, or if you want it re-generated with the latest status.] Of the currently open syzbot reports against the upstream kernel, I've manually marked 1 of them as possibly being a bug in the "kernel/cgroup" subsystem. If you believe this bug is no longer valid, please close the syzbot report by sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the original thread, as explained at https://goo.gl/tpsmEJ#status If you believe I misattributed this bug to the "kernel/cgroup" subsystem, please let me know, and if possible forward the report to the correct people or mailing list. Here is the bug: Title: WARNING in cgroup_apply_control_enable Last occurred: 4 days ago Reported: 379 days ago Branches: Mainline and others Dashboard link: https://syzkaller.appspot.com/bug?id=4cb6430631b7567848a6736be59d0350b3ada2ea Original thread: https://lkml.kernel.org/lkml/09f9a305708fa...@google.com/T/#u This bug has a C reproducer. No one replied to the original thread for this bug. If you fix this bug, please add the following tag to the commit: Reported-by: syzbot+5493b2a54d31d6aea...@syzkaller.appspotmail.com If you send any email or patch for this bug, please consider replying to the original thread. For the git send-email command to use, or tips on how to reply if the thread isn't in your mailbox, see the "Reply instructions" at https://lkml.kernel.org/r/09f9a305708fa...@google.com
Reminder: 1 open syzbot bug in "kernel/cgroup" subsystem
[This email was generated by a script. Let me know if you have any suggestions to make it better, or if you want it re-generated with the latest status.] Of the currently open syzbot reports against the upstream kernel, I've manually marked 1 of them as possibly being a bug in the "kernel/cgroup" subsystem. If you believe this bug is no longer valid, please close the syzbot report by sending a '#syz fix', '#syz dup', or '#syz invalid' command in reply to the original thread, as explained at https://goo.gl/tpsmEJ#status If you believe I misattributed this bug to the "kernel/cgroup" subsystem, please let me know, and if possible forward the report to the correct people or mailing list. Here is the bug: Title: WARNING in cgroup_apply_control_enable Last occurred: 9 days ago Reported: 365 days ago Branches: Mainline and others Dashboard link: https://syzkaller.appspot.com/bug?id=4cb6430631b7567848a6736be59d0350b3ada2ea Original thread: https://lkml.kernel.org/lkml/09f9a305708fa...@google.com/T/#u This bug has a C reproducer. No one replied to the original thread for this bug. If you fix this bug, please add the following tag to the commit: Reported-by: syzbot+5493b2a54d31d6aea...@syzkaller.appspotmail.com If you send any email or patch for this bug, please consider replying to the original thread. For the git send-email command to use, or tips on how to reply if the thread isn't in your mailbox, see the "Reply instructions" at https://lkml.kernel.org/r/09f9a305708fa...@google.com
Re: kernel BUG at kernel/cred.c:434!
On Tue, Apr 23, 2019 at 12:08 AM Yang Yingliang wrote: > On 2019/4/23 3:48, Paul Moore wrote: > > On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang > > wrote: > >> I'm not sure you got my point. > > I went back and looked at your previous emails again to try and > > understand what you are talking about, and I'm a little confused by > > some of the output ... > > > >> --- a/kernel/acct.c > >> +++ b/kernel/acct.c > >> @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct > >> *acct) > >> flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; > >> current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; > >> /* Perform file operations on behalf of whoever enabled > >> accounting */ > >> + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", > >> current, file->f_cred, current->real_cred, current->cred); > >> orig_cred = override_creds(file->f_cred); > > Okay, with this patch applied we should the task/cred info when > > do_acct_process is called. Got it. > > > >> Messages: > >> [ 56.643298] task:88841a9595c0 new cred:88841ae450c0 real > >> cred:88841ae450c0 cred:88841ae450c0//They are same. > > Okay, it looks like do_acct_process() was called and f_cred, > > real_cred, and cred are all the same. > > This is a original message, without patch applied. The patch I am referring to is your pr_info patch (above). I'm not talking about any other patches at the moment; I just want to understand the example dmesg output you copied into your email. With that in mind, the message above seems to indicate that do_acct_process() has been invoked with f_cred, real_cred, and cred all pointing to the same credentials struct, yes? > >> [ 56.646609] Process accounting resumed > > It looks like do_acct_process() has called check_free_space() now. So > > far so good. ... > >> [ 56.649943] task:88841a9595c0 new cred:88841ae450c0 real > >> cred:88841c96c300 cred:88841ae450c0 > > Wait a minute ... why are we seeing this again? Looking at the task > > pointer and the timestamp, this is the same task exiting and trying to > > write to the accounting file, yes? This output is particularly > > curious since it appears that real_cred has changed; where is this > > happening? > > This is the message when the BUG_ON was triggered without applying any > fix patch. The only place in the code that generates this message is the bit of code that you patches in using the pr_info() patch (above), yes? If so, that would seem to indicate that the same task is calling do_acct_process() twice, yes? I may be fundamentally misunderstanding something about process accounting, but I though do_acct_process() would only be called once for a given task - while it was exiting. Yes, no? > If we apply this patch "proc: prevent changes to overridden > credentials", program > runs like this: I'd like to focus on understanding the dmesg output you shared first, because it doesn't seem to make sense to me. -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 2019/4/23 3:48, Paul Moore wrote: On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang wrote: I'm not sure you got my point. I went back and looked at your previous emails again to try and understand what you are talking about, and I'm a little confused by some of the output ... --- a/kernel/acct.c +++ b/kernel/acct.c @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct *acct) flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; /* Perform file operations on behalf of whoever enabled accounting */ + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", current, file->f_cred, current->real_cred, current->cred); orig_cred = override_creds(file->f_cred); Okay, with this patch applied we should the task/cred info when do_acct_process is called. Got it. Messages: [ 56.643298] task:88841a9595c0 new cred:88841ae450c0 real cred:88841ae450c0 cred:88841ae450c0//They are same. Okay, it looks like do_acct_process() was called and f_cred, real_cred, and cred are all the same. This is a original message, without patch applied. [ 56.646609] Process accounting resumed It looks like do_acct_process() has called check_free_space() now. So far so good. [ 56.649943] task:88841a9595c0 new cred:88841ae450c0 real cred:88841c96c300 cred:88841ae450c0 Wait a minute ... why are we seeing this again? Looking at the task pointer and the timestamp, this is the same task exiting and trying to write to the accounting file, yes? This output is particularly curious since it appears that real_cred has changed; where is this happening? This is the message when the BUG_ON was triggered without applying any fix patch. If we apply this patch "proc: prevent changes to overridden credentials", program runs like this: 1. As print message shows, before overriden, the pointer has the following value: real_cread=cred=0x88841ae450c0, f_cred=0x88841ae450c0 override_creds() is called in do_acct_process(): ... /* Perform file operations on behalf of whoever enabled accounting */ orig_cred = override_creds(file->f_cred); ... 2. After override_creds(), if (current_cred() != current_real_cred()) is not work here, we will call commit_creds() in security_setprocattr(). ... /* Prevent changes to overridden credentials. */ if (current_cred() != current_real_cred()) { rcu_read_unlock(); return -EBUSY; } ... 3. After commit_creds(), we have new cred and real_cred. security_setprocattr()//commit_creds is called here 4. revert_creds() is called in in do_acct_process(), the cred is reverted to the old value(0x88841ae450c0) ... current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim; revert_creds(orig_cred); 5. After reverting, cred and real_cred are not equal. If it has a risk to trigger the BUG_ON, when doing another commit_creds() ?
Re: kernel BUG at kernel/cred.c:434!
On Sat, Apr 20, 2019 at 3:39 AM Yang Yingliang wrote: > I'm not sure you got my point. I went back and looked at your previous emails again to try and understand what you are talking about, and I'm a little confused by some of the output ... > --- a/kernel/acct.c > +++ b/kernel/acct.c > @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct > *acct) > flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; > current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; > /* Perform file operations on behalf of whoever enabled > accounting */ > + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", > current, file->f_cred, current->real_cred, current->cred); > orig_cred = override_creds(file->f_cred); Okay, with this patch applied we should the task/cred info when do_acct_process is called. Got it. > Messages: > [ 56.643298] task:88841a9595c0 new cred:88841ae450c0 real > cred:88841ae450c0 cred:88841ae450c0//They are same. Okay, it looks like do_acct_process() was called and f_cred, real_cred, and cred are all the same. > [ 56.646609] Process accounting resumed It looks like do_acct_process() has called check_free_space() now. So far so good. > [ 56.649943] task:88841a9595c0 new cred:88841ae450c0 real > cred:88841c96c300 cred:88841ae450c0 Wait a minute ... why are we seeing this again? Looking at the task pointer and the timestamp, this is the same task exiting and trying to write to the accounting file, yes? This output is particularly curious since it appears that real_cred has changed; where is this happening? > [ 56.653565] [ cut here ] > [ 56.655119] kernel BUG at kernel/cred.c:434! > [ 56.656590] invalid opcode: [#1] SMP PTI > [ 56.658033] CPU: 2 PID: 4169 Comm: syz-executor.15 Not tainted > 5.1.0-rc4-00034-g869e3305f23d-dirty #143 > [ 56.661077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > [ 56.664895] RIP: 0010:commit_creds+0x1eb/0x230 > [ 56.666344] Code: 43 1c 0f 85 08 ff ff ff e9 10 ff ff ff 8b 45 10 39 > 43 10 0f 85 18 ff ff ff 8b 43 20 39 45 20 0f 85 0c ff ff ff e9 14 ff ff > ff <0f> 0b 48 c7 c7 d0 d2 49 82 e8 17 3b 3e 00 0f 0b 48 c7 c7 c0 d2 49 > [ 56.672410] RSP: 0018:c90003a17b20 EFLAGS: 00010287 > [ 56.674098] RAX: 88841a9595c0 RBX: 88841ae450c0 RCX: > > [ 56.676410] RDX: 0001 RSI: 0020 RDI: > 88841c96ce40 > [ 56.678691] RBP: 0001 R08: 0080 R09: > > [ 56.680997] R10: 88841c9265a0 R11: 810d6940 R12: > 88841a9595c0 > [ 56.681198] task:88841a9195c0 new cred:88841aeaa0c0 real > cred:88841aeaa0c0 cred:88841aeaa0c0 > [ 56.683293] R13: 0040 R14: 88841c96ce40 R15: > 0040 > [ 56.683296] FS: 7f5969a5c700() GS:88842fa8() > knlGS: > [ 56.683297] CS: 0010 DS: ES: CR0: 80050033 > [ 56.683299] CR2: 7f82742214f0 CR3: 00041cbc0005 CR4: > 000206e0 > [ 56.683305] Call Trace: > [ 56.683340] selinux_setprocattr+0x17b/0x480 > [ 56.686513] Process accounting resumed > [ 56.688849] proc_pid_attr_write+0xc0/0xf0 > [ 56.688857] __kernel_write+0x4f/0xf0 > [ 56.688866] do_acct_process+0x538/0x750 > [ 56.703090] ? __schedule+0x290/0x960 > [ 56.704311] ? __queue_work+0x160/0x5c0 > [ 56.705571] acct_pin_kill+0x1e/0x70 > [ 56.706743] pin_kill+0x81/0x150 > [ 56.707813] ? finish_wait+0x80/0x80 > [ 56.708985] mnt_pin_kill+0x1e/0x30 > [ 56.710127] cleanup_mnt+0x6e/0x70 > [ 56.711247] task_work_run+0x8a/0xb0 > [ 56.712453] do_exit+0x2e0/0xc80 > [ 56.713525] do_group_exit+0x33/0xb0 > [ 56.714701] get_signal+0x143/0x810 > [ 56.715865] do_signal+0x36/0x610 > [ 56.716962] ? __x64_sys_futex+0x134/0x180 > [ 56.718307] ? _copy_to_user+0x22/0x30 > [ 56.719606] exit_to_usermode_loop+0x80/0xe0 > [ 56.721003] do_syscall_64+0x16c/0x180 > [ 56.722242] entry_SYSCALL_64_after_hwframe+0x44/0xa9 -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 2019/4/20 0:13, Paul Moore wrote: On Fri, Apr 19, 2019 at 10:34 AM Yang Yingliang wrote: On 2019/4/19 21:24, Paul Moore wrote: On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang wrote: On 2019/4/19 10:04, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang wrote: On 2019/4/18 8:24, Casey Schaufler wrote: On 4/17/2019 4:39 PM, Paul Moore wrote: Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? I'm fine with the change going into proc_pid_attr_write(). The cred != real_cred checking is not enough. Consider this situation, when doing override, cred, real_cred and new_cred are all same: after override_creds()cred == real_cred == new1_cred I'm sorry, you've lost me. After override_creds() returns current->cred and current->real_cred are not going to be the same, yes? It's possible the new cred is equal to current->real_cred and current->cred, so after overrides_creds(), they have the same value. Both task_struct.cred and task_struct.real_cred are pointer values, assuming that one uses prepare_creds() to allocate/initialize a new cred struct for use with override_creds() then the newly created cred should never be equal to task_struct.real_cred. Am I missing something, or are you thinking of something else? In do_acct_process(), file->f_cred may equal to current->real_cred, I confirm it by adding some debug message in do_acct_process() like this: I would expect that; real_cred is the task's objective DAC credentials, so using it for f_cred makes sense. What we are now talking about is the task's subjective credentials, which can be overridden via override_creds(), and are what the LSMs change via proc_pid_attr_write(). I'm not sure you got my point. I was saying cred != real_cred check is not quite right, because the cred can be overridden by a same pointer as my print messages showing. "cred != real_cred" means override_creds() is called, but "cred == real_cred" doesn't mean override_creds() is not called. When we use "cred != real_cred" check, we may lost the situation that cred is overridden by a same pointer. In this case, we will do override_creds() => commit_creds() => revert_creds(), this make cred != real_cred, when a new commit_creds() is called, it also will trigger a BUG_ON(). --- a/kernel/acct.c +++ b/kernel/acct.c @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct *acct) flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; /* Perform file operations on behalf of whoever enabled accounting */ + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", current, file->f_cred, current->real_cred, current->cred); orig_cred = override_creds(file->f_cred);
Re: kernel BUG at kernel/cred.c:434!
On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang wrote: > On 2019/4/19 10:04, Paul Moore wrote: > > On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang > > wrote: > >> On 2019/4/18 8:24, Casey Schaufler wrote: > >>> On 4/17/2019 4:39 PM, Paul Moore wrote: > Since it looks like all three LSMs which implement the setprocattr > hook are vulnerable I'm open to the idea that proc_pid_attr_write() is > a better choice for the cred != read_cred check, but I would want to > make sure John and Casey are okay with that. > > John? > > Casey? > >>> I'm fine with the change going into proc_pid_attr_write(). > >> The cred != real_cred checking is not enough. > >> > >> Consider this situation, when doing override, cred, real_cred and > >> new_cred are all same: > >> > >> after override_creds()cred == real_cred == new1_cred > > I'm sorry, you've lost me. After override_creds() returns > > current->cred and current->real_cred are not going to be the same, > > yes? > > It's possible the new cred is equal to current->real_cred and > current->cred, > so after overrides_creds(), they have the same value. Both task_struct.cred and task_struct.real_cred are pointer values, assuming that one uses prepare_creds() to allocate/initialize a new cred struct for use with override_creds() then the newly created cred should never be equal to task_struct.real_cred. Am I missing something, or are you thinking of something else? -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On Fri, Apr 19, 2019 at 10:34 AM Yang Yingliang wrote: > On 2019/4/19 21:24, Paul Moore wrote: > > On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang > > wrote: > >> On 2019/4/19 10:04, Paul Moore wrote: > >>> On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang > >>> wrote: > On 2019/4/18 8:24, Casey Schaufler wrote: > > On 4/17/2019 4:39 PM, Paul Moore wrote: > >> Since it looks like all three LSMs which implement the setprocattr > >> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is > >> a better choice for the cred != read_cred check, but I would want to > >> make sure John and Casey are okay with that. > >> > >> John? > >> > >> Casey? > > I'm fine with the change going into proc_pid_attr_write(). > The cred != real_cred checking is not enough. > > Consider this situation, when doing override, cred, real_cred and > new_cred are all same: > > after override_creds()cred == real_cred == new1_cred > >>> I'm sorry, you've lost me. After override_creds() returns > >>> current->cred and current->real_cred are not going to be the same, > >>> yes? > >> It's possible the new cred is equal to current->real_cred and > >> current->cred, > >> so after overrides_creds(), they have the same value. > > Both task_struct.cred and task_struct.real_cred are pointer values, > > assuming that one uses prepare_creds() to allocate/initialize a new > > cred struct for use with override_creds() then the newly created cred > > should never be equal to task_struct.real_cred. Am I missing > > something, or are you thinking of something else? > > In do_acct_process(), file->f_cred may equal to current->real_cred, I > confirm > it by adding some debug message in do_acct_process() like this: I would expect that; real_cred is the task's objective DAC credentials, so using it for f_cred makes sense. What we are now talking about is the task's subjective credentials, which can be overridden via override_creds(), and are what the LSMs change via proc_pid_attr_write(). > --- a/kernel/acct.c > +++ b/kernel/acct.c > @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct > *acct) > flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; > current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; > /* Perform file operations on behalf of whoever enabled > accounting */ > + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", > current, file->f_cred, current->real_cred, current->cred); > orig_cred = override_creds(file->f_cred); -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 2019/4/19 21:24, Paul Moore wrote: On Thu, Apr 18, 2019 at 10:42 PM Yang Yingliang wrote: On 2019/4/19 10:04, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang wrote: On 2019/4/18 8:24, Casey Schaufler wrote: On 4/17/2019 4:39 PM, Paul Moore wrote: Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? I'm fine with the change going into proc_pid_attr_write(). The cred != real_cred checking is not enough. Consider this situation, when doing override, cred, real_cred and new_cred are all same: after override_creds()cred == real_cred == new1_cred I'm sorry, you've lost me. After override_creds() returns current->cred and current->real_cred are not going to be the same, yes? It's possible the new cred is equal to current->real_cred and current->cred, so after overrides_creds(), they have the same value. Both task_struct.cred and task_struct.real_cred are pointer values, assuming that one uses prepare_creds() to allocate/initialize a new cred struct for use with override_creds() then the newly created cred should never be equal to task_struct.real_cred. Am I missing something, or are you thinking of something else? In do_acct_process(), file->f_cred may equal to current->real_cred, I confirm it by adding some debug message in do_acct_process() like this: --- a/kernel/acct.c +++ b/kernel/acct.c @@ -481,6 +481,7 @@ static void do_acct_process(struct bsd_acct_struct *acct) flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur; current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; /* Perform file operations on behalf of whoever enabled accounting */ + pr_info("task:%px new cred:%px real cred:%px cred:%px\n", current, file->f_cred, current->real_cred, current->cred); orig_cred = override_creds(file->f_cred); Messages: [ 56.643298] task:88841a9595c0 new cred:88841ae450c0 real cred:88841ae450c0 cred:88841ae450c0//They are same. [ 56.646609] Process accounting resumed [ 56.649943] task:88841a9595c0 new cred:88841ae450c0 real cred:88841c96c300 cred:88841ae450c0 [ 56.653565] --------[ cut here ] [ 56.655119] kernel BUG at kernel/cred.c:434! [ 56.656590] invalid opcode: [#1] SMP PTI [ 56.658033] CPU: 2 PID: 4169 Comm: syz-executor.15 Not tainted 5.1.0-rc4-00034-g869e3305f23d-dirty #143 [ 56.661077] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 56.664895] RIP: 0010:commit_creds+0x1eb/0x230 [ 56.666344] Code: 43 1c 0f 85 08 ff ff ff e9 10 ff ff ff 8b 45 10 39 43 10 0f 85 18 ff ff ff 8b 43 20 39 45 20 0f 85 0c ff ff ff e9 14 ff ff ff <0f> 0b 48 c7 c7 d0 d2 49 82 e8 17 3b 3e 00 0f 0b 48 c7 c7 c0 d2 49 [ 56.672410] RSP: 0018:c90003a17b20 EFLAGS: 00010287 [ 56.674098] RAX: 88841a9595c0 RBX: 88841ae450c0 RCX: [ 56.676410] RDX: 0001 RSI: 0020 RDI: 88841c96ce40 [ 56.678691] RBP: 0001 R08: 0080 R09: [ 56.680997] R10: 88841c9265a0 R11: 810d6940 R12: 88841a9595c0 [ 56.681198] task:88841a9195c0 new cred:88841aeaa0c0 real cred:88841aeaa0c0 cred:88841aeaa0c0 [ 56.683293] R13: 0040 R14: 88841c96ce40 R15: 0040 [ 56.683296] FS: 7f5969a5c700() GS:88842fa8() knlGS: [ 56.683297] CS: 0010 DS: ES: CR0: 80050033 [ 56.683299] CR2: 7f82742214f0 CR3: 00041cbc0005 CR4: 000206e0 [ 56.683305] Call Trace: [ 56.683340] selinux_setprocattr+0x17b/0x480 [ 56.686513] Process accounting resumed [ 56.688849] proc_pid_attr_write+0xc0/0xf0 [ 56.688857] __kernel_write+0x4f/0xf0 [ 56.688866] do_acct_process+0x538/0x750 [ 56.703090] ? __schedule+0x290/0x960 [ 56.704311] ? __queue_work+0x160/0x5c0 [ 56.705571] acct_pin_kill+0x1e/0x70 [ 56.706743] pin_kill+0x81/0x150 [ 56.707813] ? finish_wait+0x80/0x80 [ 56.708985] mnt_pin_kill+0x1e/0x30 [ 56.710127] cleanup_mnt+0x6e/0x70 [ 56.711247] task_work_run+0x8a/0xb0 [ 56.712453] do_exit+0x2e0/0xc80 [ 56.713525] do_group_exit+0x33/0xb0 [ 56.714701] get_signal+0x143/0x810 [ 56.715865] do_signal+0x36/0x610 [ 56.716962] ? __x64_sys_futex+0x134/0x180 [ 56.718307] ? _copy_to_user+0x22/0x30 [ 56.719606] exit_to_usermode_loop+0x80/0xe0 [ 56.721003] do_syscall_64+0x16c/0x180 [ 56.722242] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Re: kernel BUG at kernel/cred.c:434!
On 2019/4/19 10:04, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang wrote: On 2019/4/18 8:24, Casey Schaufler wrote: On 4/17/2019 4:39 PM, Paul Moore wrote: Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? I'm fine with the change going into proc_pid_attr_write(). The cred != real_cred checking is not enough. Consider this situation, when doing override, cred, real_cred and new_cred are all same: after override_creds()cred == real_cred == new1_cred I'm sorry, you've lost me. After override_creds() returns current->cred and current->real_cred are not going to be the same, yes? It's possible the new cred is equal to current->real_cred and current->cred, so after overrides_creds(), they have the same value. after prepare_creds() new2_cred after commit_creds() becasue the check is false, so cred == real_cred == new2_cred after revert_creds()cred == new1_cred, real_cred == new2_cred It will cause cred != real_cred finally.
Re: kernel BUG at kernel/cred.c:434!
On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang wrote: > On 2019/4/18 8:24, Casey Schaufler wrote: > > On 4/17/2019 4:39 PM, Paul Moore wrote: > >> > >> Since it looks like all three LSMs which implement the setprocattr > >> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is > >> a better choice for the cred != read_cred check, but I would want to > >> make sure John and Casey are okay with that. > >> > >> John? > >> > >> Casey? > > > > I'm fine with the change going into proc_pid_attr_write(). > > The cred != real_cred checking is not enough. > > Consider this situation, when doing override, cred, real_cred and > new_cred are all same: > > after override_creds()cred == real_cred == new1_cred I'm sorry, you've lost me. After override_creds() returns current->cred and current->real_cred are not going to be the same, yes? > after prepare_creds() new2_cred > after commit_creds() becasue the check is false, so cred == > real_cred == new2_cred > after revert_creds()cred == new1_cred, real_cred == new2_cred > > It will cause cred != real_cred finally. -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 4/17/19 12:42 PM, Casey Schaufler wrote: On 4/17/2019 9:27 AM, Oleg Nesterov wrote: On 04/17, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Yes, that is simple, but I worry about what other LSMs might want to do. While I believe failing if the effective creds are not the same as the real_creds is okay for SELinux (possibly Smack too), I worry about what other LSMs may want to do. After all, proc_pid_attr_write() doesn't change the the creds itself, that is something the specific LSMs do. Yes, but if proc_pid_attr_write() is called with cred != real_cred then something is already wrong? In fact, I think that something is already wrong if it is not called by user-space directly. Too late to ask, but why is this /proc/self/attr/ magic not implemented via syscall(s) ? Shell scripts, for one thing. It's a straightforward and appropriate use of the /proc interface. System calls would require additional change to existing programs, whereas using the /proc interface allows a good deal to be done in the containing scripts. It is fairly awkward/fragile to use these interfaces from shell scripts due to limitations of the interface (e.g. no partial writes, thereby breaking if the shell splits up the data into multiple write() calls) and due to the fact that most of the attributes (except for current) are typically reset/cleared upon exec to prevent undue caller/callee influence. It works well enough for echo -n "somelabel" > /proc/self/attr/current but not much else, and even that doesn't work without the -n due to multiple write() calls. Just for the record, this functionality was originally implemented via separate system calls at least for SELinux (in its original kernel patch), then multiplexed through the single LSM security system call upon porting to LSM (before merging to mainline), but viro and others strongly favored using /proc as the interface for getting/setting any new process attributes. Understandable, but it does impose some limitations, including a dependency on having a writable proc mount in the namespace of any process that needs to set any of these attributes,
Re: kernel BUG at kernel/cred.c:434!
Hi, Casey On 2019/4/18 8:24, Casey Schaufler wrote: On 4/17/2019 4:39 PM, Paul Moore wrote: On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Yes, that is simple, but I worry about what other LSMs might want to do. While I believe failing if the effective creds are not the same as the real_creds is okay for SELinux (possibly Smack too), I worry about what other LSMs may want to do. After all, proc_pid_attr_write() doesn't change the the creds itself, that is something the specific LSMs do. Yes, but if proc_pid_attr_write() is called with cred != real_cred then something is already wrong? True, or at least I would think so. Looking at the current tree there are three LSMs which implement setprocattr hooks: SELinux, Smack, and AppArmor. I know Casey has already mentioned that he wasn't able to trigger the problem in Smack, but looking at smack_setprocattr() I see the similar commit_creds() usage so I would expect the same problem in Smack; what say you Casey? I say that my test program runs without ill effect. I call acct() with "/proc/self/attr/current", which succeeds and enables accounting just like it is supposed to. I then have the program open "/proc/self/attr/current" and read it, all of which goes swimmingly. When Smack frees a cred it usually does not free any memory of its own, so it is conceivable that I'm just getting lucky. Or, I may not have sufficient debug enabled. Looking at apparmor_setprocattr(), it appears that it too could end up calling commit_creds() via aa_set_current_hat(). Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? I'm fine with the change going into proc_pid_attr_write(). The cred != real_cred checking is not enough. Consider this situation, when doing override, cred, real_cred and new_cred are all same: after override_creds()cred == real_cred == new1_cred after prepare_creds() new2_cred after commit_creds() becasue the check is false, so cred == real_cred == new2_cred after revert_creds()cred == new1_cred, real_cred == new2_cred It will cause cred != real_cred finally. Regards, Yang
Re: kernel BUG at kernel/cred.c:434!
On 4/17/2019 4:39 PM, Paul Moore wrote: On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Yes, that is simple, but I worry about what other LSMs might want to do. While I believe failing if the effective creds are not the same as the real_creds is okay for SELinux (possibly Smack too), I worry about what other LSMs may want to do. After all, proc_pid_attr_write() doesn't change the the creds itself, that is something the specific LSMs do. Yes, but if proc_pid_attr_write() is called with cred != real_cred then something is already wrong? True, or at least I would think so. Looking at the current tree there are three LSMs which implement setprocattr hooks: SELinux, Smack, and AppArmor. I know Casey has already mentioned that he wasn't able to trigger the problem in Smack, but looking at smack_setprocattr() I see the similar commit_creds() usage so I would expect the same problem in Smack; what say you Casey? I say that my test program runs without ill effect. I call acct() with "/proc/self/attr/current", which succeeds and enables accounting just like it is supposed to. I then have the program open "/proc/self/attr/current" and read it, all of which goes swimmingly. When Smack frees a cred it usually does not free any memory of its own, so it is conceivable that I'm just getting lucky. Or, I may not have sufficient debug enabled. Looking at apparmor_setprocattr(), it appears that it too could end up calling commit_creds() via aa_set_current_hat(). Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? I'm fine with the change going into proc_pid_attr_write().
Re: kernel BUG at kernel/cred.c:434!
On 4/17/19 4:39 PM, Paul Moore wrote: > On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov wrote: >> On 04/17, Paul Moore wrote: >>> >>> On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: > > I'm tempted to simply return an error in selinux_setprocattr() if > the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). >>> >>> Yes, that is simple, but I worry about what other LSMs might want to >>> do. While I believe failing if the effective creds are not the same >>> as the real_creds is okay for SELinux (possibly Smack too), I worry >>> about what other LSMs may want to do. After all, >>> proc_pid_attr_write() doesn't change the the creds itself, that is >>> something the specific LSMs do. >> >> Yes, but if proc_pid_attr_write() is called with cred != real_cred then >> something is already wrong? > > True, or at least I would think so. > > Looking at the current tree there are three LSMs which implement > setprocattr hooks: SELinux, Smack, and AppArmor. I know Casey has > already mentioned that he wasn't able to trigger the problem in Smack, > but looking at smack_setprocattr() I see the similar commit_creds() > usage so I would expect the same problem in Smack; what say you Casey? > Looking at apparmor_setprocattr(), it appears that it too could end > up calling commit_creds() via aa_set_current_hat(). > > Since it looks like all three LSMs which implement the setprocattr > hook are vulnerable I'm open to the idea that proc_pid_attr_write() is > a better choice for the cred != read_cred check, but I would want to > make sure John and Casey are okay with that. > > John? heh yeah, seems I messed up our check, we actually have if (current_cred() != current_real_cred()) return -EBUSY; on the change_profile path and missed it on the change_hat one. It makes sense to lift the check up earlier
Re: kernel BUG at kernel/cred.c:434!
On Wed, Apr 17, 2019 at 12:27 PM Oleg Nesterov wrote: > On 04/17, Paul Moore wrote: > > > > On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: > > > On 04/17, Paul Moore wrote: > > > > > > > > I'm tempted to simply return an error in selinux_setprocattr() if > > > > the task's credentials are not the same as its real_cred; > > > > > > What about other modules? I have no idea what smack_setprocattr() is, > > > but it too does prepare_creds/commit creds. > > > > > > it seems that the simplest workaround should simply add the additional > > > cred == real_cred into proc_pid_attr_write(). > > > > Yes, that is simple, but I worry about what other LSMs might want to > > do. While I believe failing if the effective creds are not the same > > as the real_creds is okay for SELinux (possibly Smack too), I worry > > about what other LSMs may want to do. After all, > > proc_pid_attr_write() doesn't change the the creds itself, that is > > something the specific LSMs do. > > Yes, but if proc_pid_attr_write() is called with cred != real_cred then > something is already wrong? True, or at least I would think so. Looking at the current tree there are three LSMs which implement setprocattr hooks: SELinux, Smack, and AppArmor. I know Casey has already mentioned that he wasn't able to trigger the problem in Smack, but looking at smack_setprocattr() I see the similar commit_creds() usage so I would expect the same problem in Smack; what say you Casey? Looking at apparmor_setprocattr(), it appears that it too could end up calling commit_creds() via aa_set_current_hat(). Since it looks like all three LSMs which implement the setprocattr hook are vulnerable I'm open to the idea that proc_pid_attr_write() is a better choice for the cred != read_cred check, but I would want to make sure John and Casey are okay with that. John? Casey? -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 4/17/2019 9:27 AM, Oleg Nesterov wrote: On 04/17, Paul Moore wrote: On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: On 04/17, Paul Moore wrote: I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Yes, that is simple, but I worry about what other LSMs might want to do. While I believe failing if the effective creds are not the same as the real_creds is okay for SELinux (possibly Smack too), I worry about what other LSMs may want to do. After all, proc_pid_attr_write() doesn't change the the creds itself, that is something the specific LSMs do. Yes, but if proc_pid_attr_write() is called with cred != real_cred then something is already wrong? In fact, I think that something is already wrong if it is not called by user-space directly. Too late to ask, but why is this /proc/self/attr/ magic not implemented via syscall(s) ? Shell scripts, for one thing. It's a straightforward and appropriate use of the /proc interface. System calls would require additional change to existing programs, whereas using the /proc interface allows a good deal to be done in the containing scripts.
Re: kernel BUG at kernel/cred.c:434!
On 04/17, Paul Moore wrote: > > On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: > > On 04/17, Paul Moore wrote: > > > > > > I'm tempted to simply return an error in selinux_setprocattr() if > > > the task's credentials are not the same as its real_cred; > > > > What about other modules? I have no idea what smack_setprocattr() is, > > but it too does prepare_creds/commit creds. > > > > it seems that the simplest workaround should simply add the additional > > cred == real_cred into proc_pid_attr_write(). > > Yes, that is simple, but I worry about what other LSMs might want to > do. While I believe failing if the effective creds are not the same > as the real_creds is okay for SELinux (possibly Smack too), I worry > about what other LSMs may want to do. After all, > proc_pid_attr_write() doesn't change the the creds itself, that is > something the specific LSMs do. Yes, but if proc_pid_attr_write() is called with cred != real_cred then something is already wrong? In fact, I think that something is already wrong if it is not called by user-space directly. Too late to ask, but why is this /proc/self/attr/ magic not implemented via syscall(s) ? But, Paul, this is up to you. I don't understand this all even remotely. Oleg.
Re: kernel BUG at kernel/cred.c:434!
On Wed, Apr 17, 2019 at 10:57 AM Oleg Nesterov wrote: > On 04/17, Paul Moore wrote: > > > > I'm tempted to simply return an error in selinux_setprocattr() if > > the task's credentials are not the same as its real_cred; > > What about other modules? I have no idea what smack_setprocattr() is, > but it too does prepare_creds/commit creds. > > it seems that the simplest workaround should simply add the additional > cred == real_cred into proc_pid_attr_write(). Yes, that is simple, but I worry about what other LSMs might want to do. While I believe failing if the effective creds are not the same as the real_creds is okay for SELinux (possibly Smack too), I worry about what other LSMs may want to do. After all, proc_pid_attr_write() doesn't change the the creds itself, that is something the specific LSMs do. -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 4/17/2019 7:57 AM, Oleg Nesterov wrote: On 04/17, Paul Moore wrote: I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. For what it's worth, my test for Smack does not reproduce the problem. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Oleg.
Re: kernel BUG at kernel/cred.c:434!
On 04/17, Paul Moore wrote: > > I'm tempted to simply return an error in selinux_setprocattr() if > the task's credentials are not the same as its real_cred; What about other modules? I have no idea what smack_setprocattr() is, but it too does prepare_creds/commit creds. it seems that the simplest workaround should simply add the additional cred == real_cred into proc_pid_attr_write(). Oleg.
Re: kernel BUG at kernel/cred.c:434!
On Tue, Apr 16, 2019 at 10:46 AM chengjian (D) wrote: > On 2019/4/16 11:40, Kees Cook wrote: > > On Mon, Apr 15, 2019 at 11:20 AM Paul Moore wrote: > >> On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov wrote: > >>> On 04/15, Paul Moore wrote: > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, > > but I do > > not know where should we put the additional check... And probably > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit > > the > > same problem, do_coredump() does override_creds() too. > > > > May be just add > > > > if (current->cred != current->real_cred) > > return -EACCES; > > > > into proc_pid_attr_write(), I dunno. > Is the problem that do_acct_process() is calling override_creds() and > the returned/old credentials are being freed before do_acct_process() > can reinstall the creds via revert_creds()? Presumably because the > process accounting is causing the credentials to be replaced? > >>> Afaics, the problem is that do_acct_process() does override_creds() and > >>> then __kernel_write(). Which calls proc_pid_attr_write(), which in turn > >>> calls > >>> selinux_setprocattr(), which does another prepare_creds() + > >>> commit_creds(); > >>> and commit_creds() hits > >>> > >>> BUG_ON(task->cred != old); > >> Gotcha. In the process of looking at the backtrace I forgot about the > >> BUG_ON() at the top of the oops message. > >> > >> I wonder what terrible things would happen if we changed the BUG_ON() > >> in commit_creds to simple returning an error an error code to the > >> caller. There is a warning/requirement in commit_creds() function > >> header comment that it should always return 0. > > Would callers be expected to call abort_creds() on failure? There are > > a number of places where it'd need fixing up. And would likely be best > > with a __must_check marking. > > > > It seems like avoiding the pathological case might be simpler? > > Yeah, Avoiding this pathological case is a better solution. No arguments that this is particularly messed up scenario, I'm just trying to arrive at a solution that isn't too ugly. > From: Yang Yingliang > Date: Sat, 13 Apr 2019 21:56:01 +0800 > Subject: [PATCH] fix cred bug_on > > Signed-off-by: Yang Yingliang > --- > kernel/acct.c| 3 ++- > kernel/cred.c| 6 ++ > security/selinux/hooks.c | 2 ++ > 3 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/kernel/acct.c b/kernel/acct.c > index addf7732fb56..f2065f899eee 100644 > --- a/kernel/acct.c > +++ b/kernel/acct.c > @@ -522,7 +522,8 @@ static void do_acct_process(struct bsd_acct_struct > *acct) > } > out: > current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim; > -revert_creds(orig_cred); > +if (orig_cred == current->real_cred)// [2] > +revert_creds(orig_cred); > } > > /** > diff --git a/kernel/cred.c b/kernel/cred.c > index ecf03657e71c..c4d5ba92fb9b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -522,6 +522,9 @@ const struct cred *override_creds(const struct cred > *new) > { > const struct cred *old = current->cred; > > +if (old == new)// [3] > +return old; > + > kdebug("override_creds(%p{%d,%d})", new, > atomic_read(>usage), > read_cred_subscribers(new)); > @@ -551,6 +554,9 @@ void revert_creds(const struct cred *old) > { > const struct cred *override = current->cred; > > +if (override == old)// [3] > +return; > + > kdebug("revert_creds(%p{%d,%d})", old, > atomic_read(>usage), > read_cred_subscribers(old)); > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > index b5017beb4ef7..bc8108e4e90f 100644 > --- a/security/selinux/hooks.c > +++ b/security/selinux/hooks.c > @@ -6590,6 +6590,8 @@ static int selinux_setprocattr(const char *name, > void *value, size_t size) > goto abort_change; > } > > +if (current->cred != current->real_cred)// [1] > +revert_creds(current->real_cred); > commit_creds(new); > return size; Doing the revert only to then commit the creds seems really ugly to me. I'm tempted to simply return an error in selinux_setprocattr() if the task's credentials are not the same as its real_cred; if we do that I believe we should resolve this problem. The accounting write to the SELinux file in /proc would fail of course, but I think we can all consider that as a positive side-effect. While I don't think this should have a negative impact on anything else, I haven't convinced myself of that just yet. -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 2019/4/16 11:40, Kees Cook wrote: On Mon, Apr 15, 2019 at 11:20 AM Paul Moore wrote: On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov wrote: On 04/15, Paul Moore wrote: On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do not know where should we put the additional check... And probably "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the same problem, do_coredump() does override_creds() too. May be just add if (current->cred != current->real_cred) return -EACCES; into proc_pid_attr_write(), I dunno. Is the problem that do_acct_process() is calling override_creds() and the returned/old credentials are being freed before do_acct_process() can reinstall the creds via revert_creds()? Presumably because the process accounting is causing the credentials to be replaced? Afaics, the problem is that do_acct_process() does override_creds() and then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls selinux_setprocattr(), which does another prepare_creds() + commit_creds(); and commit_creds() hits BUG_ON(task->cred != old); Gotcha. In the process of looking at the backtrace I forgot about the BUG_ON() at the top of the oops message. I wonder what terrible things would happen if we changed the BUG_ON() in commit_creds to simple returning an error an error code to the caller. There is a warning/requirement in commit_creds() function header comment that it should always return 0. Would callers be expected to call abort_creds() on failure? There are a number of places where it'd need fixing up. And would likely be best with a __must_check marking. It seems like avoiding the pathological case might be simpler? Yeah, Avoiding this pathological case is a better solution. It seems like that we can't commit_creds() during override_creds() and revert_creds(). So how about just put commit_creds outside ! just like: override_creds() // cred -=> new // may BUG_ON if commit_creds done. revert_creds() // cred -=> old <---| commit_creds // cred = real_cred = new | [revert_creds]--| [1]--Before we call commit_creds in selinux_setprocattr(), if we find that cred != real_cred, it may have been overridden before, we should revert it. [2]--The same to revert_creds, when we found someone have committed, orig_cred != current->real_cred may hits, this means that we have reverted before(see [1]). [3]--Sometimes new and old are the same, then we need to consider this situation specially. The code just like: From: Yang Yingliang Date: Sat, 13 Apr 2019 21:56:01 +0800 Subject: [PATCH] fix cred bug_on Signed-off-by: Yang Yingliang --- kernel/acct.c | 3 ++- kernel/cred.c | 6 ++ security/selinux/hooks.c | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/acct.c b/kernel/acct.c index addf7732fb56..f2065f899eee 100644 --- a/kernel/acct.c +++ b/kernel/acct.c @@ -522,7 +522,8 @@ static void do_acct_process(struct bsd_acct_struct *acct) } out: current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim; - revert_creds(orig_cred); + if (orig_cred == current->real_cred) // [2] + revert_creds(orig_cred); } /** diff --git a/kernel/cred.c b/kernel/cred.c index ecf03657e71c..c4d5ba92fb9b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -522,6 +522,9 @@ const struct cred *override_creds(const struct cred *new) { const struct cred *old = current->cred; + if (old == new) // [3] + return old; + kdebug("override_creds(%p{%d,%d})", new, atomic_read(>usage), read_cred_subscribers(new)); @@ -551,6 +554,9 @@ void revert_creds(const struct cred *old) { const struct cred *override = current->cred; + if (override == old) // [3] + return; + kdebug("revert_creds(%p{%d,%d})", old, atomic_read(>usage), read_cred_subscribers(old)); diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index b5017beb4ef7..bc8108e4e90f 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -6590,6 +6590,8 @@ static int selinux_setprocattr(const char *name, void *value, size_t size) goto abort_change; } + if (current->cred != current->real_cred) // [1] + revert_creds(current->real_cred); commit_creds(new); return size; -- 2.17.1 We have tested this patch for 3 days and it works well. Are there any cases that are not covered here ? Thanks. Cheng Jian
Re: kernel BUG at kernel/cred.c:434!
On Mon, Apr 15, 2019 at 11:20 AM Paul Moore wrote: > > On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov wrote: > > On 04/15, Paul Moore wrote: > > > > > > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: > > > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, > > > > but I do > > > > not know where should we put the additional check... And probably > > > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit > > > > the > > > > same problem, do_coredump() does override_creds() too. > > > > > > > > May be just add > > > > > > > > if (current->cred != current->real_cred) > > > > return -EACCES; > > > > > > > > into proc_pid_attr_write(), I dunno. > > > > > > Is the problem that do_acct_process() is calling override_creds() and > > > the returned/old credentials are being freed before do_acct_process() > > > can reinstall the creds via revert_creds()? Presumably because the > > > process accounting is causing the credentials to be replaced? > > > > Afaics, the problem is that do_acct_process() does override_creds() and > > then __kernel_write(). Which calls proc_pid_attr_write(), which in turn > > calls > > selinux_setprocattr(), which does another prepare_creds() + commit_creds(); > > and commit_creds() hits > > > > BUG_ON(task->cred != old); > > Gotcha. In the process of looking at the backtrace I forgot about the > BUG_ON() at the top of the oops message. > > I wonder what terrible things would happen if we changed the BUG_ON() > in commit_creds to simple returning an error an error code to the > caller. There is a warning/requirement in commit_creds() function > header comment that it should always return 0. Would callers be expected to call abort_creds() on failure? There are a number of places where it'd need fixing up. And would likely be best with a __must_check marking. It seems like avoiding the pathological case might be simpler? -- Kees Cook
Re: kernel BUG at kernel/cred.c:434!
On Mon, Apr 15, 2019 at 11:05 AM Oleg Nesterov wrote: > On 04/15, Paul Moore wrote: > > > > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: > > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, but > > > I do > > > not know where should we put the additional check... And probably > > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the > > > same problem, do_coredump() does override_creds() too. > > > > > > May be just add > > > > > > if (current->cred != current->real_cred) > > > return -EACCES; > > > > > > into proc_pid_attr_write(), I dunno. > > > > Is the problem that do_acct_process() is calling override_creds() and > > the returned/old credentials are being freed before do_acct_process() > > can reinstall the creds via revert_creds()? Presumably because the > > process accounting is causing the credentials to be replaced? > > Afaics, the problem is that do_acct_process() does override_creds() and > then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls > selinux_setprocattr(), which does another prepare_creds() + commit_creds(); > and commit_creds() hits > > BUG_ON(task->cred != old); Gotcha. In the process of looking at the backtrace I forgot about the BUG_ON() at the top of the oops message. I wonder what terrible things would happen if we changed the BUG_ON() in commit_creds to simple returning an error an error code to the caller. There is a warning/requirement in commit_creds() function header comment that it should always return 0. -- paul moore www.paul-moore.com
Re: kernel BUG at kernel/cred.c:434!
On 04/15, Paul Moore wrote: > > On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: > > Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I > > do > > not know where should we put the additional check... And probably > > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the > > same problem, do_coredump() does override_creds() too. > > > > May be just add > > > > if (current->cred != current->real_cred) > > return -EACCES; > > > > into proc_pid_attr_write(), I dunno. > > Is the problem that do_acct_process() is calling override_creds() and > the returned/old credentials are being freed before do_acct_process() > can reinstall the creds via revert_creds()? Presumably because the > process accounting is causing the credentials to be replaced? Afaics, the problem is that do_acct_process() does override_creds() and then __kernel_write(). Which calls proc_pid_attr_write(), which in turn calls selinux_setprocattr(), which does another prepare_creds() + commit_creds(); and commit_creds() hits BUG_ON(task->cred != old); Oleg.
Re: kernel BUG at kernel/cred.c:434!
On Mon, Apr 15, 2019 at 9:43 AM Oleg Nesterov wrote: > Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do > not know where should we put the additional check... And probably > "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the > same problem, do_coredump() does override_creds() too. > > May be just add > > if (current->cred != current->real_cred) > return -EACCES; > > into proc_pid_attr_write(), I dunno. Is the problem that do_acct_process() is calling override_creds() and the returned/old credentials are being freed before do_acct_process() can reinstall the creds via revert_creds()? Presumably because the process accounting is causing the credentials to be replaced? > On 04/12, Casey Schaufler wrote: > > > > On 4/11/2019 11:21 PM, chengjian (D) wrote: > > > > Added LSM and SELinux lists. > > > > > > >Hi. > > > > > > > > >syzkaller reported the following BUG: > > > > > >[ 73.146973] kernel BUG at kernel/cred.c:434! > > >[ 73.150231] invalid opcode: [#1] SMP KASAN PTI > > >[ 73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted > > >5.1.0-rc4-00062-g2d06b235815e-dirty #2 > > >[ 73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > >rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > > >[ 73.159798] RIP: 0010:commit_creds+0xadb/0xe50 > > >[ 73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f > > >8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 > > ><0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 > > >[ 73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293 > > >[ 73.169636] RAX: 8883767b RBX: 88837f111300 RCX: > > >8124b5db > > >[ 73.171962] RDX: RSI: 83c9b140 RDI: > > >88837f111300 > > >[ 73.174310] RBP: 888376610400 R08: R09: > > >0004 > > >[ 73.176646] R10: 0001 R11: ed107c655acf R12: > > >8883767b > > >[ 73.178527] Process accounting resumed > > >[ 73.179021] R13: 88837f111900 R14: 88837f111300 R15: > > >8883767b0ac0 > > >[ 73.179029] FS: 7f2d207f9700() GS:8883e328() > > >knlGS: > > >[ 73.179034] CS: 0010 DS: ES: CR0: 80050033 > > >[ 73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: > > >000206e0 > > >[ 73.179047] Call Trace: > > >[ 73.190461] selinux_setprocattr+0x2ea/0x8f0 > > >[ 73.191925] ? ptrace_parent_sid+0x530/0x530 > > >[ 73.193436] ? proc_pid_attr_write+0x185/0x5a0 > > >[ 73.194967] security_setprocattr+0xa1/0x100 > > >[ 73.196408] proc_pid_attr_write+0x307/0x5a0 > > >[ 73.197869] ? mem_read+0x40/0x40 > > >[ 73.199013] __vfs_write+0x81/0x100 > > >[ 73.200222] __kernel_write+0xf8/0x330 > > >[ 73.201562] do_acct_process+0xca5/0x1340 > > >[ 73.202969] ? __ia32_sys_acct+0x1e0/0x1e0 > > >[ 73.204498] ? find_held_lock+0x2f/0x1e0 > > >[ 73.205857] ? rcu_irq_exit+0xec/0x2c0 > > >[ 73.207160] ? lock_downgrade+0x630/0x630 > > >[ 73.208541] acct_pin_kill+0x63/0x150 > > >[ 73.209816] pin_kill+0x16d/0x7c0 > > >[ 73.210934] ? lockdep_hardirqs_on+0x5e0/0x5e0 > > >[ 73.212452] ? xas_start+0x155/0x510 > > >[ 73.213705] ? pin_insert+0x50/0x50 > > >[ 73.214903] ? finish_wait+0x270/0x270 > > >[ 73.216213] ? cpumask_next+0x57/0x90 > > >[ 73.217442] ? mnt_pin_kill+0x68/0x1d0 > > >[ 73.218851] mnt_pin_kill+0x68/0x1d0 > > >[ 73.220398] cleanup_mnt+0x11b/0x150 > > >[ 73.221970] task_work_run+0x136/0x1b0 > > >[ 73.223427] do_exit+0x830/0x2ca0 > > >[ 73.224586] ? trace_hardirqs_off+0x3b/0x180 > > >[ 73.226088] ? mm_update_next_owner+0x6a0/0x6a0 > > >[ 73.227622] ? find_held_lock+0x2f/0x1e0 > > >[ 73.228954] ? get_signal+0x2cf/0x1c00 > > >[ 73.230236] ? lock_downgrade+0x630/0x630 > > >[ 73.231628] ? rwlock_bug.part.0+0x90/0x90 > > >[ 73.233020] do_group_exit+0x106/0x2f0 > > >[ 73.234330] get_signal+0x325/0x1c00 > > >[ 73.235571] do_signal+0x97/0x1670 > > >[ 73.236739] ? do_send_specific+0x12d/0x220 > > >[ 73.238213] ? lock_downgrade+0x630/0x630 > > >[ 73.239566] ? setup_sigcontext
Re: kernel BUG at kernel/cred.c:434!
Well, acct("/proc/self/attr/current") doesn't look like a good idea, but I do not know where should we put the additional check... And probably "echo /proc/self/attr/current > /proc/sys/kernel/core_pattern" can hit the same problem, do_coredump() does override_creds() too. May be just add if (current->cred != current->real_cred) return -EACCES; into proc_pid_attr_write(), I dunno. On 04/12, Casey Schaufler wrote: > > On 4/11/2019 11:21 PM, chengjian (D) wrote: > > Added LSM and SELinux lists. > > > >Hi. > > > > > >syzkaller reported the following BUG: > > > >[ 73.146973] kernel BUG at kernel/cred.c:434! > >[ 73.150231] invalid opcode: [#1] SMP KASAN PTI > >[ 73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted > >5.1.0-rc4-00062-g2d06b235815e-dirty #2 > >[ 73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > >rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > >[ 73.159798] RIP: 0010:commit_creds+0xadb/0xe50 > >[ 73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f > >8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 > ><0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 > >[ 73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293 > >[ 73.169636] RAX: 8883767b RBX: 88837f111300 RCX: > >8124b5db > >[ 73.171962] RDX: RSI: 83c9b140 RDI: > >88837f111300 > >[ 73.174310] RBP: 888376610400 R08: R09: > >0004 > >[ 73.176646] R10: 0001 R11: ed107c655acf R12: > >8883767b > >[ 73.178527] Process accounting resumed > >[ 73.179021] R13: 88837f111900 R14: 88837f111300 R15: > >8883767b0ac0 > >[ 73.179029] FS: 7f2d207f9700() GS:8883e328() > >knlGS: > >[ 73.179034] CS: 0010 DS: ES: CR0: 80050033 > >[ 73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: > >000206e0 > >[ 73.179047] Call Trace: > >[ 73.190461] selinux_setprocattr+0x2ea/0x8f0 > >[ 73.191925] ? ptrace_parent_sid+0x530/0x530 > >[ 73.193436] ? proc_pid_attr_write+0x185/0x5a0 > >[ 73.194967] security_setprocattr+0xa1/0x100 > >[ 73.196408] proc_pid_attr_write+0x307/0x5a0 > >[ 73.197869] ? mem_read+0x40/0x40 > >[ 73.199013] __vfs_write+0x81/0x100 > >[ 73.200222] __kernel_write+0xf8/0x330 > >[ 73.201562] do_acct_process+0xca5/0x1340 > >[ 73.202969] ? __ia32_sys_acct+0x1e0/0x1e0 > >[ 73.204498] ? find_held_lock+0x2f/0x1e0 > >[ 73.205857] ? rcu_irq_exit+0xec/0x2c0 > >[ 73.207160] ? lock_downgrade+0x630/0x630 > >[ 73.208541] acct_pin_kill+0x63/0x150 > >[ 73.209816] pin_kill+0x16d/0x7c0 > >[ 73.210934] ? lockdep_hardirqs_on+0x5e0/0x5e0 > >[ 73.212452] ? xas_start+0x155/0x510 > >[ 73.213705] ? pin_insert+0x50/0x50 > >[ 73.214903] ? finish_wait+0x270/0x270 > >[ 73.216213] ? cpumask_next+0x57/0x90 > >[ 73.217442] ? mnt_pin_kill+0x68/0x1d0 > >[ 73.218851] mnt_pin_kill+0x68/0x1d0 > >[ 73.220398] cleanup_mnt+0x11b/0x150 > >[ 73.221970] task_work_run+0x136/0x1b0 > >[ 73.223427] do_exit+0x830/0x2ca0 > >[ 73.224586] ? trace_hardirqs_off+0x3b/0x180 > >[ 73.226088] ? mm_update_next_owner+0x6a0/0x6a0 > >[ 73.227622] ? find_held_lock+0x2f/0x1e0 > >[ 73.228954] ? get_signal+0x2cf/0x1c00 > >[ 73.230236] ? lock_downgrade+0x630/0x630 > >[ 73.231628] ? rwlock_bug.part.0+0x90/0x90 > >[ 73.233020] do_group_exit+0x106/0x2f0 > >[ 73.234330] get_signal+0x325/0x1c00 > >[ 73.235571] do_signal+0x97/0x1670 > >[ 73.236739] ? do_send_specific+0x12d/0x220 > >[ 73.238213] ? lock_downgrade+0x630/0x630 > >[ 73.239566] ? setup_sigcontext+0x820/0x820 > >[ 73.240982] ? check_kill_permission+0x4a/0x510 > >[ 73.242509] ? do_send_specific+0x156/0x220 > >[ 73.243905] ? do_tkill+0x1c4/0x260 > >[ 73.245081] ? do_send_specific+0x220/0x220 > >[ 73.246514] ? trace_hardirqs_on_thunk+0x1a/0x1c > >[ 73.248061] ? exit_to_usermode_loop+0x97/0x1d0 > >[ 73.249619] exit_to_usermode_loop+0x108/0x1d0 > >[ 73.251129] do_syscall_64+0x461/0x580 > >[ 73.252454] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >[ 73.254219] RIP: 0033:0x462eb9 > >[ 73.255327] Code: Bad RIP value. > >[ 73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX: > >00c8 > >[ 73.259454] RAX:
Re: kernel BUG at kernel/cred.c:434!
On 4/11/2019 11:21 PM, chengjian (D) wrote: Added LSM and SELinux lists. Hi. syzkaller reported the following BUG: [ 73.146973] kernel BUG at kernel/cred.c:434! [ 73.150231] invalid opcode: [#1] SMP KASAN PTI [ 73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted 5.1.0-rc4-00062-g2d06b235815e-dirty #2 [ 73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 73.159798] RIP: 0010:commit_creds+0xadb/0xe50 [ 73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 [ 73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293 [ 73.169636] RAX: 8883767b RBX: 88837f111300 RCX: 8124b5db [ 73.171962] RDX: RSI: 83c9b140 RDI: 88837f111300 [ 73.174310] RBP: 888376610400 R08: R09: 0004 [ 73.176646] R10: 0001 R11: ed107c655acf R12: 8883767b [ 73.178527] Process accounting resumed [ 73.179021] R13: 88837f111900 R14: 88837f111300 R15: 8883767b0ac0 [ 73.179029] FS: 7f2d207f9700() GS:8883e328() knlGS: [ 73.179034] CS: 0010 DS: ES: CR0: 80050033 [ 73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: 000206e0 [ 73.179047] Call Trace: [ 73.190461] selinux_setprocattr+0x2ea/0x8f0 [ 73.191925] ? ptrace_parent_sid+0x530/0x530 [ 73.193436] ? proc_pid_attr_write+0x185/0x5a0 [ 73.194967] security_setprocattr+0xa1/0x100 [ 73.196408] proc_pid_attr_write+0x307/0x5a0 [ 73.197869] ? mem_read+0x40/0x40 [ 73.199013] __vfs_write+0x81/0x100 [ 73.200222] __kernel_write+0xf8/0x330 [ 73.201562] do_acct_process+0xca5/0x1340 [ 73.202969] ? __ia32_sys_acct+0x1e0/0x1e0 [ 73.204498] ? find_held_lock+0x2f/0x1e0 [ 73.205857] ? rcu_irq_exit+0xec/0x2c0 [ 73.207160] ? lock_downgrade+0x630/0x630 [ 73.208541] acct_pin_kill+0x63/0x150 [ 73.209816] pin_kill+0x16d/0x7c0 [ 73.210934] ? lockdep_hardirqs_on+0x5e0/0x5e0 [ 73.212452] ? xas_start+0x155/0x510 [ 73.213705] ? pin_insert+0x50/0x50 [ 73.214903] ? finish_wait+0x270/0x270 [ 73.216213] ? cpumask_next+0x57/0x90 [ 73.217442] ? mnt_pin_kill+0x68/0x1d0 [ 73.218851] mnt_pin_kill+0x68/0x1d0 [ 73.220398] cleanup_mnt+0x11b/0x150 [ 73.221970] task_work_run+0x136/0x1b0 [ 73.223427] do_exit+0x830/0x2ca0 [ 73.224586] ? trace_hardirqs_off+0x3b/0x180 [ 73.226088] ? mm_update_next_owner+0x6a0/0x6a0 [ 73.227622] ? find_held_lock+0x2f/0x1e0 [ 73.228954] ? get_signal+0x2cf/0x1c00 [ 73.230236] ? lock_downgrade+0x630/0x630 [ 73.231628] ? rwlock_bug.part.0+0x90/0x90 [ 73.233020] do_group_exit+0x106/0x2f0 [ 73.234330] get_signal+0x325/0x1c00 [ 73.235571] do_signal+0x97/0x1670 [ 73.236739] ? do_send_specific+0x12d/0x220 [ 73.238213] ? lock_downgrade+0x630/0x630 [ 73.239566] ? setup_sigcontext+0x820/0x820 [ 73.240982] ? check_kill_permission+0x4a/0x510 [ 73.242509] ? do_send_specific+0x156/0x220 [ 73.243905] ? do_tkill+0x1c4/0x260 [ 73.245081] ? do_send_specific+0x220/0x220 [ 73.246514] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 73.248061] ? exit_to_usermode_loop+0x97/0x1d0 [ 73.249619] exit_to_usermode_loop+0x108/0x1d0 [ 73.251129] do_syscall_64+0x461/0x580 [ 73.252454] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 73.254219] RIP: 0033:0x462eb9 [ 73.255327] Code: Bad RIP value. [ 73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX: 00c8 [ 73.259454] RAX: RBX: 0073bf00 RCX: 00462eb9 [ 73.262309] RDX: RSI: 001e RDI: 0005 [ 73.265064] RBP: 0002 R08: R09: [ 73.267774] R10: R11: 0246 R12: 7f2d207f96bc [ 73.270546] R13: 004c5509 R14: 007042f0 R15: [ 73.273542] Modules linked in: [ 73.274670] Dumping ftrace buffer: [ 73.275852] (ftrace buffer empty) [ 73.277187] ---[ end trace dde36a95f458175d ]--- [ 73.278834] RIP: 0010:commit_creds+0xadb/0xe50 [ 73.280549] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 [ 73.287090] RSP: :88836e65f5d0 EFLAGS: 00010293 [ 73.288917] RAX: 8883767b RBX: 88837f111300 RCX: 8124b5db [ 73.291390] RDX: RSI: 83c9b140 RDI: 88837f111300 [ 73.293864] RBP: 888376610400 R08: R09: 0004 [ 73.296370] R10: 0001 R11: ed107c655acf R12: 8883767b [ 73.299275] R13: 8883
kernel BUG at kernel/cred.c:434!
Hi. syzkaller reported the following BUG: [ 73.146973] kernel BUG at kernel/cred.c:434! [ 73.150231] invalid opcode: [#1] SMP KASAN PTI [ 73.151928] CPU: 2 PID: 4058 Comm: syz-executor.6 Not tainted 5.1.0-rc4-00062-g2d06b235815e-dirty #2 [ 73.155174] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 73.159798] RIP: 0010:commit_creds+0xadb/0xe50 [ 73.161426] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 [ 73.167852] RSP: :88836e65f5d0 EFLAGS: 00010293 [ 73.169636] RAX: 8883767b RBX: 88837f111300 RCX: 8124b5db [ 73.171962] RDX: RSI: 83c9b140 RDI: 88837f111300 [ 73.174310] RBP: 888376610400 R08: R09: 0004 [ 73.176646] R10: 0001 R11: ed107c655acf R12: 8883767b [ 73.178527] Process accounting resumed [ 73.179021] R13: 88837f111900 R14: 88837f111300 R15: 8883767b0ac0 [ 73.179029] FS: 7f2d207f9700() GS:8883e328() knlGS: [ 73.179034] CS: 0010 DS: ES: CR0: 80050033 [ 73.179039] CR2: 7f1500bd36c0 CR3: 0003df304003 CR4: 000206e0 [ 73.179047] Call Trace: [ 73.190461] selinux_setprocattr+0x2ea/0x8f0 [ 73.191925] ? ptrace_parent_sid+0x530/0x530 [ 73.193436] ? proc_pid_attr_write+0x185/0x5a0 [ 73.194967] security_setprocattr+0xa1/0x100 [ 73.196408] proc_pid_attr_write+0x307/0x5a0 [ 73.197869] ? mem_read+0x40/0x40 [ 73.199013] __vfs_write+0x81/0x100 [ 73.200222] __kernel_write+0xf8/0x330 [ 73.201562] do_acct_process+0xca5/0x1340 [ 73.202969] ? __ia32_sys_acct+0x1e0/0x1e0 [ 73.204498] ? find_held_lock+0x2f/0x1e0 [ 73.205857] ? rcu_irq_exit+0xec/0x2c0 [ 73.207160] ? lock_downgrade+0x630/0x630 [ 73.208541] acct_pin_kill+0x63/0x150 [ 73.209816] pin_kill+0x16d/0x7c0 [ 73.210934] ? lockdep_hardirqs_on+0x5e0/0x5e0 [ 73.212452] ? xas_start+0x155/0x510 [ 73.213705] ? pin_insert+0x50/0x50 [ 73.214903] ? finish_wait+0x270/0x270 [ 73.216213] ? cpumask_next+0x57/0x90 [ 73.217442] ? mnt_pin_kill+0x68/0x1d0 [ 73.218851] mnt_pin_kill+0x68/0x1d0 [ 73.220398] cleanup_mnt+0x11b/0x150 [ 73.221970] task_work_run+0x136/0x1b0 [ 73.223427] do_exit+0x830/0x2ca0 [ 73.224586] ? trace_hardirqs_off+0x3b/0x180 [ 73.226088] ? mm_update_next_owner+0x6a0/0x6a0 [ 73.227622] ? find_held_lock+0x2f/0x1e0 [ 73.228954] ? get_signal+0x2cf/0x1c00 [ 73.230236] ? lock_downgrade+0x630/0x630 [ 73.231628] ? rwlock_bug.part.0+0x90/0x90 [ 73.233020] do_group_exit+0x106/0x2f0 [ 73.234330] get_signal+0x325/0x1c00 [ 73.235571] do_signal+0x97/0x1670 [ 73.236739] ? do_send_specific+0x12d/0x220 [ 73.238213] ? lock_downgrade+0x630/0x630 [ 73.239566] ? setup_sigcontext+0x820/0x820 [ 73.240982] ? check_kill_permission+0x4a/0x510 [ 73.242509] ? do_send_specific+0x156/0x220 [ 73.243905] ? do_tkill+0x1c4/0x260 [ 73.245081] ? do_send_specific+0x220/0x220 [ 73.246514] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 73.248061] ? exit_to_usermode_loop+0x97/0x1d0 [ 73.249619] exit_to_usermode_loop+0x108/0x1d0 [ 73.251129] do_syscall_64+0x461/0x580 [ 73.252454] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 73.254219] RIP: 0033:0x462eb9 [ 73.255327] Code: Bad RIP value. [ 73.256539] RSP: 002b:7f2d207f8c58 EFLAGS: 0246 ORIG_RAX: 00c8 [ 73.259454] RAX: RBX: 0073bf00 RCX: 00462eb9 [ 73.262309] RDX: RSI: 001e RDI: 0005 [ 73.265064] RBP: 0002 R08: R09: [ 73.267774] R10: R11: 0246 R12: 7f2d207f96bc [ 73.270546] R13: 004c5509 R14: 007042f0 R15: [ 73.273542] Modules linked in: [ 73.274670] Dumping ftrace buffer: [ 73.275852] (ftrace buffer empty) [ 73.277187] ---[ end trace dde36a95f458175d ]--- [ 73.278834] RIP: 0010:commit_creds+0xadb/0xe50 [ 73.280549] Code: 8b 5b 20 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 06 03 00 00 39 5d 20 0f 85 ff fa ff ff e9 0c fb ff ff e8 05 a2 25 00 <0f> 0b 48 c7 c7 80 56 c0 83 e8 95 22 b1 00 e8 f2 a1 25 00 0f 0b 48 [ 73.287090] RSP: :88836e65f5d0 EFLAGS: 00010293 [ 73.288917] RAX: 8883767b RBX: 88837f111300 RCX: 8124b5db [ 73.291390] RDX: RSI: 83c9b140 RDI: 88837f111300 [ 73.293864] RBP: 888376610400 R08: R09: 0004 [ 73.296370] R10: 0001 R11: ed107c655acf R12: 8883767b [ 73.299275] R13: 88837f111900 R14: 88837f111300 R15: 8883767b0ac0 [ 73.301351] Process accoun
[PATCH v2 3/3] riscv: Support BUG() in kernel module
The kernel module is loaded into vmalloc region which is located below to the PAGE_OFFSET. Hence the condition, pc < PAGE_OFFSET, in the is_valid_bugaddr() will filter out all trap exceptions triggered by kernel module. To support BUG() in kernel module, the condition is changed to pc < VMALLOC_START. Signed-off-by: Vincent Chen --- arch/riscv/kernel/traps.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index 0cd0137..9551388 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -156,7 +156,7 @@ int is_valid_bugaddr(unsigned long pc) { bug_insn_t insn; - if (pc < PAGE_OFFSET) + if (pc < VMALLOC_START) return 0; if (probe_kernel_address((bug_insn_t *)pc, insn)) return 0; -- 1.7.1
[PATCH 2/3] riscv: Support BUG() in kernel module
The kernel module is loaded into vmalloc region which is located below to the PAGE_OFFSET. Hence the condition, pc < PAGE_OFFSET, in the is_valid_bugaddr() will filter out all trap exceptions triggered by kernel module. To support BUG() in kernel module, the condition is changed to pc < VMALLOC_START. Signed-off-by: Vincent Chen --- arch/riscv/kernel/traps.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c index deae0e5..dee0e5e 100644 --- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -144,7 +144,7 @@ int is_valid_bugaddr(unsigned long pc) { bug_insn_t insn; - if (pc < PAGE_OFFSET) + if (pc < VMALLOC_START) return 0; if (probe_kernel_address((bug_insn_t *)pc, insn)) return 0; -- 1.7.1
Re: kernel BUG at kernel/cred.c:825!
Hi Laura, On 17/01/19, 12:39 AM, "linux-nfs-ow...@vger.kernel.org on behalf of Laura Abbott" wrote: On 1/7/19 11:18 AM, Dave Jones wrote: > [ 53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory > [ 53.981216] NFSD: starting 45-second grace period (net f098) > [ 54.006802] CRED: Invalid credentials > [ 54.006880] CRED: At ./include/linux/cred.h:253 > [ 54.006899] CRED: Specified credentials: 5daa4529 > [ 54.006916] CRED: ->magic=0, put_addr= (null) > [ 54.006927] CRED: ->usage=1, subscr=0 > [ 54.006935] CRED: ->*uid = { 0,0,0,0 } > [ 54.006944] CRED: ->*gid = { 0,0,0,0 } > [ 54.006954] ----[ cut here ]---- > [ 54.006964] kernel BUG at kernel/cred.c:825! > Fedora still seems to be hitting this as of -rc2 The FIX is made into –rc3. Best R, Santosh
Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk
On Thu, Jan 17, 2019 at 06:26:58PM +0800, Ming Lei wrote: > On Wed, Jan 16, 2019 at 09:54:05AM +0100, Krzysztof Kozlowski wrote: > > On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski wrote: > > > > > > Hi, > > > > > > On today's next-20190116 I see a bug during boot: > > > [ 6.843308] kernel BUG at ../block/bio.c:1833! > > > [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM > > > ... > > > [ 7.543824] [] (bio_split) from [<>] ( (null)) > > > [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2) > > > > > > (not much in the calltrace) > > > On all my boards. Also QEMU-arm fails. > > > > > > > I forgot the bisect commit: > > > > 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit > > commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 > > Author: Ming Lei > > Date: Fri Jan 11 19:01:15 2019 +0800 > > > > block: use bio_for_each_bvec() to compute multi-page bvec count > > > > :04 04 d79b2e71d308650df4764ff644f29d3a24dbab96 > > 0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block > > It should be one 32-bit arch specific issue, I guess. > > The following patch should fix this issue: > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index dc4877eaf9f9..4dd7183de849 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -166,6 +166,9 @@ static unsigned get_max_segment_size(struct request_queue > *q, > { > unsigned long mask = queue_segment_boundary(q); > > + if (mask == BLK_SEG_BOUNDARY_MASK) > + return queue_max_segment_size(q); > + > return min_t(unsigned long, mask - (mask & offset) + 1, >queue_max_segment_size(q)); > } > > Thanks, > Ming Hi Ming, We saw this issue on our CI as well: https://travis-ci.com/ClangBuiltLinux/continuous-integration/jobs/170724607 I added your patch and tested locally, it boots fine now. Tested-by: Nathan Chancellor Thanks, Nathan
Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk
On Wed, Jan 16, 2019 at 09:54:05AM +0100, Krzysztof Kozlowski wrote: > On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski wrote: > > > > Hi, > > > > On today's next-20190116 I see a bug during boot: > > [ 6.843308] kernel BUG at ../block/bio.c:1833! > > [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM > > ... > > [ 7.543824] [] (bio_split) from [<>] ( (null)) > > [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2) > > > > (not much in the calltrace) > > On all my boards. Also QEMU-arm fails. > > > > I forgot the bisect commit: > > 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit > commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 > Author: Ming Lei > Date: Fri Jan 11 19:01:15 2019 +0800 > > block: use bio_for_each_bvec() to compute multi-page bvec count > > :04 04 d79b2e71d308650df4764ff644f29d3a24dbab96 > 0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block It should be one 32-bit arch specific issue, I guess. The following patch should fix this issue: diff --git a/block/blk-merge.c b/block/blk-merge.c index dc4877eaf9f9..4dd7183de849 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -166,6 +166,9 @@ static unsigned get_max_segment_size(struct request_queue *q, { unsigned long mask = queue_segment_boundary(q); + if (mask == BLK_SEG_BOUNDARY_MASK) + return queue_max_segment_size(q); + return min_t(unsigned long, mask - (mask & offset) + 1, queue_max_segment_size(q)); } Thanks, Ming
Re: kernel BUG at kernel/cred.c:825!
On 1/7/19 11:18 AM, Dave Jones wrote: [ 53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 53.981216] NFSD: starting 45-second grace period (net f098) [ 54.006802] CRED: Invalid credentials [ 54.006880] CRED: At ./include/linux/cred.h:253 [ 54.006899] CRED: Specified credentials: 5daa4529 [ 54.006916] CRED: ->magic=0, put_addr= (null) [ 54.006927] CRED: ->usage=1, subscr=0 [ 54.006935] CRED: ->*uid = { 0,0,0,0 } [ 54.006944] CRED: ->*gid = { 0,0,0,0 } [ 54.006954] [ cut here ] [ 54.006964] kernel BUG at kernel/cred.c:825! [ 54.006977] invalid opcode: [#1] SMP RIP: __invalid_creds+0x48/0x50 [ 54.006987] CPU: 2 PID: 814 Comm: mount.nfs Tainted: GW 5.0.0-rc1-backup+ #1 [ 54.006997] Hardware name: ASUS All Series/Z97-DELUXE, BIOS 2602 08/18/2015 [ 54.007171] RIP: 0010:__invalid_creds+0x48/0x50 [ 54.007184] Code: 44 89 e2 48 89 ee 48 c7 c7 37 3e 53 ba e8 f7 8f 03 00 48 c7 c6 49 3e 53 ba 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 48 fd ff ff <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 fe [ 54.007207] RSP: 0018:c9e33a30 EFLAGS: 00010286 [ 54.007219] RAX: 001a RBX: ba960300 RCX: 0006 [ 54.007234] RDX: RSI: 8884276f8818 RDI: 88842f895710 [ 54.007246] RBP: ba5274c3 R08: 0001 R09: [ 54.007254] R10: c9e33a50 R11: R12: 00fd [ 54.007261] R13: 88842c1a6a08 R14: ba960300 R15: c9e33d60 [ 54.007269] FS: 7f73770cb140() GS:88842f88() knlGS: [ 54.007277] CS: 0010 DS: ES: CR0: 80050033 [ 54.007283] CR2: 5557d17d1000 CR3: 0004122ba006 CR4: 001606e0 [ 54.007359] Call Trace: [ 54.007366] nfs4_discover_server_trunking+0x286/0x310 [ 54.007376] nfs4_init_client+0xe8/0x260 [ 54.007389] ? nfs_get_client+0x519/0x610 [ 54.007401] ? _raw_spin_unlock+0x24/0x30 [ 54.007412] ? nfs_get_client+0x519/0x610 [ 54.007424] nfs4_set_client+0xb8/0x100 [ 54.007439] nfs4_create_server+0xfe/0x270 [ 54.007451] ? pcpu_alloc+0x611/0x8a0 [ 54.007462] nfs4_remote_mount+0x28/0x50 [ 54.007474] mount_fs+0xf/0x80 [ 54.007487] vfs_kern_mount+0x62/0x160 [ 54.007498] nfs_do_root_mount+0x7f/0xc0 [ 54.007510] nfs4_try_mount+0x3f/0xc0 [ 54.007521] ? get_nfs_version+0x11/0x50 [ 54.007536] nfs_fs_mount+0x61b/0xbd0 [ 54.007548] ? rcu_read_lock_sched_held+0x66/0x70 [ 54.007560] ? nfs_clone_super+0x70/0x70 [ 54.007571] ? nfs_destroy_inode+0x20/0x20 [ 54.007585] ? mount_fs+0xf/0x80 [ 54.007595] mount_fs+0xf/0x80 [ 54.007606] vfs_kern_mount+0x62/0x160 [ 54.007618] do_mount+0x1d1/0xd40 [ 54.007631] ? copy_mount_options+0xd2/0x170 [ 54.007643] ksys_mount+0x7e/0xd0 [ 54.007654] __x64_sys_mount+0x21/0x30 [ 54.007665] do_syscall_64+0x6d/0x660 [ 54.007677] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 54.007690] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 54.007702] RIP: 0033:0x7f7377e97a1a [ 54.007713] Code: 48 8b 0d 71 e4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48 [ 54.007736] RSP: 002b:7ffc73d9b4a8 EFLAGS: 0202 ORIG_RAX: 00a5 [ 54.007751] RAX: ffda RBX: RCX: 7f7377e97a1a [ 54.007764] RDX: 5632beb51b50 RSI: 5632beb51b70 RDI: 5632beb53880 [ 54.007780] RBP: 7ffc73d9b600 R08: 5632beb556b0 R09: 33643a303036343a [ 54.007794] R10: 0c00 R11: 0202 R12: 7ffc73d9b600 [ 54.007807] R13: 5632beb548a0 R14: 001c R15: 7ffc73d9b510 Fedora still seems to be hitting this as of -rc2
Re: kernel BUG at kernel/sched/core.c:3490!
On 01/12, Kohli, Gaurav wrote: > > HI Peter, Oleg, > > as per flag and state this seems to be possible only from below code: Not sure I understand you, > XXX: 0 1 0x40844c > PF_NOFREEZE > PF_RANDOMIZE > PF_SIGNALED > PF_FORKNOEXEC > PF_EXITING > PF_EXITPIDONE > > above state shows do_exit runs properely and if somehow after parked stated > , TASK_WAKEKILL got set and signal_pending_state returns 1 in below case: > > switch_count = >nivcsw; > if (!preempt && prev->state) { > if (unlikely(signal_pending_state(prev->state, prev))) { > prev->state = TASK_RUNNING; > } else { > deactivate_task(rq, prev, DEQUEUE_SLEEP | > DEQUEUE_NOCLOCK); or task->state was TASK_RUNNING when __schedule() was called, or the deactivated dead task was woken up later... The only problem is that every case looks "obviously impossible" ;) I have no idea whats going on, I can only suggest more stupid debugging patches which might narrow the problem. Oleg.
Re: [BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk
On Wed, 16 Jan 2019 at 09:52, Krzysztof Kozlowski wrote: > > Hi, > > On today's next-20190116 I see a bug during boot: > [ 6.843308] kernel BUG at ../block/bio.c:1833! > [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM > ... > [ 7.543824] [] (bio_split) from [<>] ( (null)) > [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2) > > (not much in the calltrace) > On all my boards. Also QEMU-arm fails. > I forgot the bisect commit: 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 is the first bad commit commit 258cfdfaf7bd729e759a0a91fd00ac9794796ad3 Author: Ming Lei Date: Fri Jan 11 19:01:15 2019 +0800 block: use bio_for_each_bvec() to compute multi-page bvec count :04 04 d79b2e71d308650df4764ff644f29d3a24dbab96 0a5d624843b805ee0c9fd9a7e2d5163f5b15b167 M block BR, Krzysztof
[BUG bisect] kernel BUG at block/bio.c:1833 and fail to mount disk
Hi, On today's next-20190116 I see a bug during boot: [ 6.843308] kernel BUG at ../block/bio.c:1833! [ 6.847723] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM ... [ 7.543824] [] (bio_split) from [<>] ( (null)) [ 7.549881] Code: 13833b01 11c630bc e1a6 e8bd8070 (e7f001f2) (not much in the calltrace) On all my boards. Also QEMU-arm fails. Ultimately the target fails to mount /home from SD card (although root is from NFSv4): Timed out waiting for device dev-disk-by\x2dlabel-home.device. Full console logs: https://krzk.eu/#/builders/25/builds/1079/steps/12/logs/serial0 https://krzk.eu/#/builders/1/builds/3031/steps/12/logs/serial0 1. Arch ARM Linux 2. exynos_defconfig - All my Odroid boards (ARMv7, Exynos) - Odroid HC1, U3, XU3, XU - Systemd 3. All boards boot from TFTP with NFS root (NFSv4) The QEMU arm fails with more information: [ 0.596078] loop: module loaded [ 0.601633] virtio_blk virtio0: [vda] 10240 512-byte logical blocks (5.24 MB/5.00 MiB) [ 0.603320] Internal error: Oops - undefined instruction: 0 [#1] ARM [ 0.603504] Modules linked in: [ 0.603784] CPU: 0 PID: 1 Comm: swapper Tainted: G W 5.0.0-rc2-next-20190116-gcbeb3db #1 [ 0.603905] Hardware name: ARM-Versatile Express [ 0.604329] PC is at bio_split+0x5c/0x64 [ 0.604461] LR is at blk_queue_split+0x47c/0x5f4 [ 0.604553] pc : [<8036e2c4>] lr : [<80376134>] psr: 6013 [ 0.604652] sp : 8681f878 ip : fp : 8699df00 [ 0.604750] r10: 8681f914 r9 : r8 : 007e [ 0.604838] r7 : r6 : 8547ac28 r5 : r4 : [ 0.604937] r3 : 8547b068 r2 : 0060 r1 : r0 : 8699df00 [ 0.605070] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none [ 0.605185] Control: 10c5387d Table: 60004059 DAC: 0051 [ 0.605338] Process swapper (pid: 1, stack limit = 0x(ptrval)) ... [ 0.619332] [<8036e2c4>] (bio_split) from [<80376134>] (blk_queue_split+0x47c/0x5f4) [ 0.619480] [<80376134>] (blk_queue_split) from [<8037bee8>] (blk_mq_make_request+0x80/0x3bc) [ 0.619608] [<8037bee8>] (blk_mq_make_request) from [<80371840>] (generic_make_request+0x1b0/0x364) [ 0.619733] [<80371840>] (generic_make_request) from [<80371a8c>] (submit_bio+0x98/0x194) [ 0.619850] [<80371a8c>] (submit_bio) from [<8020e1b8>] (submit_bh_wbc.constprop.21+0x154/0x17c) [ 0.619980] [<8020e1b8>] (submit_bh_wbc.constprop.21) from [<80210634>] (block_read_full_page+0x3b0/0x3d0) [ 0.620109] [<80210634>] (block_read_full_page) from [<8019139c>] (do_read_cache_page+0x180/0x33c) [ 0.620237] [<8019139c>] (do_read_cache_page) from [<8019156c>] (read_cache_page+0x14/0x1c) [ 0.620388] [<8019156c>] (read_cache_page) from [<80383be4>] (read_dev_sector+0x2c/0xbc) [ 0.620508] [<80383be4>] (read_dev_sector) from [<80386e7c>] (read_lba+0x134/0x1b4) [ 0.620623] [<80386e7c>] (read_lba) from [<80387234>] (efi_partition+0xbc/0x8e8) [ 0.620741] [<80387234>] (efi_partition) from [<80386114>] (check_partition+0x11c/0x1e8) [ 0.620856] [<80386114>] (check_partition) from [<80384298>] (rescan_partitions+0x94/0x45c) [ 0.620971] [<80384298>] (rescan_partitions) from [<80213230>] (__blkdev_get+0x260/0x404) [ 0.621087] [<80213230>] (__blkdev_get) from [<802134d8>] (blkdev_get+0x104/0x2f4) [ 0.621206] [<802134d8>] (blkdev_get) from [<80382f9c>] (__device_add_disk+0x498/0x4fc) [ 0.621325] [<80382f9c>] (__device_add_disk) from [<804317f0>] (virtblk_probe+0x444/0x744) [ 0.621444] [<804317f0>] (virtblk_probe) from [<803d3df0>] (virtio_dev_probe+0x258/0x330) [ 0.621561] [<803d3df0>] (virtio_dev_probe) from [<80415864>] (really_probe+0xf4/0x2c8) [ 0.621675] [<80415864>] (really_probe) from [<80415b9c>] (driver_probe_device+0x60/0x16c) [ 0.621788] [<80415b9c>] (driver_probe_device) from [<80415d84>] (__driver_attach+0xdc/0xe0) [ 0.621911] [<80415d84>] (__driver_attach) from [<80413d3c>] (bus_for_each_dev+0x68/0xb4) [ 0.622024] [<80413d3c>] (bus_for_each_dev) from [<80414e04>] (bus_add_driver+0x100/0x20c) [ 0.622137] [<80414e04>] (bus_add_driver) from [<804166b0>] (driver_register+0x78/0x10c) [ 0.622253] [<804166b0>] (driver_register) from [<8091b5e4>] (init+0x70/0x9c) [ 0.622403] [<8091b5e4>] (init) from [<801025ec>] (do_one_initcall+0x54/0x194) [ 0.622525] [<801025ec>] (do_one_initcall) from [<80900e14>] (kernel_init_freeable+0x12c/0x1c8) [ 0.622658] [<80900e14>] (kernel_init_freeable) from [<80628c08>] (kernel_init+0x8/0x110) [ 0.622776] [<80628c08>] (kernel_init) from [<801010e8>] (ret_from_fork+0x14/0x2c) [ 0.622887] Exception stack(0x8681ffb0 to 0x8681fff8) [ 0.623021] ffa0: [ 0.623242] ffc0: 00
Re: kernel BUG at kernel/sched/core.c:3490!
HI Peter, Oleg, as per flag and state this seems to be possible only from below code: XXX: 0 1 0x40844c PF_NOFREEZE PF_RANDOMIZE PF_SIGNALED PF_FORKNOEXEC PF_EXITING PF_EXITPIDONE above state shows do_exit runs properely and if somehow after parked stated , TASK_WAKEKILL got set and signal_pending_state returns 1 in below case: switch_count = >nivcsw; if (!preempt && prev->state) { if (unlikely(signal_pending_state(prev->state, prev))) { prev->state = TASK_RUNNING; } else { deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); Regards Gaurav On 1/11/2019 9:47 PM, Qian Cai wrote: On Fri, 2019-01-11 at 16:07 +0530, Kohli, Gaurav wrote: On 1/7/2019 11:26 PM, Oleg Nesterov wrote: pr_crit("XXX: %ld %d\n", current->state, current->on_rq); Can we also add flags, this may help to know the path of problem: pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq, current->flags); XXX: 0 1 0x40844c -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: kernel BUG at kernel/sched/core.c:3490!
On Fri, 2019-01-11 at 16:07 +0530, Kohli, Gaurav wrote: > > On 1/7/2019 11:26 PM, Oleg Nesterov wrote: > > pr_crit("XXX: %ld %d\n", current->state, current->on_rq); > > Can we also add flags, this may help to know the path of problem: > > pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq, > current->flags); > XXX: 0 1 0x40844c
Re: kernel BUG at kernel/sched/core.c:3490!
On 1/7/2019 11:26 PM, Oleg Nesterov wrote: pr_crit("XXX: %ld %d\n", current->state, current->on_rq); Can we also add flags, this may help to know the path of problem: pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq, current->flags); -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
kernel BUG at kernel/cred.c:825!
[ 53.980701] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 53.981216] NFSD: starting 45-second grace period (net f098) [ 54.006802] CRED: Invalid credentials [ 54.006880] CRED: At ./include/linux/cred.h:253 [ 54.006899] CRED: Specified credentials: 5daa4529 [ 54.006916] CRED: ->magic=0, put_addr= (null) [ 54.006927] CRED: ->usage=1, subscr=0 [ 54.006935] CRED: ->*uid = { 0,0,0,0 } [ 54.006944] CRED: ->*gid = { 0,0,0,0 } [ 54.006954] [ cut here ] [ 54.006964] kernel BUG at kernel/cred.c:825! [ 54.006977] invalid opcode: [#1] SMP RIP: __invalid_creds+0x48/0x50 [ 54.006987] CPU: 2 PID: 814 Comm: mount.nfs Tainted: GW 5.0.0-rc1-backup+ #1 [ 54.006997] Hardware name: ASUS All Series/Z97-DELUXE, BIOS 2602 08/18/2015 [ 54.007171] RIP: 0010:__invalid_creds+0x48/0x50 [ 54.007184] Code: 44 89 e2 48 89 ee 48 c7 c7 37 3e 53 ba e8 f7 8f 03 00 48 c7 c6 49 3e 53 ba 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 48 fd ff ff <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 fe [ 54.007207] RSP: 0018:c9e33a30 EFLAGS: 00010286 [ 54.007219] RAX: 001a RBX: ba960300 RCX: 0006 [ 54.007234] RDX: RSI: 8884276f8818 RDI: 88842f895710 [ 54.007246] RBP: ba5274c3 R08: 0001 R09: [ 54.007254] R10: c9e33a50 R11: R12: 00fd [ 54.007261] R13: 88842c1a6a08 R14: ba960300 R15: c9e33d60 [ 54.007269] FS: 7f73770cb140() GS:88842f88() knlGS: [ 54.007277] CS: 0010 DS: ES: CR0: 80050033 [ 54.007283] CR2: 5557d17d1000 CR3: 0004122ba006 CR4: 001606e0 [ 54.007359] Call Trace: [ 54.007366] nfs4_discover_server_trunking+0x286/0x310 [ 54.007376] nfs4_init_client+0xe8/0x260 [ 54.007389] ? nfs_get_client+0x519/0x610 [ 54.007401] ? _raw_spin_unlock+0x24/0x30 [ 54.007412] ? nfs_get_client+0x519/0x610 [ 54.007424] nfs4_set_client+0xb8/0x100 [ 54.007439] nfs4_create_server+0xfe/0x270 [ 54.007451] ? pcpu_alloc+0x611/0x8a0 [ 54.007462] nfs4_remote_mount+0x28/0x50 [ 54.007474] mount_fs+0xf/0x80 [ 54.007487] vfs_kern_mount+0x62/0x160 [ 54.007498] nfs_do_root_mount+0x7f/0xc0 [ 54.007510] nfs4_try_mount+0x3f/0xc0 [ 54.007521] ? get_nfs_version+0x11/0x50 [ 54.007536] nfs_fs_mount+0x61b/0xbd0 [ 54.007548] ? rcu_read_lock_sched_held+0x66/0x70 [ 54.007560] ? nfs_clone_super+0x70/0x70 [ 54.007571] ? nfs_destroy_inode+0x20/0x20 [ 54.007585] ? mount_fs+0xf/0x80 [ 54.007595] mount_fs+0xf/0x80 [ 54.007606] vfs_kern_mount+0x62/0x160 [ 54.007618] do_mount+0x1d1/0xd40 [ 54.007631] ? copy_mount_options+0xd2/0x170 [ 54.007643] ksys_mount+0x7e/0xd0 [ 54.007654] __x64_sys_mount+0x21/0x30 [ 54.007665] do_syscall_64+0x6d/0x660 [ 54.007677] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 54.007690] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 54.007702] RIP: 0033:0x7f7377e97a1a [ 54.007713] Code: 48 8b 0d 71 e4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3e e4 0b 00 f7 d8 64 89 01 48 [ 54.007736] RSP: 002b:7ffc73d9b4a8 EFLAGS: 0202 ORIG_RAX: 00a5 [ 54.007751] RAX: ffda RBX: RCX: 7f7377e97a1a [ 54.007764] RDX: 5632beb51b50 RSI: 5632beb51b70 RDI: 5632beb53880 [ 54.007780] RBP: 7ffc73d9b600 R08: 5632beb556b0 R09: 33643a303036343a [ 54.007794] R10: 0c00 R11: 0202 R12: 7ffc73d9b600 [ 54.007807] R13: 5632beb548a0 R14: 001c R15: 7ffc73d9b510
Re: kernel BUG at kernel/sched/core.c:3490!
On 01/07, Qian Cai wrote: > > > On 1/7/19 8:52 AM, Peter Zijlstra wrote: > > On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote: > >> Running some mmap() workloads to put the system on low memory situation > >> with > >> swapping and OOM, and then it trigger this BUG(), > >> > >> void __noreturn do_task_dead(void) > >> { > >> /* Causes final put_task_struct in finish_task_switch(): */ > >> set_special_state(TASK_DEAD); > >> > >> /* Tell freezer to ignore us: */ > >> current->flags |= PF_NOFREEZE; > >> > >> __schedule(false); > >> BUG(); > >> > >> /* Avoid "noreturn function does return" - but don't continue if > >> BUG() > >> is a NOP: */ > >> for (;;) > >> cpu_relax(); > >> } > > > > This would mean that we somehow loose the TASK_DEAD state before hitting > > schedule(), but that is something that should be avoided by > > set_special_state(), which is supposed to serialize against concurrent > > wake-ups. or may be pick_next_task() somehow returns the deactivated TASK_DEAD task? > > How readily does this reproduce? > > Running LTP oom01 [1] triggered it at least once in five attempts every time > so > far on v4.20+. Have not tried much on v5.0-rc1 yet. Can you add pr_crit("XXX: %ld %d\n", current->state, current->on_rq); before that BUG() and reproduce? Oleg.
Re: kernel BUG at kernel/sched/core.c:3490!
On 1/7/19 8:52 AM, Peter Zijlstra wrote: > On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote: >> Running some mmap() workloads to put the system on low memory situation with >> swapping and OOM, and then it trigger this BUG(), >> >> void __noreturn do_task_dead(void) >> { >> /* Causes final put_task_struct in finish_task_switch(): */ >> set_special_state(TASK_DEAD); >> >> /* Tell freezer to ignore us: */ >> current->flags |= PF_NOFREEZE; >> >> __schedule(false); >> BUG(); >> >> /* Avoid "noreturn function does return" - but don't continue if >> BUG() >> is a NOP: */ >> for (;;) >> cpu_relax(); >> } > > This would mean that we somehow loose the TASK_DEAD state before hitting > schedule(), but that is something that should be avoided by > set_special_state(), which is supposed to serialize against concurrent > wake-ups. > > Also see commit: b5bf9a90bbeb ("sched/core: Introduce set_special_state()") > > How readily does this reproduce? Running LTP oom01 [1] triggered it at least once in five attempts every time so far on v4.20+. Have not tried much on v5.0-rc1 yet. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
Re: kernel BUG at kernel/sched/core.c:3490!
On Tue, Jan 01, 2019 at 12:44:35AM -0500, Qian Cai wrote: > Running some mmap() workloads to put the system on low memory situation with > swapping and OOM, and then it trigger this BUG(), > > void __noreturn do_task_dead(void) > { > /* Causes final put_task_struct in finish_task_switch(): */ > set_special_state(TASK_DEAD); > > /* Tell freezer to ignore us: */ > current->flags |= PF_NOFREEZE; > > __schedule(false); > BUG(); > > /* Avoid "noreturn function does return" - but don't continue if BUG() > is a NOP: */ > for (;;) > cpu_relax(); > } This would mean that we somehow loose the TASK_DEAD state before hitting schedule(), but that is something that should be avoided by set_special_state(), which is supposed to serialize against concurrent wake-ups. Also see commit: b5bf9a90bbeb ("sched/core: Introduce set_special_state()") How readily does this reproduce?