Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers
On 05/25/2018 09:25 AM, Vladimir Zapolskiy wrote: For ages trivial changes to RAVB and SuperH ethernet links by means of standard 'ethtool' trigger a 'sleeping function called from invalid context' bug, to visualize it on r8a7795 ULCB: % ethtool -r eth0 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool INFO: lockdep is turned off. irq event stamp: 0 hardirqs last enabled at (0): [<>] (null) hardirqs last disabled at (0): [] copy_process.isra.7.part.8+0x2cc/0x1918 softirqs last enabled at (0): [] copy_process.isra.7.part.8+0x2cc/0x1918 softirqs last disabled at (0): [<>] (null) CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33 Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT) Call trace: dump_backtrace+0x0/0x198 show_stack+0x24/0x30 dump_stack+0xb8/0xf4 ___might_sleep+0x1c8/0x1f8 __might_sleep+0x58/0x90 __mutex_lock+0x50/0x890 mutex_lock_nested+0x3c/0x50 phy_start_aneg_priv+0x38/0x180 phy_start_aneg+0x24/0x30 ravb_nway_reset+0x3c/0x68 dev_ethtool+0x3dc/0x2338 dev_ioctl+0x19c/0x490 sock_do_ioctl+0xe0/0x238 sock_ioctl+0x254/0x460 do_vfs_ioctl+0xb0/0x918 ksys_ioctl+0x50/0x80 sys_ioctl+0x34/0x48 __sys_trace_return+0x0/0x4 The root cause is that an attempt to modify ECMR and GECMR registers only when RX/TX function is disabled was too overcomplicated in its original implementation, also processing of an optional Link Change interrupt added even more complexity, as a result the implementation was error prone. The new locking scheme is confirmed to be correct by dumping driver specific and generic PHY framework function calls with aid of ftrace while running more or less advanced tests. Please note that sh_eth patches from the series were built-tested only. On purpose I do not add Fixes tags, the reused PHY handlers were added way later than the fixed problems were firstly found in the drivers. >>> >>>I think you went one step too far with these fixes. On the first glance, >>> the real fixes are to remove grabbing/releasing the spinlock for the >>> duration >>> of the phylib calls. Am I right? If so, making use of the new phylib APIs >>> would be a further enhancement, it's not needed for fixing the splats per >>> se... >> >>Note that I hadn't looked at the patches #3/#6 at the time of writing >> this; >> those seem to be more complicated than the rest. > > Right, the simplistic approach of just removing the held spinlock does > not fit well into the overall lame locking model found in the driver. Yet you only try fixing it in the patches #3 and #6. I was talking about the patches #1 and #4 mostly (#2 and #5 turned out to be non-fixes). > The thing is that I would prefer to exhibit 'remove custom callbacks' > side of the changes as it is done now, and fixing severe 'invalid contex' > bugs is left as a valuable side effect. I may attempt to find enough > free time to follow your instructions, but frankly speaking I don't > see it beneficial to split a single good all-sufficient change into > three or more: removal of spinlocks, replacement of phy_start_aneg(), > then a non-functional clean-up. Yes, I would prefer these step-by-step changes. > Bikeshedding isn't my preference, This is not about bikeshedding. What you are trying to do clearly violates the 2 basic principles of the kernel development: "don't mix fixes and enhancements" and "do one thing per patch". > but a report about technical flaws related to the published changes > is appreciated, otherwise let me ask you to accept the changes as is, > secondary optimizations can be done on top of them. No, I'll certainly have to NAK patches #1/#3 in their current form. I'm yet to review patches #3/#6... anyway, if you lack the time to do things properly, I'll have to take this burden on my shoulders (giving you credits). Yet I'm basically is in the same situation as you -- I have to spend my copiuos free time on the large patch sets (like yours) and I'm still having some cleanups to sh_eth cooking here (which I'll most probably have to defer)... > -- > With best wishes, > Vladimir MBR, Sergei
Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers
Hi Vladimir, On Thu, May 24, 2018 at 1:11 PM, Vladimir Zapolskiywrote: > For ages trivial changes to RAVB and SuperH ethernet links by means of > standard 'ethtool' trigger a 'sleeping function called from invalid > context' bug, to visualize it on r8a7795 ULCB: > > % ethtool -r eth0 > BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:747 [...] > Please note that sh_eth patches from the series were built-tested only. Thanks you very much! I've tested this on both R-Car M2-W (sh_eth) and R-Car H3 ES2.0 (ravb), and the BUG disappeared. Tested-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers
Hello Sergei, On 05/24/2018 08:24 PM, Sergei Shtylyov wrote: > On 05/24/2018 07:40 PM, Sergei Shtylyov wrote: > >>> For ages trivial changes to RAVB and SuperH ethernet links by means of >>> standard 'ethtool' trigger a 'sleeping function called from invalid >>> context' bug, to visualize it on r8a7795 ULCB: >>> >>> % ethtool -r eth0 >>> BUG: sleeping function called from invalid context at >>> kernel/locking/mutex.c:747 >>> in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool >>> INFO: lockdep is turned off. >>> irq event stamp: 0 >>> hardirqs last enabled at (0): [<>] (null) >>> hardirqs last disabled at (0): [] >>> copy_process.isra.7.part.8+0x2cc/0x1918 >>> softirqs last enabled at (0): [] >>> copy_process.isra.7.part.8+0x2cc/0x1918 >>> softirqs last disabled at (0): [<>] (null) >>> CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33 >>> Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT) >>> Call trace: >>>dump_backtrace+0x0/0x198 >>>show_stack+0x24/0x30 >>>dump_stack+0xb8/0xf4 >>>___might_sleep+0x1c8/0x1f8 >>>__might_sleep+0x58/0x90 >>>__mutex_lock+0x50/0x890 >>>mutex_lock_nested+0x3c/0x50 >>>phy_start_aneg_priv+0x38/0x180 >>>phy_start_aneg+0x24/0x30 >>>ravb_nway_reset+0x3c/0x68 >>>dev_ethtool+0x3dc/0x2338 >>>dev_ioctl+0x19c/0x490 >>>sock_do_ioctl+0xe0/0x238 >>>sock_ioctl+0x254/0x460 >>>do_vfs_ioctl+0xb0/0x918 >>>ksys_ioctl+0x50/0x80 >>>sys_ioctl+0x34/0x48 >>>__sys_trace_return+0x0/0x4 >>> >>> The root cause is that an attempt to modify ECMR and GECMR registers >>> only when RX/TX function is disabled was too overcomplicated in its >>> original implementation, also processing of an optional Link Change >>> interrupt added even more complexity, as a result the implementation >>> was error prone. >>> >>> The new locking scheme is confirmed to be correct by dumping driver >>> specific and generic PHY framework function calls with aid of ftrace >>> while running more or less advanced tests. >>> >>> Please note that sh_eth patches from the series were built-tested only. >>> >>> On purpose I do not add Fixes tags, the reused PHY handlers were added >>> way later than the fixed problems were firstly found in the drivers. >> >>I think you went one step too far with these fixes. On the first glance, >> the real fixes are to remove grabbing/releasing the spinlock for the duration >> of the phylib calls. Am I right? If so, making use of the new phylib APIs >> would be a further enhancement, it's not needed for fixing the splats per >> se... > >Note that I hadn't looked at the patches #3/#6 at the time of writing this; > those seem to be more complicated than the rest. Right, the simplistic approach of just removing the held spinlock does not fit well into the overall lame locking model found in the driver. The thing is that I would prefer to exhibit 'remove custom callbacks' side of the changes as it is done now, and fixing severe 'invalid contex' bugs is left as a valuable side effect. I may attempt to find enough free time to follow your instructions, but frankly speaking I don't see it beneficial to split a single good all-sufficient change into three or more: removal of spinlocks, replacement of phy_start_aneg(), then a non-functional clean-up. Bikeshedding isn't my preference, but a report about technical flaws related to the published changes is appreciated, otherwise let me ask you to accept the changes as is, secondary optimizations can be done on top of them. -- With best wishes, Vladimir
Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers
On 05/24/2018 07:40 PM, Sergei Shtylyov wrote: >> For ages trivial changes to RAVB and SuperH ethernet links by means of >> standard 'ethtool' trigger a 'sleeping function called from invalid >> context' bug, to visualize it on r8a7795 ULCB: >> >> % ethtool -r eth0 >> BUG: sleeping function called from invalid context at >> kernel/locking/mutex.c:747 >> in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool >> INFO: lockdep is turned off. >> irq event stamp: 0 >> hardirqs last enabled at (0): [<>] (null) >> hardirqs last disabled at (0): [] >> copy_process.isra.7.part.8+0x2cc/0x1918 >> softirqs last enabled at (0): [] >> copy_process.isra.7.part.8+0x2cc/0x1918 >> softirqs last disabled at (0): [<>] (null) >> CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33 >> Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT) >> Call trace: >>dump_backtrace+0x0/0x198 >>show_stack+0x24/0x30 >>dump_stack+0xb8/0xf4 >>___might_sleep+0x1c8/0x1f8 >>__might_sleep+0x58/0x90 >>__mutex_lock+0x50/0x890 >>mutex_lock_nested+0x3c/0x50 >>phy_start_aneg_priv+0x38/0x180 >>phy_start_aneg+0x24/0x30 >>ravb_nway_reset+0x3c/0x68 >>dev_ethtool+0x3dc/0x2338 >>dev_ioctl+0x19c/0x490 >>sock_do_ioctl+0xe0/0x238 >>sock_ioctl+0x254/0x460 >>do_vfs_ioctl+0xb0/0x918 >>ksys_ioctl+0x50/0x80 >>sys_ioctl+0x34/0x48 >>__sys_trace_return+0x0/0x4 >> >> The root cause is that an attempt to modify ECMR and GECMR registers >> only when RX/TX function is disabled was too overcomplicated in its >> original implementation, also processing of an optional Link Change >> interrupt added even more complexity, as a result the implementation >> was error prone. >> >> The new locking scheme is confirmed to be correct by dumping driver >> specific and generic PHY framework function calls with aid of ftrace >> while running more or less advanced tests. >> >> Please note that sh_eth patches from the series were built-tested only. >> >> On purpose I do not add Fixes tags, the reused PHY handlers were added >> way later than the fixed problems were firstly found in the drivers. > >I think you went one step too far with these fixes. On the first glance, > the real fixes are to remove grabbing/releasing the spinlock for the duration > of the phylib calls. Am I right? If so, making use of the new phylib APIs > would be a further enhancement, it's not needed for fixing the splats per > se... Note that I hadn't looked at the patches #3/#6 at the time of writing this; those seem to be more complicated than the rest. MBR, Sergei
Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers
On 05/24/2018 02:11 PM, Vladimir Zapolskiy wrote: > For ages trivial changes to RAVB and SuperH ethernet links by means of > standard 'ethtool' trigger a 'sleeping function called from invalid > context' bug, to visualize it on r8a7795 ULCB: > > % ethtool -r eth0 > BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:747 > in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool > INFO: lockdep is turned off. > irq event stamp: 0 > hardirqs last enabled at (0): [<>] (null) > hardirqs last disabled at (0): [] > copy_process.isra.7.part.8+0x2cc/0x1918 > softirqs last enabled at (0): [] > copy_process.isra.7.part.8+0x2cc/0x1918 > softirqs last disabled at (0): [<>] (null) > CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33 > Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT) > Call trace: >dump_backtrace+0x0/0x198 >show_stack+0x24/0x30 >dump_stack+0xb8/0xf4 >___might_sleep+0x1c8/0x1f8 >__might_sleep+0x58/0x90 >__mutex_lock+0x50/0x890 >mutex_lock_nested+0x3c/0x50 >phy_start_aneg_priv+0x38/0x180 >phy_start_aneg+0x24/0x30 >ravb_nway_reset+0x3c/0x68 >dev_ethtool+0x3dc/0x2338 >dev_ioctl+0x19c/0x490 >sock_do_ioctl+0xe0/0x238 >sock_ioctl+0x254/0x460 >do_vfs_ioctl+0xb0/0x918 >ksys_ioctl+0x50/0x80 >sys_ioctl+0x34/0x48 >__sys_trace_return+0x0/0x4 > > The root cause is that an attempt to modify ECMR and GECMR registers > only when RX/TX function is disabled was too overcomplicated in its > original implementation, also processing of an optional Link Change > interrupt added even more complexity, as a result the implementation > was error prone. > > The new locking scheme is confirmed to be correct by dumping driver > specific and generic PHY framework function calls with aid of ftrace > while running more or less advanced tests. > > Please note that sh_eth patches from the series were built-tested only. > > On purpose I do not add Fixes tags, the reused PHY handlers were added > way later than the fixed problems were firstly found in the drivers. I think you went one step too far with these fixes. On the first glance, the real fixes are to remove grabbing/releasing the spinlock for the duration of the phylib calls. Am I right? If so, making use of the new phylib APIs would be a further enhancement, it's not needed for fixing the splats per se... MBR, Sergei