Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

2018-05-26 Thread Sergei Shtylyov
On 05/25/2018 09:25 AM, Vladimir Zapolskiy wrote:

 For ages trivial changes to RAVB and SuperH ethernet links by means of
 standard 'ethtool' trigger a 'sleeping function called from invalid
 context' bug, to visualize it on r8a7795 ULCB:

   % ethtool -r eth0
   BUG: sleeping function called from invalid context at 
 kernel/locking/mutex.c:747
   in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool
   INFO: lockdep is turned off.
   irq event stamp: 0
   hardirqs last  enabled at (0): [<>]   (null)
   hardirqs last disabled at (0): [] 
 copy_process.isra.7.part.8+0x2cc/0x1918
   softirqs last  enabled at (0): [] 
 copy_process.isra.7.part.8+0x2cc/0x1918
   softirqs last disabled at (0): [<>]   (null)
   CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33
   Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT)
   Call trace:
dump_backtrace+0x0/0x198
show_stack+0x24/0x30
dump_stack+0xb8/0xf4
___might_sleep+0x1c8/0x1f8
__might_sleep+0x58/0x90
__mutex_lock+0x50/0x890
mutex_lock_nested+0x3c/0x50
phy_start_aneg_priv+0x38/0x180
phy_start_aneg+0x24/0x30
ravb_nway_reset+0x3c/0x68
dev_ethtool+0x3dc/0x2338
dev_ioctl+0x19c/0x490
sock_do_ioctl+0xe0/0x238
sock_ioctl+0x254/0x460
do_vfs_ioctl+0xb0/0x918
ksys_ioctl+0x50/0x80
sys_ioctl+0x34/0x48
__sys_trace_return+0x0/0x4

 The root cause is that an attempt to modify ECMR and GECMR registers
 only when RX/TX function is disabled was too overcomplicated in its
 original implementation, also processing of an optional Link Change
 interrupt added even more complexity, as a result the implementation
 was error prone.

 The new locking scheme is confirmed to be correct by dumping driver
 specific and generic PHY framework function calls with aid of ftrace
 while running more or less advanced tests.

 Please note that sh_eth patches from the series were built-tested only.

 On purpose I do not add Fixes tags, the reused PHY handlers were added
 way later than the fixed problems were firstly found in the drivers.
>>>
>>>I think you went one step too far with these fixes. On the first glance,
>>> the real fixes are to remove grabbing/releasing the spinlock for the 
>>> duration
>>> of the phylib calls. Am I right? If so, making use of the new phylib APIs
>>> would be a further enhancement, it's not needed for fixing the splats per 
>>> se...
>>
>>Note that I hadn't looked at the patches #3/#6 at the time of writing 
>> this;
>> those seem to be more complicated than the rest.
> 
> Right, the simplistic approach of just removing the held spinlock does
> not fit well into the overall lame locking model found in the driver.

   Yet you only try fixing it in the patches #3 and #6. I was talking about
the patches #1 and #4 mostly (#2 and #5 turned out to be non-fixes).

> The thing is that I would prefer to exhibit 'remove custom callbacks'
> side of the changes as it is done now, and fixing severe 'invalid contex'
> bugs is left as a valuable side effect. I may attempt to find enough
> free time to follow your instructions, but frankly speaking I don't
> see it beneficial to split a single good all-sufficient change into
> three or more: removal of spinlocks, replacement of phy_start_aneg(),
> then a non-functional clean-up.
   Yes, I would prefer these step-by-step changes.

> Bikeshedding isn't my preference,

   This is not about bikeshedding. What you are trying to do clearly
violates the 2 basic principles of the kernel development: "don't mix
fixes and enhancements" and "do one thing per patch". 

> but a report about technical flaws related to the published changes
> is appreciated, otherwise let me ask you to accept the changes as is,
> secondary optimizations can be done on top of them.

   No, I'll certainly have to NAK patches #1/#3 in their current form.
I'm yet to review patches #3/#6... anyway, if you lack the time to do things
properly, I'll have to take this burden on my shoulders (giving you credits).
Yet I'm basically is in the same situation as you -- I have to spend my copiuos
free time on the large patch sets (like yours) and I'm still having some 
cleanups
to sh_eth cooking here (which I'll most probably have to defer)...

> --
> With best wishes,
> Vladimir

MBR, Sergei


Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

2018-05-25 Thread Geert Uytterhoeven
Hi Vladimir,

On Thu, May 24, 2018 at 1:11 PM, Vladimir Zapolskiy
 wrote:
> For ages trivial changes to RAVB and SuperH ethernet links by means of
> standard 'ethtool' trigger a 'sleeping function called from invalid
> context' bug, to visualize it on r8a7795 ULCB:
>
>   % ethtool -r eth0
>   BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:747

[...]

> Please note that sh_eth patches from the series were built-tested only.

Thanks you very much!

I've tested this on both R-Car M2-W (sh_eth) and R-Car H3 ES2.0 (ravb),
and the BUG disappeared.

Tested-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

2018-05-25 Thread Vladimir Zapolskiy
Hello Sergei,

On 05/24/2018 08:24 PM, Sergei Shtylyov wrote:
> On 05/24/2018 07:40 PM, Sergei Shtylyov wrote:
> 
>>> For ages trivial changes to RAVB and SuperH ethernet links by means of
>>> standard 'ethtool' trigger a 'sleeping function called from invalid
>>> context' bug, to visualize it on r8a7795 ULCB:
>>>
>>>   % ethtool -r eth0
>>>   BUG: sleeping function called from invalid context at 
>>> kernel/locking/mutex.c:747
>>>   in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool
>>>   INFO: lockdep is turned off.
>>>   irq event stamp: 0
>>>   hardirqs last  enabled at (0): [<>]   (null)
>>>   hardirqs last disabled at (0): [] 
>>> copy_process.isra.7.part.8+0x2cc/0x1918
>>>   softirqs last  enabled at (0): [] 
>>> copy_process.isra.7.part.8+0x2cc/0x1918
>>>   softirqs last disabled at (0): [<>]   (null)
>>>   CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33
>>>   Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT)
>>>   Call trace:
>>>dump_backtrace+0x0/0x198
>>>show_stack+0x24/0x30
>>>dump_stack+0xb8/0xf4
>>>___might_sleep+0x1c8/0x1f8
>>>__might_sleep+0x58/0x90
>>>__mutex_lock+0x50/0x890
>>>mutex_lock_nested+0x3c/0x50
>>>phy_start_aneg_priv+0x38/0x180
>>>phy_start_aneg+0x24/0x30
>>>ravb_nway_reset+0x3c/0x68
>>>dev_ethtool+0x3dc/0x2338
>>>dev_ioctl+0x19c/0x490
>>>sock_do_ioctl+0xe0/0x238
>>>sock_ioctl+0x254/0x460
>>>do_vfs_ioctl+0xb0/0x918
>>>ksys_ioctl+0x50/0x80
>>>sys_ioctl+0x34/0x48
>>>__sys_trace_return+0x0/0x4
>>>
>>> The root cause is that an attempt to modify ECMR and GECMR registers
>>> only when RX/TX function is disabled was too overcomplicated in its
>>> original implementation, also processing of an optional Link Change
>>> interrupt added even more complexity, as a result the implementation
>>> was error prone.
>>>
>>> The new locking scheme is confirmed to be correct by dumping driver
>>> specific and generic PHY framework function calls with aid of ftrace
>>> while running more or less advanced tests.
>>>
>>> Please note that sh_eth patches from the series were built-tested only.
>>>
>>> On purpose I do not add Fixes tags, the reused PHY handlers were added
>>> way later than the fixed problems were firstly found in the drivers.
>>
>>I think you went one step too far with these fixes. On the first glance,
>> the real fixes are to remove grabbing/releasing the spinlock for the duration
>> of the phylib calls. Am I right? If so, making use of the new phylib APIs
>> would be a further enhancement, it's not needed for fixing the splats per 
>> se...
> 
>Note that I hadn't looked at the patches #3/#6 at the time of writing this;
> those seem to be more complicated than the rest.

Right, the simplistic approach of just removing the held spinlock does
not fit well into the overall lame locking model found in the driver.

The thing is that I would prefer to exhibit 'remove custom callbacks'
side of the changes as it is done now, and fixing severe 'invalid contex'
bugs is left as a valuable side effect. I may attempt to find enough
free time to follow your instructions, but frankly speaking I don't
see it beneficial to split a single good all-sufficient change into
three or more: removal of spinlocks, replacement of phy_start_aneg(),
then a non-functional clean-up. Bikeshedding isn't my preference,
but a report about technical flaws related to the published changes
is appreciated, otherwise let me ask you to accept the changes as is,
secondary optimizations can be done on top of them.

--
With best wishes,
Vladimir


Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

2018-05-24 Thread Sergei Shtylyov
On 05/24/2018 07:40 PM, Sergei Shtylyov wrote:

>> For ages trivial changes to RAVB and SuperH ethernet links by means of
>> standard 'ethtool' trigger a 'sleeping function called from invalid
>> context' bug, to visualize it on r8a7795 ULCB:
>>
>>   % ethtool -r eth0
>>   BUG: sleeping function called from invalid context at 
>> kernel/locking/mutex.c:747
>>   in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool
>>   INFO: lockdep is turned off.
>>   irq event stamp: 0
>>   hardirqs last  enabled at (0): [<>]   (null)
>>   hardirqs last disabled at (0): [] 
>> copy_process.isra.7.part.8+0x2cc/0x1918
>>   softirqs last  enabled at (0): [] 
>> copy_process.isra.7.part.8+0x2cc/0x1918
>>   softirqs last disabled at (0): [<>]   (null)
>>   CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33
>>   Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT)
>>   Call trace:
>>dump_backtrace+0x0/0x198
>>show_stack+0x24/0x30
>>dump_stack+0xb8/0xf4
>>___might_sleep+0x1c8/0x1f8
>>__might_sleep+0x58/0x90
>>__mutex_lock+0x50/0x890
>>mutex_lock_nested+0x3c/0x50
>>phy_start_aneg_priv+0x38/0x180
>>phy_start_aneg+0x24/0x30
>>ravb_nway_reset+0x3c/0x68
>>dev_ethtool+0x3dc/0x2338
>>dev_ioctl+0x19c/0x490
>>sock_do_ioctl+0xe0/0x238
>>sock_ioctl+0x254/0x460
>>do_vfs_ioctl+0xb0/0x918
>>ksys_ioctl+0x50/0x80
>>sys_ioctl+0x34/0x48
>>__sys_trace_return+0x0/0x4
>>
>> The root cause is that an attempt to modify ECMR and GECMR registers
>> only when RX/TX function is disabled was too overcomplicated in its
>> original implementation, also processing of an optional Link Change
>> interrupt added even more complexity, as a result the implementation
>> was error prone.
>>
>> The new locking scheme is confirmed to be correct by dumping driver
>> specific and generic PHY framework function calls with aid of ftrace
>> while running more or less advanced tests.
>>
>> Please note that sh_eth patches from the series were built-tested only.
>>
>> On purpose I do not add Fixes tags, the reused PHY handlers were added
>> way later than the fixed problems were firstly found in the drivers.
> 
>I think you went one step too far with these fixes. On the first glance,
> the real fixes are to remove grabbing/releasing the spinlock for the duration
> of the phylib calls. Am I right? If so, making use of the new phylib APIs
> would be a further enhancement, it's not needed for fixing the splats per 
> se...

   Note that I hadn't looked at the patches #3/#6 at the time of writing this;
those seem to be more complicated than the rest.

MBR, Sergei


Re: [PATCH 0/6] ravb/sh_eth: fix sleep in atomic by reusing shared ethtool handlers

2018-05-24 Thread Sergei Shtylyov
On 05/24/2018 02:11 PM, Vladimir Zapolskiy wrote:

> For ages trivial changes to RAVB and SuperH ethernet links by means of
> standard 'ethtool' trigger a 'sleeping function called from invalid
> context' bug, to visualize it on r8a7795 ULCB:
> 
>   % ethtool -r eth0
>   BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:747
>   in_atomic(): 1, irqs_disabled(): 128, pid: 554, name: ethtool
>   INFO: lockdep is turned off.
>   irq event stamp: 0
>   hardirqs last  enabled at (0): [<>]   (null)
>   hardirqs last disabled at (0): [] 
> copy_process.isra.7.part.8+0x2cc/0x1918
>   softirqs last  enabled at (0): [] 
> copy_process.isra.7.part.8+0x2cc/0x1918
>   softirqs last disabled at (0): [<>]   (null)
>   CPU: 5 PID: 554 Comm: ethtool Not tainted 4.17.0-rc4-arm64-renesas+ #33
>   Hardware name: Renesas H3ULCB board based on r8a7795 ES2.0+ (DT)
>   Call trace:
>dump_backtrace+0x0/0x198
>show_stack+0x24/0x30
>dump_stack+0xb8/0xf4
>___might_sleep+0x1c8/0x1f8
>__might_sleep+0x58/0x90
>__mutex_lock+0x50/0x890
>mutex_lock_nested+0x3c/0x50
>phy_start_aneg_priv+0x38/0x180
>phy_start_aneg+0x24/0x30
>ravb_nway_reset+0x3c/0x68
>dev_ethtool+0x3dc/0x2338
>dev_ioctl+0x19c/0x490
>sock_do_ioctl+0xe0/0x238
>sock_ioctl+0x254/0x460
>do_vfs_ioctl+0xb0/0x918
>ksys_ioctl+0x50/0x80
>sys_ioctl+0x34/0x48
>__sys_trace_return+0x0/0x4
> 
> The root cause is that an attempt to modify ECMR and GECMR registers
> only when RX/TX function is disabled was too overcomplicated in its
> original implementation, also processing of an optional Link Change
> interrupt added even more complexity, as a result the implementation
> was error prone.
> 
> The new locking scheme is confirmed to be correct by dumping driver
> specific and generic PHY framework function calls with aid of ftrace
> while running more or less advanced tests.
> 
> Please note that sh_eth patches from the series were built-tested only.
> 
> On purpose I do not add Fixes tags, the reused PHY handlers were added
> way later than the fixed problems were firstly found in the drivers.

   I think you went one step too far with these fixes. On the first glance,
the real fixes are to remove grabbing/releasing the spinlock for the duration
of the phylib calls. Am I right? If so, making use of the new phylib APIs
would be a further enhancement, it's not needed for fixing the splats per se...

MBR, Sergei