On 12/5/22 13:06, Niel Fourie wrote:
Hi,
[...]
How come nobody triggered this problem with regular ethernet in
U-Boot ?
If this is isolated to USB gadget ethernet, then please do not hack
around this in core networking code, but rather fix the USB ethernet
gadget itself. It seems that gadget code should not unregister the
gadget in drivers/usb/gadget/ether.c _usb_eth_halt() , at least not
fully.
The reason is simple, the regular ethernet drivers do not get removed
on stop() like for the gadget ethernet driver, and in their case
`priv` is still valid.
The suggestion for a proper fix is in the last paragraph above -- do
not unregister the usb ethernet gadget device in halt(), keep the
gadget device registered. Then the priv pointer would be valid. (*)
I agree on your point that reworking the ethernet gadget code would
be preferable, but this would be a much bigger effort (and if I were
to do it, I would probably introduce even more bugs). I am not
certain whether this would not also affect the non-DM gadget
implementation as well, which still contain drivers like ci_udc.c
which does not appear to have been ported to DM gadget yet? (I only
see DM USB there...)
That said, I am not certain whether this is not also bug, as I am not
certain whether the assumption that `priv` should be available after
stop() is valid or not.
The documentation at
https://source.denx.de/u-boot/u-boot/-/blob/master/doc/develop/driver-model/ethernet.rst?plain=1#L135 states:
The **stop** function should turn off / disable the hardware and
place it back in its reset state. It can be called at any time
(before any call to the related start() function), so make sure it
can handle this sort of thing.
The ->stop callback is supposed to stop the interface, turn off its
DMA, but NOT deallocate the device (and its associated data) behind it.
In such a complete reset state I am not certain whether the
assumption that `priv` should exist is still valid, at least not
without another call to dev_get_uclass_priv() and revalidating it first?
In case a device is probe()d, its private data are also allocated and
available, so yes, 'priv' pointer should still be valid.
Granted, the usb gadget driver implementation is problematic, and it
definitely belongs on the TODO list.
That being said, priv not being valid at this stage is not a new
problem, as validation for it after stop() was explicitly added in
commit c3211708 ("net: eth-uclass: Fix for DM USB ethernet support") for
this reason 4 years ago. Fortunately in that case, it just fixed a null
pointer de-reference, not corruption of freed memory.
I believe the aforementioned patch is already also wrong, since the priv
pointer would be invalid for gadget ethernet driver got removed in the
stop callback.
Lastly, this assumption that priv is still valid is rather new and it
was introduced here:
https://source.denx.de/u-boot/u-boot/-/commit/fa795f452541ce07b33be603de36cac3c5d7dfcf
I disagree, the device private data are valid during the entire
lifespan of the device. That assumption has been baked into the driver
model itself and far predates that commit.
In case of a usb ethernet, the lifespan of the device starts with the
'bind' command which triggers the ->bind callback, and first use which
triggers its ->probe callback. The lifespan ends with 'unbind' command
or OS boot, which triggers ->remove callback and ->unbind callbacks.
This commit appears to be a workaround for drivers which cannot deal
with stop() being called at any time as required in the above quoted
documentation.
This commit prevents network device ->start() from being called
multiple times, which is a valid precaution, as calling start while
the interface is already up would interfere with existing connection
(e.g. the netconsole as mentioned in the commit message). That does
not seem to be a workaround to me.
Ah yes, the commit message indeed states that some drivers (like
fec_mxc) cannot deal with multiple start() calls, not multiple stop()
calls. My mistake, good point.
But preventing multiple start()s getting called is not what the code
change in this commit is actually doing. It is instead inhibiting
calling stop() on devices for which priv is not valid or for which
priv->running is false. Therefore it is avoiding calling stop() on
un-started devices.
I wonder if such a patch is needed to fix it ?
diff --git a/net/eth-uclass.c b/net/eth-uclass.c
index f41da4b37b3..e7b5d3943e8 100644
--- a/net/eth-uclass.c
+++ b/net/eth-uclass.c
@@ -298,7 +298,7 @@ int eth_init(void)
if (current) {
debug("Trying %s\n", current->name);
- if (device_active(current)) {
+ if (!eth_is_active(current)) {
I would consider adding a workaround to a workaround in this case to
be the lesser evil, as tracking down this bug in the first place was
like looking for a needle in a haystack. This change would at least
save everybody else from strange crashes in particular configurations
without any negative impact. But this is fortunately not my decision. :)
Commit fa795f45254 ("net: eth-uclass: avoid running start() twice
without stop()") is as far as I can tell unrelated to this change and
does not seem to me like a workaround.
It does introduce writing to priv after stop() is called without
(re-)validating priv first, though.
The code previously first called stop(), then called
dev_get_uclass_priv() to get priv and then validated priv before writing
to it, which avoids exactly the problem of priv no longer being valid.
As mentioned above, this was explicitly addressed by commit c3211708
("net: eth-uclass: Fix for DM USB ethernet support").
I'm afraid this is just piling workarounds on top of workarounds.
It does however show that this patch introduces a bug -- this patch
changes the order in which priv->state = ETH_STATE_PASSIVE; is
assigned from _after_ the ->stop callback to _before_ the -> stop
callback. This breaks drivers/net/ldpaa_eth/ldpaa_eth.c which checks
the priv->state in its ->stop callback, either on its own in non-DM
case, or in eth_is_active() implementation in DM case. With this
patch, the interface would never be stopped in the ->stop callback,
because the condition (net_dev->state == ETH_STATE_PASSIVE) test in
the ldpaa stop callback implementation would always be true.
In drivers/net/ldpaa_eth/ldpaa_eth.c:ldpaa_eth_stop(), priv is of type
struct ldpaa_eth_priv*, defined in drivers/net/ldpaa_eth/ldpaa_eth.h and
is accessed using dev_get_priv().
In net/eth-uclass.c:eht_halt(), priv is of type struct eth_device_priv*
and defined in the same .c file, and is accessed using
dev_get_uclass_priv(). As the structure is local to this file, nothing
outside of this file should have any knowledge of its contents, and
changing of the order of the calls should only impact this file.
As you already found out yourself in subsequent email:
655 static void ldpaa_eth_stop(struct udevice *dev)
656 {
657 struct ldpaa_eth_priv *priv = dev_get_priv(dev);
...
666 #ifdef CONFIG_DM_ETH
667 if (!eth_is_active(dev))
...
The eth_is_active():
350 int eth_is_active(struct udevice *dev)
351 {
352 struct eth_device_priv *priv;
353
354 if (!dev || !device_active(dev))
355 return 0;
356
357 priv = dev_get_uclass_priv(dev);
358 return priv->state == ETH_STATE_ACTIVE;
359 }
does access dev_get_uclass_priv(dev)->priv .
I sincerely hope that these two are not interfering with each other,
otherwise we have much bigger problems...
The proper fix is in the usb ethernet gadget code, see (*) above,
let's not pile workarounds onto already difficult to maintain code.
[...]
Agreed. The problem is unfortunately just that I don't see that
happening simply to fix this bug.
Please, do not pile workarounds on top of workarounds, there is too many
of those already as you found out above. Each one just slaps something
on, which later on breaks and causes pain a few years down the line.
This is not manageable in the long run.
The proper fix I believe is (*) above and it should not be hard to
implement that one.