On 12/02/2011 10:04 AM, Wolfgang Grandegger wrote: > Hello, > > as you might know, our handling of CAN state changes and bus-off is > not consistent, weak or even incorrect. Therefore I'm making an effort > to improve, consolidate and *unify* it. Most things are straight- > forward, but others need more attention and especially for the bus-off > recovery I would appreciate some CAN expert advice (more below). I have > already some patches implementing: > > - Add missing do_get_berr_counter() callbacks (for ti_hecc, etc.).
+1
> - Add error counters to the data fields 6..7 of *any* CAN error message
> automatically in alloc_can_err_skb():
>
> if (priv->do_get_berr_counter) {
> struct can_berr_counter bec;
>
> priv->do_get_berr_counter(dev, &bec);
> (*cf)->data[6] = bec.txerr;
> (*cf)->data[7] = bec.rxerr;
> }
What about some not directly connected devices like the mcp251x. At
least the mcp2515 driver (which is not mainline, though) needs a spi
transfer for that.
Do we need a flag in the driver to indicate not to read the berr_counter?
> - Allow state changes going down including "back to error active":
>
> Therefore I added:
>
> $ cat include/linux/can/error.h
> ...
> #define CAN_ERR_STATE_CHANGE 0x00000200U /* CAN error state change /
> data[1] */
> ...
> #define CAN_ERR_CRTL_ACTIVE 0x40 /* recovered to error active state
> */
>
> For any state change the CAN_ERR_STATE_CHANGE will be set in the
> can_id. If the state gets worse, CAN_ERR_CRTL is set as usual
> also for backward compatibility. The state change management will
> be done by a common "can_change_state()" function doing all the bit
yeah! common function! +1
> settings and counter increments. For the SJA1000 "candump -e" will
> then report for recovery from the error passive state (no cable):
>
> can0 20000204 [8] 00 08 00 00 00 00 60 00 ERRORFRAME
> controller-problem{tx-error-warning}
> state-change{tx-error-warning}
> error-counter-tx-rx{{96}{0}}
> can0 20000204 [8] 00 30 00 00 00 00 80 00 ERRORFRAME
> controller-problem{tx-error-passive}
> state-change{tx-error-passive}
> error-counter-tx-rx{{128}{0}}
> can0 124 [3] 12 34 56
> ...
> can0 124 [3] 12 34 56
> can0 20000200 [8] 00 08 00 00 00 00 7F 00 ERRORFRAME
> state-change{tx-error-warning}
> error-counter-tx-rx{{127}{0}}
> can0 124 [3] 12 34 56
> ...
> can0 124 [3] 12 34 56
> can0 20000200 [8] 00 40 00 00 00 00 5F 00 ERRORFRAME
> state-change{back-to-error-active}
> error-counter-tx-rx{{95}{0}}
>
> Updating all drivers correctly is a challenge, especially because I
> do not have all hardware. Help and comments are appreciated.
I can test the at91_can, flexcan and if we're lucky we've a mcp251x in
the office.
> - Bus-off recovery:
>
> Currently, I think, we do not handle bus-off recovery correctly for
> most controllers. We brute-force stop and restart the controller.
> The controller will do the recovery cycle anyway and we may send
> messages to early. Instead the software should handle the bus-off
> recovery cycle as shown below:
>
> * bus-off happens
> - call netif_stop_queue() and maybe disable interrupts
>
> * automatic or manual restart is done
> - trigger bus-off recovery sequence by resetting the init bit
> (on SJA1000) and maybe re-enable the interrupts
> - await the controller going back to error-active state
> (signaled via interrupt).
I'm not sure if all controllers signal correctly that they are back in
error active. My observation is that bus off handling is a bit like
climbing the mount Everest, the air is quite thin and things can lock up
quite fast.
> - call netif_wake_queue()
>
> Here is a "candump -e" output for the SJA1000 (with delta times)
>
> (009.832477) can0 20000204 [8] 00 30 00 00 00 00 88 00 ERRORFRAME
> controller-problem{tx-error-passive}
> state-change{tx-error-passive}
> error-counter-tx-rx{{136}{0}}
> (000.000804) can0 20000240 [8] 00 00 00 00 00 00 7F 00 ERRORFRAME
> bus-off
> state-change{}
> error-counter-tx-rx{{127}{0}}
> (000.099795) can0 20000100 [8] 00 00 00 00 00 00 7F 00 ERRORFRAME
> restarted-after-bus-off
> error-counter-tx-rx{{127}{0}}
> (000.003061) can0 20000200 [8] 00 40 00 00 00 00 00 00 ERRORFRAME
> state-change{back-to-error-active}
>
> Before doing all the necessary code changes, which are not always
> trivial I ask: Would that be the correct bus-off handling???
However if hardware permits the described steps sound reasonable (from
my non CAN expert point of view).
> Thanks for feedback.
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Socketcan-core mailing list [email protected] https://lists.berlios.de/mailman/listinfo/socketcan-core
