Hello, as you might know, our handling of CAN state changes and bus-off is not consistent, weak or even incorrect. Therefore I'm making an effort to improve, consolidate and *unify* it. Most things are straight- forward, but others need more attention and especially for the bus-off recovery I would appreciate some CAN expert advice (more below). I have already some patches implementing:
- Add missing do_get_berr_counter() callbacks (for ti_hecc, etc.). - Add error counters to the data fields 6..7 of *any* CAN error message automatically in alloc_can_err_skb(): if (priv->do_get_berr_counter) { struct can_berr_counter bec; priv->do_get_berr_counter(dev, &bec); (*cf)->data[6] = bec.txerr; (*cf)->data[7] = bec.rxerr; } - Allow state changes going down including "back to error active": Therefore I added: $ cat include/linux/can/error.h ... #define CAN_ERR_STATE_CHANGE 0x00000200U /* CAN error state change / data[1] */ ... #define CAN_ERR_CRTL_ACTIVE 0x40 /* recovered to error active state */ For any state change the CAN_ERR_STATE_CHANGE will be set in the can_id. If the state gets worse, CAN_ERR_CRTL is set as usual also for backward compatibility. The state change management will be done by a common "can_change_state()" function doing all the bit settings and counter increments. For the SJA1000 "candump -e" will then report for recovery from the error passive state (no cable): can0 20000204 [8] 00 08 00 00 00 00 60 00 ERRORFRAME controller-problem{tx-error-warning} state-change{tx-error-warning} error-counter-tx-rx{{96}{0}} can0 20000204 [8] 00 30 00 00 00 00 80 00 ERRORFRAME controller-problem{tx-error-passive} state-change{tx-error-passive} error-counter-tx-rx{{128}{0}} can0 124 [3] 12 34 56 ... can0 124 [3] 12 34 56 can0 20000200 [8] 00 08 00 00 00 00 7F 00 ERRORFRAME state-change{tx-error-warning} error-counter-tx-rx{{127}{0}} can0 124 [3] 12 34 56 ... can0 124 [3] 12 34 56 can0 20000200 [8] 00 40 00 00 00 00 5F 00 ERRORFRAME state-change{back-to-error-active} error-counter-tx-rx{{95}{0}} Updating all drivers correctly is a challenge, especially because I do not have all hardware. Help and comments are appreciated. - Bus-off recovery: Currently, I think, we do not handle bus-off recovery correctly for most controllers. We brute-force stop and restart the controller. The controller will do the recovery cycle anyway and we may send messages to early. Instead the software should handle the bus-off recovery cycle as shown below: * bus-off happens - call netif_stop_queue() and maybe disable interrupts * automatic or manual restart is done - trigger bus-off recovery sequence by resetting the init bit (on SJA1000) and maybe re-enable the interrupts - await the controller going back to error-active state (signaled via interrupt). - call netif_wake_queue() Here is a "candump -e" output for the SJA1000 (with delta times) (009.832477) can0 20000204 [8] 00 30 00 00 00 00 88 00 ERRORFRAME controller-problem{tx-error-passive} state-change{tx-error-passive} error-counter-tx-rx{{136}{0}} (000.000804) can0 20000240 [8] 00 00 00 00 00 00 7F 00 ERRORFRAME bus-off state-change{} error-counter-tx-rx{{127}{0}} (000.099795) can0 20000100 [8] 00 00 00 00 00 00 7F 00 ERRORFRAME restarted-after-bus-off error-counter-tx-rx{{127}{0}} (000.003061) can0 20000200 [8] 00 40 00 00 00 00 00 00 ERRORFRAME state-change{back-to-error-active} Before doing all the necessary code changes, which are not always trivial I ask: Would that be the correct bus-off handling??? Thanks for feedback. Wolfgang. _______________________________________________ Socketcan-core mailing list Socketcan-core@lists.berlios.de https://lists.berlios.de/mailman/listinfo/socketcan-core