Hi,
I have a problem when using codesys v3 with SocketCAN on a ARM module
using Linux kernel 3.0.2 with 2 SJA1000 CAN controllers.

I have two of these modules. I use a CAN messages to monitor if the
other module is alive.

This is done by a toggle bit in one message that is sent every 100ms. If
the toggle bit is not changed within 500ms. I send a alarm.

After about 36h of uptime I start getting alarms about one or the other
module is missing and then the alarm starts to go away but gets
triggered again. This happens about every 1-3 sec after the first alarm
has been sent. The only way the solve the problem is to restart the that
gives the alarm. Ie not the one that is sending the alarm.

I have used a external CAN logger to test the CAN messages. It seems
like the messages are both sent and toggled. This makes me believe that
the TX side is ok.

I also made a test codesys ping pong application. The messages are
always sent on the CAN network verified with CANoe but codesys does not
always receive them.

I think that the problem is in codesys v3. But how can I prove this? Any
suggestions. So far I have only been successful in getting this error on
the entire test system. Not standalone or with other test applications.

Bertil Bäck

p.s

Another side note that I have seen is that the following.
I ran ip -d -s link show can1

It says that I have 3702 arbit-lost and also 3702 TX: errors.
Is one of these a bug? In my head a arbitration loss is not a tx error.
Tx error is like not getting a ACK bit or a error active frame after
transmission. I.e causing the frame to be resent.


  After about 36h of uptime I start getting alarms
  then the alarm starts to go away but gets triggered again.
  This happens about every 1-3 sec after the first alarm

Maybe the timer that checks for "500 ms" is based on a counter-timer that 
doesn't handle rollover well.

Perhaps, once it is mangled, it stays broken in some way that re-triggers every 
1-3 seconds.

You could narrow down the cause by logging the last 10-20 messages received 
into a circular array, with the state of the toggle bit and a timestamp.  Maybe 
also save the values that is used to compare to the timer, or the value of the 
base timer when the check is done.  For example:
...
64500
65000
65500
strange things
...
you might look for something that overflowed 0xFFFF.

That seems more likely than "lost messages" that always start getting lost 
after 129,600 seconds, or 0x3F480 intervals of 500 ms.  A counter that rolls over at 
0x40000?  I think that's 18 bits, a funny number for rollover.

But make sure that no messages were lost right before each alarm.  With a circular array, 
and a breakpoint or "stop logging" on the alarm, the array would hold the last 
10-20 messages of interest.


Rick Corey  | Software Engineer | Crane Aerospace&  Electronics

Thanks for the input Rick.

I forgot to mention that I was able to recreate the alarm behavior by increasing the busload from normal 10% @ 125kbps to 70% @ 125kbps. This got me thinking that this would be CPU or memory dependent. We tried to make a test application that does the same toggling as in our control system and logged memory and CPU. But we did not find any problems.

Seems like there might be something else in the control application that causing the alarm message to be generated. Seems like I need to make it possible to run the complete control system here at the office.
Ie rest bus simulating the entire network.

Side note
Anybody know how I can look at the commit messages between Linux kernel 3.0.2 to 3.0.4 with kernel.org down. Do I need to do a git checkout?


Bertil BÄCK R&D Manager Hardware
T +358 6 357 6305, M +358 50 588 6895, F +358 6 357 6320
[email protected], www.tke.fi - www.canopen.fi
_______________________________________________
Socketcan-users mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/socketcan-users

Reply via email to