During the past eight weeks, when I had the time, I tried following
Joe's and Phil's advice, but never solved the problem.  Since I did
make a serious effort, and have a current workaround, I figure it
could be useful to tell the story.

Immediately, I modified my application so that every ReceiveMsg.receive
gave back the same address it was passed.  No change in symptoms
whatsoever (I should mention at this point that the application can run
over twenty minutes before the error occurs, though sometimes it
happens within a couple minutes; this is the first hint of something like
a Heisenbug).
Then I started digging deeper.  I instrumented CC2420RadioM with
a debugging interface, also adding to my application periodic shouts
of debugged data.  I could watch the pointer, rxbufptr in CC2420RadioM,
to see if that correlated with my application's getting stuck.  Lucky me!
I see that rxbufptr gets corrupted at about the same time the application
goes deaf.  One would think that debugging would be easy from this
point, but my experience was otherwise.

I carefully desk-checked my code, straining to find some way it could
be returning a different pointer value;  no luck.  I added conditional
checks to CC2420RadioM to detect any new pointer being returned
from ReceiveMsg.receive. Negatory on that front.
So I started studying CC2420RadioM to find some other place that
rxbufptr is written.  There is another situation (I mean besides
SplitControl.init(), of course).  I added some conditional code there to
detect a new value being written.  The conditional code would write
some small history to a new variable in CC2420RadioM, which was
exfiltrated by my debugging harness.  My experiments showed that
rxbufptr was getting corrupted outside of either assignment in
CC2420RadioM (well, I can't be absolutely certain of that, modulo my
understanding of atomicity, optimizing compilation or other opaque
refinements to the execution model).
The sickening conclusion?  Someone was indexing afar, overwriting
rxbufptr with junk.  And it _did_ look like junk.  No real pattern I
could see, varying from test run to test run.   Didn't look even like a
pointer.  At this point, I tried what any experienced debugger tries,
I built a fence around rxbufptr, reporting the values of the fence in
my periodic debug messages.  The fence never got hit!

This is strange, why would some stray bug just hit rxbufptr?  So, I
tried moving rxbufptr and its fence around, duplicating rxbufptr,
changing all usages of the idiom "rxbufptr->" to something else,
and quite a few other strange and desperate hacks.  Yet that bug
outsmarted me every time, seemingly omniscient about the location
of rxbufptr (recall, each test cost about half an hour).
Sometimes I would work in the other direction.  Removing, module
by module in my application, send and receive operations.  I discovered
that, at a certain point, the problem would not arise.  However, this
was not particularly helpful, since running everything just up to the
actual send() could execute without triggering deafness in the
application.
Now, in a crazy moment, I conjectured that maybe the problem had
to do with frequency of send() operations, or something along that
line of thinking.   So I tried the following hack:  I modified
SPAdaptorGenericCommM so that it would fail any send() operation
until the previous send() had completed.  My application tolerates
send() failures by retrying after a quarter second. I know this defeats that
new message pool stuff that everyone is raving about
(http://blog.moteiv.com/archives/2006/08/effective_link.php), but
I thought, hey, it's worth a try.
And the problem has not resurfaced, using this hack.  Unfortunately,
I still don't know where that bug is hiding, but if this technique can
be of value to others, I have humbly related my experience.



Philip Levis wrote:

On Jun 21, 2006, at 4:23 PM, Ted Herman wrote:

Dear Radio Stack Experts (using moteiv distro, GenericComm interface):

I've got a program that enters a curious behavior. Initially, for a few
minutes, it sends and receives messages.  Sending is driven by Timer
events.  Then it stops receiving messages, though it continues to send
messages (the sending is done out of a posted task).   I watch the
messages via sf/TOSBase.

Why do I think it stops receiving messages?  The first thing in the
ReceiveMsg event is a call to Leds.greenToggle().  During the first
few minutes, this blinks as one would expect, but then stops blinking
(other behavior such as the sent message contents confirms the
hypothesis that the mote is not receiving, though messages are in
the air to be received).


What do you mean "the message contents?" It's always possible that down in the data-link layer below GenericComm it's being received, but is then being filtered out due to address, AM type, etc.

Because messages continue to be sent, I reckon the task queue
can't be such that Led blinking would be inhibited (I welcome being
corrected on this point, since I'm a T2 newbie).  In fact, I added
a Leds.redToggle() to the sending task, and I do see the red
blink once a minute, as expected, even as the receiving is deaf.


Your analysis is correct. If you're able to send (specifically, timers fire and you get sendDone events), then the task queue is OK.

What would be the efficient way to diagnose this kind of situation?


Joe's suggestion (look at your receive pointers) is a good place to start. One easy way to test if you're munging receive pointers is to just put a

return msg;

at the top of the function, right after you toggle the LED.

If you do that and the failures continue, then something is happening below. You can try instrumenting the AM implementation to see if it's receiving packets but then filtering them. If AM isn't receiving packets, then you can test to see if the receiving node is entering receive states. Basically, as you diagnose the problem as deeper and deeper in the stack, then you have to deal with trickier code. I haven't used moteiv's distribution for mica-family nodes, but it's my general experience that these kinds of problems tend to occur at the AM level for some reason or another.

Phil


_______________________________________________
Tinyos-help mailing list
[email protected]
https://mail.millennium.berkeley.edu/cgi-bin/mailman/listinfo/tinyos-help

Reply via email to