I have come across a problem debugging the WMR300 driver and I have only a 
vague idea what is going on.

The developer's notes imply there is "no complex multithreading", although 
the strategies say multiple threads are used, so I assume the problems are 
somehow related to the way thread scheduling is done.
I have no experience with how python implements this.

*Background*:
The Oregon WMR300 communicates over USB using an unpublished protocol, so 
what we know is by reverse engineering.
Once triggered by a command, the WMR300 continually spits out data at an 
average rate of about 5 packets per second.
Provided that it receives a heartbeat packet within about every 60 seconds 
it keeps transmitting.

*The main problem*:
Every so often (maybe once in a week or 4) some users experience what looks 
like a system hang - data stops updating.
I can replicate this on my system, but only under circumstances that 
probably do not relate to the reporting users' hardware.
I suspect the same underlying cause but with different triggering event, 
but I need to understand the threading better to see how I can diagnose 
this.
Bear in mind that I think I have a work-around, so the problem is not 
urgent. I would mainly like to understand if the work-around is the only 
solution.

My weewx system is  a baby Intel Xeon system running CentOS 6 with software 
raid, data logged to a mysql database on the same machine.
I get  the hang when the system "raid-check" runs. It runs once a week and 
takes about 3 hours. Weewx might survive a few weeks of this, but 
eventually hangs during the scan. Logs report load average is 5 for this 
duration which is presumably dominated by processes waiting on the IO 
queue. This load remains the same, no matter what priority I assign in the 
raid-check config file.

Eventually, during this heavy load period, the WRM300 fails to receive a 
heartbeat in the desired interval and just stops transmitting. 

   - Diagnostic checks have revealed that *GenLoopPackets *has not been 
   executed in a time interval *up to one minute*. 
   - Time checks around the *USB read* call and the *yield *show that the 
   delay is not there.
   - there are debugging syslog lines but nothing logged in the preceeding 
   2 minutes.

There seems to be no reason this loop thread would be blocked, but it 
appears to be.  What I am left with is an assumption that this thread is 
not scheduled because some other thread, such as a report generator, is 
blocked in the IO queue.  Is this how it works?


Reply via email to