I have come across a problem debugging the WMR300 driver and I have only a vague idea what is going on.
The developer's notes imply there is "no complex multithreading", although the strategies say multiple threads are used, so I assume the problems are somehow related to the way thread scheduling is done. I have no experience with how python implements this. *Background*: The Oregon WMR300 communicates over USB using an unpublished protocol, so what we know is by reverse engineering. Once triggered by a command, the WMR300 continually spits out data at an average rate of about 5 packets per second. Provided that it receives a heartbeat packet within about every 60 seconds it keeps transmitting. *The main problem*: Every so often (maybe once in a week or 4) some users experience what looks like a system hang - data stops updating. I can replicate this on my system, but only under circumstances that probably do not relate to the reporting users' hardware. I suspect the same underlying cause but with different triggering event, but I need to understand the threading better to see how I can diagnose this. Bear in mind that I think I have a work-around, so the problem is not urgent. I would mainly like to understand if the work-around is the only solution. My weewx system is a baby Intel Xeon system running CentOS 6 with software raid, data logged to a mysql database on the same machine. I get the hang when the system "raid-check" runs. It runs once a week and takes about 3 hours. Weewx might survive a few weeks of this, but eventually hangs during the scan. Logs report load average is 5 for this duration which is presumably dominated by processes waiting on the IO queue. This load remains the same, no matter what priority I assign in the raid-check config file. Eventually, during this heavy load period, the WRM300 fails to receive a heartbeat in the desired interval and just stops transmitting. - Diagnostic checks have revealed that *GenLoopPackets *has not been executed in a time interval *up to one minute*. - Time checks around the *USB read* call and the *yield *show that the delay is not there. - there are debugging syslog lines but nothing logged in the preceeding 2 minutes. There seems to be no reason this loop thread would be blocked, but it appears to be. What I am left with is an assumption that this thread is not scheduled because some other thread, such as a report generator, is blocked in the IO queue. Is this how it works?
