I've been running for about 30 minutes with the new change and I've had four "retries exhausted" all on "get frequencies". Never on "get radio". That's just with WSJT-X and HRD Logbook running (not DM780). FYI...I did finally get a timeout with my simple patch but it took quite a while before it occurred (an hour or more).
I then put in a 50ms sleep after "get radio" and still got one retries exhausted on "get frequencies" but also saw one on "get dropdowns" about 9 minutes after the first timeout...then got another on "get radio" a few minutes after that so it behaves differently with the sleep in this rather limited test. I do see delays of like 7 seconds between commands so I do believe the retries are being attempted and succeeding. And when I shut down HRD Logbook the trace log behaves quite normally with 2 seconds between command sets (I have it set on 2 second polling right now). I can't imagine why another TCP socket to the same port on HRD would cause this....it doesn't seem to affect HRG Logger at all (but I don't have a trace log on the either to be sure about that). I'm going to let this run overnight without HRD Logbook running since once it times out it needs attention. That will at least test the stability of running by itself. And I do need to do a packet capture so I can see that the traffic is actually being sent but not received again. FYI...I did notice that once I click "retry" it takes quite a while (30 seconds or so) before the main windows shows the radio online when HRD Logbook is running. Without HRD Logbook running it takes about 4 seconds. So startup is affected quite a bit. Mike W9MDB -----Original Message----- From: Bill Somerville [mailto:g4...@classdesign.com] Sent: Sunday, September 07, 2014 6:44 PM To: wsjt-devel@lists.sourceforge.net Subject: Re: [wsjt-devel] HRD tcp bug On 07/09/2014 22:22, Michael Black wrote: Hi Mike, > I'm going to call this a possible fix....I've only been running it for > a few minutes. > > But...before this every time I started WSJT-X with HRD Logger already > running it would timeout immediately. And anytime I started WSJT-X > first and then HRD Logger WSJT-X would start timing out immediately. > I tried both of these now and there are no problems now. > > Just for reference I am running Windows 7 32-bit. > > As of now I've been running for about 20 minutes with no timeouts at > all which is a LOT longer than ever before with both of these running. > I've got trace turned on and I can see the frequency calls flying by > at 1 second intervals with no delays occurring. > All I did was move the waitForBytesWritten outside the if block. Just > a hunch on my part but for some odd/unknown reason it works. Perhaps > the hrd_->state was interfering somehow? Interesting results. I think I am close to fixing this now. My analysis says that the only time that change would make a difference is when bytes_sent < bytes_to_send since the || will shortcut and in the old version would have skipped the QTCPSocket::waitForBytesWritten() and socket state check. The code is naive assuming that that all writes succeed in one go which it would seem is not so. That means an exception is thrown since bytes_sent < bytes_to_send should always cause that. Not all cases of that exception propagate out as a CAT failure so I think what was happening is that one of the commands being sent is not able to be written in one QTCPSocket::write() call and I am resuming from the exception without ever calling QTCPSocket::waitForBytesWritten() or reading any actual reply. This may well explain why I can't reproduce the issue as my tests may well always successfully complete every command in one write. So I need to look at putting the QTCPSocket::write() calls in a while loop until all data has been successfully sent. It may be that after that the total bytes_sent will always reach bytes_to_send. If not we have an error anyway. I also need to think about why I ignore exceptions sometimes, I probably found I had to when testing the original implementation but, now I think it's a bad idea because of the socket being used after an error. I see your update and I think I see how your change could improve matters without fixing the root problem. Let me see if I can come up with a new version that behaves better. OK I have implemented a loop to make sure all the outgoing bytes are always sent. I have also changed the part of the code that can ignore an exception from send_command so that the exception always gets propagated right out to cause a failure and subsequent socket closure. Can you update, build and test again please. > > Mike W9MDB > > --- HRDTransceiver.cpp (revision 4261) > +++ HRDTransceiver.cpp (working copy) > @@ -723,9 +723,9 @@ > bytes_to_send = message->size_; > bytes_sent = hrd_->write (reinterpret_cast<char *> > (message.data ()), bytes_to_send); > } > - > + bool writeTimeout = hrd_->waitForBytesWritten (socket_wait_time); > if (bytes_sent < bytes_to_send > - || !hrd_->waitForBytesWritten (socket_wait_time) > + || !writeTimeout > || QTcpSocket::ConnectedState != hrd_->state ()) > { > #if WSJT_TRACE_CAT 73 Bill G4WJS. ---------------------------------------------------------------------------- -- Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel ------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ wsjt-devel mailing list wsjt-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/wsjt-devel