Re: [wsjt-devel] HRD tcp bug

Michael Black Sun, 07 Sep 2014 22:24:00 -0700

I've been running for about 30 minutes with the new change and I've had four
"retries exhausted" all on "get frequencies".  Never on "get radio".
That's just with WSJT-X and HRD Logbook running (not DM780).
FYI...I did finally get a timeout with my simple patch but it took quite a
while before it occurred (an hour or more).

I then put in a 50ms sleep after "get radio" and still got one retries
exhausted on "get frequencies" but also saw one on "get dropdowns" about 9
minutes after the first timeout...then got another on "get radio" a few
minutes after that so it behaves  differently with the sleep in this rather
limited test.

I do see delays of like 7 seconds between commands so I do believe the
retries are being attempted and succeeding.  And when I shut down HRD
Logbook the trace log behaves quite normally with 2 seconds between command
sets (I have it set on 2 second polling right now).  I can't imagine why
another TCP socket to the same port on HRD would cause this....it doesn't
seem to affect HRG Logger at all (but I don't have a trace log on the either
to be sure about that).

I'm going to let this run overnight without HRD Logbook running since once
it times out it needs attention.  That will at least test the stability of
running by itself.
And I do need to do a packet capture so I can see that the traffic is
actually being sent but not received again.

FYI...I did notice that once I click "retry" it takes quite a while (30
seconds or so) before the main windows shows the radio online when HRD
Logbook is running.  Without HRD Logbook running it takes about 4 seconds.
So startup is affected quite a bit.

Mike W9MDB

-----Original Message-----
From: Bill Somerville [mailto:[email protected]] 
Sent: Sunday, September 07, 2014 6:44 PM
To: [email protected]
Subject: Re: [wsjt-devel] HRD tcp bug

On 07/09/2014 22:22, Michael Black wrote:
Hi Mike,
> I'm going to call this a possible fix....I've only been running it for 
> a few minutes.
>
> But...before this every time I started WSJT-X with HRD Logger already 
> running it would timeout immediately.  And anytime I started WSJT-X 
> first and then HRD Logger WSJT-X would start timing out immediately.  
> I tried both of these now and there are no problems now.
>
> Just for reference I am running Windows 7 32-bit.
>
> As of now I've been running for about 20 minutes with no timeouts at 
> all which is a LOT longer than ever before with both of these running.  
> I've got trace turned on and I can see the frequency calls flying by 
> at 1 second intervals with no delays occurring.
> All I did was move the waitForBytesWritten outside the if block.  Just 
> a hunch on my part but for some odd/unknown reason it works.  Perhaps 
> the hrd_->state was interfering somehow?
Interesting results. I think I am close to fixing this now.

My analysis says that the only time that change would make a difference is
when bytes_sent < bytes_to_send since the || will shortcut and in the old
version would have skipped the QTCPSocket::waitForBytesWritten() and socket
state check. The code is naive assuming that that all writes succeed in one
go which it would seem is not so.

That means an exception is thrown since bytes_sent < bytes_to_send should
always cause that. Not all cases of that exception propagate out as a CAT
failure so I think what was happening is that one of the commands being sent
is not able to be written in one QTCPSocket::write() call and I am resuming
from the exception without ever calling
QTCPSocket::waitForBytesWritten() or reading any actual reply.

This may well explain why I can't reproduce the issue as my tests may well
always successfully complete every command in one write.

So I need to look at putting the QTCPSocket::write() calls in a while loop
until all data has been successfully sent. It may be that after that the
total bytes_sent will always reach bytes_to_send. If not we have an error
anyway.

I also need to think about why I ignore exceptions sometimes, I probably
found I had to when testing the original implementation but, now I think
it's a bad idea because of the socket being used after an error.

I see your update and I think I see how your change could improve matters
without fixing the root problem.

Let me see if I can come up with a new version that behaves better.

OK I have implemented a loop to make sure all the outgoing bytes are always
sent. I have also changed the part of the code that can ignore an exception
from send_command so that the exception always gets propagated right out to
cause a failure and subsequent socket closure.

Can you update, build and test again please.
>
> Mike W9MDB
>
> --- HRDTransceiver.cpp  (revision 4261)
> +++ HRDTransceiver.cpp  (working copy)
> @@ -723,9 +723,9 @@
>         bytes_to_send = message->size_;
>         bytes_sent = hrd_->write (reinterpret_cast<char *> 
> (message.data ()), bytes_to_send);
>       }
> -
> +  bool writeTimeout = hrd_->waitForBytesWritten (socket_wait_time);
>     if (bytes_sent < bytes_to_send
> -      || !hrd_->waitForBytesWritten (socket_wait_time)
> +      || !writeTimeout
>         || QTcpSocket::ConnectedState != hrd_->state ())
>       {
>   #if WSJT_TRACE_CAT

73
Bill
G4WJS.

----------------------------------------------------------------------------
--
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
wsjt-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/wsjt-devel

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
wsjt-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/wsjt-devel

Re: [wsjt-devel] HRD tcp bug

Reply via email to