On 11.09.2012 03:06, Alex Rousskov wrote:
On 09/09/2012 02:30 AM, Amos Jeffries wrote:
IIRC the race is made worse by our internal code going timout event
(async) scheduling a call (async), so doubling the async queue
delay.
Which shows up worst on heavily loaded proxies.
True. Unfortunately, the alternative is to deal with ConnOpener's
sync
call reentering ConnOpener object. We all know what that eventually
leads to. I would rather have double async calls (and optimize the
general queue handling) than deal with long chains of reentrant
calls,
especially when it comes to dealing with an exceptional event such as
a
timeout.
I am sure this can be optimized further by writing reentrant code in
performance-critical places, but we have much bigger fish to fry at
the
moment.
The patch woudl be this:
=== modified file 'src/comm/ConnOpener.cc'
--- src/comm/ConnOpener.cc 2012-08-28 19:12:13 +0000
+++ src/comm/ConnOpener.cc 2012-09-09 08:20:09 +0000
@@ -139,6 +139,10 @@
// it never reached fully open, so cleanup the FD handlers
// Note that comm_close() sequence does not happen for
partially open FD
Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE, NULL,
NULL, 0);
+ if (calls_.earlyAbort_ != NULL) {
+ calls_.earlyAbort_->cancel("Comm::ConnOpener::doneConnecting");
+ calls_.earlyAbort_ = NULL;
The above line is extra because we do that immediately below:
+ }
calls_.earlyAbort_ = NULL;
I think from all this discussion the
ConnOpener::DelayedConnectRetry
handler needs to be investigated as well.
Agreed. Why is that handler reusing the same failed socket? Should
not
we close the failed socket descriptor upon detecting a failure and
then
open a new one just before trying again?
Hmm. Not sure. This is a straight copy of the older code.
If you want a new socket FD most of what start() does needs to be
re-done for it. Except that the start timer needs to be left
unchanged,
Exactly.
and the async calls need to be switched to teh new FD (would that be
a
reassignment, or a cancel + replace?)cancelled before replacing on
the
new FD.
Ideally, when we fix this, we should avoid attaching any callbacks
that
may outlive a temporary FD to that temporary FD. For example, the
timeout callback should be converted into an event, which is not a
trivial change. AFAICT we do not have any other such callbacks (the
earlyAbort one is specific to the temporary FD and can stay), but I
have
not checked closely.
Agreed. Using eventAdd() for the timeout is a great idea IMO.
Unfortunately eventAdd() does not accept AsyncCalls (yet) and would
require another wrapper doing async re-schedule. On the plus side that
means both sides of the race face the same lag handicap.
Here is the sketch for connect retries on failures, as I see it:
default:
++failRetries_; // TODO: rename to failedTries
// we failed and have to start from scratch
... close the temporary socket ...
// if we exhausted all retries, bail with an error
if (failRetries_ > Config.connect_retries) {
... send COMM_ERR_CONNECT to the caller ...
return;
}
// we can still retry after a small pause
// we might timeout before that pause ends, but that is OK
eventAdd("Comm::ConnOpener::DelayedConnectRetry", ...)
More changes would be needed to make this work correctly:
* the timeout handler should not be associated with the socket
descriptor (since it may change). We can use eventAdd instead.
* the Comm::ConnOpener::DelayedConnectRetry() wrapper should call a
method that will open a new socket.
Sounds okay to me.
I'm not strongly viewed on this, but is it better to display a hard
ERR_CONNECT or a soft(er) TIMEOUT in these cases? ie is it actually a
good idea to remove the timeout if case?
Amos