Re: ICAP connections under heavy loads

Amos Jeffries Mon, 10 Sep 2012 16:51:49 -0700

On 11.09.2012 03:06, Alex Rousskov wrote:

On 09/09/2012 02:30 AM, Amos Jeffries wrote:
IIRC the race is made worse by our internal code going timout event
(async) scheduling a call (async), so doubling the async queuedelay.
Which shows up worst on heavily loaded proxies.
True. Unfortunately, the alternative is to deal with ConnOpener'ssync
call reentering ConnOpener object. We all know what that eventually
leads to. I would rather have double async calls (and optimize the
general queue handling) than deal with long chains of reentrantcalls,especially when it comes to dealing with an exceptional event such asa
timeout.

I am sure this can be optimized further by writing reentrant code in
performance-critical places, but we have much bigger fish to fry atthe
moment.
The patch woudl be this:

=== modified file 'src/comm/ConnOpener.cc'
--- src/comm/ConnOpener.cc      2012-08-28 19:12:13 +0000
+++ src/comm/ConnOpener.cc      2012-09-09 08:20:09 +0000
@@ -139,6 +139,10 @@
         // it never reached fully open, so cleanup the FD handlers
         // Note that comm_close() sequence does not happen for
partially open FD
Comm::SetSelect(temporaryFd_, COMM_SELECT_WRITE, NULL,NULL, 0);
+        if (calls_.earlyAbort_ != NULL) {
+ calls_.earlyAbort_->cancel("Comm::ConnOpener::doneConnecting");
+            calls_.earlyAbort_ = NULL;
The above line is extra because we do that immediately below:
+        }
         calls_.earlyAbort_ = NULL;
I think from all this discussion theConnOpener::DelayedConnectRetry
handler needs to be investigated as well.
Agreed. Why is that handler reusing the same failed socket? Shouldnotwe close the failed socket descriptor upon detecting a failure andthen
open a new one just before trying again?
Hmm. Not sure. This is a straight copy of the older code.

If you want a new socket FD most of what start() does needs to be
re-done for it. Except that the start timer needs to be leftunchanged,
Exactly.
and the async calls need to be switched to teh new FD (would that beareassignment, or a cancel + replace?)cancelled before replacing onthe
new FD.
Ideally, when we fix this, we should avoid attaching any callbacksthat
may outlive a temporary FD to that temporary FD. For example, the
timeout callback should be converted into an event, which is not a
trivial change. AFAICT we do not have any other such callbacks (the
earlyAbort one is specific to the temporary FD and can stay), but Ihave
not checked closely.

Agreed. Using eventAdd() for the timeout is a great idea IMO.Unfortunately eventAdd() does not accept AsyncCalls (yet) and wouldrequire another wrapper doing async re-schedule. On the plus side thatmeans both sides of the race face the same lag handicap.

Here is the sketch for connect retries on failures, as I see it:

     default:
         ++failRetries_; // TODO: rename to failedTries

         // we failed and have to start from scratch
         ... close the temporary socket ...

         // if we exhausted all retries, bail with an error
         if (failRetries_ > Config.connect_retries) {
             ... send COMM_ERR_CONNECT to the caller ...
             return;
         }

         // we can still retry after a small pause
         // we might timeout before that pause ends, but that is OK
         eventAdd("Comm::ConnOpener::DelayedConnectRetry", ...)

More changes would be needed to make this work correctly:

* the timeout handler should not be associated with the socket
descriptor (since it may change). We can use eventAdd instead.

* the Comm::ConnOpener::DelayedConnectRetry() wrapper should call a
method that will open a new socket.


Sounds okay to me.

I'm not strongly viewed on this, but is it better to display a hardERR_CONNECT or a soft(er) TIMEOUT in these cases? ie is it actually agood idea to remove the timeout if case?


Amos

Re: ICAP connections under heavy loads

Reply via email to