On 17/01/2013 7:31 a.m., Alex Rousskov wrote:
On 01/15/2013 07:39 AM, Rainer Weikusat wrote:
internal method named 'stillConnecting' is introduced which
contains the check for an 'abort done by the parent' which is presently
in ConnOpener::connect,

      // our parent Jobs signal abort by cancelling their callbacks.
      if (callback_ == NULL || callback_->canceled())
          return;

This method is called from both ::connect and ::timeout to determine
if the state of the connect request is still undecided by the time
either of both methods executes. This guards against the case that
both ::connect and ::timeout are scheduled before one of them
actually runs and detects when the 'other side' of a full
connection has already gone away or is in the process of going away
(and hence, the already existing 'result state' should be kept).
The stillConnecting() mechanism should not be needed at all. The
doneAll() method tests should be what is used to determine whether
the Job is still running or not.
These two lines of code are presently at the top of
ConnOpener::connect. If it should really be ConnOpener::doneAll there,
than, the new method is obviously redundant, but that's something I
don't know (since I didn't wrote the code containing the two slightly
different 'still something to do' checks and if in doubt, I'd rather
assume that there was a reason for that).

Amos, the done() protection does not replace stillConnecting() because
the job does not know yet that it is done. The automatic done() check
happens _after_ the async call into the job, not before that call.
[Perhaps done() check should be performed before and after, but we
currently do it after because, in part, some callbacks have side effects
not covered by done() that require us to call them.]

There is a significant overlap between the ConnOpener doneAll() and the new stillConnecting().

Whet I was trying to point out is that we have three race conditions between timeout() and connect() Calls...

* connect() happens first and runs re-schedules while timeout() is queued.
  - this is the case we are now ignoring. I'm fine with that.

* connect() happens first and runs doneConnection() while timeout() is queued. - in this case doneConnecting() will cancel the timeout() queued Call. The Dialer skips cancelled Calls correct? so we have no need to check if we are still connecting because we have to be for the timeout Call to be non-canceled. Therefore use of stillConnecting() is a waste of cycles in timeout().

* timeout() happens first and runs doneConnecting() while connect() is queued.
 - we are still connecting because the connect() Call has not run.
- we have no handle to the connect() AsyncCall stored so we can't cancel it directly (bug?). - Therefore the connect() call when it is run must still check for doneAll() details (Dialer did not do it beforehand as you say). However, note that the previous timeout() Call resulted in doneConnecting() clearing all the state and doneAll() after the earlier timeout() presented true.

So we have a choice of:
a) hold a pointer in the calls_ struct to the latest connect() call queued. And cancel it in doneConnecting() or when in-progress schedules a new one.
or
b) status-quo: check for doneAll() explicitly inside ConnOpener::connect() function and abort/return if it says the job should have exited already.


Rainer, the stillConnecting() check should not be needed. Nothing should
break if the code requesting a connection has quit while we were
waiting. The bzr log message for adding that check is "Save come CPU
cycles scheduling useless calls". Personally, I suspect we might be
losing more cycles on this useless check than on an occasional useless
call, but the common case may be environment-dependent.

It prevents connect() function scheduling further connect() Calls after timeout() has already sent the callback and the related Job is dead/dying.


Amos, are you OK with us removing that optimization during this rewrite?

No. I'm not clear on what happens if the Job is supposedly exiting by one of the Calls keeps re-scheduling itself as an I/O handler. The need for actually adding the optimization in the first place was IIRC that we had a lot of Calls scheduled disregarding the Job status and hogging memory the whole time.

I'm more in favour of this patch doing the (a) option above which will make it clear that all Calls are cancelled explicitly and nothing bad could possibly go wrong.


Amos

Reply via email to