Re: [PATCH] Re: pconn.cc assert index >= 0, async call queue madness

Alex Rousskov Wed, 09 Apr 2008 10:29:46 -0700

On Wed, 2008-04-09 at 08:51 +0200, Henrik Nordstrom wrote:
> tis 2008-04-08 klockan 18:58 -0600 skrev Alex Rousskov:
> > In general, the caller should forget about the call after making it
> > (i.e., after scheduling it). We can make comm different, but I am not
> > sure it is a good idea from the clarity/simplicity point of view.
> 
> Which kind of rules out using cancellation as a solution to the problem
> as can only cancel what you know about, not things you have forgotten.


In general, the code expecting the call should be responsible for
canceling it if it is no longer capable of handling it. In this case,
that code is the idle pconn pool code.

> I do not want to view comm different, I want to find a predictable and
> verifiable model in how AsyncCall should be used where it's possible to
> audit that things like the pconn race won't bite us.

Right. That is why I am trying to explain how I would use the design in
a clean way.

> > In general, there should not be a problem with comm not canceling a
> > pending call unless the recipient of that call is accessing some comm
> > structures directly, assuming they are still valid when the handler is
> > called. Since nothing is immediate, such an assumption may be wrong.
> 
> The above assumes each call handles cancellation itself, or that the
> object used by the call is invalidated and not the call itself..

I do not follow. There may be a terminology misunderstanding here:

* Call: a dumb object delivered from the caller to the recipient.
* Recipient: the code destination of the call.
* Caller: the code scheduling, placing, or making the call.
* Creator: the code creating the call.
* Dialing: the final act of delivering the call to the recipient.
* Canceling: preventing the call from being dialed in the future.
* Live call: the in-progress call that has been dialed and has not
finished yet.

The call object itself cannot or should not do anything other than
maintaining call parameters and making the actual call (dialing) if it
has not been canceled and if the recipient is still there.

Calls are rather dumb and, IMO, should stay that way. The smart code
should be in the recipient (usually a job; an idle pconn pool in our
case) and the caller (usually another job; comm in our case).

One cannot invalidate the call. One can cancel it, which means that it
will not be dialed.

To be honest, I have not originally considered the case of canceling a
call that has been (or are being) made. Perhaps we need more rules/code
to handle that case (more polished versions of these are at the end of
the email):

  * Cancellation by the recipient has immediate effect (i.e., the call
will not be dialed after cancel() is called), but may be "too late" if
the call has already been dialed (i.e., it has happened or is
happening).

  * Cancellation by others has immediate effect in the current
single-threaded code. We should not rely on that if possible though. SMP
code may need more rules about this.

  * It is not a bug to cancel something late, but the recipient should
be aware of that possibility. Canceling something late is a no-op.

> The problem is how to handle cancellation of interest properly without
> it racing with the event you cancel having already occurred and sitting
> in the AsuncCall queue.

> The current case with the comm code is only a scratch on the surface.
> What I am after is to find the proper model on how AsyncCall should be
> used, allowing us to audit the code against this kind of races. 

You are doing the right thing, but I think you need to look at the
higher level. Async queue does not exist as far as the creator, the
caller, and the recipient are concerned. From their point of view, async
queue is just a random time delay, nothing else. See the cancellation
rules above.

> You propsed we should cancel the call, but that model is incompatible
> with the model where you forget about the call when scheduling it as you
> then have a race where the call may already be sitting in the call queue
> when you want to cancel it.

Sorry, I probably was not clear. The two rules UI was talking about have
to be interpreted together, not in isolation:

* The caller should immediately forget about the scheduled call.
* The recipient should cancel the call it no longer wants to receive.

Other rules:

* The calls are received in the order they were placed.
* The recipient is guaranteed to receive at most one call at a time.

New rules:

* Calls canceled by the recipient will not be dialed, but may have been
dialed before and may be live.

* Calls canceled by others will not be dialed in the current
single-threaded code. They may have been dialed before and may be live. 

* Canceling a dialed call is a no-op.


I am not proposing that the caller cancels the call it placed. I am
proposing that the recipient cancels the call it has not and does not
want to receive. We should think of the calls as messages or parcels.
You cannot cancel a parcel after you ship it, but the recipient can
always refuse to receive it.

> It's not as bad as the call recursion in the old model, but in the same
> family of unwanted and hard to audit race conditions.

While some race conditions are unavoidable regardless of the design, the
current code provides some strong guarantees and guidelines. We just
need to clarify, polish, and enforce the rules (virtually none of that
has happened before). 

HTH,

Alex.

Re: [PATCH] Re: pconn.cc assert index >= 0, async call queue madness

Reply via email to