Re: API and terms: idle-time-out and heartbeat intervals.

Ken Giusti Fri, 30 Sep 2016 06:25:41 -0700

CC'ing user's list

Given the interpretations of the spec discussed below is the way Proton handles 
this functionality inconsistent with itself?


1) Proton sends an open frame with _half_ the value of the timeout as set by 
the application.
2) Proton interprets the idle-timeout in the received open as the actual time 
out interval, and pessimistically ;) generates heartbeats as 1/2 that value.

Is this a bug in the Proton implementation?

-K 

----- Original Message -----
> From: "Alan Conway" <acon...@redhat.com>
> To: d...@qpid.apache.org
> Sent: Thursday, September 29, 2016 9:11:51 AM
> Subject: Re: API and terms: idle-time-out and heartbeat intervals.
> 
> On Wed, 2016-09-28 at 22:33 +0100, Robbie Gemmell wrote:
> > The spec unfortunately says what it says, and even if there were
> > scope
> > to update it, I'm not sure there is a way to change it to address the
> > main issue (which regardless of any other clarity issues, is I think
> > just that it only says you SHOULD advertise half your 'actual
> > timeout') that wouldn't result in an 'incompatible change' of
> > behaviour.
> 
> I look at this differently. We agree on the semantics of idle-time-out
> defined by the spec. Now forget the spec wording, it is easy to explain
> clearly what the semantics are. It is not "half your actual time out",
> it is the *exact* frame rate you expect from your peer. The peer's
> *only* task is to respect that frame rate, they do not need any more
> information than that - they certainly don't need to guess at what your
> margin for error is for keeping the connection open.
> 
> The margin for error is an implementation detail that does *not* need
> to be advertised. Double the frame rate is a reasonable general
> recommendation, but the ideal margin depends on latency variability in
> the system, you can imagine systems where it might be better to have a
> larger or a smaller margin.
> 
> The name 'idle-time-out' is not an ideal choice, 'heartbeat' or 'max-
> frame-delay' might be better. But once we know what it means, the
> semantics are perfectly clear.
> 
> > Knowing the exact enforced timeout could certainly be clearer in some
> > ways, but if the period you had to send after wasnt defined (e.g
> > half)
> > then it would essentially leave you with the same problem as now:
> > deciding exactly how early you should send heartbeat frames to avoid
> > a
> > [more-precisely-known] timeout.
> 
> The only important thing is to agree on the meaning of the advertised
> value: is it the frame rate or the connection close threshold? The work
>  of adjusting at one end or the other is equivalent.
> 
> I think we agree it is currently defined as the frame rate, I see no
> benefit in trying to change the meaning. (There would be no drawback
> either if the spec was not already published and implemented by
> multiple implementors, but it is)
> 
> > 
> > With what the spec says we effectively have a lower-bound on the
> > actual-timeout (if cases the peer didnt half their actual-timeout
> > before advertising a number) and an upper bound (if they did) due to
> > the use of SHOULD as opposed to MUST or some other more fixed
> > definition of behaviour. So we currently try to send frames often
> > enough to satisfy that lower bound, i.e the number we received. That
> > essentially reduces it to the the same problem as if we were trying
> > to
> > satisfy a theoretical exactly-known actual-timeout (just using a
> > smaller number that we dont want to use, because we followed the
> > specs
> > recommendation). Currently we do this by sending frames after half
> > the
> > advertised period, i.e. what we know is actually a quarter of the
> > actual-timeout enforced by peers such as proton which are following
> > the specs recommendations. We could choose to do it less often than
> > that by reducing how conservative it is, e.g use 75+% of advertised
> > value instead for example, which should have no impact in the cases
> > where peers had followed the recommendation (as we would still be
> > sending more than twice as quickly as needed to satisfy their
> > actual-timeout), but would put us closer to spurious timeout ( if e.g
> > there was a delay in delivering/processing the heartbeat) against any
> > peers that didn't follow the recommendation.
> > 
> > On 28 September 2016 at 21:26, Justin Ross <justin.r...@gmail.com>
> > wrote:
> > > 
> > > IMO, the overall picture is simpler, and easier to explain to third
> > > parties, if we go the way Ken suggested.  When a remote peer sends
> > > you an
> > > idle timeout value, it is an expression of an (actual, not simply
> > > "advertised") guarantee - "I will expire the connection after X
> > > time
> > > without receiving a frame from you".
> > > 
> > > We could also legitimately go the direction you suggest.  *But* its
> > > name is
> > > "idle timeout".  We can't easily change the name.  I think we
> > > should take
> > > the spec text that goes with the name, and the behavior of our
> > > components,
> > > firmly in the direction Ken suggests.
> > > 
> > > Off topic: why is this on the dev list?
> > > 
> > > 
> > > On Wed, Sep 28, 2016 at 12:36 PM, Alan Conway <acon...@redhat.com>
> > > wrote:
> > > 
> > > > 
> > > > On Wed, 2016-09-28 at 10:13 -0400, Ken Giusti wrote:
> > > > > 
> > > > > I've had a hand in the way Proton/C interprets the meaning of
> > > > > 'idle-
> > > > > timeout' and I've never liked the solution.  I think Proton/C's
> > > > > behavior is not 'pessimistic' as much as it is 'conservative'
> > > > > for the
> > > > > sake of interoperability.  This, unfortunately ends up with a
> > > > > needless idle frame chattiness when both ends are Proton-based.
> > > > > 
> > > > > ----- Original Message -----
> > > > > > 
> > > > > > 
> > > > > > From: "Rob Godfrey" <rob.j.godf...@gmail.com>
> > > > > > To: "qpid" <d...@qpid.apache.org>
> > > > > > Sent: Wednesday, September 28, 2016 6:19:05 AM
> > > > > > Subject: Re: API and terms: idle-time-out and heartbeat
> > > > > > intervals.
> > > > > > 
> > > > > > I agree that specifying that the communicated figure should
> > > > > > be
> > > > > > "half"
> > > > > > the "actual" timeout was a mistake.
> > > > > > 
> > > > > > What the spec should have tried to communicate is that the
> > > > > > sender
> > > > > > should communicate a value somewhat less than the period it
> > > > > > uses to
> > > > > > determine that the connection has actually timed-out to allow
> > > > > > for
> > > > > > the
> > > > > > receiver to process and emit a heartbeat frame.
> > > > > 
> > > > > 
> > > > > Wouldn't it be much clearer to simply send the _actual_ idle
> > > > > timeout
> > > > > value?
> > > > 
> > > > My read is that is exactly what it does: It sends the max time
> > > > that the
> > > > *sender* of frames may be idle. The receiver of frames SHOULD be
> > > > more
> > > > patient than that. The wording of the "discussion" around it and
> > > > the
> > > > choice of terms is a bit cloudy but, the text that describes
> > > > idle-time-
> > > > out seems clear enough: it is the max interval between sending
> > > > frames.
> > > > The frame receiver SHOULD wait longer that that before closing,
> > > > and 2x
> > > > seems a reasonable suggestion, but that's for the impl to decide.
> > > > 
> > > > It's weird that it says "idle-time-out should be half the
> > > > threshold"
> > > > instead of "the threshold should be twice the idle-time-out" but
> > > > it's
> > > > logically equivalent.
> > > > 
> > > > 
> > > > > 
> > > > > Having the spec suggest "communicating a value *somewhat less*"
> > > > 
> > > > The wording is odd but the semantics are you communicate
> > > > *exactly* the
> > > > max frame delay you want and then you SHOULD set your connection
> > > > close
> > > > threshold to something bigger. The other end doesn't need to know
> > > > how
> > > > much bigger, they just need to know what rate to send frames.
> > > > 
> > > > > 
> > > > > [emphasis mine] leaves the implementation open for
> > > > > interpretation -
> > > > > which is exactly how we got into this mess in the first
> > > > > place.  Developers are a smart bunch - they know that keep
> > > > > alive
> > > > > traffic will have to be sent frequently enough to prevent idle
> > > > > timeout.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > >  Similarly the sender
> > > > > > should ensure that a frame has been emitted well within the
> > > > > > timeout
> > > > > > period to allow for any communication / processing delay.
> > > > > 
> > > > > Agreed - perfectly acceptable for the spec to point this out.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > >  In practice
> > > > > > these "wiggle room" factors should not be determined by the
> > > > > > application level timeout setting but by sensible
> > > > > > calculations on
> > > > > > transport delay variance / processing time, etc...  these
> > > > > > calculation
> > > > > > may differ between different use-cases / environments (for
> > > > > > example
> > > > > > in
> > > > > > a low latency / real-time environment you may be able to make
> > > > > > hard
> > > > > > guarantees about the number of milliseconds that
> > > > > > communication /
> > > > > > processing delay will take... on the other hand if you are
> > > > > > using an
> > > > > > interpreted language with stop-the-world garbage collection
> > > > > > you may
> > > > > > not be able to say much better than the delay should be less
> > > > > > than
> > > > > > 30s
> > > > > > or whatever).
> > > > > > 
> > > > > 
> > > > > Yes - very important things to keep in mind when implementing
> > > > > this.  But the spec shouldn't be making these suggestions for
> > > > > different implementation options. The spec should be as concise
> > > > > as
> > > > > possible about the mandated behavior, and leave the
> > > > > implementation to
> > > > > the developers.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > I think application level APIs should be in terms of the
> > > > > > timeouts
> > > > > > that
> > > > > > will affect the application.  The AMQP library should be
> > > > > > massaging
> > > > > > those numbers in such a way that they can fulfil the
> > > > > > application
> > > > > > requirements.
> > > > > > 
> > > > > 
> > > > > Agreed.  Now, is there _any_ way we can suggest an update to
> > > > > the
> > > > > spec?  Perhaps an errata, etc?
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > -- Rob
> > > > > > 
> > > > > > On 28 September 2016 at 10:42, Robbie Gemmell <robbie.gemmell
> > > > > > @gmail
> > > > > > .com>
> > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 27 September 2016 at 22:24, Alan Conway <aconway@redhat.
> > > > > > > com>
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On Tue, 2016-09-27 at 15:37 -0400, Alan Conway wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I want to clarify and document the meaning of these
> > > > > > > > > terms for
> > > > > > > > > our
> > > > > > > > > APIs,
> > > > > > > > > presently I can't find anywhere where they are
> > > > > > > > > documented
> > > > > > > > > clearly.
> > > > > > > > > 
> > > > > > > > > The AMQP spec says: "Each peer has its own
> > > > > > > > > (independent) idle
> > > > > > > > > timeout.
> > > > > > > > > At connection open each peer communicates the maximum
> > > > > > > > > period between activity (frames) on the connection that
> > > > > > > > > it
> > > > > > > > > desires
> > > > > > > > > from
> > > > > > > > > its partner.The open frame carries the idletime-out
> > > > > > > > > field for this purpose. To avoid spurious timeouts, the
> > > > > > > > > value
> > > > > > > > > in
> > > > > > > > > idle-
> > > > > > > > > time-out SHOULD be half the peer’s
> > > > > > > > > actual timeout threshold."
> > > > > > > > > 
> > > > > > > > > In other words: if I send you an "open" frame with
> > > > > > > > > idle-time-
> > > > > > > > > out=N
> > > > > > > > > that
> > > > > > > > > means *you* should not wait for longer than N
> > > > > > > > > milliseconds to
> > > > > > > > > send a
> > > > > > > > > frame to me. It does not mean *I* will close the
> > > > > > > > > connection
> > > > > > > > > after N
> > > > > > > > > milliseconds, I SHOULD be more patient and wait for N*2
> > > > > > > > > ms to
> > > > > > > > > avoid
> > > > > > > > > closing prematurely due to minor timing wobbles.
> > > > > > > > > 
> > > > > > > > > I think the choice of name is slightly ambiguous but
> > > > > > > > > the spec
> > > > > > > > > is
> > > > > > > > > clear
> > > > > > > > > on the semantics, so it's important to document it to
> > > > > > > > > remove
> > > > > > > > > the
> > > > > > > > > ambiguity.
> > > > > > > > > 
> > > > > > > > > Anybody disagree?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Sigh. Sadly proton-C interprets "idle-timeout"
> > > > > > > > differently
> > > > > > > > depending on
> > > > > > > > which end of the connection you are on:
> > > > > > > > 
> > > > > > > >       // as per the recommendation in the spec, advertise
> > > > > > > > half
> > > > > > > > our
> > > > > > > >       // actual timeout to the remote
> > > > > > > >       const pn_millis_t idle_timeout = transport-
> > > > > > > > > 
> > > > > > > > > local_idle_timeout
> > > > > > > >           ? (transport->local_idle_timeout/2)
> > > > > > > >           : 0;
> > > > > > > > 
> > > > > > > > So in proton, pn_set_idle_timeout does NOT mean set the
> > > > > > > > AMQP
> > > > > > > > idle-
> > > > > > > > timeout value, it means set the local "receive timeout"
> > > > > > > > value
> > > > > > > > and send
> > > > > > > > half that as the AMQP "send timeout" for the peer.
> > > > > > > > 
> > > > > > > > I'm tempted to use a new term in the Go API: "heartbeat".
> > > > > > > > To me
> > > > > > > > that
> > > > > > > > clearly means the "send timeout" (hearts beat, they don't
> > > > > > > > listen for
> > > > > > > > beats) so it coincides with the meaning of the AMQP
> > > > > > > > "idle-
> > > > > > > > timeout", but
> > > > > > > > without the ambiguity that is exacerbated by proton
> > > > > > > > interpreting it
> > > > > > > > both ways.
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > Proton may seem to behave differently on each end, but I
> > > > > > > don't
> > > > > > > think
> > > > > > > its necessarily a bad thing that it does, and it is also I
> > > > > > > think
> > > > > > > largely just reflecting an annoying bit in the spec around
> > > > > > > this
> > > > > > > where
> > > > > > > different behaviours are allowed for, whereas it would be
> > > > > > > easier
> > > > > > > if it
> > > > > > > had less wiggle room.
> > > > > > > 
> > > > > > > The transport setter/getter for the local timeout takes the
> > > > > > > 'actual
> > > > > > > timeout' and then sends half of it as the advertised value
> > > > > > > in the
> > > > > > > Open
> > > > > > > sent. This makes a certain amount of sense since it ensures
> > > > > > > that
> > > > > > > appropriate behaviour is actually satisfied, rather than
> > > > > > > expecting the
> > > > > > > user to ensure they only give half the value they really
> > > > > > > want for
> > > > > > > their actual timeout. The getter for the remote timeout
> > > > > > > value on
> > > > > > > the
> > > > > > > other hand returns the advertised value from the Open that
> > > > > > > is
> > > > > > > received. I expect it does that since it cant actually ever
> > > > > > > return the
> > > > > > > remotes 'actual timeout' without making an assumption, i.e
> > > > > > > that
> > > > > > > they
> > > > > > > did in fact advertise half (or less) of their actual
> > > > > > > timeout,
> > > > > > > which
> > > > > > > the spec only says that they SHOULD do.
> > > > > > > 
> > > > > > > Yes the local setter taking the advertised value may have
> > > > > > > been
> > > > > > > better
> > > > > > > for method consistency with the remote getter. On the other
> > > > > > > hand,
> > > > > > > sending of necessary heartbeats is handled directly by the
> > > > > > > transport
> > > > > > > during the tick process, so users may not necessarily even
> > > > > > > use
> > > > > > > the
> > > > > > > getter themselves, and proton uses that remote value
> > > > > > > internally
> > > > > > > by
> > > > > > > pessimistically halfing it to account for the case that
> > > > > > > folks on
> > > > > > > the
> > > > > > > other end did not advertise half their actual timeout
> > > > > > > (since the
> > > > > > > spec
> > > > > > > doesnt require that they do). Side note: proton could
> > > > > > > arguably be
> > > > > > > less
> > > > > > > pessimistic here and go for say a percentage much nearer
> > > > > > > the full
> > > > > > > advertised value, but then you'd probably need to start
> > > > > > > guaging
> > > > > > > how
> > > > > > > close is too close.
> > > > > > > 
> > > > > > > I think ensuring the doccumentation on the methods is clear
> > > > > > > what
> > > > > > > they
> > > > > > > do is sufficient enough here. I actually prefer idle-
> > > > > > > timeout as
> > > > > > > an
> > > > > > > name rather than heartbeat due to the way this all works.
> > > > > > > Since
> > > > > > > you
> > > > > > > only tell the other side [half] your timeout, you dont
> > > > > > > actually
> > > > > > > have
> > > > > > > direct control over when they send any needed empty frames
> > > > > > > to
> > > > > > > satisfy
> > > > > > > it (as the above shows, we might send them more often than
> > > > > > > they
> > > > > > > require) and 'heartbeat' might seem to imply that you do,
> > > > > > > and
> > > > > > > possibly
> > > > > > > even that they need be sent at that period all the time
> > > > > > > even
> > > > > > > despite
> > > > > > > regular traffic, which is not the case.
> > > > > > > 
> > > > > > > Robbie
> > > > > > > 
> > > > > > > ---------------------------------------------------------
> > > > > > > ------
> > > > > > > ------
> > > > > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
> > > > > > > For additional commands, e-mail: dev-h...@qpid.apache.org
> > > > > > > 
> > > > > > 
> > > > > > -----------------------------------------------------------
> > > > > > ------
> > > > > > ----
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
> > > > > > For additional commands, e-mail: dev-h...@qpid.apache.org
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > > ---------------------------------------------------------------
> > > > ------
> > > > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
> > > > For additional commands, e-mail: dev-h...@qpid.apache.org
> > > > 
> > > > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
> > For additional commands, e-mail: dev-h...@qpid.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
> For additional commands, e-mail: dev-h...@qpid.apache.org
> 
> 

-- 
-K

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org

Re: API and terms: idle-time-out and heartbeat intervals.

Reply via email to