Re: [Xenomai-core] latencys drifting into negative (Xenomai 2.4.2/2.4.3)

Gilles Chanteperdrix Thu, 03 Apr 2008 05:54:59 -0700

On Thu, Apr 3, 2008 at 2:50 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
>
> Gilles Chanteperdrix wrote:
>
> > On Thu, Apr 3, 2008 at 2:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
> >
> > > Sebastian Smolorz wrote:
> > >
> > >
> > > > Gilles Chanteperdrix wrote:
> > > >
> > > >
> > > > > On Wed, Apr 2, 2008 at 5:58 PM, Sebastian Smolorz
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > >
> > > > > > Jan Kiszka wrote:
> > > > > >  > Sebastian Smolorz wrote:
> > > > > >  >> Jan Kiszka wrote:
> > > > > >  >>> Cornelius Köpp wrote:
> > > > > >  >>>> I talked with Sebastian Smolorz about this and he builds his
> own
> > > > > >  >>>> independent kernel-config to check. He got the same
> > > > > >
> > > > >
> > > >
> > > drifting-effect
> > >
> > > >
> > > > >
> > > > > >  >>>> with Xenomai 2.4.2 and Xenomai 2.4.3 running latency over
> > > > > >
> > > > >
> > > >
> > > several
> > >
> > > >
> > > > >
> > > > > >  >>>> hours. His kernel-config ist attached as
> > > > > >  >>>> 'config-2.6.24-xenomai-2.4.3__ssm'.
> > > > > >  >>>>
> > > > > >  >>>> Our kernel-configs are both based on a config used with
> Xenomai
> > > > > >
> > > > >
> > > >
> > > 2.3.4
> > >
> > > >
> > > > >
> > > > > >  >>>> and Linux 2.6.20.15 without any drifting effects.
> > > > > >  >>> 2.3.x did not incorporate the new TSC-to-ns conversion. Maybe
> it
> > > > > >
> > > > >
> > > >
> > > is
> > >
> > > >
> > > > >
> > > > > >  >>> not a PIC vs. APIC thing, but rather a rounding problem of
> larger
> > > > > >
> > > > >
> > > >
> > > TSC
> > >
> > > >
> > > > >
> > > > > >  >>> values (that naturally show up when the system runs for a
> longer
> > > > > >
> > > > >
> > > >
> > > time).
> > >
> > > >
> > > > >
> > > > > >  >> This hint seems to point into the right direction. I tried out
> a
> > > > > >  >> modified pod_32.h (xnarch_tsc_to_ns() commented out) so that
> the
> > > > > >
> > > > >
> > > >
> > > old
> > >
> > > >
> > > > >
> > > > > >  >> implementation in include/asm-generic/bits/pod.h was used. The
> > > > > >
> > > > >
> > > >
> > > drifting
> > >
> > > >
> > > > >
> > > > > >  >> bug disappeared. So there seems so be a buggy x86-specific
> > > > > >  >> implementation of this routine.
> > > > > >  >
> > > > > >  > Hmm, maybe even a conceptional issue: the multiply-shift-based
> > > > > >  > xnarch_tsc_to_ns is not as precise as the still
> > > > > >
> > > > >
> > > >
> > > multiply-divide-based
> > >
> > > >
> > > > >
> > > > > >  > xnarch_ns_to_tsc. So when converting from tsc over ns back to
> tsc,
> > > > > >
> > > > >
> > > >
> > > we
> > >
> > > >
> > > > >
> > > > > >  > may loose some bits, maybe too many bits...
> > > > > >  >
> > > > > >  > It looks like this bites us in the kernel latency tests (-t2
> should
> > > > > >  > suffer as well). Those recalculate their timeouts each round
> based
> > > > > >
> > > > >
> > > >
> > > on
> > >
> > > >
> > > > >
> > > > > >  > absolute nanoseconds. In contrast, the periodic user mode task
> of
> > > > > >
> > > > >
> > > >
> > > -t0
> > >
> > > >
> > > > >
> > > > > >  > uses a periodic timer that is forwarded via a tsc-based
> interval.
> > > > > >  >
> > > > > >  > You (or Cornelius) could try to analyse the calculation path of
> the
> > > > > >  > involved timeouts, specifically to understand why the scheduled
> > > > > >
> > > > >
> > > >
> > > timeout
> > >
> > > >
> > > > >
> > > > > >  > of the underlying task timer (which is tsc-based) tend to
> diverge
> > > > > >
> > > > >
> > > >
> > > from
> > >
> > > >
> > > > >
> > > > > >  > the calculated one (ns-based).
> > > > > >
> > > > > >  So here comes the explanation. The error is inside the function
> > > > > >  rthal_llmulshft(). It returns wrong values which are too small -
> the
> > > > > >  higher the given TSC value the bigger the error. The function
> > > > > >  rtdm_clock_read_monotonic() calls rthal_llmulshft(). As
> > > > > >  rtdm_clock_read_monotonic() is called every time the latency
> kernel
> > > > > >  thread runs [1] the values reported by latency become smaller
> over
> > > > > >
> > > > >
> > > >
> > > time.
> > >
> > > >
> > > > >
> > > > > >  In contrast, the latency task in user space only uses the
> conversion
> > > > > >  from TSC to ns only once when calling rt_timer_inquire [2].
> > > > > >  timer_info.date is too small, timer_info.tsc is right. So all
> > > > > >
> > > > >
> > > >
> > > calculated
> > >
> > > >
> > > > >
> > > > > >  deltas in [3] are shifted to a smaller value. This value is
> constant
> > > > > >  during the runtime of lateny in user space because no more
> conversion
> > > > > >  from TSC to ns occurs.
> > > > > >
> > > > > >
> > > > > latency does conversions from tsc to ns, but it converts time
> > > > > differences, so the error is small relative to the results.
> > > > >
> > > > >
> > > > Of course. I wasn't precise with my last statement. It should be: No
> more
> > > >
> > > conversions from *absolute* TSC values to ns occur.
> > >
> > > >
> > > >
> > >  This patch may do the trick: it uses the inverted tsc-to-ns function
> > > instead of the frequency-based one. Be warned, it is totally untested
> inside
> > > Xenomai, I just ran it in a user space test program. But it may give an
> > > idea.
> > >
> > >  Gilles, not sure if this is related to my quickly hacked test, but with
> > > RTHAL_CPU_FREQ = 800MHz and TSC = 0x7000000000000000 (or larger) I get
> an
> > > arithmetic exception with the rthal_llimd-based conversion to
> nanoseconds.
> > > Is there an input range we may have to exclude for rthal_llimd?
> > >
> >
> > rthal_llimd does a multiplication first, then a division. The
> > multiplication can not overflow, but the result of the division may
> > not fit on 64 bits, you then get an exception on x86. This happens
> > only with m > d.
> >
>
>  OK, for tsc-to-ns this only bites us after a few hundred years of uptime -
> or when we have settable tsc counters (does Linux tweak them beyond aligning
> on SMP?).
>
>  But there is also the risk the other way around: ns-to-tsc with frequency >
> 1GHz will fall apart (kernel oops!) when the user provides a large timeout
> in nanoseconds that we then try to convert to tsc. Not good. Wrong values
> are one thing, but oopses are even worse.
>
>  Any idea how to fix this?


Since in the failure case the result of llimd would need 96 bits, the
only way is to make llimd result a 96 bits variable.

-- 
 Gilles

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Re: [Xenomai-core] latencys drifting into negative (Xenomai 2.4.2/2.4.3)

Reply via email to