Gilles Chanteperdrix wrote:
On Thu, Apr 3, 2008 at 2:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote:
Sebastian Smolorz wrote:

Gilles Chanteperdrix wrote:

On Wed, Apr 2, 2008 at 5:58 PM, Sebastian Smolorz
<[EMAIL PROTECTED]> wrote:

Jan Kiszka wrote:
 > Sebastian Smolorz wrote:
 >> Jan Kiszka wrote:
 >>> Cornelius Köpp wrote:
 >>>> I talked with Sebastian Smolorz about this and he builds his own
 >>>> independent kernel-config to check. He got the same
drifting-effect
 >>>> with Xenomai 2.4.2 and Xenomai 2.4.3 running latency over
several
 >>>> hours. His kernel-config ist attached as
 >>>> 'config-2.6.24-xenomai-2.4.3__ssm'.
 >>>>
 >>>> Our kernel-configs are both based on a config used with Xenomai
2.3.4
 >>>> and Linux 2.6.20.15 without any drifting effects.
 >>> 2.3.x did not incorporate the new TSC-to-ns conversion. Maybe it
is
 >>> not a PIC vs. APIC thing, but rather a rounding problem of larger
TSC
 >>> values (that naturally show up when the system runs for a longer
time).
 >> This hint seems to point into the right direction. I tried out a
 >> modified pod_32.h (xnarch_tsc_to_ns() commented out) so that the
old
 >> implementation in include/asm-generic/bits/pod.h was used. The
drifting
 >> bug disappeared. So there seems so be a buggy x86-specific
 >> implementation of this routine.
 >
 > Hmm, maybe even a conceptional issue: the multiply-shift-based
 > xnarch_tsc_to_ns is not as precise as the still
multiply-divide-based
 > xnarch_ns_to_tsc. So when converting from tsc over ns back to tsc,
we
 > may loose some bits, maybe too many bits...
 >
 > It looks like this bites us in the kernel latency tests (-t2 should
 > suffer as well). Those recalculate their timeouts each round based
on
 > absolute nanoseconds. In contrast, the periodic user mode task of
-t0
 > uses a periodic timer that is forwarded via a tsc-based interval.
 >
 > You (or Cornelius) could try to analyse the calculation path of the
 > involved timeouts, specifically to understand why the scheduled
timeout
 > of the underlying task timer (which is tsc-based) tend to diverge
from
 > the calculated one (ns-based).

 So here comes the explanation. The error is inside the function
 rthal_llmulshft(). It returns wrong values which are too small - the
 higher the given TSC value the bigger the error. The function
 rtdm_clock_read_monotonic() calls rthal_llmulshft(). As
 rtdm_clock_read_monotonic() is called every time the latency kernel
 thread runs [1] the values reported by latency become smaller over
time.
 In contrast, the latency task in user space only uses the conversion
 from TSC to ns only once when calling rt_timer_inquire [2].
 timer_info.date is too small, timer_info.tsc is right. So all
calculated
 deltas in [3] are shifted to a smaller value. This value is constant
 during the runtime of lateny in user space because no more conversion
 from TSC to ns occurs.

latency does conversions from tsc to ns, but it converts time
differences, so the error is small relative to the results.

Of course. I wasn't precise with my last statement. It should be: No more
conversions from *absolute* TSC values to ns occur.

 This patch may do the trick: it uses the inverted tsc-to-ns function
instead of the frequency-based one. Be warned, it is totally untested inside
Xenomai, I just ran it in a user space test program. But it may give an
idea.

 Gilles, not sure if this is related to my quickly hacked test, but with
RTHAL_CPU_FREQ = 800MHz and TSC = 0x7000000000000000 (or larger) I get an
arithmetic exception with the rthal_llimd-based conversion to nanoseconds.
Is there an input range we may have to exclude for rthal_llimd?

rthal_llimd does a multiplication first, then a division. The
multiplication can not overflow, but the result of the division may
not fit on 64 bits, you then get an exception on x86. This happens
only with m > d.

OK, for tsc-to-ns this only bites us after a few hundred years of uptime - or when we have settable tsc counters (does Linux tweak them beyond aligning on SMP?).

But there is also the risk the other way around: ns-to-tsc with frequency > 1GHz will fall apart (kernel oops!) when the user provides a large timeout in nanoseconds that we then try to convert to tsc. Not good. Wrong values are one thing, but oopses are even worse.

Any idea how to fix this?

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to