On Thu, Apr 3, 2008 at 2:50 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > Gilles Chanteperdrix wrote: > > > On Thu, Apr 3, 2008 at 2:17 PM, Jan Kiszka <[EMAIL PROTECTED]> wrote: > > > > > Sebastian Smolorz wrote: > > > > > > > > > > Gilles Chanteperdrix wrote: > > > > > > > > > > > > > On Wed, Apr 2, 2008 at 5:58 PM, Sebastian Smolorz > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > Jan Kiszka wrote: > > > > > > > Sebastian Smolorz wrote: > > > > > > >> Jan Kiszka wrote: > > > > > > >>> Cornelius Köpp wrote: > > > > > > >>>> I talked with Sebastian Smolorz about this and he builds his > own > > > > > > >>>> independent kernel-config to check. He got the same > > > > > > > > > > > > > > > > > > drifting-effect > > > > > > > > > > > > > > > > > > >>>> with Xenomai 2.4.2 and Xenomai 2.4.3 running latency over > > > > > > > > > > > > > > > > > > several > > > > > > > > > > > > > > > > > > >>>> hours. His kernel-config ist attached as > > > > > > >>>> 'config-2.6.24-xenomai-2.4.3__ssm'. > > > > > > >>>> > > > > > > >>>> Our kernel-configs are both based on a config used with > Xenomai > > > > > > > > > > > > > > > > > > 2.3.4 > > > > > > > > > > > > > > > > > > >>>> and Linux 2.6.20.15 without any drifting effects. > > > > > > >>> 2.3.x did not incorporate the new TSC-to-ns conversion. Maybe > it > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > >>> not a PIC vs. APIC thing, but rather a rounding problem of > larger > > > > > > > > > > > > > > > > > > TSC > > > > > > > > > > > > > > > > > > >>> values (that naturally show up when the system runs for a > longer > > > > > > > > > > > > > > > > > > time). > > > > > > > > > > > > > > > > > > >> This hint seems to point into the right direction. I tried out > a > > > > > > >> modified pod_32.h (xnarch_tsc_to_ns() commented out) so that > the > > > > > > > > > > > > > > > > > > old > > > > > > > > > > > > > > > > > > >> implementation in include/asm-generic/bits/pod.h was used. The > > > > > > > > > > > > > > > > > > drifting > > > > > > > > > > > > > > > > > > >> bug disappeared. So there seems so be a buggy x86-specific > > > > > > >> implementation of this routine. > > > > > > > > > > > > > > Hmm, maybe even a conceptional issue: the multiply-shift-based > > > > > > > xnarch_tsc_to_ns is not as precise as the still > > > > > > > > > > > > > > > > > > multiply-divide-based > > > > > > > > > > > > > > > > > > > xnarch_ns_to_tsc. So when converting from tsc over ns back to > tsc, > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > may loose some bits, maybe too many bits... > > > > > > > > > > > > > > It looks like this bites us in the kernel latency tests (-t2 > should > > > > > > > suffer as well). Those recalculate their timeouts each round > based > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > absolute nanoseconds. In contrast, the periodic user mode task > of > > > > > > > > > > > > > > > > > > -t0 > > > > > > > > > > > > > > > > > > > uses a periodic timer that is forwarded via a tsc-based > interval. > > > > > > > > > > > > > > You (or Cornelius) could try to analyse the calculation path of > the > > > > > > > involved timeouts, specifically to understand why the scheduled > > > > > > > > > > > > > > > > > > timeout > > > > > > > > > > > > > > > > > > > of the underlying task timer (which is tsc-based) tend to > diverge > > > > > > > > > > > > > > > > > > from > > > > > > > > > > > > > > > > > > > the calculated one (ns-based). > > > > > > > > > > > > So here comes the explanation. The error is inside the function > > > > > > rthal_llmulshft(). It returns wrong values which are too small - > the > > > > > > higher the given TSC value the bigger the error. The function > > > > > > rtdm_clock_read_monotonic() calls rthal_llmulshft(). As > > > > > > rtdm_clock_read_monotonic() is called every time the latency > kernel > > > > > > thread runs [1] the values reported by latency become smaller > over > > > > > > > > > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > In contrast, the latency task in user space only uses the > conversion > > > > > > from TSC to ns only once when calling rt_timer_inquire [2]. > > > > > > timer_info.date is too small, timer_info.tsc is right. So all > > > > > > > > > > > > > > > > > > calculated > > > > > > > > > > > > > > > > > > deltas in [3] are shifted to a smaller value. This value is > constant > > > > > > during the runtime of lateny in user space because no more > conversion > > > > > > from TSC to ns occurs. > > > > > > > > > > > > > > > > > latency does conversions from tsc to ns, but it converts time > > > > > differences, so the error is small relative to the results. > > > > > > > > > > > > > > Of course. I wasn't precise with my last statement. It should be: No > more > > > > > > > conversions from *absolute* TSC values to ns occur. > > > > > > > > > > > > > > This patch may do the trick: it uses the inverted tsc-to-ns function > > > instead of the frequency-based one. Be warned, it is totally untested > inside > > > Xenomai, I just ran it in a user space test program. But it may give an > > > idea. > > > > > > Gilles, not sure if this is related to my quickly hacked test, but with > > > RTHAL_CPU_FREQ = 800MHz and TSC = 0x7000000000000000 (or larger) I get > an > > > arithmetic exception with the rthal_llimd-based conversion to > nanoseconds. > > > Is there an input range we may have to exclude for rthal_llimd? > > > > > > > rthal_llimd does a multiplication first, then a division. The > > multiplication can not overflow, but the result of the division may > > not fit on 64 bits, you then get an exception on x86. This happens > > only with m > d. > > > > OK, for tsc-to-ns this only bites us after a few hundred years of uptime - > or when we have settable tsc counters (does Linux tweak them beyond aligning > on SMP?). > > But there is also the risk the other way around: ns-to-tsc with frequency > > 1GHz will fall apart (kernel oops!) when the user provides a large timeout > in nanoseconds that we then try to convert to tsc. Not good. Wrong values > are one thing, but oopses are even worse. > > Any idea how to fix this?
Since in the failure case the result of llimd would need 96 bits, the only way is to make llimd result a 96 bits variable. -- Gilles _______________________________________________ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core