Stelian Pop wrote: > Le dimanche 08 janvier 2006 à 18:56 +0200, Heikki Lindholm a écrit : > > >>>>Some recent changes (*cough* RTDM benchmark driver *cough*) broke kernel >>>>mode benchmarking for ppc64. Previously klatency worked fine, but now >>>>latency -t 1 crashes somewhere in xnpod_schedule. Jan, any pending >>>>patches a comin'? > > > So it seems I'm not alone. > > I have done some additionnal debugging on this issue in the last days. I > still haven't find the bug but I narrowed it down a bit. > >>>Nope, it should work as it is. But as Stelian also reported problems on >>>his fresh ARM port with the in-kernel test, I cannot exclude that there >>>/might/ be a problem in the benchmark. >>> >>>As I don't have any ppc64 hanging around somewhere, we will have to go >>>through this together. Things I would like to know: >> >>Dammit, I hoped you'd whip up a fix just from me noting a problem. Well, >>all right then, I'll play along...;) >> >> >>> o When and how does it crash? At start-up immediately? Or after a >>> while? >> >>I inserted some serial debug prints and it gets two passes to >>eval_outer_loop done (enter/exit function). After that it freezes. > > > It freezes exactly upon the invocation of rtdm_event_pulse() which > causes a scheduling. In xnpod_schedule, the scheduler queue has been > corrupted and this causes the illegal accesses.
Do you mean the synch-queue inside result_event or the global run-queue? I saw in your test that you even moved result_event out of the context structure and turned it into a global variable. So it seems that the timerbench itself does not overwrite it. Does it crash on the second or already first invocation of rtdm_event_flush()? > > >>Without the debug printing it dies with kernel access of illegal memory >>at xnpod_schedule, which btw. has been quite a common place to die. > > > Same for me. > > >>> o Are there any details / backtraces available with the crash? >> >>Becaktrace limits to xnpod_schedule if I remember right. > > > Same for me. But very often I don't even get a backtrace, it just hangs. > > >>> o Does -t2 work? >> >>Umm. Probably not. See below. > > > Heikki said in a later mail that it works for him, and so it does for me > too. > > >>> o What happens if your disable "rtdm_event_pulse(&ctx->result_event);" >>> in eval_outer_loop (thus no signalling of intermediate results during >>> the test)? Does it still crash, maybe later during cleanup now? > > >>Doesn't freeze and can be exited with ctrl-c and even re-run. > > > Same for me. > > Some additionnal information: I've disabled FPU handling in Xeno and it > doesn't change anything, it still crashes. > > As I said before, the old klatency test does work reliably for me, with > the latest Xenomai. > > I tried moving the 'display' thread into the kernel, and in this > configuration it does no longer crash. Hmm, the RTDM kernel entry/exit path? But what evil thing should happen there? Weird. > > I've started simplifying the code trying to get to the simplest code > which does have the problem. The results is at > http://www.popies.net/tmp/xenobug/bug.tgz if somebody wants to take a > look. Re-checked your test on x86, no problems (this was obvious, I did a lot of tests the last week with that setup and that particular piece of code). > > I'll be working on this again tomorrow... > When you know now what piece of memory gets corrupted, you may add checks or outputs of the content to the code, seeing which is the offending function. And there is also a nice switch called CONFIG_XENO_OPT_DEBUG which may provide some hints regarding incorrect queue usages (although such things should have been triggered on other archs as well). Jan
Description: OpenPGP digital signature