Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM - SUMMARY

2013-01-22 Thread Bas Laarhoven

On 21-1-2013 22:20, Michael Haberler wrote:

Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix:


On 01/21/2013 02:32 PM, Michael Haberler wrote:


Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:



question: does a RTC time warp have any possible bearing on
Xenomai operations?


No, it should not, Xenomai uses its own clock, which is set only
once upon boot, so, is unaffected by Linux wallclock time
changes... or should be.


it might not be Xenomai after all. Uhum.

the bughunt safari tribe has decided to focus on class 'duh' problems
and resolves to shut up until red hands are spotted.


I would still put the check in the timer set_next_event callback, just
in case...

I assume Bas will give the postmortem shortly - he nailed the issue; the RTC 
boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it 
look like a kernel hang.

relieved,

- Michael


Michael said it all, there's not much for me to add. I'll summarize the 
case for the records ; )


Lesson learned: Change only one variable at a time and don't assume 
anything!


I had been using a NFS mounted filesystem with the Beaglebone for over a 
year now without problems and got used to it's reliability (as I was 
used to in a corporate environment in the past).
Because the Xenomai software was built with libraries (eabihf) not 
compatible with my (eabi) system I switched to the Ubuntu image Michael 
built, and everything seemed to work fine. Except that the (xenomai) 
kernel froze out after around 50-60 minutes of uptime. With the JTAG 
debugger I could see the kernel still running, but all applications 
(both text and X via SSH, and console via serial/USB connection) seemed 
frozen and there was no output indicating what was going on. Of course 
the xenomai kernel was the first suspect. But that proved to be a 
mistake. With hindsight, knowing the cause of the freeze now, I wonder 
why I haven't gotten the NFS connection time-out message on the console, 
but for some reason or another that isn't generated in this case.


The underlying problem is that the Beaglebone has no battery backed 
real-time clock. This gives (only) a serious problem (freeze) with (1) a 
network mounted NFS root filesystem and (2) an initial kernel time lying 
in the past and (3) a DHCP lease time shorter than some multiple (in 
this case 2x) of the required system uptime.


Ubuntu (and maybe Debian too) systems are obviously not designed to 
start with a completely wrong real-time clock value. And the dhclient 
(as many other programs) is not designed to handle the large time step 
that's generated once the clock is set properly sometime during the boot 
process.
Note that if the filesystem is on local storage (e.g. FLASH or 
harddisk), there will only be a short disruption of the network 
connection and it's likely that the problem won't be noticed at all.


A final solution hasn't been found yet: I prefer a workaround without 
changing the dhclient or some other standard program. I think it would 
suffice to acquire a new lease right after the time-step has been made. 
This has to be done without giving up the previous lease (that has 
expired because of the time-step), because that would cause the system 
to freeze again. Suggestions on how to do this are welcome. I can't 
spend much more time on this issue this week.


-- Bas








___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai



___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-21 Thread Michael Haberler
the suspicion now turned to the DHCP lease setting and RTC time warp issues - 
the Beaglebone doesnt have an RTC so it starts up at 1-1-1970

the first DHCP lease still has 1970 timestamps, but eventually the RTC is set 
with ntpdate and it could be this causes confusion

the thing which is hard to believe for me: loss of IP connectivity - 
conceivable; kernel hang - why?

question: does a RTC time warp have any possible bearing on Xenomai operations?

- Michael
___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-21 Thread Michael Haberler

Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:

 On 01/21/2013 12:43 PM, Michael Haberler wrote:
 
 the suspicion now turned to the DHCP lease setting and RTC time warp
 issues - the Beaglebone doesnt have an RTC so it starts up at
 1-1-1970
 
 the first DHCP lease still has 1970 timestamps, but eventually the
 RTC is set with ntpdate and it could be this causes confusion
 
 the thing which is hard to believe for me: loss of IP connectivity -
 conceivable; kernel hang - why?
 
 question: does a RTC time warp have any possible bearing on Xenomai
 operations?
 
 
 No, it should not, Xenomai uses its own clock, which is set only once
 upon boot, so, is unaffected by Linux wallclock time changes... or
 should be.


it might not be Xenomai after all. Uhum.

the bughunt safari tribe has decided to focus on class 'duh' problems and 
resolves to shut up until red hands are spotted.

--

btw the upgrade to the ipipe patch in master made all xeno-regression-test 
problems go away - thanks!

-Michael



___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-21 Thread Gilles Chanteperdrix
On 01/21/2013 02:32 PM, Michael Haberler wrote:

 
 Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:
 
 On 01/21/2013 12:43 PM, Michael Haberler wrote:
 
 the suspicion now turned to the DHCP lease setting and RTC time
 warp issues - the Beaglebone doesnt have an RTC so it starts up
 at 1-1-1970
 
 the first DHCP lease still has 1970 timestamps, but eventually
 the RTC is set with ntpdate and it could be this causes
 confusion
 
 the thing which is hard to believe for me: loss of IP
 connectivity - conceivable; kernel hang - why?
 
 question: does a RTC time warp have any possible bearing on
 Xenomai operations?
 
 
 No, it should not, Xenomai uses its own clock, which is set only
 once upon boot, so, is unaffected by Linux wallclock time
 changes... or should be.
 
 
 it might not be Xenomai after all. Uhum.
 
 the bughunt safari tribe has decided to focus on class 'duh' problems
 and resolves to shut up until red hands are spotted.


I would still put the check in the timer set_next_event callback, just
in case...


-- 
Gilles.

___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-21 Thread Michael Haberler

Am 21.01.2013 um 20:10 schrieb Gilles Chanteperdrix:

 On 01/21/2013 02:32 PM, Michael Haberler wrote:
 
 
 Am 21.01.2013 um 12:56 schrieb Gilles Chanteperdrix:


 question: does a RTC time warp have any possible bearing on
 Xenomai operations?
 
 
 No, it should not, Xenomai uses its own clock, which is set only
 once upon boot, so, is unaffected by Linux wallclock time
 changes... or should be.
 
 
 it might not be Xenomai after all. Uhum.
 
 the bughunt safari tribe has decided to focus on class 'duh' problems
 and resolves to shut up until red hands are spotted.
 
 
 I would still put the check in the timer set_next_event callback, just
 in case...

I assume Bas will give the postmortem shortly - he nailed the issue; the RTC 
boot timewarp makes for a lost DHCP lease midflight and NFS freezing, making it 
look like a kernel hang.

relieved,

- Michael






___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-19 Thread Gilles Chanteperdrix
On 01/17/2013 02:30 PM, Bas Laarhoven wrote:

 On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
 On 01/17/2013 08:59 AM, Bas Laarhoven wrote:

 On 16-1-2013 20:36, Michael Haberler wrote:
 Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:

 On 16-1-2013 15:15, Michael Haberler wrote:
 ARM work:

 Several people have been able to get the Beaglebone ubuntu/xenomai setup 
 working as outlined here: 
 http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
 I have updated the kernel and rootfs image a few days ago so the kernel 
 includes ext2/3/4 support compiled in, which should take care of two 
 failure reports I got.

 Again that xenomai kernel is based on 3.2.21; it works very stable for 
 me but there have been several reports of 'sudden stops'. The BB is a 
 bit sensitive to power fluctuations but it might be more than that. As 
 for that kernel, it works, but it is based on a branch which will see no 
 further development. It supports most of the stuff needed to 
 development; there might be some patches coming from more active BB 
 users than me.
 Hi Michael,

 Are you saying you don't have seen these 'sudden stops' yourself?
 No, never, after swapping to stronger power supplies; I have two of these 
 boards running over NFS all the time. I dont have Linuxcnc running on them 
 though, I'll do that and see if that changes the picture. Maybe keeping 
 the torture test running helps trigger it.
 Beginners error! :-P The power supply is indeed critical, but the
 stepdown converter on my BeBoPr is dimensioned for at least 2A and
 hasn't failed me yet.

 I think that running linuxcnc is mandatory for the lockup. After a dozen
 runs, it looks like I can reproduce the lockup with 100% certainty
 within one hour.
 Using the JTAG interface to attach a debugger to the Bone, I've found
 that once stalled the kernel is still running. It looks like it won't
 schedule properly and almost all time is spent in the cpu_idle thread.

 This is typical of a tsc emulation or timer issue. On a system without
 anything running, please let the tsc -w command run. It will take some
 time to run (the wrap time of the hardware timer used for tsc
 emulation), if it runs correctly, then you need to check whether the
 timer is still running when the bug happens (cat /proc/xenomai/irq
 should continue increasing when for instance the latency test is
 running). If the timer is stopped, it may have been programmed for a too
 short delay, to avoid that, you can try:
 - increasing the ipipe_timer min_delay_ticks member (by default, it uses
 a value corresponding to the min_delta_ns member in the clockevent
 structure);
 - checking after programming the timer (in the set_next_event method) if
 the timer counter is already 0, in which case you can return a negative
 value, usually -ETIME.

 
 Hi Gilles,
 
 Thanks for the swift reply.
 
 As far as I can see, tsc -w runs without an error:
 
 ARM: counter wrap time: 179 seconds
 Checking tsc for 6 minute(s)
 min: 5, max: 12, avg: 5.04168
 ...
 min: 5, max: 6, avg: 5.03771
 min: 5, max: 28, avg: 5.03989 - 0.209995 us
 
 real6m0.284s
 
 I've also done the other regression tests and all were successful.
 
 Problem is that once the bug happens I won't be able to issue the cat 
 command.
 I've fixed my debug setup so I don't have to use the System.map to 
 manually translate the debugger addresses : /
 Now I'm waiting for another lockup to see what's happening.


You may want to have a look at the xeno-regression-test script to put
your system under pressure (and likely generate the lockup faster).

-- 
Gilles.

___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-19 Thread Michael Haberler

Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:

 On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
 
 On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
 On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
 
 On 16-1-2013 20:36, Michael Haberler wrote:
 Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
 
 On 16-1-2013 15:15, Michael Haberler wrote:
 ARM work:
 
 Several people have been able to get the Beaglebone ubuntu/xenomai 
 setup working as outlined here: 
 http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
 I have updated the kernel and rootfs image a few days ago so the kernel 
 includes ext2/3/4 support compiled in, which should take care of two 
 failure reports I got.
 
 Again that xenomai kernel is based on 3.2.21; it works very stable for 
 me but there have been several reports of 'sudden stops'. The BB is a 
 bit sensitive to power fluctuations but it might be more than that. As 
 for that kernel, it works, but it is based on a branch which will see 
 no further development. It supports most of the stuff needed to 
 development; there might be some patches coming from more active BB 
 users than me.
 Hi Michael,
 
 Are you saying you don't have seen these 'sudden stops' yourself?
 No, never, after swapping to stronger power supplies; I have two of these 
 boards running over NFS all the time. I dont have Linuxcnc running on 
 them though, I'll do that and see if that changes the picture. Maybe 
 keeping the torture test running helps trigger it.
 Beginners error! :-P The power supply is indeed critical, but the
 stepdown converter on my BeBoPr is dimensioned for at least 2A and
 hasn't failed me yet.
 
 I think that running linuxcnc is mandatory for the lockup. After a dozen
 runs, it looks like I can reproduce the lockup with 100% certainty
 within one hour.
 Using the JTAG interface to attach a debugger to the Bone, I've found
 that once stalled the kernel is still running. It looks like it won't
 schedule properly and almost all time is spent in the cpu_idle thread.
 
 This is typical of a tsc emulation or timer issue. On a system without
 anything running, please let the tsc -w command run. It will take some
 time to run (the wrap time of the hardware timer used for tsc
 emulation), if it runs correctly, then you need to check whether the
 timer is still running when the bug happens (cat /proc/xenomai/irq
 should continue increasing when for instance the latency test is
 running). If the timer is stopped, it may have been programmed for a too
 short delay, to avoid that, you can try:
 - increasing the ipipe_timer min_delay_ticks member (by default, it uses
 a value corresponding to the min_delta_ns member in the clockevent
 structure);
 - checking after programming the timer (in the set_next_event method) if
 the timer counter is already 0, in which case you can return a negative
 value, usually -ETIME.
 
 
 Hi Gilles,
 
 Thanks for the swift reply.
 
 As far as I can see, tsc -w runs without an error:
 
 ARM: counter wrap time: 179 seconds
 Checking tsc for 6 minute(s)
 min: 5, max: 12, avg: 5.04168
 ...
 min: 5, max: 6, avg: 5.03771
 min: 5, max: 28, avg: 5.03989 - 0.209995 us
 
 real6m0.284s
 
 I've also done the other regression tests and all were successful.
 
 Problem is that once the bug happens I won't be able to issue the cat 
 command.
 I've fixed my debug setup so I don't have to use the System.map to 
 manually translate the debugger addresses : /
 Now I'm waiting for another lockup to see what's happening.
 
 
 You may want to have a look at the xeno-regression-test script to put
 your system under pressure (and likely generate the lockup faster).

running tsc -w and xeno-regression-test in parallel I get errors like so (not 
on every run; no lockup so far):

++ /usr/xenomai/bin/mutex-torture-native
simple_wait
recursive_wait
timed_mutex
mode_switch
pi_wait
lock_stealing
NOTE: lock_stealing mutex_trylock: not supported
deny_stealing
simple_condwait
recursive_condwait
auto_switchback
FAILURE: current prio (0) != expected prio (2)

dmesg 
[501963.390598] Xenomai: native: cleaning up mutex  (ret=0).
[502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc

on another run, I got a segfault while running sigdebug:
++ /usr/xenomai/bin/regression/native/sigdebug
mayday page starting at 0x400eb000 [/dev/rtheap]
mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 02 
00 0a 42 00 0f 00 db d7 ee b8
mlockall
syscall
signal
relaxed mutex owner
page fault
watchdog
./xeno-regression-test: line 53:  4210 Segmentation fault  
/usr/xenomai/bin/regression/native/sigdebug

root@bb1:/usr/xenomai/bin# dmesg 
[502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 
'rt_task'
[502443.054186] Xenomai: native: cleaning up mutex prio_invert (ret=0).
[502443.055730] Xenomai: native: cleaning up sem send_signal (ret=0).
[502518.134977] usb 1-1: reset high-speed USB device number 2 using musb-hdrc


unsure what to make of it - any 

Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-19 Thread Gilles Chanteperdrix
On 01/19/2013 03:09 PM, Michael Haberler wrote:

 
 Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:
 
 On 01/17/2013 02:30 PM, Bas Laarhoven wrote:

 On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
 On 01/17/2013 08:59 AM, Bas Laarhoven wrote:

 On 16-1-2013 20:36, Michael Haberler wrote:
 Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:

 On 16-1-2013 15:15, Michael Haberler wrote:
 ARM work:

 Several people have been able to get the Beaglebone ubuntu/xenomai 
 setup working as outlined here: 
 http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
 I have updated the kernel and rootfs image a few days ago so the 
 kernel includes ext2/3/4 support compiled in, which should take care 
 of two failure reports I got.

 Again that xenomai kernel is based on 3.2.21; it works very stable for 
 me but there have been several reports of 'sudden stops'. The BB is a 
 bit sensitive to power fluctuations but it might be more than that. As 
 for that kernel, it works, but it is based on a branch which will see 
 no further development. It supports most of the stuff needed to 
 development; there might be some patches coming from more active BB 
 users than me.
 Hi Michael,

 Are you saying you don't have seen these 'sudden stops' yourself?
 No, never, after swapping to stronger power supplies; I have two of 
 these boards running over NFS all the time. I dont have Linuxcnc running 
 on them though, I'll do that and see if that changes the picture. Maybe 
 keeping the torture test running helps trigger it.
 Beginners error! :-P The power supply is indeed critical, but the
 stepdown converter on my BeBoPr is dimensioned for at least 2A and
 hasn't failed me yet.

 I think that running linuxcnc is mandatory for the lockup. After a dozen
 runs, it looks like I can reproduce the lockup with 100% certainty
 within one hour.
 Using the JTAG interface to attach a debugger to the Bone, I've found
 that once stalled the kernel is still running. It looks like it won't
 schedule properly and almost all time is spent in the cpu_idle thread.

 This is typical of a tsc emulation or timer issue. On a system without
 anything running, please let the tsc -w command run. It will take some
 time to run (the wrap time of the hardware timer used for tsc
 emulation), if it runs correctly, then you need to check whether the
 timer is still running when the bug happens (cat /proc/xenomai/irq
 should continue increasing when for instance the latency test is
 running). If the timer is stopped, it may have been programmed for a too
 short delay, to avoid that, you can try:
 - increasing the ipipe_timer min_delay_ticks member (by default, it uses
 a value corresponding to the min_delta_ns member in the clockevent
 structure);
 - checking after programming the timer (in the set_next_event method) if
 the timer counter is already 0, in which case you can return a negative
 value, usually -ETIME.


 Hi Gilles,

 Thanks for the swift reply.

 As far as I can see, tsc -w runs without an error:

 ARM: counter wrap time: 179 seconds
 Checking tsc for 6 minute(s)
 min: 5, max: 12, avg: 5.04168
 ...
 min: 5, max: 6, avg: 5.03771
 min: 5, max: 28, avg: 5.03989 - 0.209995 us

 real6m0.284s

 I've also done the other regression tests and all were successful.

 Problem is that once the bug happens I won't be able to issue the cat 
 command.
 I've fixed my debug setup so I don't have to use the System.map to 
 manually translate the debugger addresses : /
 Now I'm waiting for another lockup to see what's happening.


 You may want to have a look at the xeno-regression-test script to put
 your system under pressure (and likely generate the lockup faster).
 
 running tsc -w and xeno-regression-test in parallel I get errors like so (not 
 on every run; no lockup so far):
 
 ++ /usr/xenomai/bin/mutex-torture-native
 simple_wait
 recursive_wait
 timed_mutex
 mode_switch
 pi_wait
 lock_stealing
 NOTE: lock_stealing mutex_trylock: not supported
 deny_stealing
 simple_condwait
 recursive_condwait
 auto_switchback
 FAILURE: current prio (0) != expected prio (2)
 
 dmesg 
 [501963.390598] Xenomai: native: cleaning up mutex  (ret=0).
 [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
 
 on another run, I got a segfault while running sigdebug:
 ++ /usr/xenomai/bin/regression/native/sigdebug
 mayday page starting at 0x400eb000 [/dev/rtheap]
 mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 
 02 00 0a 42 00 0f 00 db d7 ee b8
 mlockall
 syscall
 signal
 relaxed mutex owner
 page fault
 watchdog
 ./xeno-regression-test: line 53:  4210 Segmentation fault  
 /usr/xenomai/bin/regression/native/sigdebug
 
 root@bb1:/usr/xenomai/bin# dmesg 
 [502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 
 'rt_task'
 [502443.054186] Xenomai: native: cleaning up mutex prio_invert (ret=0).
 [502443.055730] Xenomai: native: cleaning up sem send_signal (ret=0).
 [502518.134977] usb 1-1: reset high-speed 

Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-19 Thread Michael Haberler

Am 19.01.2013 um 15:10 schrieb Gilles Chanteperdrix:

 On 01/19/2013 03:09 PM, Michael Haberler wrote:
 
 
 Am 19.01.2013 um 14:29 schrieb Gilles Chanteperdrix:
 
 On 01/17/2013 02:30 PM, Bas Laarhoven wrote:
 
 On 17-1-2013 9:53, Gilles Chanteperdrix wrote:
 On 01/17/2013 08:59 AM, Bas Laarhoven wrote:
 
 On 16-1-2013 20:36, Michael Haberler wrote:
 Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:
 
 On 16-1-2013 15:15, Michael Haberler wrote:
 ARM work:
 
 Several people have been able to get the Beaglebone ubuntu/xenomai 
 setup working as outlined here: 
 http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
 I have updated the kernel and rootfs image a few days ago so the 
 kernel includes ext2/3/4 support compiled in, which should take care 
 of two failure reports I got.
 
 Again that xenomai kernel is based on 3.2.21; it works very stable 
 for me but there have been several reports of 'sudden stops'. The BB 
 is a bit sensitive to power fluctuations but it might be more than 
 that. As for that kernel, it works, but it is based on a branch which 
 will see no further development. It supports most of the stuff needed 
 to development; there might be some patches coming from more active 
 BB users than me.
 Hi Michael,
 
 Are you saying you don't have seen these 'sudden stops' yourself?
 No, never, after swapping to stronger power supplies; I have two of 
 these boards running over NFS all the time. I dont have Linuxcnc 
 running on them though, I'll do that and see if that changes the 
 picture. Maybe keeping the torture test running helps trigger it.
 Beginners error! :-P The power supply is indeed critical, but the
 stepdown converter on my BeBoPr is dimensioned for at least 2A and
 hasn't failed me yet.
 
 I think that running linuxcnc is mandatory for the lockup. After a dozen
 runs, it looks like I can reproduce the lockup with 100% certainty
 within one hour.
 Using the JTAG interface to attach a debugger to the Bone, I've found
 that once stalled the kernel is still running. It looks like it won't
 schedule properly and almost all time is spent in the cpu_idle thread.
 
 This is typical of a tsc emulation or timer issue. On a system without
 anything running, please let the tsc -w command run. It will take some
 time to run (the wrap time of the hardware timer used for tsc
 emulation), if it runs correctly, then you need to check whether the
 timer is still running when the bug happens (cat /proc/xenomai/irq
 should continue increasing when for instance the latency test is
 running). If the timer is stopped, it may have been programmed for a too
 short delay, to avoid that, you can try:
 - increasing the ipipe_timer min_delay_ticks member (by default, it uses
 a value corresponding to the min_delta_ns member in the clockevent
 structure);
 - checking after programming the timer (in the set_next_event method) if
 the timer counter is already 0, in which case you can return a negative
 value, usually -ETIME.
 
 
 Hi Gilles,
 
 Thanks for the swift reply.
 
 As far as I can see, tsc -w runs without an error:
 
 ARM: counter wrap time: 179 seconds
 Checking tsc for 6 minute(s)
 min: 5, max: 12, avg: 5.04168
 ...
 min: 5, max: 6, avg: 5.03771
 min: 5, max: 28, avg: 5.03989 - 0.209995 us
 
 real6m0.284s
 
 I've also done the other regression tests and all were successful.
 
 Problem is that once the bug happens I won't be able to issue the cat 
 command.
 I've fixed my debug setup so I don't have to use the System.map to 
 manually translate the debugger addresses : /
 Now I'm waiting for another lockup to see what's happening.
 
 
 You may want to have a look at the xeno-regression-test script to put
 your system under pressure (and likely generate the lockup faster).
 
 running tsc -w and xeno-regression-test in parallel I get errors like so 
 (not on every run; no lockup so far):
 
 ++ /usr/xenomai/bin/mutex-torture-native
 simple_wait
 recursive_wait
 timed_mutex
 mode_switch
 pi_wait
 lock_stealing
 NOTE: lock_stealing mutex_trylock: not supported
 deny_stealing
 simple_condwait
 recursive_condwait
 auto_switchback
 FAILURE: current prio (0) != expected prio (2)
 
 dmesg 
 [501963.390598] Xenomai: native: cleaning up mutex  (ret=0).
 [502170.164984] usb 1-1: reset high-speed USB device number 2 using musb-hdrc
 
 on another run, I got a segfault while running sigdebug:
 ++ /usr/xenomai/bin/regression/native/sigdebug
 mayday page starting at 0x400eb000 [/dev/rtheap]
 mayday code: 0c 00 9f e5 0c 70 9f e5 00 00 00 ef 00 00 a0 e3 00 00 80 e5 2b 
 02 00 0a 42 00 0f 00 db d7 ee b8
 mlockall
 syscall
 signal
 relaxed mutex owner
 page fault
 watchdog
 ./xeno-regression-test: line 53:  4210 Segmentation fault  
 /usr/xenomai/bin/regression/native/sigdebug
 
 root@bb1:/usr/xenomai/bin# dmesg 
 [502442.312996] Xenomai: watchdog triggered -- signaling runaway thread 
 'rt_task'
 [502443.054186] Xenomai: native: cleaning up mutex prio_invert (ret=0).
 [502443.055730] Xenomai: native: 

Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-19 Thread Gilles Chanteperdrix
On 01/19/2013 03:14 PM, Michael Haberler wrote:

 that was xenomai 2.6.1 as per release tag in the git repo; the rest as 
 outlined here: 
 http://www.xenomai.org/pipermail/xenomai/2013-January/027164.html


Please upgrade to xenomai master. You are having bug which have already
been fixed since 2.6.1.

 [502738.607343] switchtest: page allocation failure: order:4, mode:0xd0


That is an allocation failure. I am afraid you can run
xeno-regression-test only once after the system boot (it is supposed to
run for several hours anyway).


-- 
Gilles.

___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai


Re: [Xenomai] [Emc-developers] new RTOS status: Scheduler (?) lockup on ARM

2013-01-17 Thread Bas Laarhoven

On 17-1-2013 9:53, Gilles Chanteperdrix wrote:

On 01/17/2013 08:59 AM, Bas Laarhoven wrote:


On 16-1-2013 20:36, Michael Haberler wrote:

Am 16.01.2013 um 17:45 schrieb Bas Laarhoven:


On 16-1-2013 15:15, Michael Haberler wrote:

ARM work:

Several people have been able to get the Beaglebone ubuntu/xenomai setup 
working as outlined here: 
http://wiki.linuxcnc.org/cgi-bin/wiki.pl?BeagleboneDevsetup
I have updated the kernel and rootfs image a few days ago so the kernel 
includes ext2/3/4 support compiled in, which should take care of two failure 
reports I got.

Again that xenomai kernel is based on 3.2.21; it works very stable for me but 
there have been several reports of 'sudden stops'. The BB is a bit sensitive to 
power fluctuations but it might be more than that. As for that kernel, it 
works, but it is based on a branch which will see no further development. It 
supports most of the stuff needed to development; there might be some patches 
coming from more active BB users than me.

Hi Michael,

Are you saying you don't have seen these 'sudden stops' yourself?

No, never, after swapping to stronger power supplies; I have two of these 
boards running over NFS all the time. I dont have Linuxcnc running on them 
though, I'll do that and see if that changes the picture. Maybe keeping the 
torture test running helps trigger it.

Beginners error! :-P The power supply is indeed critical, but the
stepdown converter on my BeBoPr is dimensioned for at least 2A and
hasn't failed me yet.

I think that running linuxcnc is mandatory for the lockup. After a dozen
runs, it looks like I can reproduce the lockup with 100% certainty
within one hour.
Using the JTAG interface to attach a debugger to the Bone, I've found
that once stalled the kernel is still running. It looks like it won't
schedule properly and almost all time is spent in the cpu_idle thread.


This is typical of a tsc emulation or timer issue. On a system without
anything running, please let the tsc -w command run. It will take some
time to run (the wrap time of the hardware timer used for tsc
emulation), if it runs correctly, then you need to check whether the
timer is still running when the bug happens (cat /proc/xenomai/irq
should continue increasing when for instance the latency test is
running). If the timer is stopped, it may have been programmed for a too
short delay, to avoid that, you can try:
- increasing the ipipe_timer min_delay_ticks member (by default, it uses
a value corresponding to the min_delta_ns member in the clockevent
structure);
- checking after programming the timer (in the set_next_event method) if
the timer counter is already 0, in which case you can return a negative
value, usually -ETIME.



Hi Gilles,

Thanks for the swift reply.

As far as I can see, tsc -w runs without an error:

ARM: counter wrap time: 179 seconds
Checking tsc for 6 minute(s)
min: 5, max: 12, avg: 5.04168
...
min: 5, max: 6, avg: 5.03771
min: 5, max: 28, avg: 5.03989 - 0.209995 us

real6m0.284s

I've also done the other regression tests and all were successful.

Problem is that once the bug happens I won't be able to issue the cat 
command.
I've fixed my debug setup so I don't have to use the System.map to 
manually translate the debugger addresses : /

Now I'm waiting for another lockup to see what's happening.

-- Bas



___
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai