Re: [Xenomai-core] [bug] zombie mutex owners
Dmitry Adamushko wrote: Hi Jan, running the attached test case for the native skin, you will get an ugly lock-up on probably all Xenomai versions. Granted, this code is a bit synthetic. I originally thought I could trigger the bug also via timeouts when waiting on mutexes, but this scenario is safe (the timeout is cleared before being able to cause harm). just in order to educate me as probably I might have got something wrong at the first glance :) if we take this one: --- mutex.c2006-02-27 15:34:58.0 +0100 +++ mutex-NEW.c2006-05-10 11:55:25.0 +0200 @@ -391,7 +391,7 @@ int rt_mutex_lock (RT_MUTEX *mutex, err = -EIDRM; /* Mutex deleted while pending. */ else if (xnthread_test_flags(task-thread_base,XNTIMEO)) err = -ETIMEDOUT; /* Timeout.*/ -else if (xnthread_test_flags(task-thread_base,XNBREAK)) +else if (xnthread_test_flags(task-thread_base,XNBREAK) mutex-owner != task) err = -EINTR; /* Unblocked.*/ unlock_and_exit: As I understand task2 has a lower prio and that's why [task1] rt_mutex_unlock [task 1] rt_task_unblock(task1) are called in a row. ok, task2 wakes up in rt_mutex_unlock() (when task1 is blocked on rt_mutex_lock()) and finds XNBREAK flag but, [doc] -EINTR is returned if rt_task_unblock() has been called for the waiting task (1) before the mutex has become available (2). (1) it's true, task2 was still waiting at that time; (2) it's wrong, task2 was already the owner. So why just not to bail out XNBREAK and continue task2 as it has a mutex (as shown above) ? Indeed, this solves the issue more gracefully. Looking at this again from a different perspective and running the test case with your patch in a slightly different way, I think I misinterpreted the crash. If I modify task2 like this void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } //rt_mutex_unlock(mtx); printf(done task2\n); } I'm also getting a crash. So the problem seems to be releasing a mutex ownership on task termination. Well, this needs further examination. Looks like the issue is limited to cleanup problems and is not that widespread to other skins as I thought. RTDM is not involved as it does not know EINTR for rtdm_mutex_lock. The POSIX skins runs in a loop on interruption and should recover from this. Besides this, we then may want to consider if introducing a pending ownership of synch objects is worthwhile to improve efficiency of PIP users. Not critical, but if it comes at a reasonable price... Will try to draft something. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] [bug] zombie mutex owners
Hi, running the attached test case for the native skin, you will get an ugly lock-up on probably all Xenomai versions. Granted, this code is a bit synthetic. I originally thought I could trigger the bug also via timeouts when waiting on mutexes, but this scenario is safe (the timeout is cleared before being able to cause harm). What we see here is that task1 forwards the ownership of the mutex to task2 on its first unlock invocation. Then we interrupt task2, making it drop its wish to acquire the lock - but it already has it! Now weird things happen on cleanup of task2 (likely also when trying to require the lock via task1 beforehand). The attached fix solves at least the crash but still gives an unsatisfying result: : fn -158+ __rt_mutex_lock (hisyscall_event) : fn -156+ __copy_from_user_ll (__rt_mutex_lock) : fn -154+ xnregistry_fetch (__rt_mutex_lock) :|fn -151+ __ipipe_restore_pipeline_head (xnregistry_fetch) : fn -148+ rt_mutex_lock (__rt_mutex_lock) :|fn -144+ xnsynch_sleep_on (rt_mutex_lock) :|fn -134+ xnpod_resume_thread (xnsynch_sleep_on) :|fn -130+ xnpod_suspend_thread (xnsynch_sleep_on) :|fn -125+ xnpod_schedule (xnpod_suspend_thread) :|fn -116! __switch_to (xnpod_schedule) :|fn -103+ rt_mutex_unlock (rt_mutex_lock) :|fn -100+ xnsynch_wakeup_one_sleeper (rt_mutex_unlock) :|fn -98+ xnpod_resume_thread (xnsynch_wakeup_one_sleeper) :|fn -95+ xnsynch_clear_boost (xnsynch_wakeup_one_sleeper) :|fn -89+ xnpod_schedule (rt_mutex_unlock) :|fn -85+ __switch_to (xnpod_schedule) :|fn -79! __ipipe_restore_pipeline_head (rt_mutex_lock) This means that task2 needs to be woken up in order to let task1 re-acquire the mutex. What would be more efficient for task1 is to steal the granted lock again (that's what the preempt-rt people do in their rtmutex code - and this is where I stumbled over our issues). I haven't tried to construct test cases for other skins yet, but in theory at least POSIX and RTDM should suffer from the same issue. This raises the question if there might be some generic solution at nucleus level for this, also improving the re-acquire path. But I have nothing at hand so far. Jan Index: ksrc/skins/native/mutex.c === --- ksrc/skins/native/mutex.c (Revision 1058) +++ ksrc/skins/native/mutex.c (Arbeitskopie) @@ -375,12 +375,17 @@ int rt_mutex_lock(RT_MUTEX *mutex, RTIME xnsynch_sleep_on(mutex-synch_base, timeout); -if (xnthread_test_flags(task-thread_base, XNRMID)) -err = -EIDRM; /* Mutex deleted while pending. */ -else if (xnthread_test_flags(task-thread_base, XNTIMEO)) -err = -ETIMEDOUT; /* Timeout. */ -else if (xnthread_test_flags(task-thread_base, XNBREAK)) -err = -EINTR; /* Unblocked. */ +if (xnthread_test_flags(task-thread_base, XNRMID | XNTIMEO | XNBREAK)) { +if (xnthread_test_flags(task-thread_base, XNRMID)) +err = -EIDRM; /* Mutex deleted while pending. */ +else if (xnthread_test_flags(task-thread_base, XNTIMEO)) +err = -ETIMEDOUT; /* Timeout. */ +else +err = -EINTR; /* Unblocked. */ + +if (mutex-owner == task) +rt_mutex_unlock(mutex); +} unlock_and_exit: #include stdio.h #include sys/mman.h #include native/task.h #include native/mutex.h #include native/timer.h #include rtdm/rtbenchmark.h RT_TASK task1, task2; RT_MUTEX mtx; int fd; void task1_fnc(void *arg) { printf(started task1\n); rt_mutex_lock(mtx, 0); rt_task_sleep(10LL); rt_mutex_unlock(mtx); rt_task_unblock(task2); rt_mutex_lock(mtx, 0); rt_dev_ioctl(fd, RTBNCH_RTIOC_REFREEZE_TRACE, 0); rt_dev_close(fd); rt_mutex_unlock(mtx); printf(done task1\n); } void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } rt_mutex_unlock(mtx); printf(done task2\n); } int main() { mlockall(MCL_CURRENT | MCL_FUTURE); fd = rt_dev_open(rtbenchmark0, 0); rt_mutex_create(mtx, NULL); rt_task_spawn(task1, task1, 0, 20, T_JOINABLE, task1_fnc, 0); rt_task_spawn(task2, task2, 0, 10, T_JOINABLE, task2_fnc, 0); rt_task_join(task1); rt_task_join(task2); rt_mutex_delete(mtx); return 0; } signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [bug] zombie mutex owners
Jan Kiszka wrote: Dmitry Adamushko wrote: Hi Jan, running the attached test case for the native skin, you will get an ugly lock-up on probably all Xenomai versions. Granted, this code is a bit synthetic. I originally thought I could trigger the bug also via timeouts when waiting on mutexes, but this scenario is safe (the timeout is cleared before being able to cause harm). just in order to educate me as probably I might have got something wrong at the first glance :) if we take this one: --- mutex.c2006-02-27 15:34:58.0 +0100 +++ mutex-NEW.c2006-05-10 11:55:25.0 +0200 @@ -391,7 +391,7 @@ int rt_mutex_lock (RT_MUTEX *mutex, err = -EIDRM; /* Mutex deleted while pending. */ else if (xnthread_test_flags(task-thread_base,XNTIMEO)) err = -ETIMEDOUT; /* Timeout.*/ -else if (xnthread_test_flags(task-thread_base,XNBREAK)) +else if (xnthread_test_flags(task-thread_base,XNBREAK) mutex-owner != task) err = -EINTR; /* Unblocked.*/ unlock_and_exit: As I understand task2 has a lower prio and that's why [task1] rt_mutex_unlock [task 1] rt_task_unblock(task1) are called in a row. ok, task2 wakes up in rt_mutex_unlock() (when task1 is blocked on rt_mutex_lock()) and finds XNBREAK flag but, [doc] -EINTR is returned if rt_task_unblock() has been called for the waiting task (1) before the mutex has become available (2). (1) it's true, task2 was still waiting at that time; (2) it's wrong, task2 was already the owner. So why just not to bail out XNBREAK and continue task2 as it has a mutex (as shown above) ? Indeed, this solves the issue more gracefully. Looking at this again from a different perspective and running the test case with your patch in a slightly different way, I think I misinterpreted the crash. If I modify task2 like this void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } //rt_mutex_unlock(mtx); printf(done task2\n); } I'm also getting a crash. So the problem seems to be releasing a mutex ownership on task termination. Well, this needs further examination. Looks like the issue is limited to cleanup problems and is not that widespread to other skins as I thought. RTDM is not involved as it does not know EINTR for rtdm_mutex_lock. The POSIX skins runs in a loop on interruption and should recover from this. Besides this, we then may want to consider if introducing a pending ownership of synch objects is worthwhile to improve efficiency of PIP users. Not critical, but if it comes at a reasonable price... Will try to draft something. I've planned to work over the simulator asap to implement the stealing of ownership at the nucleus level, so that this kind of issue will become history. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [bug] zombie mutex owners
Jan Kiszka wrote: Dmitry Adamushko wrote: Indeed, this solves the issue more gracefully. Looking at this again from a different perspective and running the test case with your patch in a slightly different way, I think I misinterpreted the crash. If I modify task2 like this void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } //rt_mutex_unlock(mtx); printf(done task2\n); } I'm also getting a crash. So the problem seems to be releasing a mutex ownership on task termination. Well, this needs further examination. The native skin does not implement robust mutex, indeed. Looks like the issue is limited to cleanup problems and is not that widespread to other skins as I thought. RTDM is not involved as it does not know EINTR for rtdm_mutex_lock. The POSIX skins runs in a loop on interruption and should recover from this. I can confirm with the help of the simulator that there's no issue regarding the way the ownership is transferred back and forth between both tasks, this works. However, I agree that this should not preclude us from improving the priority inversion guard, by allowing the lock to be stolen if the new owner has not resumed execution yet, after it has been granted the mutex ownership. Besides this, we then may want to consider if introducing a pending ownership of synch objects is worthwhile to improve efficiency of PIP users. Not critical, but if it comes at a reasonable price... Will try to draft something. Jan ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [bug] zombie mutex owners
Philippe Gerum wrote: Jan Kiszka wrote: Dmitry Adamushko wrote: Indeed, this solves the issue more gracefully. Looking at this again from a different perspective and running the test case with your patch in a slightly different way, I think I misinterpreted the crash. If I modify task2 like this void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } //rt_mutex_unlock(mtx); printf(done task2\n); } I'm also getting a crash. So the problem seems to be releasing a mutex ownership on task termination. Well, this needs further examination. The native skin does not implement robust mutex, indeed. Yeah, lunch opened my eyes: the skin data structure (RT_MUTEX) is not updated appropriately on task cleanup. What about some callback hook in xnsynch_t to invoke a per-skin cleanup handler when running xnsynch_release_all_ownerships? Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [bug] zombie mutex owners
Jan Kiszka wrote: Dmitry Adamushko wrote: Hi Jan, running the attached test case for the native skin, you will get an ugly lock-up on probably all Xenomai versions. Granted, this code is a bit synthetic. I originally thought I could trigger the bug also via timeouts when waiting on mutexes, but this scenario is safe (the timeout is cleared before being able to cause harm). just in order to educate me as probably I might have got something wrong at the first glance :) if we take this one: --- mutex.c2006-02-27 15:34:58.0 +0100 +++ mutex-NEW.c2006-05-10 11:55:25.0 +0200 @@ -391,7 +391,7 @@ int rt_mutex_lock (RT_MUTEX *mutex, err = -EIDRM; /* Mutex deleted while pending. */ else if (xnthread_test_flags(task-thread_base,XNTIMEO)) err = -ETIMEDOUT; /* Timeout.*/ -else if (xnthread_test_flags(task-thread_base,XNBREAK)) +else if (xnthread_test_flags(task-thread_base,XNBREAK) mutex-owner != task) err = -EINTR; /* Unblocked.*/ unlock_and_exit: As I understand task2 has a lower prio and that's why [task1] rt_mutex_unlock [task 1] rt_task_unblock(task1) are called in a row. ok, task2 wakes up in rt_mutex_unlock() (when task1 is blocked on rt_mutex_lock()) and finds XNBREAK flag but, [doc] -EINTR is returned if rt_task_unblock() has been called for the waiting task (1) before the mutex has become available (2). (1) it's true, task2 was still waiting at that time; (2) it's wrong, task2 was already the owner. So why just not to bail out XNBREAK and continue task2 as it has a mutex (as shown above) ? Indeed, this solves the issue more gracefully. The real issue with the XNBREAK bit is that xnpod_unblock_thread() should only raise it for threads which it does actually awake, and not for those which are already resumed, e.g. by a call to xnsynch_wakeup_sleeper(). This explains why task2 gets -EINTR from rt_mutex_lock() albeit the syscall has indeed succeeded, so that the next rt_mutex_unlock() was ok, but the downstream code might clearly be confused by such behaviour. It should get a success code instead, since it has been resumed by xnsynch_wakeup_sleeper() _before_ xnpod_unblock_thread() has been called against it. Fixed in the repository. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] [bug] zombie mutex owners
Jan Kiszka wrote: Philippe Gerum wrote: Jan Kiszka wrote: Dmitry Adamushko wrote: Indeed, this solves the issue more gracefully. Looking at this again from a different perspective and running the test case with your patch in a slightly different way, I think I misinterpreted the crash. If I modify task2 like this void task2_fnc(void *arg) { printf(started task2\n); if (rt_mutex_lock(mtx, 0) 0) { printf(lock failed in task2\n); return; } //rt_mutex_unlock(mtx); printf(done task2\n); } I'm also getting a crash. So the problem seems to be releasing a mutex ownership on task termination. Well, this needs further examination. The native skin does not implement robust mutex, indeed. Yeah, lunch opened my eyes: the skin data structure (RT_MUTEX) is not updated appropriately on task cleanup. What about some callback hook in xnsynch_t to invoke a per-skin cleanup handler when running xnsynch_release_all_ownerships? Yep, we need this. Added by commit #1067 (i.e. xnsynch_register_cleanup()) and also backported to the maintenance branch, so that new code may happily rely on it without too much portability burden. -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core