Philippe Gerum wrote:
> Jan Kiszka wrote:
>> Jan Kiszka wrote:
>>
>>> Jan Kiszka wrote:
>>>
>>>> Hi Philippe,
>>>>
>>>> debugging is nice, tracing is still nicer. As you suggested, I extended
>>>> the tracer with per-domain stall flags (needs some output clean-up,
>>>> preliminary patch attached nevertheless).
>>>>
>>>> And here is the result (full trace attached):
>>>>
>>>>
>>>>> :|    (0x51) 0x000000c8 -1113+   1.112  __ipipe_sync_stage+0x142
>>>>> (ipipe_suspend_domain+0x56)
>>>>> :|    fn                -1112+   1.789  __ipipe_sync_stage+0xe
>>>>> (ipipe_suspend_domain+0x56)
>>>>> :|   *(0x50) 0x00000064 -1110+   1.293  __ipipe_sync_stage+0x97
>>>>> (ipipe_suspend_domain+0x56)
>>>>> :    *fn                -1109+   1.398  do_IRQ+0x8
>>>>> (__ipipe_sync_stage+0xcf)
>>>>> :    *fn                -1107+   2.105  __do_IRQ+0xc (do_IRQ+0x21)
>>>>> :    *fn                -1105+   1.631  handle_IRQ_event+0xd
>>>>> (__do_IRQ+0x9e)
>>>>> :    *fn                -1104+   2.413  timer_interrupt+0x9
>>>>> (handle_IRQ_event+0x40)
>>>>> :    *fn                -1101+   3.022  mark_offset_tsc+0xe
>>>>> (timer_interrupt+0x31)
>>>>> :    *fn                -1098+   2.721  do_timer+0x9
>>>>> (timer_interrupt+0x37)
>>>>> :|   *fn                -1096+   2.112  __ipipe_handle_irq+0xe
>>>>> (common_interrupt+0x18)
>>>>> :|   *fn                -1093+   1.210  __ipipe_ack_common_irq+0xc
>>>>> (__ipipe_handle_irq+0xc0)
>>>>> :|   *(0x50) 0x00000064 -1092+   1.218  __ipipe_ack_common_irq+0x31
>>>>> (__ipipe_handle_irq+0xc0)
>>>>> :|   *fn                -1091+   1.834  mask_and_ack_8259A+0xb
>>>>> (__ipipe_ack_common_irq+0x5d)
>>>>> :|   *(0x50) 0x00000064 -1089+   1.345  __ipipe_ack_common_irq+0x9d
>>>>> (__ipipe_handle_irq+0xc0)
>>>>> :|   *fn                -1088    0.924  ipipe_suspend_domain+0xb
>>>>> (__ipipe_handle_irq+0x174)
>>>>> :|   *(0x51) 0x000000c8 -1087+   1.172  ipipe_suspend_domain+0x26
>>>>> (__ipipe_handle_irq+0x174)
>>>>> :|   *fn                -1086+   2.030  __ipipe_sync_stage+0xe
>>>>> (ipipe_suspend_domain+0x56)
>>>>> :|  **(0x50) 0x000000c8 -1084+   1.330  __ipipe_sync_stage+0x97
>>>>> (ipipe_suspend_domain+0x56)
>>>>> :|  **fn                -1082    0.879  xnintr_clock_handler+0x8
>>>>> (__ipipe_sync_stage+0x12b)
>>>>> :|  **fn                -1082+   1.218  xnintr_irq_handler+0xb
>>>>> (xnintr_clock_handler+0x18)
>>>>> :|  **fn                -1080+   1.233  xnpod_announce_tick+0x9
>>>>> (xnintr_irq_handler+0x2a)
>>>>> :|  **(0x50) 0x000000c8 -1079+   1.736  xnpod_announce_tick+0x20
>>>>> (xnintr_irq_handler+0x2a)
>>>>> :|  **fn                -1077+   2.984 
>>>>> xntimer_do_tick_aperiodic+0xe (xnpod_announce_tick+0x36)
>>>>> :|  **fn                -1074+   2.751 
>>>>> xnthread_timeout_handler+0x8 (xntimer_do_tick_aperiodic+0x18d)
>>>>> :|  **fn                -1072+   1.864  xnpod_resume_thread+0xe
>>>>> (xnthread_timeout_handler+0x1d)
>>>>> :|  **(0x50) 0x000000c8 -1070+   4.699  xnpod_resume_thread+0x2b
>>>>> (xnthread_timeout_handler+0x1d)
>>>>> :|  **(0x50) 0x000000c8 -1065+   1.586  xnpod_resume_thread+0x2bf
>>>>> (xnthread_timeout_handler+0x1d)
>>>>> :|  **(0x01) 0x00001237 -1063+   2.661 
>>>>> xntimer_do_tick_aperiodic+0x309 (xnpod_announce_tick+0x36)
>>>>> :|  **(0x50) 0x000000c8 -1061+   1.263  xnpod_announce_tick+0x4f
>>>>> (xnintr_irq_handler+0x2a)
>>>>> :|  **fn                -1060+   1.255  rthal_irq_end+0x8
>>>>> (xnintr_irq_handler+0x46)
>>>>> :|  **fn                -1058+   2.007  enable_8259A_irq+0x9
>>>>> (rthal_irq_end+0x2e)
>>>>> :|  **fn                -1056+   1.924  xnpod_schedule+0xe
>>>>> (xnintr_irq_handler+0x6e)
>>>>> :|  **(0x50) 0x000000c8 -1054!  15.368  xnpod_schedule+0x53
>>>>> (xnintr_irq_handler+0x6e)
>>>>> :|  **fn                -1039!  13.300  __switch_to+0xc
>>>>> (xnpod_schedule+0x689)
>>>>> :|  **(0x50) 0x000000c8 -1026+   1.766  xnpod_schedule+0x951
>>>>> (xnpod_suspend_thread+0x27c)
>>>>> :|  **(0x50) 0x000000c8 -1024!  18.195  xnpod_suspend_thread+0x2bb
>>>>> (rt_task_sleep+0x54)
>>>>> :   **fn                -1006+   1.624  __ipipe_syscall_root+0x9
>>>>> (system_call+0x20)
>>>>> :   **fn                -1004+   1.406  __ipipe_dispatch_event+0xe
>>>>> (__ipipe_syscall_root+0x58)
>>>>> :   **fn                -1003+   4.323  hisyscall_event+0xe
>>>>> (__ipipe_dispatch_event+0x5e)
>>>>> :   **fn                 -998+   1.398  __rt_task_sleep+0xa
>>>>> (hisyscall_event+0x23c)
>>>>> :   **fn                 -997+   1.789  __copy_from_user_ll+0xa
>>>>> (__rt_task_sleep+0x1a)
>>>>> :   **fn                 -995!  15.345  rt_task_sleep+0xa
>>>>> (__rt_task_sleep+0x25)
>>>>> :   **fn                 -980    0.879  __ipipe_syscall_root+0x9
>>>>> (system_call+0x20)
>>>>> :   **fn                 -979+   1.308  __ipipe_dispatch_event+0xe
>>>>> (__ipipe_syscall_root+0x58)
>>>>> :   **fn                 -978+   2.451  hisyscall_event+0xe
>>>>> (__ipipe_dispatch_event+0x5e)
>>>>> :   **fn                 -975+   1.631  sys_rtdm_ioctl+0x8
>>>>> (hisyscall_event+0x23c)
>>>>> :   **fn                 -974+   1.255  _rtdm_ioctl+0xc
>>>>> (sys_rtdm_ioctl+0x1b)
>>>>> :   **fn                 -972+   4.180  rtdm_context_get+0xa
>>>>> (_rtdm_ioctl+0x20)
>>>>> :   **fn                 -968+   1.203  __ipipe_syscall_root+0x9
>>>>> (system_call+0x20)
>>>>> :   **fn                 -967+   1.345  __ipipe_dispatch_event+0xe
>>>>> (__ipipe_syscall_root+0x58)
>>>>
>>>> The '*' mark a stalled domain, root domain on the right. It seems to me
>>>> that xnpod_suspend_thread() leaves the xeno domain stalled on wake up.
>>>> This gets recovered somehow later.
>>>>
>>>> But here we see a different effect which finally causes the tick to get
>>>> frozen:
>>>>
>>>>
>>>>> :     fn                 -504+   2.075  sched_clock+0xd
>>>>> (schedule+0x112)
>>>>> :     fn                 -502    0.992  __ipipe_stall_root+0x8
>>>>> (schedule+0x18e)
>>>>> :    *(0x50) 0x00000064  -501+   4.022  __ipipe_stall_root+0x32
>>>>> (schedule+0x18e)
>>>>> :    *fn                 -497+   1.977  __ipipe_dispatch_event+0xe
>>>>> (schedule+0x412)
>>>>> :    *fn                 -495+   4.210  schedule_event+0xd
>>>>> (__ipipe_dispatch_event+0x5e)
>>>>> :   **(0x50) 0x000000c8  -491+   1.428  schedule_event+0x261
>>>>> (__ipipe_dispatch_event+0x5e)
>>>>> :|  **fn                 -490+   2.067  xnpod_schedule_runnable+0xe
>>>>> (schedule_event+0x28b)
>>>>> :|  **fn                 -488    0.954 
>>>>> ipipe_unstall_pipeline_from+0xc (schedule_event+0x2c1)
>>>>> :|   *(0x51) 0x000000c8  -487+   4.646 
>>>>> ipipe_unstall_pipeline_from+0x25 (schedule_event+0x2c1)
>>>>> :|   *fn                 -482+   7.248  __switch_to+0xc
>>>>> (xnpod_schedule+0x689)
>>>>> :|  **(0x50) 0x000000c8  -475+   1.421  xnpod_schedule+0x77f
>>>>> (xnpod_suspend_thread+0x27c)
>>>>> :|  **(0x50) 0x000000c8  -473+   1.481  xnpod_schedule+0x951
>>>>> (xnpod_suspend_thread+0x27c)
>>>>> :|  **(0x50) 0x000000c8  -472+   1.654  xnpod_suspend_thread+0x2bb
>>>>> (xnshadow_relax+0x136)
>>>>> :|  **(0x50) 0x000000c8  -470+   2.917  xnshadow_relax+0x154
>>>>> (hisyscall_event+0x2ec)
>>>>> :|  **fn                 -467+   1.375  ipipe_reenter_root+0xb
>>>>> (xnshadow_relax+0x204)
>>>>> :|  **fn                 -466    0.954  __ipipe_unstall_root+0x8
>>>>> (ipipe_reenter_root+0x26)
>>>>> :|  * (0x51) 0x00000064  -465+   3.789  __ipipe_unstall_root+0x25
>>>>> (ipipe_reenter_root+0x26)
>>>>> :   * fn                 -461+   1.578  send_sig+0x8
>>>>> (xnshadow_relax+0x228)
>>>>> :   * fn                 -460+   1.496  send_sig_info+0xb
>>>>> (send_sig+0x1d)
>>>>> :   * fn                 -458    0.909 
>>>>> __ipipe_test_and_stall_root+0x9 (send_sig_info+0x38)
>>>>> :   **(0x50) 0x00000064  -457+   1.699 
>>>>> __ipipe_test_and_stall_root+0x27 (send_sig_info+0x38)
>>>>> :   **fn                 -455+   1.203  specific_send_sig_info+0xb
>>>>> (send_sig_info+0x69)
>>>>> :   **fn                 -454+   1.706  __ipipe_test_root+0x8
>>>>> (specific_send_sig_info+0x16)
>>>>> :   **fn                 -453+   3.360  sig_ignored+0x9
>>>>> (specific_send_sig_info+0x29)
>>>>> :   **fn                 -449+   1.714  send_signal+0xb
>>>>> (specific_send_sig_info+0x55)
>>>>> :   **fn                 -447+   3.142  __sigqueue_alloc+0x9
>>>>> (send_signal+0x3f)
>>>>> :   **fn                 -444+   1.210  kmem_cache_alloc+0xa
>>>>> (__sigqueue_alloc+0x42)
>>>>> :   **fn                 -443    0.909 
>>>>> __ipipe_test_and_stall_root+0x9 (kmem_cache_alloc+0x12)
>>>>> :   **(0x50) 0x00000064  -442+   2.180 
>>>>> __ipipe_test_and_stall_root+0x27 (kmem_cache_alloc+0x12)
>>>>> :   **fn                 -440+   1.218  __ipipe_restore_root+0x8
>>>>> (kmem_cache_alloc+0x52)
>>>>> :   **fn                 -439+   1.315  __ipipe_stall_root+0x8
>>>>> (__ipipe_restore_root+0x11)
>>>>
>>>> No more recovery from the xeno stall, no more timer ticks.
>>>>
>>>> The special traces were generated via
>>>>
>>>> #define ipipe_mark_domain_stall(ipd,cpuid)     \
>>>>     ipipe_trace_special(0x50, ipd->priority)
>>>> #define ipipe_mark_domain_unstall(ipd,cpuid)   \
>>>>     ipipe_trace_special(0x51, ipd->priority)
>>>>
>>>> Any ideas? I do not find anything fishy yet that we may have introduced
>>>> to cause this problem.
>>>>
>>>
>>> Hmm, what about this theory: the RT-thread which got woken up in the
>>> first trace snippet suffered from a leaking IRQ lock. Its broken flags
>>> got restored on wakeup, and also when the thread went for termination
>>> (second trace). Therefore, we see this leaking domain stall. Possible
>>> explanation?
>>>
>>> This would move I-pipe and Xenomai out of the focus, leaving only the
>>> 16550A driver and our middleware messaging service (also a RTDM driver,
>>> currently my top favourite). My trace unfortunately does not last back
>>> far enough to answer this, I will have a look at this on Monday.
>>
>>
>> I should stop blaming others: it was a leaking lock in xeno_16550A. :D
>>
>>
>>> If it turned out to be the reason, we should consider putting some
>>> XENO_ASSERT guards in the return-to-user paths of RTDM services.
>>>
>>
>>
>> This was done yesterday and immediately pulled out the offending code a
>> few minutes ago.
>>
>> Philippe, do you see any remaining issues, e.g. that the leak survived
>> the task termination? Does this have any meaning for correct driver and
>> skin code?
>>
> 
> The only way I could see this leakage survive a switch transition would
> require it to happen over the root context, not over a primary context.
> Was it the case?
> 

The task had to leave from primary mode. If I forced it to secondary
before terminating, the problem did not show up.

Jan

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Xenomai-core mailing list
Xenomai-core@gna.org
https://mail.gna.org/listinfo/xenomai-core

Reply via email to