On 04.04.22 15:21, Arturo Laurenzi via Xenomai wrote:
> Dear Xenomai community,
> in our lab we use Xenomai + RTnet to control complex EtherCAT-based robotic
> platforms (research prototypes).
> 
> Our infrastructure is made of two multi-threaded processes, let's say A and
> B, as follows.
> 
> Process A is an ethercat master, wrapped to expose both a RT and NRT
> interface to other processes:
>  - A1: ecat master (SOEM-based, uses RTnet), SCHED_FIFO
>  - A2: iddp end-point, SCHED_FIFO
>  - A3: zmq server, xddp end-point, SCHED_OTHER
> 
> Process B is our "control process" where algorithms actually run:
>  - B1: control thread, SCHED_FIFO
>  - B2: communication thread, SCHED_OTHER
> 
> The two processes interact in two ways.
> The first is zmq-based, and happens between B1 and A3 during the
> initialization phase (so, before the time-critical part of thread B1).
> The second is iddp-based. Both endpoints (A2 and B1) will bind/connect to a
> set of pipes, to realize a bi-directional communication channel that is
> RT-safe.
> 
> This usually works fine under the following setup:
> 
> CPU: Intel Core i7-7820EQ @3.00 GHz
> OS/Kernel: Ubuntu 18.04 + Linux 4.19.140-xeno-ipipe-3.1
> Xenomai: v3.1 (Cobalt + Posix API)
> Compiler: default GCC (v7.5)
> 
> Recently, we have started a transition towards Ubuntu 20.04, and things
> have started to break.
> 
> The first attempt was to install kernel 5.4.151 and stick to ipipe. Under
> this setup, we experience issues even before starting our applications. We
> have seen random crashes while compiling with GCC, sporadic "System Program
> Problem Detected" popups by Ubuntu, and others. We even tried to re-install
> OS and kernel from scratch with no luck.

A reference setup for this kernel line can be found in xenomai-images
(https://source.denx.de/Xenomai/xenomai-images). Would be good to
understand which deviation from it makes the difference for which
component (see also further questions below).

> 
> The second attempt was to stick to our old kernel 4.19.140. All the weird
> issues disappear and the system is stable. However, we are unable to have
> the system pass our suite of "stress tests", which basically involve starting,
> running, and killing process B multiple times in a cyclic fashion, while
> process A runs in the background. After a short while (minutes), the whole
> system just hangs, forcing us to do an hard reset. Only once, we managed to
> get this kernel oops after rebooting (journalctl -k -b -1 --no-pager).
> 

For reliably recording crashes, it is highly recommended to use a UART
as kernel debug output.

> 
> *dic 20 17:07:10 com-exp-dev kernel: BUG: unable to handle kernel paging
> request at fffffffeee9e41b1*
> *dic 20 17:07:10 com-exp-dev kernel: PGD 42080c067 P4D 42080c067 PUD 0*
> *dic 20 17:07:10 com-exp-dev kernel: Oops: 0010 [#1] SMP PTI*
> *dic 20 17:07:10 com-exp-dev kernel: CPU: 1 PID: 134 Comm: kworker/u16:1
> Not tainted 4.19.140-xeno-ipipe-3.1 #1*
> *dic 20 17:07:10 com-exp-dev kernel: Hardware name:  /TS175, BIOS BQKLR112
> 07/04/2017*
> *dic 20 17:07:10 com-exp-dev kernel: I-pipe domain: Linux*
> *dic 20 17:07:10 com-exp-dev kernel: Workqueue: efi_rts_wq efi_call_rts*
> *dic 20 17:07:10 com-exp-dev kernel: RIP: 0010:0xfffffffeee9e41b1*
> *dic 20 17:07:10 com-exp-dev kernel: Code: Bad RIP value.*
> *dic 20 17:07:10 com-exp-dev kernel: RSP: 0018:ffffa6170334fd28 EFLAGS:
> 00010246*
> *dic 20 17:07:10 com-exp-dev kernel: RAX: 00000000000002ff RBX:
> 0000000000000000 RCX: fffffffeee9e73b8*
> *dic 20 17:07:10 com-exp-dev kernel: RDX: 00000000000000a1 RSI:
> ffff8884d8371400 RDI: ffffa61704f8fdcc*
> *dic 20 17:07:10 com-exp-dev kernel: RBP: ffff8884d8371000 R08:
> fffffffeee9e73b8 R09: ffffa61704f8fdd0*
> *dic 20 17:07:10 com-exp-dev kernel: R10: 00000000000002ff R11:
> 0000000000000018 R12: ffff8884d8371000*
> *dic 20 17:07:10 com-exp-dev kernel: R13: ffff8884d8371400 R14:
> ffffa61704f8fdcc R15: ffff8884c8331d84*
> *dic 20 17:07:10 com-exp-dev kernel: FS:  0000000000000000(0000)
> GS:ffff8884df500000(0000) knlGS:0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033*
> *dic 20 17:07:10 com-exp-dev kernel: CR2: fffffffeee9e4187 CR3:
> 000000042080a005 CR4: 00000000003606e0*
> *dic 20 17:07:10 com-exp-dev kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400*
> *dic 20 17:07:10 com-exp-dev kernel: Call Trace:*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __switch_to_asm+0x35/0x70*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __switch_to_asm+0x41/0x70*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __switch_to_asm+0x35/0x70*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __switch_to_asm+0x41/0x70*
> *dic 20 17:07:10 com-exp-dev kernel:  ? efi_call+0x58/0x90*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __switch_to_asm+0x41/0x70*
> *dic 20 17:07:10 com-exp-dev kernel:  ? efi_call_rts+0x18c/0x960*
> *dic 20 17:07:10 com-exp-dev kernel:  ? process_one_work+0x1ac/0x330*
> *dic 20 17:07:10 com-exp-dev kernel:  ? worker_thread+0x48/0x3e0*
> *dic 20 17:07:10 com-exp-dev kernel:  ? kthread+0xfc/0x130*
> *dic 20 17:07:10 com-exp-dev kernel:  ? process_one_work+0x330/0x330*
> *dic 20 17:07:10 com-exp-dev kernel:  ? kthread_park+0x80/0x80*
> *dic 20 17:07:10 com-exp-dev kernel:  ? ret_from_fork+0x36/0x50*
> *dic 20 17:07:10 com-exp-dev kernel: Modules linked in: fuse rtpacket
> binfmt_misc nls_ascii nls_cp437 vfat fat evdev x86_pkg_temp_thermal
> rt_e1000e intel_powerclamp i915 crc32c_intel rtnet i2c_algo_bit
> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt
> fb_sys_fops cfbcopyarea fb efi_pstore font fbdev efivars intel_pch_thermal
> video button loop sch_fq_codel msr drm drm_panel_orientation_quirks sunrpc
> efivarfs autofs4 e1000e i2c_i801 xhci_pci xhci_hcd ptp pps_core ahci
> libahci usbcore libata usb_common*
> *dic 20 17:07:10 com-exp-dev kernel: CR2: fffffffeee9e41b1*
> *dic 20 17:07:10 com-exp-dev kernel: ---[ end trace d36b472eaef981c9 ]---*
> *dic 20 17:07:10 com-exp-dev kernel: RIP: 0010:0xfffffffeee9e41b1*
> *dic 20 17:07:10 com-exp-dev kernel: Code: Bad RIP value.*
> *dic 20 17:07:10 com-exp-dev kernel: RSP: 0018:ffffa6170334fd28 EFLAGS:
> 00010246*
> *dic 20 17:07:10 com-exp-dev kernel: RAX: 00000000000002ff RBX:
> 0000000000000000 RCX: fffffffeee9e73b8*
> *dic 20 17:07:10 com-exp-dev kernel: RDX: 00000000000000a1 RSI:
> ffff8884d8371400 RDI: ffffa61704f8fdcc*
> *dic 20 17:07:10 com-exp-dev kernel: RBP: ffff8884d8371000 R08:
> fffffffeee9e73b8 R09: ffffa61704f8fdd0*
> *dic 20 17:07:10 com-exp-dev kernel: R10: 00000000000002ff R11:
> 0000000000000018 R12: ffff8884d8371000*
> *dic 20 17:07:10 com-exp-dev kernel: R13: ffff8884d8371400 R14:
> ffffa61704f8fdcc R15: ffff8884c8331d84*
> *dic 20 17:07:10 com-exp-dev kernel: FS:  0000000000000000(0000)
> GS:ffff8884df500000(0000) knlGS:0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033*
> *dic 20 17:07:10 com-exp-dev kernel: CR2: fffffffeee9e4187 CR3:
> 000000042080a005 CR4: 00000000003606e0*
> *dic 20 17:07:10 com-exp-dev kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400*
> *dic 20 17:07:10 com-exp-dev kernel: general protection fault: 0000 [#2]
> SMP PTI*
> *dic 20 17:07:10 com-exp-dev kernel: CPU: 1 PID: 445 Comm: rs:main Q:Reg
> Tainted: G      D           4.19.140-xeno-ipipe-3.1 #1*
> *dic 20 17:07:10 com-exp-dev kernel: Hardware name:  /TS175, BIOS BQKLR112
> 07/04/2017*
> *dic 20 17:07:10 com-exp-dev kernel: I-pipe domain: Linux*
> *dic 20 17:07:10 com-exp-dev kernel: RIP: 0010:pgd_free+0x56/0x90*
> *dic 20 17:07:10 com-exp-dev kernel: Code: 2b 15 66 89 b4 00 48 bf 00 01 00
> 00 00 00 ad de 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 3a 89 b4 00 48 8b
> 48 08 48 8b 50 10 <48> 89 51 08 48 89 0a 48 b9 00 02 00 00 00 00 ad de 48
> 89 78 08 48*
> *dic 20 17:07:10 com-exp-dev kernel: RSP: 0018:ffffa617033f7b68 EFLAGS:
> 00010282*
> *dic 20 17:07:10 com-exp-dev kernel: RAX: fffff43d117c1b80 RBX:
> 0000000000000402 RCX: dead000000000100*
> *dic 20 17:07:10 com-exp-dev kernel: RDX: dead000000000200 RSI:
> ffff8884df06e000 RDI: dead000000000100*
> *dic 20 17:07:10 com-exp-dev kernel: RBP: ffffa617033f7b70 R08:
> 0000000000000001 R09: 0000000000000003*
> *dic 20 17:07:10 com-exp-dev kernel: R10: 00000000000acfb0 R11:
> 00000000000007ff R12: ffff8884df06e000*
> *dic 20 17:07:10 com-exp-dev kernel: R13: ffff8884d6f031c0 R14:
> ffffffffafa785a0 R15: ffff8884cd1c4080*
> *dic 20 17:07:10 com-exp-dev kernel: FS:  00007f0cc3fff700(0000)
> GS:ffff8884df500000(0000) knlGS:0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033*
> *dic 20 17:07:10 com-exp-dev kernel: CR2: fffffffeee9e4187 CR3:
> 000000044da2c004 CR4: 00000000003606e0*
> *dic 20 17:07:10 com-exp-dev kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400*
> *dic 20 17:07:10 com-exp-dev kernel: Call Trace:*
> *dic 20 17:07:10 com-exp-dev kernel:  __mmdrop+0x52/0xf0*
> *dic 20 17:07:10 com-exp-dev kernel:  finish_task_switch+0x1bf/0x240*
> *dic 20 17:07:10 com-exp-dev kernel:  __schedule+0x208/0x650*
> *dic 20 17:07:10 com-exp-dev kernel:  ? default_wake_function+0xd/0x10*
> *dic 20 17:07:10 com-exp-dev kernel:  schedule+0x31/0x80*
> *dic 20 17:07:10 com-exp-dev kernel:  futex_wait_queue_me+0xc3/0x130*
> *dic 20 17:07:10 com-exp-dev kernel:  futex_wait+0x10a/0x250*
> *dic 20 17:07:10 com-exp-dev kernel:  do_futex+0x146/0xc50*
> *dic 20 17:07:10 com-exp-dev kernel:  ? ext4_file_write_iter+0xff/0x3a0*
> *dic 20 17:07:10 com-exp-dev kernel:  ? _cond_resched+0x14/0x30*
> *dic 20 17:07:10 com-exp-dev kernel:  ? dput+0x31/0x140*
> *dic 20 17:07:10 com-exp-dev kernel:  __x64_sys_futex+0x144/0x180*
> *dic 20 17:07:10 com-exp-dev kernel:  ? __f_unlock_pos+0xd/0x10*
> *dic 20 17:07:10 com-exp-dev kernel:  ? ksys_write+0xbc/0xd0*
> *dic 20 17:07:10 com-exp-dev kernel:  do_syscall_64+0x6d/0x250*
> *dic 20 17:07:10 com-exp-dev kernel:
> entry_SYSCALL_64_after_hwframe+0x44/0xa9*
> *dic 20 17:07:10 com-exp-dev kernel: RIP: 0033:0x7f0cc945d376*
> *dic 20 17:07:10 com-exp-dev kernel: Code: 44 24 60 0f 11 44 24 68 e8 97 38
> 00 00 e8 82 3c 00 00 89 de 45 31 d2 31 d2 41 89 c0 40 80 f6 80 4c 89 ff b8
> ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 26 01 00 00 44 89 c7 e8 b6 3c
> 00 00 31 f6*
> *dic 20 17:07:10 com-exp-dev kernel: RSP: 002b:00007f0cc3ffeab0 EFLAGS:
> 00000282 ORIG_RAX: 00000000000000ca*
> *dic 20 17:07:10 com-exp-dev kernel: RAX: ffffffffffffffda RBX:
> 0000000000000000 RCX: 00007f0cc945d376*
> *dic 20 17:07:10 com-exp-dev kernel: RDX: 0000000000000000 RSI:
> 0000000000000080 RDI: 000055b4f8812d74*
> *dic 20 17:07:10 com-exp-dev kernel: RBP: 000055b4f8812d48 R08:
> 0000000000000001 R09: 0000000000000004*
> *dic 20 17:07:10 com-exp-dev kernel: R10: 0000000000000000 R11:
> 0000000000000282 R12: 000055b4f8812d6c*
> *dic 20 17:07:10 com-exp-dev kernel: R13: 000055b4f880ba60 R14:
> 00007f0cc3ffeaf0 R15: 000055b4f8812d74*
> *dic 20 17:07:10 com-exp-dev kernel: Modules linked in: fuse rtpacket
> binfmt_misc nls_ascii nls_cp437 vfat fat evdev x86_pkg_temp_thermal
> rt_e1000e intel_powerclamp i915 crc32c_intel rtnet i2c_algo_bit
> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt
> fb_sys_fops cfbcopyarea fb efi_pstore font fbdev efivars intel_pch_thermal
> video button loop sch_fq_codel msr drm drm_panel_orientation_quirks sunrpc
> efivarfs autofs4 e1000e i2c_i801 xhci_pci xhci_hcd ptp pps_core ahci
> libahci usbcore libata usb_common*
> *dic 20 17:07:10 com-exp-dev kernel: ---[ end trace d36b472eaef981ca ]---*
> *dic 20 17:07:10 com-exp-dev kernel: RIP: 0010:0xfffffffeee9e41b1*
> *dic 20 17:07:10 com-exp-dev kernel: Code: Bad RIP value.*
> *dic 20 17:07:10 com-exp-dev kernel: RSP: 0018:ffffa6170334fd28 EFLAGS:
> 00010246*
> *dic 20 17:07:10 com-exp-dev kernel: RAX: 00000000000002ff RBX:
> 0000000000000000 RCX: fffffffeee9e73b8*
> *dic 20 17:07:10 com-exp-dev kernel: RDX: 00000000000000a1 RSI:
> ffff8884d8371400 RDI: ffffa61704f8fdcc*
> *dic 20 17:07:10 com-exp-dev kernel: RBP: ffff8884d8371000 R08:
> fffffffeee9e73b8 R09: ffffa61704f8fdd0*
> *dic 20 17:07:10 com-exp-dev kernel: R10: 00000000000002ff R11:
> 0000000000000018 R12: ffff8884d8371000*
> *dic 20 17:07:10 com-exp-dev kernel: R13: ffff8884d8371400 R14:
> ffffa61704f8fdcc R15: ffff8884c8331d84*
> *dic 20 17:07:10 com-exp-dev kernel: FS:  00007f0cc3fff700(0000)
> GS:ffff8884df500000(0000) knlGS:0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033*
> *dic 20 17:07:10 com-exp-dev kernel: CR2: fffffffeee9e4187 CR3:
> 000000044da2c004 CR4: 00000000003606e0*
> *dic 20 17:07:10 com-exp-dev kernel: DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000*
> *dic 20 17:07:10 com-exp-dev kernel: DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400*
> *dic 20 17:07:48 com-exp-dev kernel: rcu: INFO: rcu_sched self-detected
> stall on CPU*
> *dic 20 17:07:48 com-exp-dev kernel: rcu:         5-....: (20999 ticks this
> GP) idle=85e/1/0x4000000000000004 softirq=7766/7766 fqs=5246*
> *dic 20 17:07:48 com-exp-dev kernel: rcu:          (t=21000 jiffies g=13261
> q=147)*
> *dic 20 17:07:48 com-exp-dev kernel: NMI backtrace for cpu 5*
> *dic 20 17:07:48 com-exp-dev kernel: CPU: 5 PID: 387 Comm: NetworkManager
> Tainted: G      D           4.19.140-xeno-ipipe-3.1 #1*
> *dic 20 17:07:48 com-exp-dev kernel: Hardware name:  /TS175, BIOS BQKLR112
> 07/04/2017*
> *dic 20 17:07:48 com-exp-dev kernel: I-pipe domain: Linux*
> *dic 20 17:07:48 com-exp-dev kernel: Call Trace:*
> *dic 20 17:07:48 com-exp-dev kernel:  <IRQ>*
> *dic 20 17:07:48 com-exp-dev kernel:  dump_stack+0x98/0xbc*
> *dic 20 17:07:48 com-exp-dev kernel:  nmi_cpu_backtrace.cold+0x14/0x54*
> *dic 20 17:07:48 com-exp-dev kernel:  ? lapic_can_unplug_cpu.cold+0x39/0x39*
> *dic 20 17:07:48 com-exp-dev kernel:
> nmi_trigger_cpumask_backtrace+0xfa/0xfc*
> *dic 20 17:07:48 com-exp-dev kernel:
> arch_trigger_cpumask_backtrace+0x14/0x20*
> *dic 20 17:07:48 com-exp-dev kernel:  rcu_dump_cpu_stacks+0x96/0xca*
> *dic 20 17:07:48 com-exp-dev kernel:  rcu_check_callbacks.cold+0x20c/0x35e*
> *dic 20 17:07:48 com-exp-dev kernel:  update_process_times+0x40/0x80*
> *dic 20 17:07:48 com-exp-dev kernel:  tick_sched_handle.isra.0+0x2f/0x50*
> *dic 20 17:07:48 com-exp-dev kernel:  tick_sched_timer+0x3b/0x80*
> *dic 20 17:07:48 com-exp-dev kernel:  ? tick_sched_handle.isra.0+0x50/0x50*
> *dic 20 17:07:48 com-exp-dev kernel:  __hrtimer_run_queues+0xe6/0x190*
> *dic 20 17:07:48 com-exp-dev kernel:  hrtimer_interrupt+0x104/0x220*
> *dic 20 17:07:48 com-exp-dev kernel:  ? ___xnsched_run+0x2f5/0x4f0*
> *dic 20 17:07:48 com-exp-dev kernel:  smp_apic_timer_interrupt+0x45/0x90*
> *dic 20 17:07:48 com-exp-dev kernel:  ?
> smp_call_function_single_interrupt+0x10/0x10*
> *dic 20 17:07:48 com-exp-dev kernel:  __ipipe_do_IRQ+0x46/0x80*
> *dic 20 17:07:48 com-exp-dev kernel:  __ipipe_do_sync_stage+0x143/0x180*
> *dic 20 17:07:48 com-exp-dev kernel:  __ipipe_do_sync_pipeline+0xb1/0xc0*
> *dic 20 17:07:48 com-exp-dev kernel:  dispatch_irq_head+0xe1/0x110*
> *dic 20 17:07:48 com-exp-dev kernel:  __ipipe_dispatch_irq+0x198/0x1c0*
> *dic 20 17:07:48 com-exp-dev kernel:  __ipipe_handle_irq+0x89/0x1f0*
> *dic 20 17:07:48 com-exp-dev kernel:  apic_timer_interrupt+0x12/0x40*
> *dic 20 17:07:48 com-exp-dev kernel:  </IRQ>*
> *dic 20 17:07:48 com-exp-dev kernel: RIP:
> 0010:queued_spin_lock_slowpath+0xde/0x190*
> *dic 20 17:07:48 com-exp-dev kernel: Code: f7 41 89 c0 66 45 31 c0 41 39 c8
> 0f 84 ac 00 00 00 48 85 f6 c6 07 01 74 22 c7 46 08 01 00 00 00 65 ff 0d f1
> 5b 21 51 c3 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f3 90 48
> 8b 02 48 85*
> *dic 20 17:07:48 com-exp-dev kernel: RSP: 0018:ffffa617036dbc48 EFLAGS:
> 00000202 ORIG_RAX: ffffffffffffff13*
> *dic 20 17:07:48 com-exp-dev kernel: RAX: 0000000000140101 RBX:
> ffffa61705548000 RCX: ffff8884d595ba40*
> *dic 20 17:07:48 com-exp-dev kernel: RDX: 0000000000000001 RSI:
> 0000000000000000 RDI: ffffffffafc9ec70*
> *dic 20 17:07:48 com-exp-dev kernel: RBP: ffffa617036dbc50 R08:
> 8000000000000063 R09: ffff8884cd8e35c0*
> *dic 20 17:07:48 com-exp-dev kernel: R10: ffffa617036dbb88 R11:
> 0000000000000004 R12: ffffff8000000000*
> *dic 20 17:07:48 com-exp-dev kernel: R13: 0000008000000000 R14:
> ffffa61705547fff R15: ffffa61705548000*
> *dic 20 17:07:48 com-exp-dev kernel:  ? kmem_cache_alloc_node+0x166/0x1d0*
> *dic 20 17:07:48 com-exp-dev kernel:  ? _raw_spin_lock+0x1b/0x20*
> *dic 20 17:07:48 com-exp-dev kernel:
> __ipipe_pin_mapping_globally+0x6a/0xa5*
> *dic 20 17:07:48 com-exp-dev kernel:  vmap_page_range_noflush+0x28c/0x320*
> *dic 20 17:07:48 com-exp-dev kernel:  map_vm_area+0x30/0x40*
> *dic 20 17:07:48 com-exp-dev kernel:  __vmalloc_node_range+0x1ca/0x260*
> *dic 20 17:07:48 com-exp-dev kernel:  copy_process.part.0+0x6c8/0x1c00*
> *dic 20 17:07:48 com-exp-dev kernel:  ? _do_fork+0xd8/0x330*
> *dic 20 17:07:48 com-exp-dev kernel:  ? __alloc_file+0x70/0xe0*
> *dic 20 17:07:48 com-exp-dev kernel:  ? alloc_empty_file+0x63/0xb0*
> *dic 20 17:07:48 com-exp-dev kernel:  _do_fork+0xd8/0x330*
> *dic 20 17:07:48 com-exp-dev kernel:  ? __sys_socketpair+0x17d/0x230*
> *dic 20 17:07:48 com-exp-dev kernel:  __x64_sys_clone+0x22/0x30*
> *dic 20 17:07:48 com-exp-dev kernel:  do_syscall_64+0x6d/0x250*
> *dic 20 17:07:48 com-exp-dev kernel:
> entry_SYSCALL_64_after_hwframe+0x44/0xa9*
> *dic 20 17:07:48 com-exp-dev kernel: RIP: 0033:0x7f4153096285*
> *dic 20 17:07:48 com-exp-dev kernel: Code: 48 85 ff 74 3d 48 85 f6 74 38 48
> 83 ee 10 48 89 4e 08 48 89 3e 48 89 d7 4c 89 c2 4d 89 c8 4c 8b 54 24 08 b8
> 38 00 00 00 0f 05 <48> 85 c0 7c 13 74 01 c3 31 ed 58 5f ff d0 48 89 c7 b8
> 3c 00 00 00*
> *dic 20 17:07:48 com-exp-dev kernel: RSP: 002b:00007ffc7a31e0f8 EFLAGS:
> 00000206 ORIG_RAX: 0000000000000038*
> *dic 20 17:07:48 com-exp-dev kernel: RAX: ffffffffffffffda RBX:
> 00007f414b7fe700 RCX: 00007f4153096285*
> *dic 20 17:07:48 com-exp-dev kernel: RDX: 00007f414b7fe9d0 RSI:
> 00007f414b7fdb30 RDI: 00000000003d0f00*
> *dic 20 17:07:48 com-exp-dev kernel: RBP: 00007ffc7a31e1b0 R08:
> 00007f414b7fe700 R09: 00007f414b7fe700*
> *dic 20 17:07:48 com-exp-dev kernel: R10: 00007f414b7fe9d0 R11:
> 0000000000000206 R12: 00007ffc7a31e1ae*
> *dic 20 17:07:48 com-exp-dev kernel: R13: 00007ffc7a31e1af R14:
> 00007ffc7a31e1b0 R15: 00007f414b7fdb40*
> 
> 
> 
> The third attempt was to try out kernel 5.10.89 plus the new dovetail
> patch, and Xenomai v3.2.1. Again, all the weird issues are gone and the
> system is stable. However, we are unable to have the system pass our suite
> of "stress tests". Differently from 4.19-ipipe, the system resists for a
> longer time before hanging (few hours sometimes), but this also varies a
> lot.
> 
> After some more investigation, we found out something interesting. By
> removing the code that interacts with Process A, Process B is then able to
> run "forever" (overnight at least), but *only if Process A is not running*.
> Otherwise, the system will hang. In other words, the mere presence of
> Process A is affecting Process B, even though both IDDP and ZMQ have been
> removed from B and replaced with fake data. Furthermore, the system does
> not freeze if we set B1's scheduling policy to SCHED_OTHER.

Do you have the Xenomai watchdog enabled, thus will you be able to tell
RT application "hangs" (infinite loop at high prio) apart from real
hangs/crashes?

> 
> We have also run one more test where we disable B's non-RT thread, so that
> B is now single-threaded, and only runs B1 (SCHED_FIFO). We could therefore
> remove all mutexes and condition variables from the system, and the system
> is then able to run indefinitely. Notice that, even in this single-thread
> mode, the system still hangs if mutexes are left in their place.
> 
> From these - rather heuristic - tests, it looks like there could be some
> coupling between unrelated processes which causes some sort of bug, that is
> probably related to some interaction with mutexes/condvars, when these are
> used from a RT context. This issue shows up (or at least we have seen it)
> only under Ubuntu 20.04 (GCC 9.x), whereas a 18.04 build (GCC 7.x) looks
> fine.

Ubuntu toolchains are known for agressively enabling certain security
features. Maybe one that we didn't check yet flipped between 18.04 and
20.04 - if that switch is only difference between working and
non-working builds in your case. GCC itself should be fine, we are
testing with gcc-10 via Debian 11 in our CI.

Can you check whether the toolchain change breaks the kernel (kernel
with old toolchain runs fine with userspace built via new toolchain)?

> 
> The purpose of this message is twofold.
> First, to see if these symptoms might "ring a bell" to anyone in the
> community, who might be able to suggest a fix.
> Second, we'd like to ask what you would do to debug this issue. Which tool
> could we use to trace what's going on, considering that whatever the bug
> is, it leads to a state where the machine is not usable at all. We can
> share our .config files if required, and we are willing to test more
> combinations of kernel and xenomai patch or library versions upon your
> advice. Any help you can give us is greatly appreciated.
> 

Can you simplify your test case to a level that makes it sharable,
executable by third parties? Please also share your kernel .config.

Jan

-- 
Siemens AG, Technology
Competence Center Embedded Linux

Reply via email to