On 19.04.22 12:02, Arturo Laurenzi wrote:
> Sorry for the delayed answer, it took us some time to instrument our
> setup for broadcasting the kernel output over serial,
> and now we have some interesting results.
> See below.
> 
>> On 05.04.22 15:43, Arturo Laurenzi wrote:
>>>> On 04.04.22 15:21, Arturo Laurenzi via Xenomai wrote:
>>>
>>>>>
>>>>> Recently, we have started a transition towards Ubuntu 20.04, and things
>>>>> have started to break.
>>>>>
>>>>> The first attempt was to install kernel 5.4.151 and stick to ipipe. Under
>>>>> this setup, we experience issues even before starting our applications. We
>>>>> have seen random crashes while compiling with GCC, sporadic "System 
>>>>> Program
>>>>> Problem Detected" popups by Ubuntu, and others. We even tried to 
>>>>> re-install
>>>>> OS and kernel from scratch with no luck.
>>>>
>>>> A reference setup for this kernel line can be found in xenomai-images
>>>> (https://source.denx.de/Xenomai/xenomai-images). Would be good to
>>>> understand which deviation from it makes the difference for which
>>>> component (see also further questions below).
>>>
>>> I'm attaching the config we're using (from /boot/config-$(uname -r)).
>>> If that makes sense, we're going to try to configure the kernel
>>> according to this file
>>> (https://source.denx.de/Xenomai/xenomai-images/-/blob/master/recipes-kernel/linux/files/amd64_defconfig).
>>> What kernel version do you recommend to try?
>>>
>>
>> Always the latest of the individual kernel series.
> 
> We still have to test the reference .config file, as we gave higher
> priority to the kernel output over serial stuff.
> 
>>>>>
>>>>> The second attempt was to stick to our old kernel 4.19.140. All the weird
>>>>> issues disappear and the system is stable. However, we are unable to have
>>>>> the system pass our suite of "stress tests", which basically involve 
>>>>> starting,
>>>>> running, and killing process B multiple times in a cyclic fashion, while
>>>>> process A runs in the background. After a short while (minutes), the whole
>>>>> system just hangs, forcing us to do an hard reset. Only once, we managed 
>>>>> to
>>>>> get this kernel oops after rebooting (journalctl -k -b -1 --no-pager).
>>>>>
>>>>
>>>> For reliably recording crashes, it is highly recommended to use a UART
>>>> as kernel debug output.
>>>
>>> Will do ASAP and let you know.
> 
> Done, see below.
> 
>>>>> The third attempt was to try out kernel 5.10.89 plus the new dovetail
>>>>> patch, and Xenomai v3.2.1. Again, all the weird issues are gone and the
>>>>> system is stable. However, we are unable to have the system pass our suite
>>>>> of "stress tests". Differently from 4.19-ipipe, the system resists for a
>>>>> longer time before hanging (few hours sometimes), but this also varies a
>>>>> lot.
>>>>>
>>>>> After some more investigation, we found out something interesting. By
>>>>> removing the code that interacts with Process A, Process B is then able to
>>>>> run "forever" (overnight at least), but *only if Process A is not 
>>>>> running*.
>>>>> Otherwise, the system will hang. In other words, the mere presence of
>>>>> Process A is affecting Process B, even though both IDDP and ZMQ have been
>>>>> removed from B and replaced with fake data. Furthermore, the system does
>>>>> not freeze if we set B1's scheduling policy to SCHED_OTHER.
>>>>
>>>> Do you have the Xenomai watchdog enabled, thus will you be able to tell
>>>> RT application "hangs" (infinite loop at high prio) apart from real
>>>> hangs/crashes?
>>>
>>> Yes. When we try a while(true) inside a RT context, we see the
>>> watchdog killing our application
>>> as expected.
>>>
>>>
>>>>>
>>>>> From these - rather heuristic - tests, it looks like there could be some
>>>>> coupling between unrelated processes which causes some sort of bug, that 
>>>>> is
>>>>> probably related to some interaction with mutexes/condvars, when these are
>>>>> used from a RT context. This issue shows up (or at least we have seen it)
>>>>> only under Ubuntu 20.04 (GCC 9.x), whereas a 18.04 build (GCC 7.x) looks
>>>>> fine.
>>>>
>>>> Ubuntu toolchains are known for agressively enabling certain security
>>>> features. Maybe one that we didn't check yet flipped between 18.04 and
>>>> 20.04 - if that switch is only difference between working and
>>>> non-working builds in your case. GCC itself should be fine, we are
>>>> testing with gcc-10 via Debian 11 in our CI.
>>>>
>>>> Can you check whether the toolchain change breaks the kernel (kernel
>>>> with old toolchain runs fine with userspace built via new toolchain)?
>>>
>>> We have tried this, and still the system freezes after a while. We
>>> followed the procedure that follows:
>>>  1) generate binaries for our "working" kernel 4.19.140-xeno-ipipe-3.1
>>> on a Ubuntu 18 machine (make deb-pkg)
>>>  2) copy the whole /usr/xenomai directory (compiled with the 18.04
>>> toolchain) to the test machine with Ubuntu 20.04
>>>  3) install the kernel binaries to the test machine
>>>  4) re-compile our application
>>> Is this ok?
>>>
>>
>> Wait, these are three variables: kernel, Xenomai application and Ubuntu
>> userspace. Does your system also break when using both kernel and
>> application binaries from a Ubuntu 18 build? Or will it start to break
>> once you recompile the Xenomai application with Ubuntu 20 toolchain?
> 
> Also this needs further investigation. Right now we're focusing on
> 5.10-dovetail + Xenomai 3.2 + application all built under
> the default 20.04 toolchain.
> 
>>>>>
>>>>> The purpose of this message is twofold.
>>>>> First, to see if these symptoms might "ring a bell" to anyone in the
>>>>> community, who might be able to suggest a fix.
>>>>> Second, we'd like to ask what you would do to debug this issue. Which tool
>>>>> could we use to trace what's going on, considering that whatever the bug
>>>>> is, it leads to a state where the machine is not usable at all. We can
>>>>> share our .config files if required, and we are willing to test more
>>>>> combinations of kernel and xenomai patch or library versions upon your
>>>>> advice. Any help you can give us is greatly appreciated.
>>>>>
>>>>
>>>> Can you simplify your test case to a level that makes it sharable,
>>>> executable by third parties? Please also share your kernel .config.
>>>
>>> Will try. It's not going to be quick though, as any trial we make
>>> needs hours of testing to understand if it causes a system freeze.
>>>
>>> What is the recommended to trace/debug this kind of problems? Is there
>>> anything "fancier" than broadcasting kernel output over a serial port?
>>>
>>
>> Hard to say in general. Full system freezes can be tricky to debug
>> unless there are at least some hints provided by the kernel. That's why
>> the focus is first on validating that.
> 
> In this regard, we managed to produce a stack trace via serial port.
> This is obtained on 5.10-dovetail + Xenomai 3.2 + application all
> built under
> the default 20.04 toolchain. This happens consistently in both our
> scenarios, i.e.
>  1) process A interacting with process B via IDDP and ZMQ (i.e. TCP/IP)
>  2) process A and a "modified" process B running at the same time, and
> not interacting in any way
> The stack trace is always the same (I am attaching a few examples)
> 
> [  594.117307] kernel tried to execute NX-protected page - exploit
> attempt? (uid: 1000)
> [  594.117308] BUG: unable to handle page fault for address: ffffa20908ee1b00
> [  594.117308] #PF: supervisor instruction fetch in kernel mode
> [  594.117308] #PF: error_code(0x0011) - permissions violation
> [  594.117309] PGD 44b601067 P4D 44b601067 PUD 80000001c00001e3
> [  594.117310] Oops: 0011 [#1] SMP PTI IRQ_PIPELINE
> [  594.117310] CPU: 1 PID: 34507 Comm: xbot2-core Not tainted
> 5.10.89-xeno-ipipe-3.1+ #1
> [  594.117311] Hardware name:  /TS175, BIOS BQKLR112 07/04/2017
> [  594.117311] IRQ stage: Linux
> [  594.117311] RIP: 0010:0xffffa20908ee1b00
> [  594.117312] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> [  594.117312] RSP: 0018:ffffa37b8003cf80 EFLAGS: 00010202
> [  594.117313] RAX: ffffffff84cb1d29 RBX: ffffa37b89e8bd98 RCX: 
> 00000000cd46ea8f
> [  594.117313] RDX: ffffa37b89e8bda0 RSI: ffffa20b9fc40000 RDI: 
> ffffa37b89e8bd98
> [  594.117314] RBP: ffffffff84d10064 R08: ffffa2084005d800 R09: 
> 0000000000000001
> [  594.117314] R10: 0000000000000001 R11: 0000000000000001 R12: 
> 000000000000001e
> [  594.117314] R13: 000000000000001c R14: 0000000000000000 R15: 
> 0000000000000024
> [  594.117315] FS:  00007fef2c191600(0000) GS:ffffa20b9fc40000(0000)
> knlGS:0000000000000000
> [  594.117315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  594.117316] CR2: ffffa20908ee1b00 CR3: 000000010095e003 CR4: 
> 00000000003706e0
> [  594.117316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  594.117316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  594.117317] Call Trace:
> [  594.117317]  <IRQ>
> [  594.117317]  ? irq_work_run_list+0x32/0x40
> [  594.117317]  ? irq_work_run+0x18/0x30
> [  594.117318]  ? inband_work_interrupt+0x9/0x10
> [  594.117318]  ? handle_synthetic_irq+0x59/0x80
> [  594.117318]  ? asm_call_irq_on_stack+0x12/0x20
> [  594.117319]  </IRQ>
> [  594.117319]  ? arch_do_IRQ_pipelined+0xc2/0x150
> [  594.117319]  ? sync_current_irq_stage+0x1ae/0x230
> [  594.117320]  ? __inband_irq_enable+0x47/0x50
> [  594.117320]  ? inband_irq_restore+0x21/0x30
> [  594.117320]  ? _raw_spin_unlock_irqrestore+0x1d/0x20
> [  594.117320]  ? __set_cpus_allowed_ptr+0xa2/0x200
> [  594.117321]  ? sched_setaffinity+0x1b7/0x2a0
> [  594.117321]  ? __x64_sys_sched_setaffinity+0x4e/0x90
> [  594.117321]  ? do_syscall_64+0x44/0xa0
> [  594.117322]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

The call-stack is not reported as fully reliable. Are you running with
CONFIG_DEBUG_INFO=y? Do you have CONFIG_UNWINDER_ORC=y?

Assuming it is reliable, we may try to run some irq-work that no longer
exists. But that's speculation.

What may help here is ftrace dump on panic, see
https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#ftrace-dump-on-oops

Jan

-- 
Siemens AG, Technology
Competence Center Embedded Linux

Reply via email to