Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
On 2023/1/4 下午8:59, Michal Koutný wrote: On Wed, Jan 04, 2023 at 07:13:59PM +0800, Heyi Guo wrote: Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516. Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution. Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list units: Transport endpoint is not connected Is it the reason for systemctl fails to work? For the log says "systemd freezing execution". Yes, see the line above, there's SIGSEGV in PID 1. (Given the other SIGSEGVs, it looks like a common cause across different processes, e.g. screwed libc update or similar.) Also, based on the same line, you may be able to extract the coredump from /var/lib/systemd/coredump (depends on coredump.conf:Storage=) and figure out more. The core shows something like this: (gdb) bt #0 0x00485848 in prioq_peek_by_index (idx=0, q=0x80808080) at ../git/src/basic/prioq.c:272 #1 prioq_peek (q=0x80808080) at ../git/src/basic/prioq.h:25 #2 process_timer (n=, d=d@entry=0x6f6600, e=out>) at ../git/src/libsystemd/sd-event/sd-event.c:3181 #3 0x00486594 in process_timer (d=0x6f6600, n=, e=0x6f65c8) at ../git/src/libsystemd/sd-event/sd-event.c:4124 #4 sd_event_wait (e=e@entry=0x6f65c8, timeout=) at ../git/src/libsystemd/sd-event/sd-event.c:4124 #5 0x00486bb0 in sd_event_run (timeout=18446744073709551615, e=0x6f65c8) at ../git/src/systemd/sd-event.h:172 #6 sd_event_loop (e=0x6f65c8) at ../git/src/libsystemd/sd-event/sd-event.c:4264 #7 0x0040c59c in main_loop (manager=0x0) at ../git/src/udev/udevd.c:1872 #8 run_udevd (argv=, argc=) at ../git/src/udev/udevd.c:1984 #9 run (argv=, argc=) at ../git/src/udev/udevadm.c:116 #10 main (argc=, argv=) at ../git/src/udev/udevadm.c:133 Thanks, Heyi Michal
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
On Wed, Jan 04, 2023 at 07:13:59PM +0800, Heyi Guo wrote: > Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516. > Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution. > Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list units: > Transport endpoint is not connected > > Is it the reason for systemctl fails to work? For the log says "systemd > freezing execution". Yes, see the line above, there's SIGSEGV in PID 1. (Given the other SIGSEGVs, it looks like a common cause across different processes, e.g. screwed libc update or similar.) Also, based on the same line, you may be able to extract the coredump from /var/lib/systemd/coredump (depends on coredump.conf:Storage=) and figure out more. Michal signature.asc Description: Digital signature
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
Hi Michal, Actually we have upgraded systemd version to 250.5, but the issue will still happen. Navigating the journal log context of when the error message is first printed, I found there is a SEGV fault of systemd-udevd: Jan 04 16:10:40 ali2600 systemd[1]: Created slice Slice /system/systemd-coredump. Jan 04 16:10:40 ali2600 systemd[1]: Started Process Core Dump (PID 7507/UID 0). Jan 04 16:10:42 ali2600 systemd-coredump[7508]: elfutils disabled, parsing ELF objects not supported Jan 04 16:10:42 ali2600 systemd-coredump[7508]: [LNK] Process 173 (systemd-udevd) of user 0 dumped core. Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Main process exited, code=dumped, status=11/SEGV Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Killing process 7503 (systemd-udevd) with signal SIGKILL. Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Killing process 7503 (systemd-udevd) with signal SIGKILL. Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Failed with result 'core-dump'. Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Scheduled restart job, restart counter is at 1. Jan 04 16:10:42 ali2600 systemd[1]: Stopped Rule-based Manager for Device Events and Files. Jan 04 16:10:42 ali2600 systemd[1]: Starting Rule-based Manager for Device Events and Files... Jan 04 16:10:42 ali2600 systemd[1]: systemd-coredump@0-7507-0.service: Deactivated successfully. Jan 04 16:10:42 ali2600 systemd-udevd[7510]: corrupted size vs. prev_size .. Jan 04 16:10:57 ali2600 systemd-coredump[7517]: elfutils disabled, parsing ELF objects not supported Jan 04 16:10:57 ali2600 systemd-coredump[7517]: [LNK] Process 7516 (systemd) of user 0 dumped core. Jan 04 16:10:57 ali2600 phosphor-dump-manager[356]: *** stack smashing detected ***: terminated Jan 04 16:10:57 ali2600 phosphor-dump-monitor[280]: Failed to create dump: sd_bus_call noreply: org.freedesktop.DBus.Error.NoReply: Remote peer disconnected Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516. Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution. Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list units: Transport endpoint is not connected Is it the reason for systemctl fails to work? For the log says "systemd freezing execution". Thanks, Heyi On 2023/1/4 下午6:48, Michal Koutný wrote: On Wed, Jan 04, 2023 at 04:51:22PM +0800, Heyi Guo wrote: The issue happened again, but the /proc/1/stack and /proc/$pid_of_dbus-broker/stack are both empty on our platform. (You reported previously the version was v249 (which is behind the last two upstream versions, so it may be a good idea to raise the issue with your distro.)) I checked kernel config and confirmed that CONFIG_STACKTRACE is enabled: zcat /proc/config.gz | grep CONFIG_STACKTRACE CONFIG_STACKTRACE_SUPPORT=y # CONFIG_STACKTRACE_BUILD_ID is not set CONFIG_STACKTRACE=y Is there any other config that is missing? I don't think so (the file wouldn't be present otherwise). If there are no kernel stacks, the tasks execute in userspace and given the indefinite stuckage, they're likely looping somewhere (or you must have been unlucky to miss a syscall), which should manifest in their CPU consumption. The userspace stack may be of interest then, e.g. `gdb -ex "bt" --batch -p 1` (for PID 1 and debuginfo for involved binaries must be present to obtain useful info). Michal
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
On Wed, Jan 04, 2023 at 04:51:22PM +0800, Heyi Guo wrote: > The issue happened again, but the /proc/1/stack and > /proc/$pid_of_dbus-broker/stack are both empty on our platform. (You reported previously the version was v249 (which is behind the last two upstream versions, so it may be a good idea to raise the issue with your distro.)) > I checked kernel config and confirmed that CONFIG_STACKTRACE is enabled: > > zcat /proc/config.gz | grep CONFIG_STACKTRACE > CONFIG_STACKTRACE_SUPPORT=y > # CONFIG_STACKTRACE_BUILD_ID is not set > CONFIG_STACKTRACE=y > > Is there any other config that is missing? I don't think so (the file wouldn't be present otherwise). If there are no kernel stacks, the tasks execute in userspace and given the indefinite stuckage, they're likely looping somewhere (or you must have been unlucky to miss a syscall), which should manifest in their CPU consumption. The userspace stack may be of interest then, e.g. `gdb -ex "bt" --batch -p 1` (for PID 1 and debuginfo for involved binaries must be present to obtain useful info). Michal signature.asc Description: Digital signature
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
On 2022/12/1 下午9:57, Michal Koutný wrote: Hello Heyi. On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo wrote: Is there any known issue which will cause this problem? Or do you have any suggestion on how to debug? As written in the report, it looks like dbus-daemon or PID1 itself not responding. Some insights may be obtained by looking at /proc/1/stack and /proc/$pid_of_dbus/stack (as root). Hi Michal, The issue happened again, but the /proc/1/stack and /proc/$pid_of_dbus-broker/stack are both empty on our platform. I checked kernel config and confirmed that CONFIG_STACKTRACE is enabled: zcat /proc/config.gz | grep CONFIG_STACKTRACE CONFIG_STACKTRACE_SUPPORT=y # CONFIG_STACKTRACE_BUILD_ID is not set CONFIG_STACKTRACE=y Is there any other config that is missing? Thanks, Heyi HTH, Michal
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
Thanks very much; we'll try to get the information when the issue happens next time. Heyi On 2022/12/1 下午9:57, Michal Koutný wrote: Hello Heyi. On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo wrote: Is there any known issue which will cause this problem? Or do you have any suggestion on how to debug? As written in the report, it looks like dbus-daemon or PID1 itself not responding. Some insights may be obtained by looking at /proc/1/stack and /proc/$pid_of_dbus/stack (as root). HTH, Michal
Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
Hello Heyi. On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo wrote: > Is there any known issue which will cause this problem? Or do you have any > suggestion on how to debug? As written in the report, it looks like dbus-daemon or PID1 itself not responding. Some insights may be obtained by looking at /proc/1/stack and /proc/$pid_of_dbus/stack (as root). HTH, Michal signature.asc Description: Digital signature
[systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release
Hi all, We are runing OpenBMC 2.11.0 release which is based on yocto Honister and systemd 249.7 release, and we find systemctl hangs occasionally. The phenomenon is just like the below issue: https://github.com/openbmc/openbmc/issues/1097 systemctl command will always return timeout and dmesg will continue to pop up below message: |systemd-journald[445]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected| Is there any known issue which will cause this problem? Or do you have any suggestion on how to debug? For the above bug, I don't see any resolution before it was closed. Thanks, Heyi