Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2023-01-04 Thread Heyi Guo

On 2023/1/4 下午8:59, Michal Koutný wrote:

On Wed, Jan 04, 2023 at 07:13:59PM +0800, Heyi Guo  
wrote:

Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516.
Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution.
Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list units:
Transport endpoint is not connected

Is it the reason for systemctl fails to work? For the log says "systemd
freezing execution".

Yes, see the line above, there's SIGSEGV in PID 1.
(Given the other SIGSEGVs, it looks like a common cause across different
processes, e.g. screwed libc update or similar.)

Also, based on the same line, you may be able to extract the coredump
from /var/lib/systemd/coredump (depends on coredump.conf:Storage=) and
figure out more.


The core shows something like this:

(gdb) bt
#0  0x00485848 in prioq_peek_by_index (idx=0, q=0x80808080) at 
../git/src/basic/prioq.c:272

#1  prioq_peek (q=0x80808080) at ../git/src/basic/prioq.h:25
#2  process_timer (n=, d=d@entry=0x6f6600, e=out>) at ../git/src/libsystemd/sd-event/sd-event.c:3181
#3  0x00486594 in process_timer (d=0x6f6600, n=, 
e=0x6f65c8) at ../git/src/libsystemd/sd-event/sd-event.c:4124
#4  sd_event_wait (e=e@entry=0x6f65c8, timeout=) at 
../git/src/libsystemd/sd-event/sd-event.c:4124
#5  0x00486bb0 in sd_event_run (timeout=18446744073709551615, 
e=0x6f65c8) at ../git/src/systemd/sd-event.h:172
#6  sd_event_loop (e=0x6f65c8) at 
../git/src/libsystemd/sd-event/sd-event.c:4264

#7  0x0040c59c in main_loop (manager=0x0) at ../git/src/udev/udevd.c:1872
#8  run_udevd (argv=, argc=) at 
../git/src/udev/udevd.c:1984
#9  run (argv=, argc=) at 
../git/src/udev/udevadm.c:116
#10 main (argc=, argv=) at 
../git/src/udev/udevadm.c:133


Thanks,

Heyi




Michal


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2023-01-04 Thread Michal Koutný
On Wed, Jan 04, 2023 at 07:13:59PM +0800, Heyi Guo  
wrote:
> Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516.
> Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution.
> Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list units:
> Transport endpoint is not connected
> 
> Is it the reason for systemctl fails to work? For the log says "systemd
> freezing execution".

Yes, see the line above, there's SIGSEGV in PID 1.
(Given the other SIGSEGVs, it looks like a common cause across different
processes, e.g. screwed libc update or similar.)

Also, based on the same line, you may be able to extract the coredump
from /var/lib/systemd/coredump (depends on coredump.conf:Storage=) and
figure out more.

Michal


signature.asc
Description: Digital signature


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2023-01-04 Thread Heyi Guo

Hi Michal,

Actually we have upgraded systemd version to 250.5, but the issue will 
still happen.


Navigating the journal log context of when the error message is first 
printed, I found there is a SEGV fault of systemd-udevd:


Jan 04 16:10:40 ali2600 systemd[1]: Created slice Slice 
/system/systemd-coredump.
Jan 04 16:10:40 ali2600 systemd[1]: Started Process Core Dump (PID 
7507/UID 0).
Jan 04 16:10:42 ali2600 systemd-coredump[7508]: elfutils disabled, 
parsing ELF objects not supported
Jan 04 16:10:42 ali2600 systemd-coredump[7508]: [LNK] Process 173 
(systemd-udevd) of user 0 dumped core.
Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Main process 
exited, code=dumped, status=11/SEGV
Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Killing 
process 7503 (systemd-udevd) with signal SIGKILL.
Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Killing 
process 7503 (systemd-udevd) with signal SIGKILL.
Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Failed with 
result 'core-dump'.
Jan 04 16:10:42 ali2600 systemd[1]: systemd-udevd.service: Scheduled 
restart job, restart counter is at 1.
Jan 04 16:10:42 ali2600 systemd[1]: Stopped Rule-based Manager for 
Device Events and Files.
Jan 04 16:10:42 ali2600 systemd[1]: Starting Rule-based Manager for 
Device Events and Files...
Jan 04 16:10:42 ali2600 systemd[1]: systemd-coredump@0-7507-0.service: 
Deactivated successfully.

Jan 04 16:10:42 ali2600 systemd-udevd[7510]: corrupted size vs. prev_size

..

Jan 04 16:10:57 ali2600 systemd-coredump[7517]: elfutils disabled, 
parsing ELF objects not supported
Jan 04 16:10:57 ali2600 systemd-coredump[7517]: [LNK] Process 7516 
(systemd) of user 0 dumped core.
Jan 04 16:10:57 ali2600 phosphor-dump-manager[356]: *** stack smashing 
detected ***: terminated
Jan 04 16:10:57 ali2600 phosphor-dump-monitor[280]: Failed to create 
dump: sd_bus_call noreply: org.freedesktop.DBus.Error.NoReply: Remote 
peer disconnected

Jan 04 16:10:57 ali2600 systemd[1]: Caught , dumped core as pid 7516.
Jan 04 16:10:57 ali2600 systemd[1]: Freezing execution.
Jan 04 16:10:57 ali2600 phosphor-dump-manager[7536]: Failed to list 
units: Transport endpoint is not connected


Is it the reason for systemctl fails to work? For the log says "systemd 
freezing execution".


Thanks,

Heyi


On 2023/1/4 下午6:48, Michal Koutný wrote:

On Wed, Jan 04, 2023 at 04:51:22PM +0800, Heyi Guo  
wrote:

The issue happened again, but the /proc/1/stack and
/proc/$pid_of_dbus-broker/stack are both empty on our platform.

(You reported previously the version was v249 (which is behind the last
two upstream versions, so it may be a good idea to raise the issue with
your distro.))


I checked kernel config and confirmed that  CONFIG_STACKTRACE is enabled:

zcat /proc/config.gz | grep CONFIG_STACKTRACE
CONFIG_STACKTRACE_SUPPORT=y
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_STACKTRACE=y

Is there any other config that is missing?

I don't think so (the file wouldn't be present otherwise).

If there are no kernel stacks, the tasks execute in userspace and given
the indefinite stuckage, they're likely looping somewhere (or you must
have been unlucky to miss a syscall), which should manifest in their CPU
consumption.

The userspace stack may be of interest then, e.g.
`gdb -ex "bt" --batch -p 1`

(for PID 1 and debuginfo for involved binaries must be present to obtain
useful info).

Michal


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2023-01-04 Thread Michal Koutný
On Wed, Jan 04, 2023 at 04:51:22PM +0800, Heyi Guo  
wrote:
> The issue happened again, but the /proc/1/stack and
> /proc/$pid_of_dbus-broker/stack are both empty on our platform.

(You reported previously the version was v249 (which is behind the last
two upstream versions, so it may be a good idea to raise the issue with
your distro.))

> I checked kernel config and confirmed that  CONFIG_STACKTRACE is enabled:
> 
> zcat /proc/config.gz | grep CONFIG_STACKTRACE
> CONFIG_STACKTRACE_SUPPORT=y
> # CONFIG_STACKTRACE_BUILD_ID is not set
> CONFIG_STACKTRACE=y
> 
> Is there any other config that is missing?

I don't think so (the file wouldn't be present otherwise).

If there are no kernel stacks, the tasks execute in userspace and given
the indefinite stuckage, they're likely looping somewhere (or you must
have been unlucky to miss a syscall), which should manifest in their CPU
consumption.

The userspace stack may be of interest then, e.g.
`gdb -ex "bt" --batch -p 1`

(for PID 1 and debuginfo for involved binaries must be present to obtain
useful info).

Michal


signature.asc
Description: Digital signature


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2023-01-04 Thread Heyi Guo



On 2022/12/1 下午9:57, Michal Koutný wrote:

Hello Heyi.

On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo  
wrote:

Is there any known issue which will cause this problem? Or do you have any
suggestion on how to debug?

As written in the report, it looks like dbus-daemon or PID1 itself not
responding. Some insights may be obtained by looking at /proc/1/stack
and /proc/$pid_of_dbus/stack (as root).


Hi Michal,

The issue happened again, but the /proc/1/stack and 
/proc/$pid_of_dbus-broker/stack are both empty on our platform.


I checked kernel config and confirmed that  CONFIG_STACKTRACE is enabled:

zcat /proc/config.gz | grep CONFIG_STACKTRACE
CONFIG_STACKTRACE_SUPPORT=y
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_STACKTRACE=y

Is there any other config that is missing?

Thanks,

Heyi



HTH,
Michal


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2022-12-07 Thread Heyi Guo
Thanks very much; we'll try to get the information when the issue 
happens next time.


Heyi

On 2022/12/1 下午9:57, Michal Koutný wrote:

Hello Heyi.

On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo  
wrote:

Is there any known issue which will cause this problem? Or do you have any
suggestion on how to debug?

As written in the report, it looks like dbus-daemon or PID1 itself not
responding. Some insights may be obtained by looking at /proc/1/stack
and /proc/$pid_of_dbus/stack (as root).

HTH,
Michal


Re: [systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2022-12-01 Thread Michal Koutný
Hello Heyi.

On Tue, Nov 29, 2022 at 12:44:12PM +0800, Heyi Guo  
wrote:
> Is there any known issue which will cause this problem? Or do you have any
> suggestion on how to debug?

As written in the report, it looks like dbus-daemon or PID1 itself not
responding. Some insights may be obtained by looking at /proc/1/stack
and /proc/$pid_of_dbus/stack (as root).

HTH,
Michal


signature.asc
Description: Digital signature


[systemd-devel] systemctl hangs with 249.7 systemd in yocto Honister release

2022-11-28 Thread Heyi Guo

Hi all,

We are runing OpenBMC 2.11.0 release which is based on yocto Honister 
and systemd 249.7 release, and we find systemctl hangs occasionally. The 
phenomenon is just like the below issue:


https://github.com/openbmc/openbmc/issues/1097

systemctl command will always return timeout and dmesg will continue to 
pop up below message:


|systemd-journald[445]: Failed to send WATCHDOG=1 notification message: 
Transport endpoint is not connected|


Is there any known issue which will cause this problem? Or do you have 
any suggestion on how to debug?


For the above bug, I don't see any resolution before it was closed.

Thanks,

Heyi