Public bug reported:

Environment:
Ubuntu 24.04.3 on HWE kernel 6.17.0-14-generic #14~24.04.1-Ubuntu
Host is joined to AD domain with sssd, sssd is cifs idmap provider
root has Kerberos TGT issued via k5start using static keytab w/ service 
account, to facilitate CIFS mount at boot time
krb5 ccache is stored in keyring
CIFS share is mounted with vers=3.1.1,sec=krb5i,multiuser,cifsacl - root is 
mapped to service account
some user home folders are stored on CIFS mount

This setup has been working (more or less, there are other unrelated bugs in 
cifs.upcall racing that I have not yet reported, but not relevant here).
After recently adding a frequently-running task to sync files between the cifs 
share and a tmpfs on the Linux host (using `rclone bisync`) the system 
intermittently becomes non-responsive as systemd/pid1 becomes hung in a fstat 
call to some file on the CIFS share. With systemd not responding and IO to the 
CIFS share blocked, the vast majority of tools become unusable (no `strace`, 
`lsof`, not even `ps` works) for several minutes.

Hang oops:

INFO: task systemd:1 blocked for more than 245 seconds.
      Not tainted 6.17.0-14-generic #14~24.04.1-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:systemd         state:D stack:0     pid:1     tgid:1     ppid:0      
task_flags:0x400100 flags:0x00004002
Call Trace:
 <TASK>
 __schedule+0x30d/0x7a0
 schedule+0x27/0x90
 schedule_timeout+0x104/0x110
 __wait_for_common+0x98/0x180
 ? __pfx_schedule_timeout+0x10/0x10
 wait_for_completion_state+0x21/0x50
 call_usermodehelper_exec+0x181/0x1b0
 call_sbin_request_key+0x343/0x500
 construct_key_and_link+0x14e/0x1b0
 request_key_and_link+0x1d5/0x200
 ? __pfx_key_default_cmp+0x10/0x10
 ? __pfx_keyring_search_iterator+0x10/0x10
 request_key_tag+0x48/0xc0
 sid_to_id+0xe4/0x350 [cifs]
 parse_sec_desc+0x8e/0x320 [cifs]
 cifs_acl_to_fattr+0x14b/0x1f0 [cifs]
 cifs_get_fattr+0x35f/0x6d0 [cifs]
 ? rmqueue.isra.0+0x13bc/0x1a20
 cifs_get_inode_info+0x60/0x140 [cifs]
 cifs_revalidate_dentry_attr+0x1a9/0x3d0 [cifs]
 cifs_getattr+0x16c/0x250 [cifs]
 vfs_getattr_nosec+0xb9/0x110
 vfs_fstat+0x4e/0xc0
 __do_sys_newfstat+0x3d/0x80
 __x64_sys_newfstat+0x15/0x20
 x64_sys_call+0x219b/0x2680
 do_syscall_64+0x80/0xa30
 ? __x64_sys_openat+0x54/0xa0
 ? arch_exit_to_user_mode_prepare.isra.0+0xd/0xe0
 ? do_syscall_64+0xb6/0xa30
 ? __fput+0x1a2/0x2d0
 ? kmem_cache_free+0x43a/0x470
 ? __fput+0x1a2/0x2d0
 ? fput_close_sync+0x3d/0xa0
 ? __x64_sys_close+0x3e/0x90
 ? arch_exit_to_user_mode_prepare.isra.0+0xd/0xe0
 ? do_syscall_64+0xb6/0xa30
 ? do_syscall_64+0xb6/0xa30
 ? do_wp_page+0x1d4/0x640
 ? handle_pte_fault+0x1ec/0x200
 ? __handle_mm_fault+0x5ba/0x740
 ? count_memcg_events+0xf0/0x1e0
 ? handle_mm_fault+0x237/0x370
 ? do_user_addr_fault+0x1d2/0x8d0
 ? arch_exit_to_user_mode_prepare.isra.0+0xd/0x100
 ? irqentry_exit_to_user_mode+0x2d/0x1d0
 ? irqentry_exit+0x43/0x50
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x745f751173bb
RSP: 002b:00007fffcf1da368 EFLAGS: 00000202 ORIG_RAX: 0000000000000005
RAX: ffffffffffffffda RBX: 000000000000004a RCX: 0000745f751173bb
RDX: 0000000000000000 RSI: 00007fffcf1da490 RDI: 000000000000004a
RBP: 00007fffcf1da680 R08: 000060542ce71010 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00007fffcf1da400
R13: 0000000000000000 R14: 000060542d0ec9e0 R15: 0000000000000010
 </TASK>

from the stack trace, it looks like the CIFS module is calling back to userland 
to execute `/sbin/request-key`, which is normal and expected behaviour, but for 
some reason the `request-key` invocation hangs for several minutes. I am unsure 
if it eventually completes or times out, but the system does return to normal 
behaviour, with IO to CIFS share working as expected. Manually invoking `rclone 
bisync`  does not reproduce the behaviour, I can stress the cifs mount fairly 
heavily without any problems whatsoever.
Network issues between the cifs host and the Linux host are unlikely, as 
they're both colocated on the same hypervisor in this case. I suspected maybe 
sssd was having issues communicating with AD, but that also seems to not be the 
problem.

Please let me know if there is any other diagnostic information that
could be useful to figure out what's going on here, I am unfortunately
at a loss without being able to run any system introspection while the
hang is ongoing.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2142726

Title:
  systemd/pid1 hangs while calling fstat on file in cifs mount

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2142726/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to