Not 100% sure if we're seeing the same bug (we use plain Kerberos, no AD
or Samba involved). However, since we started rolling out 14.04 and
16.04 in bigger numbers, we get literally hundreds of BUG messages (see
below) and a few kernel panics a week (console output showing among
other "oops_end" and "rpc_pipe_read", so we're pretty sure there's a
direct connection between what we have in the logs and the panics).

The kernel BUGs _all_ look like this:

NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:59963]
Modules linked in: cpuid 8021q garp mrp stp llc cts nfsv4 ip6t_REJECT 
nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 nf_log_ipv4 
nf_log_common xt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment 
xt_conntrack nf_conntrack xt_multiport iptable_filter ip_tables x_tables 
autofs4 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
ipmi_ssif irqbypass ipmi_devintf crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper 
cryptd sb_edac dcdbas ipmi_si wmi mei_me ipmi_msghandler edac_core mei shpchp 
8250_fintek lpc_ich acpi_power_meter mac_hid rpcsec_gss_krb5 nfsd auth_rpcgss 
nfs_acl lp nfs parport lockd grace sunrpc fscache tg3 ahci megaraid_sas ptp 
libahci pps_core fjes
CPU: 5 PID: 59963 Comm: stat Tainted: G      D W       4.4.0-53-generic 
#74~14.04.1-Ubuntu
Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.2.10 03/09/2015
task: ffff880b2b0b6040 ti: ffff8809e1d00000 task.ti: ffff8809e1d00000
RIP: 0010:[<ffffffff810c4fa0>]  [<ffffffff810c4fa0>] 
native_queued_spin_lock_slowpath+0x160/0x170
RSP: 0018:ffff8809e1d03960  EFLAGS: 00000202
RAX: 0000000000000101 RBX: ffff8808d6319600 RCX: 0000000000000001
RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff881056f679c8
RBP: ffff8809e1d03960 R08: 0000000000000101 R09: 000000000000ffff
R10: 0000000000000000 R11: ffffea00415b4a00 R12: ffff881056f67900
R13: ffff881056f679c8 R14: ffff8808d631976b R15: ffff8808d6319600
FS:  00007fdcb0f39840(0000) GS:ffff88105e480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdcb0604160 CR3: 00000008dd211000 CR4: 00000000001406e0
Stack:
 ffff8809e1d03970 ffffffff81180e47 ffff8809e1d03980 ffffffff817fe4d0
 ffff8809e1d039d0 ffffffffc0114d50 ffff8808d6319740 0000000000000000
 ffffffffc0129060 ffff8810560eaf00 0000000000000001 ffffffff81ef6bc0
Call Trace:
 [<ffffffff81180e47>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff817fe4d0>] _raw_spin_lock+0x20/0x30
 [<ffffffffc0114d50>] gss_setup_upcall+0x160/0x390 [auth_rpcgss]
 [<ffffffffc0115b8e>] gss_cred_init+0xce/0x350 [auth_rpcgss]
 [<ffffffff810bde50>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffffc00d1473>] rpcauth_lookup_credcache+0x1e3/0x280 [sunrpc]
 [<ffffffffc011338e>] gss_lookup_cred+0xe/0x10 [auth_rpcgss]
 [<ffffffffc00d0c7c>] rpcauth_lookupcred+0x7c/0xb0 [sunrpc]
 [<ffffffffc00d1c6a>] rpcauth_refreshcred+0x12a/0x1a0 [sunrpc]
 [<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
 [<ffffffffc00c1650>] ? call_bc_transmit+0x1a0/0x1a0 [sunrpc]
 [<ffffffffc00c1b30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffc00c1b30>] ? call_retry_reserve+0x60/0x60 [sunrpc]
 [<ffffffffc00c1b6c>] call_refresh+0x3c/0x70 [sunrpc]
 [<ffffffffc00cd496>] __rpc_execute+0x86/0x440 [sunrpc]
 [<ffffffffc00d057e>] rpc_execute+0x5e/0xb0 [sunrpc]
 [<ffffffffc00c4210>] rpc_run_task+0x70/0x90 [sunrpc]
 [<ffffffffc0484176>] nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
 [<ffffffffc0484d78>] _nfs4_proc_statfs+0xb8/0xd0 [nfsv4]
 [<ffffffffc048efb9>] nfs4_proc_statfs+0x49/0x70 [nfsv4]
 [<ffffffffc014a479>] nfs_statfs+0x59/0x170 [nfs]
 [<ffffffff8123189b>] statfs_by_dentry+0x9b/0x120
 [<ffffffff8123193b>] vfs_statfs+0x1b/0xb0
 [<ffffffff81231a19>] user_statfs+0x49/0x80
 [<ffffffff81231a65>] SYSC_statfs+0x15/0x30
 [<ffffffff81231b9e>] SyS_statfs+0xe/0x10
 [<ffffffff817fe876>] entry_SYSCALL_64_fastpath+0x16/0x75
Code: 8b 01 48 85 c0 75 0a f3 90 48 8b 01 48 85 c0 74 f6 c7 40 08 01 00 00 00 
e9 61 ff ff ff 83 fa 01 75 07 e9 c2 fe ff ff f3 90 8b 07 <84> c0 75 f8 b8 01 00 
00 00 66 89 07 5d c3 66 90 0f 1f 44 00 00 


This is from yesterday, Sunday. Nobody was logged in to the machine (according 
to wtmp & wtmp.1, not since at least Feb 01, but those could of course have 
been borked by the panic). 
Nevertheless there are 676 "gss_setup_upcall" lines between 
2017-02-19T09:06:46.535855+01:00 and 2017-02-19T09:41:02.900460+01:00.
And sure enough, after the last occurence the server PANIC'd.

In this paricular case it is 14.04.5, Kernel 4.4.0-53-generic, but we've seen 
this with basically every 4.4.0-* flavour on 14.04 and 16.04.
This config has basically been in use (on 12.04) for a few years:

*****
/etc/krb5.conf (excerpt):
[libdefaults]
        dns_lookup_realm = true
        dns_lookup_kdc = true
        kdc_timesync = 1
        ccache_type = 4
        forwardable = true
        proxiable = true
*****

*****
autofs used for NFS directories, with options
-fstype=nfs,intr,hard,fg,rsize=16384,wsize=16384,proto=tcp,timeo=600,retrans=3,port=2049,nfsvers=4,sec=krb5p,nodev,nosuid
*****

*****
# grep -v '^#' /etc/default/autofs 
MASTER_MAP_NAME=/etc/auto.master
TIMEOUT=300
BROWSE_MODE=yes
LOGGING=none
USE_MISC_DEVICE=yes
*****

*****
# grep -v '^#' /etc/default/nfs-common 
NEED_STATD=no
STATDOPTS=
NEED_GSSD=yes
NEED_IDMAPD=yes
*****

*****
/etc/idmapd.conf:
[General]

Verbosity = 0
Pipefs-Directory = /run/rpc_pipefs
# set your own domain here, if id differs from FQDN minus hostname
Domain = our.domain

[Mapping]

Nobody-User = nobody
Nobody-Group = nogroup

[Translation]
Method = nsswitch
*****


What we do have is ~ 23 UNIX groups assigned to most of the admin users. We're 
working on reducing those at the moment, but nastily we did not find a way to 
reproduce the panic, so we have no reliable way to tell right now if that fixes 
things (the procedure in the initial report does not trigger the BUG nor panic 
for us). Also, the servers that we equipped with kexec did not crash again 
since.

We'd appreciate to get the importance of this raised significantly, as
our only _stable_ platform (i.e., doesn't randomly crash) is 12.04 right
now, which will be EOL pretty soon.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1466654

Title:
  kernel soft lockup on nfs server when using a kerberos mount

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1466654/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to