[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
Per comment#23, the ip from AIX 7.2 client are: 9.20.120.127 name = adia.v6.hursley.ibm.com -- Primary 9.20.121.46 name = amberjack.v6.hursley.ibm.com ? Partner And I searched the trace again with above ips, looks socket cc6f0db2 is created between 9.20.120.127 and nfs server, however it can also return EAGAIN. duckseason kernel: [13254.724411] svc: socket cc6f0db2 sendto([8485f39d 72... ], 72) = 72 (addr 9.20.120.127, port=1022) ... duckseason kernel: [13254.724734] svc: socket cc6f0db2(inet c831762e), busy=0 duckseason kernel: [13254.724759] svc: server 728e82a2, pool 0, transport cc6f0db2, inuse=2 duckseason kernel: [13254.724761] svc: tcp_recv cc6f0db2 data 1 conn 0 close 0 duckseason kernel: [13254.724765] svc: socket cc6f0db2 recvfrom(b6708704, 4) = 4 duckseason kernel: [13254.724766] svc: TCP record, 168 bytes duckseason kernel: [13254.724769] svc: socket cc6f0db2 recvfrom(57dbced3, 4096) = 168 duckseason kernel: [13254.724771] svc: TCP final record (168 bytes) duckseason kernel: [13254.724775] svc: svc_authenticate (1) duckseason kernel: [13254.724779] svc: server ee62a401, pool 0, transport cc6f0db2, inuse=3 duckseason kernel: [13254.724780] svc: tcp_recv cc6f0db2 data 1 conn 0 close 0 duckseason kernel: [13254.724783] svc: socket cc6f0db2 recvfrom(b6708704, 4) = -11 And it is same for socket 3497acd5 which is used between 9.20.121.46 and nfs server. duckseason kernel: [13254.802249] svc: socket 3497acd5 sendto([86e5a045 72... ], 72) = 72 (addr 9.20.121.46, port=1020) ... duckseason kernel: [13254.802533] svc: socket 3497acd5(inet 72c9551d), busy=0 duckseason kernel: [13254.802571] svc: server 728e82a2, pool 0, transport 3497acd5, inuse=2 duckseason kernel: [13254.802573] svc: tcp_recv 3497acd5 data 1 conn 0 close 0 duckseason kernel: [13254.802578] svc: socket 3497acd5 recvfrom(77f9cf7c, 4) = 4 duckseason kernel: [13254.802579] svc: TCP record, 164 bytes duckseason kernel: [13254.802583] svc: socket 3497acd5 recvfrom(57dbced3, 4096) = 164 duckseason kernel: [13254.802585] svc: TCP final record (164 bytes) duckseason kernel: [13254.802590] svc: svc_authenticate (1) duckseason kernel: [13254.802596] svc: server ee62a401, pool 0, transport 3497acd5, inuse=3 duckseason kernel: [13254.802597] svc: tcp_recv 3497acd5 data 1 conn 0 close 0 duckseason kernel: [13254.802599] svc: socket 3497acd5 recvfrom(77f9cf7c, 4) = -11 But since aix 7.2 client can work with the same server according to bug description, I am curious why 7.2 client also return EAGAIN which is same as 7.3 client, what am I missing? Some questions/suggestion: 1. Did aix 7.3 nfs client work with previous kernel? If so, run "git bisect" to find which commit caused the issue. 2. Is it possible to try with latest 5.4 stable kernel as suggested in comment#1? Also try latest upstream kernel (6.9-rc5 at this time) as well. 3. Does increase lease time make difference? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list:
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
** Attachment added: "RENEW packets between 9.20.32.85 (server) and 9.20.120.127 (7.2 client)" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+attachment/5767206/+files/7.2nfs.png -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
** Attachment added: "packets for 9.20.32.85 (server) and 9.20.120.112 (7.3 client)" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+attachment/5767207/+files/7.3nfs.png -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
Sorry, I can't distinguish which parts of logs in the attachments (#comment11, #comment12 and #comment13) are belong to the connection from working 7.2 and non-working 7.3. All the attachments have "TCP recvfrom got EAGAIN" which should from the connection for 7.3. $ grep "TCP recvfrom got EAGAIN" syslog_16042024_amaliada_primary_adamsongrunter_partner_both_aix73_part1.log -r|wc -l 213127 $ grep "TCP recvfrom got EAGAIN" syslog_16042024_amaliada_primary_adamsongrunter_partner_both_aix73_part2.log -r|wc -l 226005 $ grep "TCP recvfrom got EAGAIN" syslog_17042024_adia_primary_amberjack_partner_both_aix72.log -r|wc -l 20233 May I suggest to collect those logs in two separated files? One from 7.2 and another from 7.3 instead of mix them together. Not an network expert, but I see some NFS RENEW ops packets between 9.20.32.85 (server) and 9.20.120.127 (7.2 client) in tcp_dump17_04_2024_09H_10M, but no such RENEW packets for 9.20.32.85 (server) and 9.20.120.112 (7.3 client) in tcpdump16_04_2024_14H_03M. Given NFS4 is a stateful fs which is based on leases, without client send an operation to renew the lease, it is possible for server to return EAGAIN. And please check if 7.3 client is not same as 7.2 client regarding lease renewing. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
Per below from the trace file Nov 30 11:13:40 duckseason kernel: [1291756.354728] nfsd_dispatch: vers 4 proc 1 Nov 30 11:13:40 duckseason kernel: [1291756.354731] svc: server 7c7e7536, pool 0, transport 3fd86d34, inuse=3 Nov 30 11:13:40 duckseason kernel: [1291756.354732] process_renew(6554b87b/4ab45507): starting Nov 30 11:13:40 duckseason kernel: [1291756.354734] svc: tcp_recv 3fd86d34 data 1 conn 0 close 0 Nov 30 11:13:40 duckseason kernel: [1291756.354736] svc: socket 3fd86d34 recvfrom(03fecffb, 4) = -11 Nov 30 11:13:40 duckseason kernel: [1291756.354737] RPC: TCP recv_record got -11 Nov 30 11:13:40 duckseason kernel: [1291756.354737] RPC: TCP recvfrom got EAGAIN we can see NFS server return -11 (EAGAIN), which can be executed from from the path, svc_recv -> svc_handle_xprt -> xprt->xpt_ops->xpo_recvfrom svc_tcp_recvfrom -> svc_recvfrom -> sock_recvmsg which probably triggers sock_recvmsg_nosec -> ... -> tcp_recvmsg As mentioned in recvfrom manpage, ERRORS The recvfrom() function shall fail if: EAGAIN or EWOULDBLOCK The socket's file descriptor is marked O_NONBLOCK and no data is waiting to be received; or MSG_OOB is set and no out-of-band data is available and either the socket's file descriptor is marked O_NONBLOCK or the socket does not support blocking to await out-of-band data. I am not sure if 7.3 NFS client opened non-blocking socket and no data on that socket to be read. So I would like to check if 7.3 client sent something different compared with 7.2 client which caused server returned BAD_SEQID to AIX 7.3 client. Please also collect relevant trace log from server side when connecting with 7.2 client, then we can investigate the difference between good one and bad one. If possible, maybe you can try with the latest 5.4 stable (5.4.274) and upstream version (6.9-rc4). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
I did a screening of the traces, but couldn't really find suspicious entries. I'm now looking for a someone else's view and opinion... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2042363] Re: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 ser
Hi, reading the bug description one of the first things that caught my attention is that the kernel (5.4.0-156) seems to be a bit outdated, since the latest (aot) is 5.4.0.166. Would you mind retrying this with the latest kernel (ideally actually with the latest userspace, for example after having done a 'apt update' and 'apt full-upgrade'), since there will be a difference of hundreds of patches (also upstream stable) between these. On top I believe it would probably very helpful to have rpcdebug enabled for the NFS Server, like: rpcdebug -m nfsd -s all rpcdebug -m nlm -s all rpcdebug -m rpc -s all Btw. it would also be interesting to know it this also happens with a bare-metal install of the NFS server, means without having VMware in between (avoiding any potential flaws with VMware virtual network components, like virtual switches.). I think technically this is not a Launchpad bug for the Ubuntu on IBM Power project, since here Ubuntu runs on amd64 (and on VMware), but we may still try to figure out what's going on. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2042363 Title: AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl() on a Ubuntu 20.04 NFSV4 server Status in linux package in Ubuntu: New Bug description: ---Problem Description--- AIX 7.3 NFS client frequently returns an EIO error to an application when reading or writing to a file that has been locked with fcntl(). NFS server is Ubuntu 20.04.6 LTS, GNU/Linux 5.4.0-139-generic x86_64. The problem does not appear to affect other combinations of NFS client (including AIX 7.2) with this NFS server. The AIX team have indicated that the cause of the EIO is triggered by the NFS server returning a BAD_SEQID error which leads to the AIX NFS client incorrectly zeroing the stateid, which then leads to the NFS server returning a BAD_STATEID error and the NFS client then returns the EIO error. The AIX team would like to understand why the BAD_SEQID has been returned. ---uname output--- Linux duckseason 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Machine Type = VMware ESXi Server 7.0 4 x Intel(R) Xeon(R) Gold 6348H CPU @ 2.30GHz ---Steps to Reproduce--- We cannot offer a simple way to recreate the problem as it involves IBM MQ running on two primary machines (AIX) using the Ubuntu server for it's HA NFSv4 storage. However, we can provide any requested trace or dumps from any or all of the involved machines. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042363/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp