[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Theory falsified, it happened again with NAT switched off. No clues in snoop, too late. -- This message posted from opensolaris.org
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Pavel Filipensky wrote: > Hi Udo, > > how long have you been on vanilla Osol 2009.06 and how long on SRU u6 > before updating to SRU u7? Is it possible that SRU u6 has the issue as > well? We apparently switched to SU5 on 21-Oct-09, to SU6 on 17-Dec-2009, and then to SU7+IDR30+IDR35 on 14-Jan-2010, after that these problems started to occur about once per 1-2 days. We had unexplained hangups before on Mondays, but with a much lower frequency, and probably related to zfs scrubs which triggered (via the ARC) swapping (which in turn was buggy as well...). I am now suspecting that maybe NFS is right when denying access through NFSV4ERR_NO_GRACE, since I found a DNS query related to nfs4v_mapid (we have not set it, all NFS mapping should be done on sys locally via /etc/hosts), and our local machines on the private net have access to dns through NAT (no ipf.conf entries), which seems ok, but a few machines with private AND public network access have an additional (unwanted) route on the private net to the public net via this NAT. This seems to result sometimes (don't know why) in a DNS query about the imksunxxx machine (which is a local entry in /etc/hosts on the private net), and the DNS query would return the short name resolved to imksunxxx.ourdomain.tld with a public IP, and, voila, NFS sees a different client under the same short name and must deny access due to ambiguity (but I suspect that this should be visible in the snoop as a FQHN?). Or, alternatively, the client is visible through NAT and the private net, and this could trigger the problem. It's still inconlusive to me. As a test I switched off NAT (we only need it for mail and updates), and the hangups are gone for 3 days now (but we will see). Maybe we need ipf.conf entries which filter out the private net traffic to prevent additional routes (or even loops?). > I was suspicious about one integration which went to SRUu2 and to S10U8 > and to snv_114, but the change is only on the nfs client side - this > does not match the set-up described earlier by Jorgen: > - clients are s10u5 > - server/s10u8 (issue) > - server/snv_117 (no issue) > > I am not able to find out more from the available data - unless the > problem is reproducible it is hard to diagnose. > I have made one observation (not sure if it is useful) > > reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE, > but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK > > Pavel > > On 03/11/10 14:16, Udo Grabowski wrote: >> Hi Pavel, >> >> both clients and server were updated (we always have a consistent >> environment), and we came >> from u6, to which we updated directly from a vanilla Osol 2009.06 >> before. The Readmes don't list any NFS patches there, so I suspect >> that our IDR30 patch carries unwanted changes from Solaris u8 into >> Opensolaris U7 which trigger this problem. Since we don't use Solaris >> 10, I cannot >> confirm that 10u7 did not have that problem, I just concluded it from >> the initial post here (but >> that conclusion maybe wrong, I admit). >> We currently snoop the problem and catched some clues (maybe): Shortly >> before, we >> see CB_NULL and NULL4 exchanges, seemingly as a result of a client >> renewal (we do not catch everything before), both server and client >> seem to check their partners callback capabilities: >> . -- Dr.Udo GrabowskiInst.f.Meteorology a.Climate Research IMK-ASF-SAT www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php KIT - Karlsruhe Institute of Technologyhttp://www.kit.edu Postfach 3640,76021 Karlsruhe,Germany Tel:(+49)7247 82-6026,Fax:-7026
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
I have not had any progress on this personally, but we have sent live-cores to Sun-Japan, and well as all other details they requested. However, as a work around, the NOC staff is re-mounting the NFS mounts once a day. One server at a time. This has prevented the issue from happening. We still see the log entries, the full hang of NFS mounts has not happened, so far. If more information is required, I can ask NOC to leave one server out of remounting, and run snoop or whatever is required until NFS stalls. -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home)
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Hi Udo, how long have you been on vanilla Osol 2009.06 and how long on SRU u6 before updating to SRU u7? Is it possible that SRU u6 has the issue as well? I was suspicious about one integration which went to SRUu2 and to S10U8 and to snv_114, but the change is only on the nfs client side - this does not match the set-up described earlier by Jorgen: - clients are s10u5 - server/s10u8 (issue) - server/snv_117 (no issue) I am not able to find out more from the available data - unless the problem is reproducible it is hard to diagnose. I have made one observation (not sure if it is useful) reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE, but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK Pavel On 03/11/10 14:16, Udo Grabowski wrote: > Hi Pavel, > > both clients and server were updated (we always have a consistent > environment), and we came > from u6, to which we updated directly from a vanilla Osol 2009.06 before. The > Readmes don't > list any NFS patches there, so I suspect that our IDR30 patch carries > unwanted changes from > Solaris u8 into Opensolaris U7 which trigger this problem. Since we don't use > Solaris 10, I cannot > confirm that 10u7 did not have that problem, I just concluded it from the > initial post here (but > that conclusion maybe wrong, I admit). > We currently snoop the problem and catched some clues (maybe): Shortly > before, we > see CB_NULL and NULL4 exchanges, seemingly as a result of a client renewal > (we do not > catch everything before), both server and client seem to check their partners > callback capabilities: > >imksunxxx -> imksunyyy NFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 > SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.151.156.34 CBID=1073741... >imksunyyy -> imksunxxxNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK > GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3144b6920ee CFV=2AE10314 >imksunxxx -> imksunyyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM > CL=3144b6920ee CFV=2AE10314 >imksunyyy -> imksunxxxNFS R 4 (sclntid_conf) NFS4_OK > SETCLIENTID_CONFIRM NFS4_OK >imksunyyy -> imksunxxxNFS C CB_NULL >imksunxxx -> imksunyyyNFS R CB_NULL >imksunxxx -> imksunyyyTCP D=2049 S=772 Ack=2315081869 Seq=1684524332 > Len=0 Win=64074 Options= >imksunyyy -> imksunxxxTCP D=39970 S=46007 Ack=2765004476 > Seq=2079880454 Len=0 Win=64074 Options= >imksunxxx -> imksunyyyTCP D=2049 S=35143 Syn Seq=3666179320 Len=0 > Win=64057 Options= 4,nop,nop,sackOK> >imksunyyy -> imksunxxxTCP D=35143 S=2049 Syn Ack=3666179321 > Seq=3064768973 Len=0 Win=64074 Options= 449055009,mss 1460,nop,wscale 4,nop,nop,sackOK> >imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064768974 Seq=3666179321 > Len=0 Win=64074 Options= >imksunxxx -> imksunyyyNFS C NULL4 >imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179365 Seq=3064768974 > Len=0 Win=64074 Options= >imksunyyy -> imksunxxxNFS R NULL4 >imksunxxx -> imksunyyy TCP D=2049 S=35143 Ack=3064769002 Seq=3666179365 > Len=0 Win=64074 Options= >imksunxxx -> imksunyyyTCP D=2049 S=35143 Fin Ack=3064769002 > Seq=3666179365 Len=0 Win=64074 Options= >imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179366 Seq=3064769002 > Len=0 Win=64074 Options= >imksunyyy -> imksunxxxTCP D=35143 S=2049 Fin Ack=3666179366 > Seq=3064769002 Len=0 Win=64074 Options= >imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064769003 Seq=3666179366 > Len=0 Win=64074 Options= > ** here starts the reopen mess, clients hang ** >imksunxxx -> imksunyyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=21955 CT=P DT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a >imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH > NFS4_OK OPEN NFS4ERR_NO_GRACE >imksunxxx -> imksunyyy NFS C 4 (reopen ) PUTFH FH=95AC OPEN > start_version OT=NC SQ=21956 CT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a > b0a23a >imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK > OPEN NFS4_OK ST=150F:10979 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK >imksunxxx -> imksunyyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=21955 CT=P DT=N AC=R DN=N OO=07D1 GETFH GETATTR 10011a b0a23a >imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH > NFS4_OK OPEN NFS4ERR_NO_GRACE > > Another one here, this time after the reopen: > >imksunzzz -> local-yyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=42131 CT=P DT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a >local-yyy -> imksunzzzNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH > NFS4_OK OPEN NFS4ERR_NO_GRACE >imksunzzz -> local-yyyNFS C 4 (reopen ) PUTFH FH=95AC OPEN > start_version OT=NC SQ=42132 CT=N AC=R DN=N OO=065A GETFH GETATTR 10011a > b0a23a >local-yyy -> imksunzzzNFS R 4 (reopen ) NFS4_OK PUTFH NFS
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Hi Pavel, both clients and server were updated (we always have a consistent environment), and we came from u6, to which we updated directly from a vanilla Osol 2009.06 before. The Readmes don't list any NFS patches there, so I suspect that our IDR30 patch carries unwanted changes from Solaris u8 into Opensolaris U7 which trigger this problem. Since we don't use Solaris 10, I cannot confirm that 10u7 did not have that problem, I just concluded it from the initial post here (but that conclusion maybe wrong, I admit). We currently snoop the problem and catched some clues (maybe): Shortly before, we see CB_NULL and NULL4 exchanges, seemingly as a result of a client renewal (we do not catch everything before), both server and client seem to check their partners callback capabilities: imksunxxx -> imksunyyy NFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.151.156.34 CBID=1073741... imksunyyy -> imksunxxxNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3144b6920ee CFV=2AE10314 imksunxxx -> imksunyyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM CL=3144b6920ee CFV=2AE10314 imksunyyy -> imksunxxxNFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM NFS4_OK imksunyyy -> imksunxxxNFS C CB_NULL imksunxxx -> imksunyyyNFS R CB_NULL imksunxxx -> imksunyyyTCP D=2049 S=772 Ack=2315081869 Seq=1684524332 Len=0 Win=64074 Options= imksunyyy -> imksunxxxTCP D=39970 S=46007 Ack=2765004476 Seq=2079880454 Len=0 Win=64074 Options= imksunxxx -> imksunyyyTCP D=2049 S=35143 Syn Seq=3666179320 Len=0 Win=64057 Options= imksunyyy -> imksunxxxTCP D=35143 S=2049 Syn Ack=3666179321 Seq=3064768973 Len=0 Win=64074 Options= imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064768974 Seq=3666179321 Len=0 Win=64074 Options= imksunxxx -> imksunyyyNFS C NULL4 imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179365 Seq=3064768974 Len=0 Win=64074 Options= imksunyyy -> imksunxxxNFS R NULL4 imksunxxx -> imksunyyy TCP D=2049 S=35143 Ack=3064769002 Seq=3666179365 Len=0 Win=64074 Options= imksunxxx -> imksunyyyTCP D=2049 S=35143 Fin Ack=3064769002 Seq=3666179365 Len=0 Win=64074 Options= imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179366 Seq=3064769002 Len=0 Win=64074 Options= imksunyyy -> imksunxxxTCP D=35143 S=2049 Fin Ack=3666179366 Seq=3064769002 Len=0 Win=64074 Options= imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064769003 Seq=3666179366 Len=0 Win=64074 Options= ** here starts the reopen mess, clients hang ** imksunxxx -> imksunyyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=21955 CT=P DT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE imksunxxx -> imksunyyy NFS C 4 (reopen ) PUTFH FH=95AC OPEN start_version OT=NC SQ=21956 CT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN NFS4_OK ST=150F:10979 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK imksunxxx -> imksunyyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=21955 CT=P DT=N AC=R DN=N OO=07D1 GETFH GETATTR 10011a b0a23a imksunyyy -> imksunxxxNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE Another one here, this time after the reopen: imksunzzz -> local-yyyNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=42131 CT=P DT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a local-yyy -> imksunzzzNFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE imksunzzz -> local-yyyNFS C 4 (reopen ) PUTFH FH=95AC OPEN start_version OT=NC SQ=42132 CT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a local-yyy -> imksunzzzNFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN NFS4_OK ST=1B1B:21067 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK imksunzzz -> local-yyyNFS C 4 (open) PUTFH FH=95AC OPEN start_version OT=NC SQ=1 CT=N AC=R DN=N OO=0F54 GETFH GETATTR 10011a b0a23a local-yyy -> imksunzzzNFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID imksunzzz -> local-yyyNFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.157.145.153 CBID=107374... local-yyy -> imksunzzzNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3154b6920ee CFV=524A0315 imksunzzz -> local-yyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM CL=3154b6920ee CFV=524A0315 local-yyy -> imksunzzzNFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM NFS4_OK local-yyy -> imksunzzzNFS C CB_NULL imksunzzz -> local-yyyNFS R CB_NULL and a third one with a couple of NFS4ERR_STAL
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Hi, what was updated to Opensolaris 2009.06 SU7? NFS server or NFS clients or both? What was the previous version before updating to s 2009.06 SU7? You also mention that this is seen on solaris 10u8. Can you confirm that it was not seen on s10u7? It is not obvious if this is a bug on the client or on the server. The snoop and the nfs diagnostic messages from syslog do not show enough to get closer to the root cause. In s10u8 patch 141734-02 there were many changes for both nfs server and nfs client. So one of the possibilities is that the problem is related to that patch, otoh only few fixes from 141734-02 were integrated to Opensolaris 2009.06 updates. Thanks, Pavel On 02/08/10 13:56, Udo Grabowski wrote: > Hello, > > we have the same problem here after update to Opensolaris 2009.06 SU7 > (+IDR30+35) > on a 40 clients production system. > About every second day one of our servers looses state, and gives the output > below on snoop > (numerous reopen attempts with a NFS4ERR_NO_GRACE error). All clients hang, > since > they don't get their data, only a restart of the nfs/server process cures > this problem. > This seems to be a serious bug in the NFS module introduced recently in both > Opensolaris > support line and solaris 10 u8. > >sun -> local-s11NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH > NFS4_OK OPEN NFS4ERR_STALE_CLIENTID >local-s11 -> sun NFS C 4 (reopen ) PUTFH FH=B5F3 OPEN OT=NC > SQ=14571 CT=P DT=N AC=W DN=N OO=0955 GE > TFH GETATTR 10011a b0a23a >sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK > OPEN NFS4ERR_NO_GRACE >? -> (multicast) ETHER Type=0001 (LLC/802.3), size=52 bytes >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=14539 CT=P DT=N AC=R DN=N OO=0A4C GE > TFH GETATTR 10011a b0a23a >local-s11 -> sunTCP D=2049 S=1023 Ack=293438085 Seq=4069541650 Len=0 > Win=64074 Options= 25776951 188898831> >local-s11 -> sunTCP D=2049 S=1023 Ack=293438153 Seq=4069541650 Len=0 > Win=64074 Options= 25776951 188898907> >sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK > OPEN NFS4ERR_NO_GRACE >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=AA7B OPEN > imksuns11:1.log OT=NC SQ=14572 CT=N AC=W DN=N > OO=0955 GETFH GETATTR 10011a b0a... >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN > start_version OT=NC SQ=14540 CT=N AC=R DN=N O > O=0A4C GETFH GETATTR 10011a b0a23a >sun -> local-s11TCP D=1023 S=2049 Ack=4069542150 Seq=293438289 Len=0 > Win=64074 Options= 188898915 25776951> >sun -> local-s11NFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN > NFS4_OK ST=1A6F:7271 RF=PL DT=N GETFH > NFS4_OK FH=92CA GETATTR NFS4_OK >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=14539 CT=P DT=N AC=R DN=N OO=0A0F GE > TFH GETATTR 10011a b0a23a >sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK > OPEN NFS4ERR_NO_GRACE >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN > start_version OT=NC SQ=14540 CT=N AC=R DN=N O > O=0A0F GETFH GETATTR 10011a b0a23a >sun -> local-s11NFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN > NFS4_OK ST=1B35:7271 RF=PL DT=N GETFH > NFS4_OK FH=92CA GETATTR NFS4_OK >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC > SQ=14539 CT=P DT=N AC=R DN=N OO=0AEE GE > TFH GETATTR 10011a b0a23a >sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK > OPEN NFS4ERR_NO_GRACE >local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN > start_version OT=NC SQ=14540 CT=N AC=R DN=N O > > NFS4ERR_NO_GRACE > A reclaim of client state was attempted in circumstances in > which the server cannot guarantee that conflicting state has > not been provided to another client. This can occur because > the reclaim has been done outside of the grace period of the > server, after the client has done a RECLAIM_COMPLETE operation, > or because previous operations have created a situation in which > the server is not able to determine that a reclaim-interfering > edge condition does not exist. >
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Hello, we have the same problem here after update to Opensolaris 2009.06 SU7 (+IDR30+35) on a 40 clients production system. About every second day one of our servers looses state, and gives the output below on snoop (numerous reopen attempts with a NFS4ERR_NO_GRACE error). All clients hang, since they don't get their data, only a restart of the nfs/server process cures this problem. This seems to be a serious bug in the NFS module introduced recently in both Opensolaris support line and solaris 10 u8. sun -> local-s11NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID local-s11 -> sun NFS C 4 (reopen ) PUTFH FH=B5F3 OPEN OT=NC SQ=14571 CT=P DT=N AC=W DN=N OO=0955 GE TFH GETATTR 10011a b0a23a sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE ? -> (multicast) ETHER Type=0001 (LLC/802.3), size=52 bytes local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=14539 CT=P DT=N AC=R DN=N OO=0A4C GE TFH GETATTR 10011a b0a23a local-s11 -> sunTCP D=2049 S=1023 Ack=293438085 Seq=4069541650 Len=0 Win=64074 Options= local-s11 -> sunTCP D=2049 S=1023 Ack=293438153 Seq=4069541650 Len=0 Win=64074 Options= sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=AA7B OPEN imksuns11:1.log OT=NC SQ=14572 CT=N AC=W DN=N OO=0955 GETFH GETATTR 10011a b0a... local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN start_version OT=NC SQ=14540 CT=N AC=R DN=N O O=0A4C GETFH GETATTR 10011a b0a23a sun -> local-s11TCP D=1023 S=2049 Ack=4069542150 Seq=293438289 Len=0 Win=64074 Options= sun -> local-s11NFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN NFS4_OK ST=1A6F:7271 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=14539 CT=P DT=N AC=R DN=N OO=0A0F GE TFH GETATTR 10011a b0a23a sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN start_version OT=NC SQ=14540 CT=N AC=R DN=N O O=0A0F GETFH GETATTR 10011a b0a23a sun -> local-s11NFS R 4 (reopen ) NFS4_OK PUTFH NFS4_OK OPEN NFS4_OK ST=1B35:7271 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=92CA OPEN OT=NC SQ=14539 CT=P DT=N AC=R DN=N OO=0AEE GE TFH GETATTR 10011a b0a23a sun -> local-s11NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE local-s11 -> sunNFS C 4 (reopen ) PUTFH FH=95AC OPEN start_version OT=NC SQ=14540 CT=N AC=R DN=N O NFS4ERR_NO_GRACE A reclaim of client state was attempted in circumstances in which the server cannot guarantee that conflicting state has not been provided to another client. This can occur because the reclaim has been done outside of the grace period of the server, after the client has done a RECLAIM_COMPLETE operation, or because previous operations have created a situation in which the server is not able to determine that a reclaim-interfering edge condition does not exist. -- This message posted from opensolaris.org
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Another client: cgi01.unix 5.10 Generic_127128-11 will issue these lines when it happens: Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_OPEN got error NFS4ERR_DELAY causing recovery action NR_DELAY. Nov 27 14:01:55 cgi01.unix last message repeated 98 times Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_GETATTR got error NFS4ERR_STALE causing recovery action NR_STALE. Nov 27 14:01:55 cgi01.unix nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server: x4500-07.unix][Mntpt: /export/sd03/www]File ./$customer-path/$file.lok (rnode_pt: e06c73d0) was closed due to NFS recovery error on server x4500-07.unix(failed to recover from NFS4ERR_STALE NFS4ERR_STALE) Nov 27 14:01:55 cgi01.unix nfs: [ID 941083 kern.info] NOTICE: NFS4 FACT SHEET: Nov 27 14:01:55 cgi01.unix Action: NR_STALE Nov 27 14:01:55 cgi01.unix NFS4 error: NFS4ERR_STALE Nov 27 14:03:42 cgi01.unix nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: x4500-14.unix][Mntpt: /export/sd04/www]NFS Recovery done for mount /export/sd04/www (mi 0xd9779800) on server x4500-14.unix, rnode_pt1 ./$customer-file (0xdc7496d8), rnode_pt2 ./$customer-file (0xe00af188) Nov 27 14:03:42 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: x4500-14.unix][Mntpt: /export/sd04/www]NFS op OP_OPEN got error NFS4ERR_STALE_CLIENTID causing recovery action NR_CLIENTID. Client also suspects that the server rebooted, or experienced a network partition. This only happens against NFS-Server/X4500 running Solaris 10 u8, not those running svn_117. No machine has been rebooted, no process has been restarted. They just periodically seem to lose all state. Currently once/day. Jorgen Lundman wrote: > > Hello list, > > NFS server (x4540): > * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc > > and about 20 clients hanging off it, today I was logged in on cgi01 to > do my tests: > > * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc > > > This system is generally running well, but every now-and-then we get > periods where it slows to a crawl, even listing "df -h" output will > pause before the x4500-14 output. It seems to be connected to just > x4500-14, which is currently the only Solaris 10 u8 server is production > (the others are running 5.11 snv_117 and appear not to be involved). > > I can umount and mount the file-system and everything goes back to normal. > > The load on the x4500-14 itself is low: > > load averages: 0.35, 0.33, 0.36 15:05:34 > 52 processes: 50 sleeping, 1 stopped, 1 on cpu > CPU states: % idle, % user, % kernel, % iowait, % swap > Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free > > PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND > 832 daemon 96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd > > The nfsv4 domains match, the groups and passwd files are identical. All > mounted with: > > "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a" > > > On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR): > > 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC > SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a > 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH > NFS4_OK OPEN NFS4ERR_NO_GRACE > > > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 > Seq=2641046014 Len=0 Win=49640 > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID > PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID > [lots of these] > > 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID > NFS4_OK CL=5684ae805ee CFV=07B10568 > 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM > CL=5684ae805ee CFV=07B10568 PUTROOTFH GETATTR 400 0 > 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 > Seq=1242317416 Len=0 Win=49640 > 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK > SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK > 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL > > > 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH > LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY > GETATTR 1... > 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH > NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR > NFS4_OK RESTOREFH NFS4_... > 172.20.12.26 -> 172.20.12.246 NF
[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8
Hello list, NFS server (x4540): * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc and about 20 clients hanging off it, today I was logged in on cgi01 to do my tests: * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc This system is generally running well, but every now-and-then we get periods where it slows to a crawl, even listing "df -h" output will pause before the x4500-14 output. It seems to be connected to just x4500-14, which is currently the only Solaris 10 u8 server is production (the others are running 5.11 snv_117 and appear not to be involved). I can umount and mount the file-system and everything goes back to normal. The load on the x4500-14 itself is low: load averages: 0.35, 0.33, 0.36 15:05:34 52 processes: 50 sleeping, 1 stopped, 1 on cpu CPU states: % idle, % user, % kernel, % iowait, % swap Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free PID USERNAME LWP PRI NICE SIZE RES STATETIMECPU COMMAND 832 daemon96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd The nfsv4 domains match, the groups and passwd files are identical. All mounted with: "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a" On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR): 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH NFS4_OK OPEN NFS4ERR_NO_GRACE 172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 Seq=2641046014 Len=0 Win=49640 172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID [lots of these] 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID NFS4_OK CL=5684ae805ee CFV=07B10568 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM CL=5684ae805ee CFV=07B10568 PUTROOTFH GETATTR 400 0 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 Seq=1242317416 Len=0 Win=49640 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 1... 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR NFS4_OK RESTOREFH NFS4_... 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=9454 SAVEFH LOOKUP ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a b0a23... 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR NFS4_OK RESTOREFH NFS4_... [lots of these too] If I leave it alone for about 5 minutes, it does seem to sort itself out. I suspect it is more that all 20 clients get this happening at exactly the same time, so it takes near 15 minutes for it to recover (on its own). This would indicate that it is the x4500-14 server that is losing state. But the processes for NFS (nfsd, stat, lockd and mapid) have all the same date (days in the past) and nfsd/lockd are using 96/3 threads respectively. Which one of the snoop output MIGHT be related to the trouble? I'm unsure which ERRORS are part of standard NFS operations, and which are not. Thanks for any replies! Lund -- Jorgen Lundman | Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo| +81 (0)90-5578-8500 (cell) Japan| +81 (0)3 -3375-1767 (home)
