[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-03-20 Thread Udo Grabowski
Theory falsified, it happened again with NAT switched off. No clues in snoop, 
too late.
-- 
This message posted from opensolaris.org


[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-03-18 Thread Udo Grabowski
Pavel Filipensky wrote:
> Hi Udo,
> 
> how long have you been on vanilla Osol 2009.06 and how long on SRU u6 
> before updating to SRU u7? Is it possible that SRU u6 has the issue as 
> well?

We apparently switched to SU5 on 21-Oct-09, to SU6 on 17-Dec-2009,
and then to SU7+IDR30+IDR35 on 14-Jan-2010, after that these problems
started to occur about once per 1-2 days. We had unexplained hangups
before on Mondays, but with a much lower frequency, and probably related to
zfs scrubs which triggered (via the ARC) swapping (which in turn was buggy
as well...).

I am now suspecting that maybe NFS is right when denying access through
NFSV4ERR_NO_GRACE, since I found a DNS query related to nfs4v_mapid (we
have not set it, all NFS mapping should be done on sys locally via /etc/hosts),
and our local machines on the private net have access to dns through
NAT (no ipf.conf entries), which seems ok, but a few machines with
private AND public network access have an additional (unwanted) route on the 
private net to the public net via this NAT. This seems to result sometimes
(don't know why) in a DNS query about the imksunxxx machine (which is a local
entry in /etc/hosts on the private net), and the DNS query would return
the short name resolved to imksunxxx.ourdomain.tld with a public IP, and,
voila, NFS sees a different client under the same short name and must deny
access due to ambiguity (but I suspect that this should be visible in the
snoop as a FQHN?). Or, alternatively, the client is visible through
NAT and the private net, and this could trigger the problem. It's still
inconlusive to me.

As a test I switched off NAT (we only need it for mail and updates), and
the hangups are gone for 3 days now (but we will see). Maybe we need
ipf.conf entries which filter out the private net traffic to prevent
additional routes (or even loops?).

> I was suspicious about one integration which went to SRUu2 and to S10U8 
> and to snv_114, but the change is only on the nfs client side - this 
> does not match the set-up described earlier by Jorgen:
> - clients are s10u5
> - server/s10u8 (issue)
> - server/snv_117 (no issue)
> 
> I am not able to find out more from the available data - unless the 
> problem is reproducible it is hard to diagnose.
> I have made one observation (not sure if it is useful)
> 
> reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE,
> but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK
> 
> Pavel
> 
> On 03/11/10 14:16, Udo Grabowski wrote:
>> Hi Pavel,
>>
>> both clients and server were updated (we always have a consistent 
>> environment), and we came
>> from u6, to which we updated directly from a vanilla Osol 2009.06 
>> before. The Readmes don't list any NFS patches there, so I suspect 
>> that our IDR30 patch carries unwanted changes from Solaris u8 into 
>> Opensolaris U7 which trigger this problem. Since we don't use Solaris 
>> 10, I cannot
>> confirm that 10u7 did not have that problem, I just concluded it from 
>> the initial post here (but
>> that conclusion maybe wrong, I admit).
>> We currently snoop the problem and catched some clues (maybe): Shortly 
>> before, we
>> see CB_NULL and NULL4 exchanges, seemingly as a result of a client 
>> renewal (we do not catch everything before), both server and client 
>> seem to check their partners callback capabilities:
 >> .
-- 
Dr.Udo GrabowskiInst.f.Meteorology a.Climate Research IMK-ASF-SAT
www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technologyhttp://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany Tel:(+49)7247 82-6026,Fax:-7026


[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-03-18 Thread Jorgen Lundman


I have not had any progress on this personally, but we have sent live-cores to 
Sun-Japan, and well as all other details they requested.

However, as a work around, the NOC staff is re-mounting the NFS mounts once a 
day. One server at a time. This has prevented the issue from happening.
We still see the log entries, the full hang of NFS mounts has not happened, so 
far.

If more information is required, I can ask NOC to leave one server out of 
remounting, and run snoop or whatever is required until NFS stalls.



-- 
Jorgen Lundman   | 
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)


[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-03-17 Thread Pavel Filipensky
Hi Udo,

how long have you been on vanilla Osol 2009.06 and how long on SRU u6 
before updating to SRU u7? Is it possible that SRU u6 has the issue as well?
I was suspicious about one integration which went to SRUu2 and to S10U8 
and to snv_114, but the change is only on the nfs client side - this 
does not match the set-up described earlier by Jorgen:
- clients are s10u5
- server/s10u8 (issue)
- server/snv_117 (no issue)

I am not able to find out more from the available data - unless the 
problem is reproducible it is hard to diagnose.
I have made one observation (not sure if it is useful)

reopen with CLAIM_PREVIOUS (CT=P) fails with NFS4ERR_NO_GRACE,
but reopen with CLAIM_NULL (CT=N) succeeds with NFS4_OK

Pavel

On 03/11/10 14:16, Udo Grabowski wrote:
> Hi Pavel,
>
> both clients and server were updated (we always have a consistent 
> environment), and we came
> from u6, to which we updated directly from a vanilla Osol 2009.06 before. The 
> Readmes don't 
> list any NFS patches there, so I suspect that our IDR30 patch carries 
> unwanted changes from 
> Solaris u8 into Opensolaris U7 which trigger this problem. Since we don't use 
> Solaris 10, I cannot
> confirm that 10u7 did not have that problem, I just concluded it from the 
> initial post here (but
> that conclusion maybe wrong, I admit).
> We currently snoop the problem and catched some clues (maybe): Shortly 
> before, we
> see CB_NULL and NULL4 exchanges, seemingly as a result of a client renewal 
> (we do not 
> catch everything before), both server and client seem to check their partners 
> callback capabilities:
>
>imksunxxx -> imksunyyy   NFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 
> SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.151.156.34 CBID=1073741...
>imksunyyy -> imksunxxxNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK 
> GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3144b6920ee CFV=2AE10314 
>imksunxxx -> imksunyyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM 
> CL=3144b6920ee CFV=2AE10314 
>imksunyyy -> imksunxxxNFS R 4 (sclntid_conf) NFS4_OK 
> SETCLIENTID_CONFIRM NFS4_OK 
>imksunyyy -> imksunxxxNFS C CB_NULL
>imksunxxx -> imksunyyyNFS R CB_NULL 
>imksunxxx -> imksunyyyTCP D=2049 S=772 Ack=2315081869 Seq=1684524332 
> Len=0 Win=64074 Options=
>imksunyyy -> imksunxxxTCP D=39970 S=46007 Ack=2765004476 
> Seq=2079880454 Len=0 Win=64074 Options=
>imksunxxx -> imksunyyyTCP D=2049 S=35143 Syn Seq=3666179320 Len=0 
> Win=64057 Options= 4,nop,nop,sackOK>
>imksunyyy -> imksunxxxTCP D=35143 S=2049 Syn Ack=3666179321 
> Seq=3064768973 Len=0 Win=64074 Options= 449055009,mss 1460,nop,wscale 4,nop,nop,sackOK>
>imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064768974 Seq=3666179321 
> Len=0 Win=64074 Options=
>imksunxxx -> imksunyyyNFS C NULL4
>imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179365 Seq=3064768974 
> Len=0 Win=64074 Options=
>imksunyyy -> imksunxxxNFS R NULL4 
>imksunxxx -> imksunyyy   TCP D=2049 S=35143 Ack=3064769002 Seq=3666179365 
> Len=0 Win=64074 Options=
>imksunxxx -> imksunyyyTCP D=2049 S=35143 Fin Ack=3064769002 
> Seq=3666179365 Len=0 Win=64074 Options=
>imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179366 Seq=3064769002 
> Len=0 Win=64074 Options=
>imksunyyy -> imksunxxxTCP D=35143 S=2049 Fin Ack=3666179366 
> Seq=3064769002 Len=0 Win=64074 Options=
>imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064769003 Seq=3666179366 
> Len=0 Win=64074 Options=
> ** here starts the reopen mess, clients hang **
>imksunxxx -> imksunyyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=21955 CT=P DT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a 
>imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
> NFS4_OK OPEN NFS4ERR_NO_GRACE 
>imksunxxx -> imksunyyy   NFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=21956 CT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a 
> b0a23a 
>imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK 
> OPEN NFS4_OK ST=150F:10979 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK 
>imksunxxx -> imksunyyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=21955 CT=P DT=N AC=R DN=N OO=07D1 GETFH GETATTR 10011a b0a23a 
>imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
> NFS4_OK OPEN NFS4ERR_NO_GRACE 
>  
> Another one here, this time after the reopen:
>
>imksunzzz -> local-yyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=42131 CT=P DT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a 
>local-yyy -> imksunzzzNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
> NFS4_OK OPEN NFS4ERR_NO_GRACE 
>imksunzzz -> local-yyyNFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=42132 CT=N AC=R DN=N OO=065A GETFH GETATTR 10011a 
> b0a23a 
>local-yyy -> imksunzzzNFS R 4 (reopen  ) NFS4_OK PUTFH NFS

[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-03-11 Thread Udo Grabowski
Hi Pavel,

both clients and server were updated (we always have a consistent environment), 
and we came
from u6, to which we updated directly from a vanilla Osol 2009.06 before. The 
Readmes don't 
list any NFS patches there, so I suspect that our IDR30 patch carries unwanted 
changes from 
Solaris u8 into Opensolaris U7 which trigger this problem. Since we don't use 
Solaris 10, I cannot
confirm that 10u7 did not have that problem, I just concluded it from the 
initial post here (but
that conclusion maybe wrong, I admit).
We currently snoop the problem and catched some clues (maybe): Shortly before, 
we
see CB_NULL and NULL4 exchanges, seemingly as a result of a client renewal (we 
do not 
catch everything before), both server and client seem to check their partners 
callback capabilities:

   imksunxxx -> imksunyyy   NFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 
SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.151.156.34 CBID=1073741...
   imksunyyy -> imksunxxxNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK 
GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3144b6920ee CFV=2AE10314 
   imksunxxx -> imksunyyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM 
CL=3144b6920ee CFV=2AE10314 
   imksunyyy -> imksunxxxNFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM 
NFS4_OK 
   imksunyyy -> imksunxxxNFS C CB_NULL
   imksunxxx -> imksunyyyNFS R CB_NULL 
   imksunxxx -> imksunyyyTCP D=2049 S=772 Ack=2315081869 Seq=1684524332 
Len=0 Win=64074 Options=
   imksunyyy -> imksunxxxTCP D=39970 S=46007 Ack=2765004476 Seq=2079880454 
Len=0 Win=64074 Options=
   imksunxxx -> imksunyyyTCP D=2049 S=35143 Syn Seq=3666179320 Len=0 
Win=64057 Options=
   imksunyyy -> imksunxxxTCP D=35143 S=2049 Syn Ack=3666179321 
Seq=3064768973 Len=0 Win=64074 Options=
   imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064768974 Seq=3666179321 
Len=0 Win=64074 Options=
   imksunxxx -> imksunyyyNFS C NULL4
   imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179365 Seq=3064768974 
Len=0 Win=64074 Options=
   imksunyyy -> imksunxxxNFS R NULL4 
   imksunxxx -> imksunyyy   TCP D=2049 S=35143 Ack=3064769002 Seq=3666179365 
Len=0 Win=64074 Options=
   imksunxxx -> imksunyyyTCP D=2049 S=35143 Fin Ack=3064769002 
Seq=3666179365 Len=0 Win=64074 Options=
   imksunyyy -> imksunxxxTCP D=35143 S=2049 Ack=3666179366 Seq=3064769002 
Len=0 Win=64074 Options=
   imksunyyy -> imksunxxxTCP D=35143 S=2049 Fin Ack=3666179366 
Seq=3064769002 Len=0 Win=64074 Options=
   imksunxxx -> imksunyyyTCP D=2049 S=35143 Ack=3064769003 Seq=3666179366 
Len=0 Win=64074 Options=
** here starts the reopen mess, clients hang **
   imksunxxx -> imksunyyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
SQ=21955 CT=P DT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a 
   imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
NFS4_OK OPEN NFS4ERR_NO_GRACE 
   imksunxxx -> imksunyyy   NFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
start_version OT=NC SQ=21956 CT=N AC=R DN=N OO=0670 GETFH GETATTR 10011a b0a23a 
   imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
NFS4_OK ST=150F:10979 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK 
   imksunxxx -> imksunyyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
SQ=21955 CT=P DT=N AC=R DN=N OO=07D1 GETFH GETATTR 10011a b0a23a 
   imksunyyy -> imksunxxxNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
NFS4_OK OPEN NFS4ERR_NO_GRACE 
 
Another one here, this time after the reopen:

   imksunzzz -> local-yyyNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
SQ=42131 CT=P DT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a 
   local-yyy -> imksunzzzNFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
NFS4_OK OPEN NFS4ERR_NO_GRACE 
   imksunzzz -> local-yyyNFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
start_version OT=NC SQ=42132 CT=N AC=R DN=N OO=065A GETFH GETATTR 10011a b0a23a 
   local-yyy -> imksunzzzNFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
NFS4_OK ST=1B1B:21067 RF=PL DT=N GETFH NFS4_OK FH=92CA GETATTR NFS4_OK 
   imksunzzz -> local-yyyNFS C 4 (open) PUTFH FH=95AC OPEN 
start_version OT=NC SQ=1 CT=N AC=R DN=N OO=0F54 GETFH GETATTR 10011a b0a23a 
   local-yyy -> imksunzzzNFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID 
   imksunzzz -> local-yyyNFS C 4 (setclientid ) PUTROOTFH GETATTR 400 0 
SETCLIENTID Prog=1073741826 ID=tcp Addr=192.168.2.157.145.153 CBID=107374...
   local-yyy -> imksunzzzNFS R 4 (setclientid ) NFS4_OK PUTROOTFH NFS4_OK 
GETATTR NFS4_OK SETCLIENTID NFS4_OK CL=3154b6920ee CFV=524A0315 
   imksunzzz -> local-yyyNFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM 
CL=3154b6920ee CFV=524A0315 
   local-yyy -> imksunzzzNFS R 4 (sclntid_conf) NFS4_OK SETCLIENTID_CONFIRM 
NFS4_OK 
   local-yyy -> imksunzzzNFS C CB_NULL
   imksunzzz -> local-yyyNFS R CB_NULL 

and a third one with a couple of NFS4ERR_STAL

[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-02-22 Thread Pavel Filipensky
Hi,

what was updated to Opensolaris 2009.06 SU7? NFS server or NFS clients 
or both? What was the previous version before updating to s 2009.06 SU7? 
You also mention that this is seen on solaris 10u8. Can you confirm that 
it was not seen on s10u7?

It is not obvious if this is a bug on the client or on the server. The 
snoop and the nfs diagnostic messages from syslog do not show enough to 
get closer to the root cause.

In  s10u8 patch 141734-02 there were many changes for both nfs server 
and nfs client. So one of the possibilities is that the problem is 
related to that patch, otoh only few  fixes  from 141734-02 were 
integrated to Opensolaris 2009.06 updates.

Thanks,
Pavel

On 02/08/10 13:56, Udo Grabowski wrote:
> Hello,
>
> we have the same problem here after update to Opensolaris 2009.06 SU7 
> (+IDR30+35)
> on a 40 clients production system.
> About every second day one of our servers looses state,  and gives the output 
> below on snoop
> (numerous reopen attempts with a  NFS4ERR_NO_GRACE error). All clients hang, 
> since 
> they don't get their data, only a restart of the nfs/server process cures 
> this problem.
> This seems to be a serious bug in the NFS module introduced recently in both 
> Opensolaris
> support line and solaris 10 u8.
>
>sun -> local-s11NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH 
> NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
>local-s11 -> sun   NFS C 4 (reopen  ) PUTFH FH=B5F3 OPEN OT=NC 
> SQ=14571 CT=P DT=N AC=W DN=N OO=0955 GE
> TFH GETATTR 10011a b0a23a 
>sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>? -> (multicast)  ETHER Type=0001 (LLC/802.3), size=52 bytes
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0A4C GE
> TFH GETATTR 10011a b0a23a 
>local-s11 -> sunTCP D=2049 S=1023 Ack=293438085 Seq=4069541650 Len=0 
> Win=64074 Options= 25776951 188898831>
>local-s11 -> sunTCP D=2049 S=1023 Ack=293438153 Seq=4069541650 Len=0 
> Win=64074 Options= 25776951 188898907>
>sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=AA7B OPEN 
> imksuns11:1.log OT=NC SQ=14572 CT=N AC=W DN=N
>  OO=0955 GETFH GETATTR 10011a b0a...
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
> O=0A4C GETFH GETATTR 10011a b0a23a 
>sun -> local-s11TCP D=1023 S=2049 Ack=4069542150 Seq=293438289 Len=0 
> Win=64074 Options= 188898915 25776951>
>sun -> local-s11NFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
> NFS4_OK ST=1A6F:7271 RF=PL DT=N GETFH
>  NFS4_OK FH=92CA GETATTR NFS4_OK 
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0A0F GE
> TFH GETATTR 10011a b0a23a 
>sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
> O=0A0F GETFH GETATTR 10011a b0a23a 
>sun -> local-s11NFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
> NFS4_OK ST=1B35:7271 RF=PL DT=N GETFH
>  NFS4_OK FH=92CA GETATTR NFS4_OK 
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC 
> SQ=14539 CT=P DT=N AC=R DN=N OO=0AEE GE
> TFH GETATTR 10011a b0a23a 
>sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
> OPEN NFS4ERR_NO_GRACE 
>local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN 
> start_version OT=NC SQ=14540 CT=N AC=R DN=N O
>
> NFS4ERR_NO_GRACE
>  A reclaim of client state was attempted in circumstances in 
>   which the server cannot guarantee that conflicting state has 
>   not been provided to another client.  This can occur because 
>   the reclaim has been done outside of the grace period of the
>   server, after the client has done a RECLAIM_COMPLETE operation,
>   or because previous operations have created a situation in which
>   the server is not able to determine that a reclaim-interfering
>   edge condition does not exist.
>   



[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2010-02-08 Thread Udo Grabowski
Hello,

we have the same problem here after update to Opensolaris 2009.06 SU7 
(+IDR30+35)
on a 40 clients production system.
About every second day one of our servers looses state,  and gives the output 
below on snoop
(numerous reopen attempts with a  NFS4ERR_NO_GRACE error). All clients hang, 
since 
they don't get their data, only a restart of the nfs/server process cures this 
problem.
This seems to be a serious bug in the NFS module introduced recently in both 
Opensolaris
support line and solaris 10 u8.

   sun -> local-s11NFS R 4 (open) NFS4ERR_STALE_CLIENTID PUTFH 
NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
   local-s11 -> sun   NFS C 4 (reopen  ) PUTFH FH=B5F3 OPEN OT=NC SQ=14571 
CT=P DT=N AC=W DN=N OO=0955 GE
TFH GETATTR 10011a b0a23a 
   sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
OPEN NFS4ERR_NO_GRACE 
   ? -> (multicast)  ETHER Type=0001 (LLC/802.3), size=52 bytes
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC SQ=14539 
CT=P DT=N AC=R DN=N OO=0A4C GE
TFH GETATTR 10011a b0a23a 
   local-s11 -> sunTCP D=2049 S=1023 Ack=293438085 Seq=4069541650 Len=0 
Win=64074 Options=
   local-s11 -> sunTCP D=2049 S=1023 Ack=293438153 Seq=4069541650 Len=0 
Win=64074 Options=
   sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
OPEN NFS4ERR_NO_GRACE 
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=AA7B OPEN 
imksuns11:1.log OT=NC SQ=14572 CT=N AC=W DN=N
 OO=0955 GETFH GETATTR 10011a b0a...
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN start_version 
OT=NC SQ=14540 CT=N AC=R DN=N O
O=0A4C GETFH GETATTR 10011a b0a23a 
   sun -> local-s11TCP D=1023 S=2049 Ack=4069542150 Seq=293438289 Len=0 
Win=64074 Options=
   sun -> local-s11NFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
NFS4_OK ST=1A6F:7271 RF=PL DT=N GETFH
 NFS4_OK FH=92CA GETATTR NFS4_OK 
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC SQ=14539 
CT=P DT=N AC=R DN=N OO=0A0F GE
TFH GETATTR 10011a b0a23a 
   sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
OPEN NFS4ERR_NO_GRACE 
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN start_version 
OT=NC SQ=14540 CT=N AC=R DN=N O
O=0A0F GETFH GETATTR 10011a b0a23a 
   sun -> local-s11NFS R 4 (reopen  ) NFS4_OK PUTFH NFS4_OK OPEN 
NFS4_OK ST=1B35:7271 RF=PL DT=N GETFH
 NFS4_OK FH=92CA GETATTR NFS4_OK 
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=92CA OPEN OT=NC SQ=14539 
CT=P DT=N AC=R DN=N OO=0AEE GE
TFH GETATTR 10011a b0a23a 
   sun -> local-s11NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH NFS4_OK 
OPEN NFS4ERR_NO_GRACE 
   local-s11 -> sunNFS C 4 (reopen  ) PUTFH FH=95AC OPEN start_version 
OT=NC SQ=14540 CT=N AC=R DN=N O

NFS4ERR_NO_GRACE
 A reclaim of client state was attempted in circumstances in 
  which the server cannot guarantee that conflicting state has 
  not been provided to another client.  This can occur because 
  the reclaim has been done outside of the grace period of the
  server, after the client has done a RECLAIM_COMPLETE operation,
  or because previous operations have created a situation in which
  the server is not able to determine that a reclaim-interfering
  edge condition does not exist.
-- 
This message posted from opensolaris.org


[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2009-12-01 Thread Jorgen Lundman
Another client:

cgi01.unix 5.10 Generic_127128-11

will issue these lines when it happens:

Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_OPEN got error NFS4ERR_DELAY 
causing recovery action NR_DELAY.

Nov 27 14:01:55 cgi01.unix last message repeated 98 times

Nov 27 14:01:55 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]NFS op OP_GETATTR got error 
NFS4ERR_STALE 
causing recovery action NR_STALE.

Nov 27 14:01:55 cgi01.unix nfs: [ID 286389 kern.info] NOTICE: [NFS4][Server: 
x4500-07.unix][Mntpt: /export/sd03/www]File ./$customer-path/$file.lok 
(rnode_pt: e06c73d0) was closed due to NFS recovery error on server 
x4500-07.unix(failed to recover from NFS4ERR_STALE NFS4ERR_STALE)

Nov 27 14:01:55 cgi01.unix nfs: [ID 941083 kern.info] NOTICE: NFS4 FACT SHEET:

Nov 27 14:01:55 cgi01.unix  Action: NR_STALE

Nov 27 14:01:55 cgi01.unix  NFS4 error: NFS4ERR_STALE

Nov 27 14:03:42 cgi01.unix nfs: [ID 273629 kern.info] NOTICE: [NFS4][Server: 
x4500-14.unix][Mntpt: /export/sd04/www]NFS Recovery done for mount 
/export/sd04/www (mi 0xd9779800) on server x4500-14.unix, rnode_pt1 
./$customer-file (0xdc7496d8), rnode_pt2 ./$customer-file (0xe00af188)

Nov 27 14:03:42 cgi01.unix nfs: [ID 236337 kern.info] NOTICE: [NFS4][Server: 
x4500-14.unix][Mntpt: /export/sd04/www]NFS op OP_OPEN got error 
NFS4ERR_STALE_CLIENTID causing recovery action NR_CLIENTID.  Client also 
suspects that the server rebooted, or experienced a network partition.


This only happens against NFS-Server/X4500 running Solaris 10 u8, not those 
running svn_117. No machine has been rebooted, no process has been restarted. 
They just periodically seem to lose all state. Currently once/day.




Jorgen Lundman wrote:
>
> Hello list,
>
> NFS server (x4540):
> * SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc
>
> and about 20 clients hanging off it, today I was logged in on cgi01 to
> do my tests:
>
> * SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc
>
>
> This system is generally running well, but every now-and-then we get
> periods where it slows to a crawl, even listing "df -h" output will
> pause before the x4500-14 output. It seems to be connected to just
> x4500-14, which is currently the only Solaris 10 u8 server is production
> (the others are running 5.11 snv_117 and appear not to be involved).
>
> I can umount and mount the file-system and everything goes back to normal.
>
> The load on the x4500-14 itself is low:
>
> load averages: 0.35, 0.33, 0.36 15:05:34
> 52 processes: 50 sleeping, 1 stopped, 1 on cpu
> CPU states: % idle, % user, % kernel, % iowait, % swap
> Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free
>
> PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
> 832 daemon 96 60 -20 2968K 2160K sleep 92.8H 1.96% nfsd
>
> The nfsv4 domains match, the groups and passwd files are identical. All
> mounted with:
>
> "remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a"
>
>
> On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR):
>
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen ) PUTFH FH=8873 OPEN OT=NC
> SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen ) NFS4ERR_NO_GRACE PUTFH
> NFS4_OK OPEN NFS4ERR_NO_GRACE
>
>
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637
> Seq=2641046014 Len=0 Win=49640
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (open ) NFS4ERR_STALE_CLIENTID
> PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
> [lots of these]
>
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID
> NFS4_OK CL=5684ae805ee CFV=07B10568
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM
> CL=5684ae805ee CFV=07B10568 PUTROOTFH GETATTR 400 0
> 172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072
> Seq=1242317416 Len=0 Win=49640
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK
> SETCLIENTID_CONFIRM NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK
> 172.20.12.246 -> 172.20.12.26 NFS C CB_NULL
>
>
> 172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup ) PUTFH FH=956C SAVEFH
> LOOKUP ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY
> GETATTR 1...
> 172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup ) NFS4ERR_SAME PUTFH
> NFS4_OK SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR
> NFS4_OK RESTOREFH NFS4_...
> 172.20.12.26 -> 172.20.12.246 NF

[nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

2009-11-27 Thread Jorgen Lundman

Hello list,

NFS server (x4540):
* SunOS x4500-14.unix 5.10 Generic_141445-09 i86pc i386 i86pc

and about 20 clients hanging off it, today I was logged in on cgi01 to do my 
tests:

* SunOS cgi01.unix 5.10 Generic_127128-11 i86pc i386 i86pc


This system is generally running well, but every now-and-then we get periods 
where it slows to a crawl, even listing "df -h" output will pause before the 
x4500-14 output.  It seems to be connected to just x4500-14, which is currently 
the only Solaris 10 u8 server is production (the others are running 5.11 
snv_117 
and appear not to be involved).

I can umount and mount the file-system and everything goes back to normal.

The load on the x4500-14 itself is low:

load averages:  0.35,  0.33,  0.36 15:05:34
52 processes:  50 sleeping, 1 stopped, 1 on cpu
CPU states: % idle, % user, % kernel, % iowait, % swap
Memory: 32G real, 2772M free, 6616M swap in use, 700M swap free

PID USERNAME LWP PRI NICE  SIZE   RES STATETIMECPU COMMAND
832 daemon96  60  -20 2968K 2160K sleep   92.8H  1.96% nfsd

The nfsv4 domains match, the groups and passwd files are identical. All mounted 
with:

"remote/read/write/setuid/devices/vers=4/hard/intr/quota/xattr/dev=494001a"


On the cgi server, when I snoop, I see traffic like (lines with NFS4ERR):

172.20.12.26 -> 172.20.12.246 NFS C 4 (reopen  ) PUTFH FH=8873 OPEN OT=NC 
SQ=12626 CT=P DT=N AC=R DN=N OO=0B17 GETFH GETATTR 10011a b0a23a
172.20.12.246 -> 172.20.12.26 NFS R 4 (reopen  ) NFS4ERR_NO_GRACE PUTFH 
NFS4_OK OPEN NFS4ERR_NO_GRACE


172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 TCP D=1017 S=2049 Ack=1854686637 Seq=2641046014 
Len=0 Win=49640
172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
172.20.12.246 -> 172.20.12.26 NFS R 4 (open) NFS4ERR_STALE_CLIENTID 
PUTFH NFS4_OK OPEN NFS4ERR_STALE_CLIENTID
[lots of these]

172.20.12.246 -> 172.20.12.26 NFS R 4 (setclientid ) NFS4_OK SETCLIENTID 
NFS4_OK 
CL=5684ae805ee CFV=07B10568
172.20.12.26 -> 172.20.12.246 NFS C 4 (sclntid_conf) SETCLIENTID_CONFIRM 
CL=5684ae805ee CFV=07B10568 PUTROOTFH GETATTR 400 0
172.20.12.246 -> 172.20.12.26 TCP D=32772 S=54238 Ack=524071072 Seq=1242317416 
Len=0 Win=49640
172.20.12.246 -> 172.20.12.26 NFS R 4 (sclntid_conf) NFS4_OK 
SETCLIENTID_CONFIRM 
NFS4_OK PUTROOTFH NFS4_OK GETATTR NFS4_OK
172.20.12.246 -> 172.20.12.26 NFS C CB_NULL


172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup  ) PUTFH FH=956C SAVEFH 
LOOKUP 
ilcsugamo.com GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 1...
172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup  ) NFS4ERR_SAME PUTFH NFS4_OK 
SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=9454 GETATTR NFS4_OK RESTOREFH 
NFS4_...
172.20.12.26 -> 172.20.12.246 NFS C 4 (lookup  ) PUTFH FH=9454 SAVEFH 
LOOKUP 
ja GETFH GETATTR 10011a b0a23a RESTOREFH NVERIFY GETATTR 10011a b0a23...
172.20.12.246 -> 172.20.12.26 NFS R 4 (lookup  ) NFS4ERR_SAME PUTFH NFS4_OK 
SAVEFH NFS4_OK LOOKUP NFS4_OK GETFH NFS4_OK FH=8CAD GETATTR NFS4_OK RESTOREFH 
NFS4_...
[lots of these too]

If I leave it alone for about 5 minutes, it does seem to sort itself out. I 
suspect it is more that all 20 clients get this happening at exactly the same 
time, so it takes near 15 minutes for it to recover (on its own).

This would indicate that it is the x4500-14 server that is losing state. But 
the 
processes for NFS (nfsd, stat, lockd and mapid) have all the same date (days in 
the past) and nfsd/lockd are using 96/3 threads respectively.

Which one of the snoop output MIGHT be related to the trouble? I'm unsure which 
ERRORS are part of standard NFS operations, and which are not.

Thanks for any replies!

Lund



-- 
Jorgen Lundman   | 
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo| +81 (0)90-5578-8500  (cell)
Japan| +81 (0)3 -3375-1767  (home)