Re: [zfs-discuss] ZFS NFS service hanging on Sunday morning problem

TPCzfs Wed, 13 Jun 2012 10:45:24 -0700

> 
> Shot in the dark here:
> 
> What are you using for the sharenfs value on the ZFS filesystem? Something 
> like rw=.mydomain.lan ?


They are IP blocks or hosts specified as FQDNs, eg.,

pptank/home/tcrane sharenfs 
rw=@192.168.101/24,rw=serverX.xx.rhul.ac.uk:serverY.xx.rhul.ac.uk

> 

> I've had issues where a ZFS server loses connectivity to the primary DNS 
> server and as a result the reverse lookups used to validate the identity 

It was using our slave DNS but there have been no recent problems with it.  
I've switched it to the primary DNS.

> of client systems fails and the connections hang. Any chance there's a 
> planned reboot of the DNS server Sunday morning? That sounds like the kind of 

No.  The only things tied to Sunday morning are these two (Solaris factory 
installed?) cronjobs;

root@server5:/# grep nfsfind /var/spool/cron/crontabs/root
15 3 * * 0 /usr/lib/fs/nfs/nfsfind
root@server5:/# grep 13 /var/spool/cron/crontabs/lp
#  At 03:13am on Sundays:
13 3 * * 0 cd /var/lp/logs; if [ -f requests ]; then if [ -f requests.1 ]; then 
/bin/mv requests.1 requests.2; fi; /usr/bin/cp requests requests.1; >requests; 
fi

The lp one does not access the main ZFS pool but the nfsfind does.  However, 
AFAICT it has usually finished before the problem manifests itself.

> preventative maintenance that might be happening in that time window.


Cheers
Tom.

> 
> Cheers,
> 
> Erik
> 
> On 13 juin 2012, at 12:47, tpc...@mklab.ph.rhul.ac.uk wrote:
> 
> > Dear All,
> >     I have been advised to enquire here on zfs-discuss with the
> > ZFS problem described below, following discussion on Usenet NG 
> > comp.unix.solaris.  The full thread should be available here 
> > https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s
> > 
> > Many thanks
> > Tom Crane
> > 
> > 
> > 
> > -- forwarded message
> > 
> > cindy.swearin...@oracle.com wrote:
> > : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote:
> > : > Dear All,
> > : >    Can anyone give any tips on diagnosing the following recurring 
> > problem?
> > : > 
> > : > I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15
> > : > i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so
> > : > often, always in the early hours of Sunday morning. I am barely
> > : > familiar with Solaris but here what I have managed to discern when the
> > : > problem occurs;
> > : > 
> > : > Jobs on other machines which access server5's shares (via automounter)
> > : > hang and attempts to manually remote-mount shares just timeout.
> > : > 
> > : > Remotely, showmount -e server5 shows all the exported FS are available.
> > : > 
> > : > On server5, the following services are running;
> > : > 
> > : > root@server5:/var/adm# svcs | grep nfs                 
> > : > online         May_25   svc:/network/nfs/status:default
> > : > online         May_25   svc:/network/nfs/nlockmgr:default
> > : > online         May_25   svc:/network/nfs/cbd:default
> > : > online         May_25   svc:/network/nfs/mapid:default
> > : > online         May_25   svc:/network/nfs/rquota:default
> > : > online         May_25   svc:/network/nfs/client:default
> > : > online         May_25   svc:/network/nfs/server:default
> > : > 
> > : > On server5, I can list and read files on the affected FSs w/o problem
> > : > but any attempt to write to the FS (eg. copy a file to or rm a file
> > : > on the FS) just hangs the cp/rm process.
> > : > 
> > : > On server5, using a zfs command zfs 'get sharenfs pptank/local_linux'
> > : > displays the expected list of hosts/IPs with remote ro & rw access.
> > : > 
> > : > Here is the O/P from some other hopefully relevant commands;
> > : > 
> > : > root@server5:/# zpool status
> > : >   pool: pptank
> > : >  state: ONLINE
> > : > status: The pool is formatted using an older on-disk format.  The pool 
> > can
> > : >         still be used, but some features are unavailable.
> > : > action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
> > : >         pool will no longer be accessible on older software versions.
> > : >  scan: none requested
> > : > config:
> > : > 
> > : >         NAME        STATE     READ WRITE CKSUM
> > : >         pptank      ONLINE       0     0     0
> > : >           raidz1-0  ONLINE       0     0     0
> > : >             c3t0d0  ONLINE       0     0     0
> > : >             c3t1d0  ONLINE       0     0     0
> > : >             c3t2d0  ONLINE       0     0     0
> > : >             c3t3d0  ONLINE       0     0     0
> > : >             c3t4d0  ONLINE       0     0     0
> > : >             c3t5d0  ONLINE       0     0     0
> > : >             c3t6d0  ONLINE       0     0     0
> > : > 
> > : > errors: No known data errors
> > : > 
> > : > root@server5:/# zpool list
> > : > NAME     SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
> > : > pptank  12.6T   384G  12.3T     2%  ONLINE  -
> > : > 
> > : > root@server5:/# zpool history
> > : > History for 'pptank':
> > : > <just hangs here>
> > : > 
> > : > root@server5:/# zpool iostat 5
> > : >                capacity     operations    bandwidth
> > : > pool        alloc   free   read  write   read  write
> > : > ----------  -----  -----  -----  -----  -----  -----
> > : > pptank       384G  12.3T     92    115  3.08M  1.22M
> > : > pptank       384G  12.3T  1.11K    629  35.5M  3.03M
> > : > pptank       384G  12.3T    886    889  27.1M  3.68M
> > : > pptank       384G  12.3T    837    677  24.9M  2.82M
> > : > pptank       384G  12.3T  1.19K    757  37.4M  3.69M
> > : > pptank       384G  12.3T  1.02K    759  29.6M  3.90M
> > : > pptank       384G  12.3T    952    707  32.5M  3.09M
> > : > pptank       384G  12.3T  1.02K    831  34.5M  3.72M
> > : > pptank       384G  12.3T    707    503  23.5M  1.98M
> > : > pptank       384G  12.3T    626    707  20.8M  3.58M
> > : > pptank       384G  12.3T    816    838  26.1M  4.26M
> > : > pptank       384G  12.3T    942    800  30.1M  3.48M
> > : > pptank       384G  12.3T    677    675  21.7M  2.91M
> > : > pptank       384G  12.3T    590    725  19.2M  3.06M
> > : > 
> > : > 
> > : > top shows the following runnable processes.  Nothing excessive here 
> > AFAICT?
> > : > 
> > : > last pid: 25282;  load avg:  1.98,  1.95,  1.86;       up 1+09:02:05 
> > 07:46:29
> > : > 72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu
> > : > CPU states: 81.5% idle,  0.1% user, 18.3% kernel,  0.0% iowait,  0.0% 
> > swap
> > : > Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap
> > : > 
> > : >    PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
> > : >    748 root      18  60  -20  103M 9752K cpu/1   78:44  6.62% nfsd
> > : >  24854 root       1  54    0 1480K  792K cpu/1    0:42  0.69% cp
> > : >  25281 root       1  59    0 3584K 2152K cpu/0    0:00  0.02% top
> > : > 
> > : > The above cp job is as mentioned above, attempting to copy a file to
> > : > an effected FS, I've noticed is apparently not completely hung.
> > : > 
> > : > The only thing that appears specific to Sunday morning is a cronjob to
> > : > remove old .nfs* files,
> > : > 
> > : > root@server5:/# crontab -l | grep nfsfind
> > : > 15 3 * * 0 /usr/lib/fs/nfs/nfsfind
> > : > 
> > : > Any suggestions on how to proceed?
> > : > 
> > : > Many thanks
> > : > Tom Crane
> > : > 
> > : > Ps. The email address in the header is just a spam-trap.
> > : > -- 
> > : > Tom Crane, IT support, RHUL Particle Physics.,
> > : > Dept. Physics, Royal Holloway, University of London, Egham Hill,
> > : > Egham, Surrey, TW20 0EX, England. 
> > : > Email:  T.Crane at rhul dot ac dot uk
> > 
> > : Hi Tom,
> > 
> > Hi Cindy,
> >     Thanks for the followup
> > 
> > : I think SunOS server5 5.10 Generic_147441-15 is the Solaris 10 8/11
> > : release. Is this correct?
> > 
> > I think so,...
> > root@server5:/# cat /etc/release
> >                       Solaris 10 10/08 s10x_u6wos_07b X86
> >           Copyright 2008 Sun Microsystems, Inc.  All Rights Reserved.
> >                        Use is subject to license terms.
> >                            Assembled 27 October 2008
> > 
> > 
> > : We looked at your truss output briefly and it looks like it is hanging
> > : trying to allocate memory. At least, that's what the "br ...." statements
> > : are at the end.
> > 
> > : I will see if I can find out what diagnostic info would be help in
> > : this case.
> > 
> > Thanks. That would be much appreciated.
> > 
> > : You might get a faster response on zfs-discuss as John suggested.
> > 
> > I will CC to zfs-discuss.
> > 
> > Best regards
> > Tom.
> > 
> > : Thanks,
> > 
> > : Cindy
> > 
> > Ps. The email address in the header is just a spam-trap.
> > -- 
> > Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
> > Egham, Surrey, TW20 0EX, England. 
> > Email:  T.Crane at rhul dot ac dot uk
> > -- end of forwarded message --
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> 


-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  t.cr...@rhul.ac.uk
Fax:    +44 (0) 1784 472794
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS NFS service hanging on Sunday morning problem

Reply via email to