[zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday morning problem
Dear All, I have been advised to enquire here on zfs-discuss with the ZFS problem described below, following discussion on Usenet NG comp.unix.solaris. The full thread should be available here https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s Many thanks Tom Crane -- forwarded message cindy.swearin...@oracle.com wrote: : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote: : Dear All, : Can anyone give any tips on diagnosing the following recurring problem? : : I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15 : i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so : often, always in the early hours of Sunday morning. I am barely : familiar with Solaris but here what I have managed to discern when the : problem occurs; : : Jobs on other machines which access server5's shares (via automounter) : hang and attempts to manually remote-mount shares just timeout. : : Remotely, showmount -e server5 shows all the exported FS are available. : : On server5, the following services are running; : : root@server5:/var/adm# svcs | grep nfs : online May_25 svc:/network/nfs/status:default : online May_25 svc:/network/nfs/nlockmgr:default : online May_25 svc:/network/nfs/cbd:default : online May_25 svc:/network/nfs/mapid:default : online May_25 svc:/network/nfs/rquota:default : online May_25 svc:/network/nfs/client:default : online May_25 svc:/network/nfs/server:default : : On server5, I can list and read files on the affected FSs w/o problem : but any attempt to write to the FS (eg. copy a file to or rm a file : on the FS) just hangs the cp/rm process. : : On server5, using a zfs command zfs 'get sharenfs pptank/local_linux' : displays the expected list of hosts/IPs with remote ro rw access. : : Here is the O/P from some other hopefully relevant commands; : : root@server5:/# zpool status :pool: pptank : state: ONLINE : status: The pool is formatted using an older on-disk format. The pool can : still be used, but some features are unavailable. : action: Upgrade the pool using 'zpool upgrade'. Once this is done, the : pool will no longer be accessible on older software versions. : scan: none requested : config: : : NAMESTATE READ WRITE CKSUM : pptank ONLINE 0 0 0 :raidz1-0 ONLINE 0 0 0 : c3t0d0 ONLINE 0 0 0 : c3t1d0 ONLINE 0 0 0 : c3t2d0 ONLINE 0 0 0 : c3t3d0 ONLINE 0 0 0 : c3t4d0 ONLINE 0 0 0 : c3t5d0 ONLINE 0 0 0 : c3t6d0 ONLINE 0 0 0 : : errors: No known data errors : : root@server5:/# zpool list : NAME SIZE ALLOC FREECAP HEALTH ALTROOT : pptank 12.6T 384G 12.3T 2% ONLINE - : : root@server5:/# zpool history : History for 'pptank': : just hangs here : : root@server5:/# zpool iostat 5 : capacity operationsbandwidth : poolalloc free read write read write : -- - - - - - - : pptank 384G 12.3T 92115 3.08M 1.22M : pptank 384G 12.3T 1.11K629 35.5M 3.03M : pptank 384G 12.3T886889 27.1M 3.68M : pptank 384G 12.3T837677 24.9M 2.82M : pptank 384G 12.3T 1.19K757 37.4M 3.69M : pptank 384G 12.3T 1.02K759 29.6M 3.90M : pptank 384G 12.3T952707 32.5M 3.09M : pptank 384G 12.3T 1.02K831 34.5M 3.72M : pptank 384G 12.3T707503 23.5M 1.98M : pptank 384G 12.3T626707 20.8M 3.58M : pptank 384G 12.3T816838 26.1M 4.26M : pptank 384G 12.3T942800 30.1M 3.48M : pptank 384G 12.3T677675 21.7M 2.91M : pptank 384G 12.3T590725 19.2M 3.06M : : : top shows the following runnable processes. Nothing excessive here AFAICT? : : last pid: 25282; load avg: 1.98, 1.95, 1.86; up 1+09:02:05 07:46:29 : 72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu : CPU states: 81.5% idle, 0.1% user, 18.3% kernel, 0.0% iowait, 0.0% swap : Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap : : PID USERNAME LWP PRI NICE SIZE RES STATETIMECPU COMMAND : 748 root 18 60 -20 103M 9752K cpu/1 78:44 6.62% nfsd : 24854 root 1 540 1480K 792K cpu/10:42 0.69% cp : 25281 root 1 590 3584K 2152K cpu/00:00 0.02% top : : The above cp job is as mentioned above, attempting to copy a file to : an effected FS, I've noticed is apparently not completely hung. : : The only thing that appears specific to Sunday morning is a
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
Sao Kiselkov writes: On 06/12/2012 05:37 PM, Roch Bourbonnais wrote: So the xcall are necessary part of memory reclaiming, when one needs to tear down the TLB entry mapping the physical memory (which can from here on be repurposed). So the xcall are just part of this. Should not cause trouble, but they do. They consume a cpu for some time. That in turn can cause infrequent latency bubble on the network. A certain root cause of these latency bubble is that network thread are bound by default and if the xcall storm ends up on the CPU that the network thread is bound to, it will wait for the storm to pass. I understand, but the xcall storm settles only eats up a single core out of a total of 32, plus it's not a single specific one, it tends to change, so what are the odds of hitting the same core as the one on which the mac thread is running? That's easy :-) : 1/32 each time it needs to run. So depending on how often it runs (which depends on how much churn there is in the ARC) and how often you see the latency bubbles, that may or may not be it. What is zio_taskq_batch_pct on your system ? That is another storm bit of code which causes bubble. Setting it down to 50 (versus an older default of 100) should help if it's not done already. -r So try unbinding the mac threads; it may help you here. How do I do that? All I can find on interrupt fencing and the like is to simply set certain processors to no-intr, which moves all of the interrupts and it doesn't prevent the xcall storm choosing to affect these CPUs either... -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS NFS service hanging on Sunday morning problem
Shot in the dark here: What are you using for the sharenfs value on the ZFS filesystem? Something like rw=.mydomain.lan ? They are IP blocks or hosts specified as FQDNs, eg., pptank/home/tcrane sharenfs rw=@192.168.101/24,rw=serverX.xx.rhul.ac.uk:serverY.xx.rhul.ac.uk I've had issues where a ZFS server loses connectivity to the primary DNS server and as a result the reverse lookups used to validate the identity It was using our slave DNS but there have been no recent problems with it. I've switched it to the primary DNS. of client systems fails and the connections hang. Any chance there's a planned reboot of the DNS server Sunday morning? That sounds like the kind of No. The only things tied to Sunday morning are these two (Solaris factory installed?) cronjobs; root@server5:/# grep nfsfind /var/spool/cron/crontabs/root 15 3 * * 0 /usr/lib/fs/nfs/nfsfind root@server5:/# grep 13 /var/spool/cron/crontabs/lp # At 03:13am on Sundays: 13 3 * * 0 cd /var/lp/logs; if [ -f requests ]; then if [ -f requests.1 ]; then /bin/mv requests.1 requests.2; fi; /usr/bin/cp requests requests.1; requests; fi The lp one does not access the main ZFS pool but the nfsfind does. However, AFAICT it has usually finished before the problem manifests itself. preventative maintenance that might be happening in that time window. Cheers Tom. Cheers, Erik On 13 juin 2012, at 12:47, tpc...@mklab.ph.rhul.ac.uk wrote: Dear All, I have been advised to enquire here on zfs-discuss with the ZFS problem described below, following discussion on Usenet NG comp.unix.solaris. The full thread should be available here https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s Many thanks Tom Crane -- forwarded message cindy.swearin...@oracle.com wrote: : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote: : Dear All, : Can anyone give any tips on diagnosing the following recurring problem? : : I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15 : i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so : often, always in the early hours of Sunday morning. I am barely : familiar with Solaris but here what I have managed to discern when the : problem occurs; : : Jobs on other machines which access server5's shares (via automounter) : hang and attempts to manually remote-mount shares just timeout. : : Remotely, showmount -e server5 shows all the exported FS are available. : : On server5, the following services are running; : : root@server5:/var/adm# svcs | grep nfs : online May_25 svc:/network/nfs/status:default : online May_25 svc:/network/nfs/nlockmgr:default : online May_25 svc:/network/nfs/cbd:default : online May_25 svc:/network/nfs/mapid:default : online May_25 svc:/network/nfs/rquota:default : online May_25 svc:/network/nfs/client:default : online May_25 svc:/network/nfs/server:default : : On server5, I can list and read files on the affected FSs w/o problem : but any attempt to write to the FS (eg. copy a file to or rm a file : on the FS) just hangs the cp/rm process. : : On server5, using a zfs command zfs 'get sharenfs pptank/local_linux' : displays the expected list of hosts/IPs with remote ro rw access. : : Here is the O/P from some other hopefully relevant commands; : : root@server5:/# zpool status :pool: pptank : state: ONLINE : status: The pool is formatted using an older on-disk format. The pool can : still be used, but some features are unavailable. : action: Upgrade the pool using 'zpool upgrade'. Once this is done, the : pool will no longer be accessible on older software versions. : scan: none requested : config: : : NAMESTATE READ WRITE CKSUM : pptank ONLINE 0 0 0 :raidz1-0 ONLINE 0 0 0 : c3t0d0 ONLINE 0 0 0 : c3t1d0 ONLINE 0 0 0 : c3t2d0 ONLINE 0 0 0 : c3t3d0 ONLINE 0 0 0 : c3t4d0 ONLINE 0 0 0 : c3t5d0 ONLINE 0 0 0 : c3t6d0 ONLINE 0 0 0 : : errors: No known data errors : : root@server5:/# zpool list : NAME SIZE ALLOC FREECAP HEALTH ALTROOT : pptank 12.6T 384G 12.3T 2% ONLINE - : : root@server5:/# zpool history : History for 'pptank': : just hangs here : : root@server5:/# zpool iostat 5 : capacity operationsbandwidth : poolalloc free read write read write : -- - - - - - - : pptank 384G 12.3T 92
[zfs-discuss] NFS asynchronous writes being written to ZIL
I noticed recently that the SSDs hosting the ZIL for my pool had a large number in the SMART attribute for total LBAs written (with some calculation, it seems to be the total amount of data written to the pool so far), did some testing, and found that the ZIL is being used quite heavily (matching the writing speed) on writes that should be asynchronous. I did a capture with wireshark during a simple copy of a large file (8GB), and both the write packets and the write responses showed the UNSTABLE I would expect in asynchronous writes. The copy didn't finish before the server started writing heavily to the ZIL, so I wouldn't expect it to tell NFS to commit then. Here are some relevant pieces of info (with hostnames, etc removed): client: ubuntu 11.10 /etc/fstab entry: server:/mainpool/storage /mnt/myelin nfs bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 0 server: OpenIndiana oi_151a4 $ zfs get sync mainpool NAME PROPERTY VALUE SOURCE mainpool sync standard default $ zfs get sync mainpool/storage NAME PROPERTY VALUE SOURCE mainpool/storage sync standard default $ zfs get sharenfs mainpool/storage NAME PROPERTY VALUE SOURCE mainpool/storage sharenfs rw=@xxx.xxx.37 local $ zpool get version mainpool NAME PROPERTY VALUESOURCE mainpool version 28 default The pool consists of 24 sata disks arranged as 2 raid-z2 groups of 12, and originally had a mirrored log across 10GB slices on two intel 320 80GB SSDs, but I have since rearranged the logs as non-mirrored (since a single SSD doesn't quite keep up with gigabit network throughput, and some testing convinced me that a pool survives a failing log device gracefully, as long as there isn't a simultaneous crash). I would like to switch them back to a mirrored configuration, but without impacting the asynchronous throughput, and obviously it would be nice to reduce the write load to them so they live longer. As far as I can tell, all nfs writes are being written to the ZIL when many should be cached in memory and bypass the ZIL. Any help appreciated, Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: client: ubuntu 11.10 /etc/fstab entry: server:/mainpool/storage /mnt/myelin nfs bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 0 nfsvers=3 NAME PROPERTY VALUE SOURCE mainpool/storage sync standard default sync=standard This is expected behaviour for this combination. NFS 3 semantics are for persistent writes at the server regardless - and mostly also for NFS 4. The async client mount option relates to when the writes get shipped to the server (immediately or delayed in dirty pages), rather than to how the server should handle those writes once they arrive. You could set sync=disabled if you're happy with the consequences, or even just as a temporary test to confirm the impact. It sounds like you would be since that's what you're trying to achieve. There is a difference: async on the client means data is lost on a client reboot, async on the server means data may be lost on a server reboot (and the client/application confused by inconsistencies as a result). Separate datasets (and mounts) for data with different persistence requirements can help. -- Dan. pgp4Kh9cv3BTt.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Tue, Jun 12, 2012 at 03:46:00PM +1000, Scott Aitken wrote: Hi all, Hi Scott. :-) I have a 5 drive RAIDZ volume with data that I'd like to recover. Yeah, still.. I tried using Jeff Bonwick's labelfix binary to create new labels but it carps because the txg is not zero. Can you provide details of invocation and error response? For the benefit of others, this was at my suggestion; I've been discussing this problem with Scott for.. some time. I can also make the solaris machine available via SSH if some wonderful person wants to poke around. Will take a poke, as discussed. May well raise more discussion here as a result. -- Dan. pgpS6tV6uuTeF.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS asynchronous writes being written to ZIL
Interesting...from what I had read about NFSv3 asynchronous writes, especially bits about does not require the server to commit to stable storage, led me to expect different behavior. The performance impact on large writes (which we do a lot of) wasn't severe, so sync=disabled is probably not worth the risk. The SSDs should also be able to take a fair amount of abuse, so I can live with the behavior as-is. Here's one example of documentation that led me to expect something else: http://nfs.sourceforge.net/ , search for unstable. It did indeed seem like the nfs client was delaying writes, then doing synchronous nfs calls, which is why I looked into the packets with wireshark (and found them to be advertised as UNSTABLE, but with the pool acting synchronous). Thanks, Tim On Wed, Jun 13, 2012 at 6:51 PM, Daniel Carosone d...@geek.com.au wrote: On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote: client: ubuntu 11.10 /etc/fstab entry: server:/mainpool/storage /mnt/myelin nfs bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async 0 0 nfsvers=3 NAME PROPERTY VALUE SOURCE mainpool/storage sync standard default sync=standard This is expected behaviour for this combination. NFS 3 semantics are for persistent writes at the server regardless - and mostly also for NFS 4. The async client mount option relates to when the writes get shipped to the server (immediately or delayed in dirty pages), rather than to how the server should handle those writes once they arrive. You could set sync=disabled if you're happy with the consequences, or even just as a temporary test to confirm the impact. It sounds like you would be since that's what you're trying to achieve. There is a difference: async on the client means data is lost on a client reboot, async on the server means data may be lost on a server reboot (and the client/application confused by inconsistencies as a result). Separate datasets (and mounts) for data with different persistence requirements can help. -- Dan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss