[zfs-discuss] (fwd) Re: ZFS NFS service hanging on Sunday morning problem

2012-06-13 Thread TPCzfs
Dear All,
I have been advised to enquire here on zfs-discuss with the
ZFS problem described below, following discussion on Usenet NG 
comp.unix.solaris.  The full thread should be available here 
https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s

Many thanks
Tom Crane



-- forwarded message

cindy.swearin...@oracle.com wrote:
: On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote:
:  Dear All,
: Can anyone give any tips on diagnosing the following recurring problem?
:  
:  I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15
:  i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so
:  often, always in the early hours of Sunday morning. I am barely
:  familiar with Solaris but here what I have managed to discern when the
:  problem occurs;
:  
:  Jobs on other machines which access server5's shares (via automounter)
:  hang and attempts to manually remote-mount shares just timeout.
:  
:  Remotely, showmount -e server5 shows all the exported FS are available.
:  
:  On server5, the following services are running;
:  
:  root@server5:/var/adm# svcs | grep nfs 
:  online May_25   svc:/network/nfs/status:default
:  online May_25   svc:/network/nfs/nlockmgr:default
:  online May_25   svc:/network/nfs/cbd:default
:  online May_25   svc:/network/nfs/mapid:default
:  online May_25   svc:/network/nfs/rquota:default
:  online May_25   svc:/network/nfs/client:default
:  online May_25   svc:/network/nfs/server:default
:  
:  On server5, I can list and read files on the affected FSs w/o problem
:  but any attempt to write to the FS (eg. copy a file to or rm a file
:  on the FS) just hangs the cp/rm process.
:  
:  On server5, using a zfs command zfs 'get sharenfs pptank/local_linux'
:  displays the expected list of hosts/IPs with remote ro  rw access.
:  
:  Here is the O/P from some other hopefully relevant commands;
:  
:  root@server5:/# zpool status
:pool: pptank
:   state: ONLINE
:  status: The pool is formatted using an older on-disk format.  The pool can
:  still be used, but some features are unavailable.
:  action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
:  pool will no longer be accessible on older software versions.
:   scan: none requested
:  config:
:  
:  NAMESTATE READ WRITE CKSUM
:  pptank  ONLINE   0 0 0
:raidz1-0  ONLINE   0 0 0
:  c3t0d0  ONLINE   0 0 0
:  c3t1d0  ONLINE   0 0 0
:  c3t2d0  ONLINE   0 0 0
:  c3t3d0  ONLINE   0 0 0
:  c3t4d0  ONLINE   0 0 0
:  c3t5d0  ONLINE   0 0 0
:  c3t6d0  ONLINE   0 0 0
:  
:  errors: No known data errors
:  
:  root@server5:/# zpool list
:  NAME SIZE  ALLOC   FREECAP  HEALTH  ALTROOT
:  pptank  12.6T   384G  12.3T 2%  ONLINE  -
:  
:  root@server5:/# zpool history
:  History for 'pptank':
:  just hangs here
:  
:  root@server5:/# zpool iostat 5
: capacity operationsbandwidth
:  poolalloc   free   read  write   read  write
:  --  -  -  -  -  -  -
:  pptank   384G  12.3T 92115  3.08M  1.22M
:  pptank   384G  12.3T  1.11K629  35.5M  3.03M
:  pptank   384G  12.3T886889  27.1M  3.68M
:  pptank   384G  12.3T837677  24.9M  2.82M
:  pptank   384G  12.3T  1.19K757  37.4M  3.69M
:  pptank   384G  12.3T  1.02K759  29.6M  3.90M
:  pptank   384G  12.3T952707  32.5M  3.09M
:  pptank   384G  12.3T  1.02K831  34.5M  3.72M
:  pptank   384G  12.3T707503  23.5M  1.98M
:  pptank   384G  12.3T626707  20.8M  3.58M
:  pptank   384G  12.3T816838  26.1M  4.26M
:  pptank   384G  12.3T942800  30.1M  3.48M
:  pptank   384G  12.3T677675  21.7M  2.91M
:  pptank   384G  12.3T590725  19.2M  3.06M
:  
:  
:  top shows the following runnable processes.  Nothing excessive here AFAICT?
:  
:  last pid: 25282;  load avg:  1.98,  1.95,  1.86;   up 1+09:02:05 
07:46:29
:  72 processes: 67 sleeping, 1 running, 1 stopped, 3 on cpu
:  CPU states: 81.5% idle,  0.1% user, 18.3% kernel,  0.0% iowait,  0.0% swap
:  Memory: 2048M phys mem, 32M free mem, 16G total swap, 16G free swap
:  
: PID USERNAME LWP PRI NICE  SIZE   RES STATETIMECPU COMMAND
: 748 root  18  60  -20  103M 9752K cpu/1   78:44  6.62% nfsd
:   24854 root   1  540 1480K  792K cpu/10:42  0.69% cp
:   25281 root   1  590 3584K 2152K cpu/00:00  0.02% top
:  
:  The above cp job is as mentioned above, attempting to copy a file to
:  an effected FS, I've noticed is apparently not completely hung.
:  
:  The only thing that appears specific to Sunday morning is a 

Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-13 Thread Roch

Sašo Kiselkov writes:
  On 06/12/2012 05:37 PM, Roch Bourbonnais wrote:
   
   So the xcall are necessary part of memory reclaiming, when one needs to 
   tear down the TLB entry mapping the physical memory (which can from here 
   on be repurposed).
   So the xcall are just part of this. Should not cause trouble, but they do. 
   They consume a cpu for some time.
   
   That in turn can cause infrequent latency bubble on the network. A certain 
   root cause of these latency bubble is that network thread are bound by 
   default and
   if the xcall storm ends up on the CPU that the network thread is bound to, 
   it will wait for the storm to pass.
  
  I understand, but the xcall storm settles only eats up a single core out
  of a total of 32, plus it's not a single specific one, it tends to
  change, so what are the odds of hitting the same core as the one on
  which the mac thread is running?
  

That's easy :-) : 1/32 each time it needs to run. So depending on how often it 
runs (which depends on how
much churn there is in the ARC) and how often you see the latency bubbles, that 
may or may
not be it.

What is zio_taskq_batch_pct on your system ? That is another storm bit of code 
which
causes bubble. Setting it down to 50 (versus an older default of 100) should 
help if it's
not done already.

-r

   So try unbinding the mac threads; it may help you here.
  
  How do I do that? All I can find on interrupt fencing and the like is to
  simply set certain processors to no-intr, which moves all of the
  interrupts and it doesn't prevent the xcall storm choosing to affect
  these CPUs either...
  
  --
  Saso

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS NFS service hanging on Sunday morning problem

2012-06-13 Thread TPCzfs
 
 Shot in the dark here:
 
 What are you using for the sharenfs value on the ZFS filesystem? Something 
 like rw=.mydomain.lan ?

They are IP blocks or hosts specified as FQDNs, eg.,

pptank/home/tcrane sharenfs 
rw=@192.168.101/24,rw=serverX.xx.rhul.ac.uk:serverY.xx.rhul.ac.uk

 

 I've had issues where a ZFS server loses connectivity to the primary DNS 
 server and as a result the reverse lookups used to validate the identity 

It was using our slave DNS but there have been no recent problems with it.  
I've switched it to the primary DNS.

 of client systems fails and the connections hang. Any chance there's a 
 planned reboot of the DNS server Sunday morning? That sounds like the kind of 

No.  The only things tied to Sunday morning are these two (Solaris factory 
installed?) cronjobs;

root@server5:/# grep nfsfind /var/spool/cron/crontabs/root
15 3 * * 0 /usr/lib/fs/nfs/nfsfind
root@server5:/# grep 13 /var/spool/cron/crontabs/lp
#  At 03:13am on Sundays:
13 3 * * 0 cd /var/lp/logs; if [ -f requests ]; then if [ -f requests.1 ]; then 
/bin/mv requests.1 requests.2; fi; /usr/bin/cp requests requests.1; requests; 
fi

The lp one does not access the main ZFS pool but the nfsfind does.  However, 
AFAICT it has usually finished before the problem manifests itself.

 preventative maintenance that might be happening in that time window.


Cheers
Tom.

 
 Cheers,
 
 Erik
 
 On 13 juin 2012, at 12:47, tpc...@mklab.ph.rhul.ac.uk wrote:
 
  Dear All,
  I have been advised to enquire here on zfs-discuss with the
  ZFS problem described below, following discussion on Usenet NG 
  comp.unix.solaris.  The full thread should be available here 
  https://groups.google.com/forum/#!topic/comp.unix.solaris/uEQzz1t-G1s
  
  Many thanks
  Tom Crane
  
  
  
  -- forwarded message
  
  cindy.swearin...@oracle.com wrote:
  : On Tuesday, May 29, 2012 5:39:11 AM UTC-6, (unknown) wrote:
  :  Dear All,
  : Can anyone give any tips on diagnosing the following recurring 
  problem?
  :  
  :  I have a Solaris box (server5, SunOS server5 5.10 Generic_147441-15
  :  i86pc i386 i86pc ) whose ZFS FS NFS exported service fails every so
  :  often, always in the early hours of Sunday morning. I am barely
  :  familiar with Solaris but here what I have managed to discern when the
  :  problem occurs;
  :  
  :  Jobs on other machines which access server5's shares (via automounter)
  :  hang and attempts to manually remote-mount shares just timeout.
  :  
  :  Remotely, showmount -e server5 shows all the exported FS are available.
  :  
  :  On server5, the following services are running;
  :  
  :  root@server5:/var/adm# svcs | grep nfs 
  :  online May_25   svc:/network/nfs/status:default
  :  online May_25   svc:/network/nfs/nlockmgr:default
  :  online May_25   svc:/network/nfs/cbd:default
  :  online May_25   svc:/network/nfs/mapid:default
  :  online May_25   svc:/network/nfs/rquota:default
  :  online May_25   svc:/network/nfs/client:default
  :  online May_25   svc:/network/nfs/server:default
  :  
  :  On server5, I can list and read files on the affected FSs w/o problem
  :  but any attempt to write to the FS (eg. copy a file to or rm a file
  :  on the FS) just hangs the cp/rm process.
  :  
  :  On server5, using a zfs command zfs 'get sharenfs pptank/local_linux'
  :  displays the expected list of hosts/IPs with remote ro  rw access.
  :  
  :  Here is the O/P from some other hopefully relevant commands;
  :  
  :  root@server5:/# zpool status
  :pool: pptank
  :   state: ONLINE
  :  status: The pool is formatted using an older on-disk format.  The pool 
  can
  :  still be used, but some features are unavailable.
  :  action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
  :  pool will no longer be accessible on older software versions.
  :   scan: none requested
  :  config:
  :  
  :  NAMESTATE READ WRITE CKSUM
  :  pptank  ONLINE   0 0 0
  :raidz1-0  ONLINE   0 0 0
  :  c3t0d0  ONLINE   0 0 0
  :  c3t1d0  ONLINE   0 0 0
  :  c3t2d0  ONLINE   0 0 0
  :  c3t3d0  ONLINE   0 0 0
  :  c3t4d0  ONLINE   0 0 0
  :  c3t5d0  ONLINE   0 0 0
  :  c3t6d0  ONLINE   0 0 0
  :  
  :  errors: No known data errors
  :  
  :  root@server5:/# zpool list
  :  NAME SIZE  ALLOC   FREECAP  HEALTH  ALTROOT
  :  pptank  12.6T   384G  12.3T 2%  ONLINE  -
  :  
  :  root@server5:/# zpool history
  :  History for 'pptank':
  :  just hangs here
  :  
  :  root@server5:/# zpool iostat 5
  : capacity operationsbandwidth
  :  poolalloc   free   read  write   read  write
  :  --  -  -  -  -  -  -
  :  pptank   384G  12.3T 92 

[zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Timothy Coalson
I noticed recently that the SSDs hosting the ZIL for my pool had a large
number in the SMART attribute for total LBAs written (with some
calculation, it seems to be the total amount of data written to the pool so
far), did some testing, and found that the ZIL is being used quite heavily
(matching the writing speed) on writes that should be asynchronous.  I did
a capture with wireshark during a simple copy of a large file (8GB), and
both the write packets and the write responses showed the UNSTABLE I
would expect in asynchronous writes.  The copy didn't finish before the
server started writing heavily to the ZIL, so I wouldn't expect it to tell
NFS to commit then.  Here are some relevant pieces of info (with hostnames,
etc removed):

client: ubuntu 11.10
/etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
0

server: OpenIndiana oi_151a4
$ zfs get sync mainpool
NAME  PROPERTY  VALUE SOURCE
mainpool  sync  standard  default
$ zfs get sync mainpool/storage
NAME  PROPERTY  VALUE SOURCE
mainpool/storage  sync  standard  default
$ zfs get sharenfs mainpool/storage
NAME  PROPERTY  VALUE   SOURCE
mainpool/storage  sharenfs  rw=@xxx.xxx.37  local
$ zpool get version mainpool
NAME  PROPERTY  VALUESOURCE
mainpool  version   28   default

The pool consists of 24 sata disks arranged as 2 raid-z2 groups of 12, and
originally had a mirrored log across 10GB slices on two intel 320 80GB
SSDs, but I have since rearranged the logs as non-mirrored (since a single
SSD doesn't quite keep up with gigabit network throughput, and some testing
convinced me that a pool survives a failing log device gracefully, as long
as there isn't a simultaneous crash).  I would like to switch them back to
a mirrored configuration, but without impacting the asynchronous
throughput, and obviously it would be nice to reduce the write load to them
so they live longer.  As far as I can tell, all nfs writes are being
written to the ZIL when many should be cached in memory and bypass the ZIL.

Any help appreciated,
Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Daniel Carosone
On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
 client: ubuntu 11.10
 /etc/fstab entry: server:/mainpool/storage   /mnt/myelin nfs
 bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async   0
 0

nfsvers=3

 NAME  PROPERTY  VALUE SOURCE
 mainpool/storage  sync  standard  default

sync=standard

This is expected behaviour for this combination. NFS 3 semantics are
for persistent writes at the server regardless - and mostly also 
for NFS 4.

The async client mount option relates to when the writes get shipped
to the server (immediately or delayed in dirty pages), rather than to
how the server should handle those writes once they arrive.

You could set sync=disabled if you're happy with the consequences, or
even just as a temporary test to confirm the impact.  It sounds like
you would be since that's what you're trying to achieve.

There is a difference: async on the client means data is lost on a
client reboot, async on the server means data may be lost on a server
reboot (and the client/application confused by inconsistencies as a
result). 

Separate datasets (and mounts) for data with different persistence
requirements can help.

--
Dan.




pgp4Kh9cv3BTt.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)

2012-06-13 Thread Daniel Carosone
On Tue, Jun 12, 2012 at 03:46:00PM +1000, Scott Aitken wrote:
 Hi all,

Hi Scott. :-)

 I have a 5 drive RAIDZ volume with data that I'd like to recover.

Yeah, still..

 I tried using Jeff Bonwick's labelfix binary to create new labels but it
 carps because the txg is not zero.

Can you provide details of invocation and error response?

For the benefit of others, this was at my suggestion; I've been
discussing this problem with Scott for.. some time. 

 I can also make the solaris machine available via SSH if some wonderful
 person wants to poke around. 

Will take a poke, as discussed.  May well raise more discussion here
as a result.

--
Dan.


pgpS6tV6uuTeF.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS asynchronous writes being written to ZIL

2012-06-13 Thread Timothy Coalson
Interesting...from what I had read about NFSv3 asynchronous writes,
especially bits about does not require the server to commit to stable
storage, led me to expect different behavior.  The performance impact
on large writes (which we do a lot of) wasn't severe, so sync=disabled
is probably not worth the risk.  The SSDs should also be able to take
a fair amount of abuse, so I can live with the behavior as-is.

Here's one example of documentation that led me to expect something
else: http://nfs.sourceforge.net/ , search for unstable.  It did
indeed seem like the nfs client was delaying writes, then doing
synchronous nfs calls, which is why I looked into the packets with
wireshark (and found them to be advertised as UNSTABLE, but with the
pool acting synchronous).

Thanks,
Tim

On Wed, Jun 13, 2012 at 6:51 PM, Daniel Carosone d...@geek.com.au wrote:

 On Wed, Jun 13, 2012 at 05:56:56PM -0500, Timothy Coalson wrote:
  client: ubuntu 11.10
  /etc/fstab entry: server:/mainpool/storage       /mnt/myelin     nfs
  bg,retry=5,soft,proto=tcp,intr,nfsvers=3,noatime,nodiratime,async
  0
      0

 nfsvers=3

  NAME              PROPERTY  VALUE     SOURCE
  mainpool/storage  sync      standard  default

 sync=standard

 This is expected behaviour for this combination. NFS 3 semantics are
 for persistent writes at the server regardless - and mostly also
 for NFS 4.

 The async client mount option relates to when the writes get shipped
 to the server (immediately or delayed in dirty pages), rather than to
 how the server should handle those writes once they arrive.

 You could set sync=disabled if you're happy with the consequences, or
 even just as a temporary test to confirm the impact.  It sounds like
 you would be since that's what you're trying to achieve.

 There is a difference: async on the client means data is lost on a
 client reboot, async on the server means data may be lost on a server
 reboot (and the client/application confused by inconsistencies as a
 result).

 Separate datasets (and mounts) for data with different persistence
 requirements can help.

 --
 Dan.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss