[Solaris-Users] thumper acting as NAS only crashing from high usage?

Carsten Aulbert Thu, 25 Feb 2010 01:05:21 -0800

Hi all,

we have a couple of thumpers here, one running Sol10u8 with the latest patches 
all applied (just did it this morning). After problems with the 10 Gb/s NIC we 
disabled that one and moved to the on-board e1000g0.


Now, when user launch jobs and are hitting the box relatively hard we get very 
interesting numbers:

s09:~# zpool iostat 18
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
atlashome   7.64T  12.8T    512    251  51.2M  28.1M
atlashome   7.64T  12.8T  1.28K    174   145M  8.24M
atlashome   7.64T  12.8T  1.17K    123   133M  7.26M
atlashome   7.64T  12.8T  1.60K     36   202M  3.06M
atlashome   7.64T  12.8T  1.37K    106   173M  3.94M
atlashome   7.64T  12.8T  1.31K      8   164M   956K
atlashome   7.64T  12.8T  1.14K     63   144M  2.57M
atlashome   7.64T  12.8T  1.73K     48   218M  2.02M
atlashome   7.64T  12.8T  1.45K      3   184M   429K
atlashome   7.64T  12.8T  1.66K     44   210M  1.55M
atlashome   7.64T  12.8T  1.75K      4   219M   549K
atlashome   7.64T  12.8T  1.88K     34   238M  1.12M
atlashome   7.64T  12.8T  1.62K     26   205M  1.32M
atlashome   7.64T  12.8T  1.79K      9   224M  1.15M
atlashome   7.64T  12.8T  2.12K     42   269M  2.95M
atlashome   7.64T  12.8T  2.35K     14   298M  1.71M
atlashome   7.64T  12.8T  3.12K     49   397M  4.00M
atlashome   7.64T  12.8T  3.61K     55   460M  3.91M
atlashome   7.64T  12.8T  4.32K      7   550M   902K
atlashome   7.64T  12.8T  4.12K     44   525M  3.32M
atlashome   7.64T  12.8T  5.05K      5   644M   643K
atlashome   7.64T  12.8T  4.33K     34   553M  1.70M
atlashome   7.64T  12.8T  4.52K     30   577M  1.69M
atlashome   7.64T  12.8T  4.70K      3   600M   427K
atlashome   7.64T  12.8T  4.71K     36   599M  2.43M
atlashome   7.64T  12.8T  4.49K      2   569M   314K
atlashome   7.64T  12.8T  6.56K     40   832M  2.89M
atlashome   7.64T  12.8T  5.78K     46   735M  3.58M
atlashome   7.64T  12.8T  5.97K      3   759M   345K
atlashome   7.64T  12.8T  6.03K     45   765M  3.29M
atlashome   7.64T  12.8T  5.18K      2   658M   309K
atlashome   7.64T  12.8T  5.64K     37   710M  2.42M
atlashome   7.64T  12.8T  5.44K     42   685M  2.95M
atlashome   7.64T  12.8T  4.73K      4   590M   454K
atlashome   7.64T  12.8T  3.60K     53   447M  3.46M
atlashome   7.64T  12.8T  3.76K     59   469M  2.82M
atlashome   7.64T  12.8T  2.95K     51   367M  1.52M
atlashome   7.64T  12.8T  1.52K     53   191M  1.18M
atlashome   7.64T  12.8T  3.48K     32   434M  1.11M
atlashome   7.64T  12.8T  3.41K     21   432M   533K
atlashome   7.64T  12.8T  3.58K     41   454M  1.56M
atlashome   7.64T  12.8T  2.71K     39   342M  1.36M
Read from remote host s09: Connection timed out


Here the system crashed. Can someone explain me, why the zpool is reading data 
at 5-8 times the possible bandwidth of the single Gbit interface off the 
disks?

Could this just be combination of large record sizes (128k) and compression 
being on and the users reading very tiny files?

Also, via the ILOM I'm totally stuck:

last pid:  2115;  load avg: 101.4, 104.9,  61.3;       up 0+00:32:32   
09:49:08
46 processes: 43 sleeping, 2 running, 1 on cpu
CPU states: 31.2% idle,  0.1% user, 68.7% kernel,  0.0% iowait,  0.0% swap     
Memory: 16G phys mem, 62M free mem, 4001M swap, 3822M free swap
Feb 25 09:46:55 s09 nfssrv: WARNING: nfsauth: mountd not responding
   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
   736 daemon   999  60  -20   14M  736K sleep   11:56 56.08% nfsd
   160 root      47  59    0 8920K 1152K sleep    0:04  0.33% nscd
  1796 noaccess  18  59    0  189M 1008K sleep    0:21  0.17% java
  2004 root       1  59    0 3396K  236K cpu      0:04  0.12% top
  2012 root       1  60    0 2960K  212K sleep    0:00  0.10% bash
  1915 root       1  59    0 8564K  256K sleep    0:00  0.10% master
     9 root      16  59    0   11M  424K sleep    0:05  0.09% svc.configd
   616 root      23  59    0   21M  868K run      0:13  0.08% fmd
   485 root       3  59    0 2816K  188K sleep    0:00  0.07% automountd
     7 root      14  59    0   15M  324K sleep    0:01  0.07% svc.startd
     1 root       1  59    0 2496K    4K sleep    0:00  0.07% init
  1962 postfix    1  60    0 8768K  264K sleep    0:00  0.06% qmgr
   357 daemon     1  60    0 4540K  288K sleep    0:01  0.06% rpcbind
  2006 root       1  60    0 6328K  164K sleep    0:00  0.06% sshd
   427 root       1  59    0 1436K  248K sleep    0:00  0.04% utmpd


zfs ^C^C^C^C^C^C^C^C
ifconfig -a
s09:~#
s09:~# ifconfig -a
-bash: fork: Resource temporarily unavailable
s09:~# free
-bash: fork: Resource temporarily unavailable
s09:~# Feb 25 09:50:03 s09 sshd[591]: error: fork: Error 0

s09:~# uptime
-bash: fork: Resource temporarily unavailable
s09:~# reboot
-bash: fork: Resource temporarily unavailable
s09:~# shutdown -i 5 -y -g 0
-bash: fork: Resource temporarily unavailable

Any idea how to find out what's going amiss here?

cheers

Carsten
_______________________________________________
Solaris-Users mailing list
[email protected]
http://www.filibeto.org/mailman/listinfo/solaris-users

[Solaris-Users] thumper acting as NAS only crashing from high usage?

Reply via email to