I have been having issues with Solaris kernel based systems "locking up"
and am wondering if anyone else has observed a similar symptom before.
Systems the symptom has presented on: NFS server (Nexenta Core 3.01) and a
MySQL Server (Sol 11 Express).
The issue presents itself as almost total unresponsiveness -- Cannot SSH to
the host any longer, access on the local console (via Dell Remote Access
Console) is also unresponsive.
The only case I have seen some level of responsiveness is in the case of a
MySQL server... I was able to connect to the server and issue extremely
basic commands like SHOW PROCESSLIST -- anything else would just hang.
I feel like this could be explained by the fact that MySQL keeps a thread
cache (no need to allocate memory for a new thread on incoming connection)
and SHOW PROCESSLIST can be served almost entirely from allocated memory
The NFS server has 48G physical memory and no specifically tuned ZFS
settings in /etc/system.
The MySQL server has 80G physical memory and I have had a variety of ZFS
tuning settings -- this is now that system that I am primarily focused in
The primary cache for the MySQL data zpool is set for metadata only (InnoDB
has it's own buffer pool for data) and I have prefetch disabled, since
InnoDB also does it's own prefetching...
Originally when the lock up was first observed I had limited ARC to 4G (to
allow most memory to MySQL), but then I saw this lock up happen.
I then tuned the server thinking I wasn't allowing ZFS enough breathing
room -- I didn't realise how much metadata can really consume for a 20TB
So I removed the ARC limit and set InnoDB buffer pool to 54G, down from the
previous setting of 64G ... This should allow about 26G to the kernel and
The server ran fine for a few days, but then the symptom showed up again...
I rebooted the machine and interestingly while MySQL was doing crash
recovery, the system locked up yet again!..
Hardware wise we are using mostly Dell gear.
The MySQL server is:
Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks
each - 600GB 10k RPM SAS Drives
Storage Controller: LSI, Inc. 1068E (JBOD)
I have also seen similar symptoms on systems with MD1000 disk arrays
containing 2TB 7200RPM SATA drives.
The only thing of note that seems to show up in the /var/adm/messages file
on this MySQL server is:
Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0
Oct 31 18:24:51 mslvstdp02r mpt request inquiry page 0x89 for SATA
Oct 31 18:24:52 mslvstdp02r scsi: [ID 583861 kern.info] ses0 at mpt0:
unit-address 58,0: target 58 lun 0
Oct 31 18:24:52 mslvstdp02r genunix: [ID 936769 kern.info] ses0 is /pci@0
Oct 31 18:24:52 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 (ses0) online
Oct 31 18:24:52 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0
Oct 31 18:24:52 mslvstdp02r mpt request inquiry page 0x89 for SATA
Oct 31 18:24:53 mslvstdp02r scsi: [ID 583861 kern.info] ses1 at mpt0:
unit-address 59,0: target 59 lun 0
Oct 31 18:24:53 mslvstdp02r genunix: [ID 936769 kern.info] ses1 is /pci@0
Oct 31 18:24:53 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0
,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 (ses1) online
I'm thinking that the issue is memory related, so the current test I am
# Limit the amount of memory the ARC cache will use
# See this link:
# Limit to 24G
set zfs:zfs_arc_max = 25769803776
# Limit meta data to 20GB
set zfs:zfs_arc_meta_limit = 21474836480
# Disable ZFS prefetch - InnoDB Does its own
set zfs:zfs_prefetch_disable = 1
MySQL memory: Set Innodb buffer pool size to 44G (down another 10G from
54G).. That should allow 44+24=68 for ARC and MySQL and 12G for anything
else that I haven't considered...
I am using arcstat.pl to collect/write stats on arc size, hit ratio,
requests, etc. to a file every 5 seconds. and vmstat also every 5 seconds.
I'm hoping that should the issue present itself again, that I can find a
possible cause, but I'm really concerned about this issue - we want to make
use of ZFS in production, but this seemingly inexplicable lock ups are not
filling us with confidence :(
Has anyone seen similar things before and do you have any suggestions for
what else I should consider looking at?
Thanks and Regards,
Marin Software Inc.
San Francisco, USA
AU Mobile: +61 458 448 721
US Mobile: +1 (415) 867 2839
Office : +1 (415) 671 6080
zfs-discuss mailing list