Hi Folks, I have been having issues with Solaris kernel based systems "locking up" and am wondering if anyone else has observed a similar symptom before.
Some information/background... Systems the symptom has presented on: NFS server (Nexenta Core 3.01) and a MySQL Server (Sol 11 Express). The issue presents itself as almost total unresponsiveness -- Cannot SSH to the host any longer, access on the local console (via Dell Remote Access Console) is also unresponsive. The only case I have seen some level of responsiveness is in the case of a MySQL server... I was able to connect to the server and issue extremely basic commands like SHOW PROCESSLIST -- anything else would just hang. I feel like this could be explained by the fact that MySQL keeps a thread cache (no need to allocate memory for a new thread on incoming connection) and SHOW PROCESSLIST can be served almost entirely from allocated memory structures. The NFS server has 48G physical memory and no specifically tuned ZFS settings in /etc/system. The MySQL server has 80G physical memory and I have had a variety of ZFS tuning settings -- this is now that system that I am primarily focused in on troubleshooting... The primary cache for the MySQL data zpool is set for metadata only (InnoDB has it's own buffer pool for data) and I have prefetch disabled, since InnoDB also does it's own prefetching... Originally when the lock up was first observed I had limited ARC to 4G (to allow most memory to MySQL), but then I saw this lock up happen. I then tuned the server thinking I wasn't allowing ZFS enough breathing room -- I didn't realise how much metadata can really consume for a 20TB zpool! So I removed the ARC limit and set InnoDB buffer pool to 54G, down from the previous setting of 64G ... This should allow about 26G to the kernel and ZFS.... The server ran fine for a few days, but then the symptom showed up again... I rebooted the machine and interestingly while MySQL was doing crash recovery, the system locked up yet again!.. Hardware wise we are using mostly Dell gear. The MySQL server is: Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) I have also seen similar symptoms on systems with MD1000 disk arrays containing 2TB 7200RPM SATA drives. The only thing of note that seems to show up in the /var/adm/messages file on this MySQL server is: Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0 ,0/pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt request inquiry page 0x89 for SATA target:58 failed! Oct 31 18:24:52 mslvstdp02r scsi: [ID 583861 kern.info] ses0 at mpt0: unit-address 58,0: target 58 lun 0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 936769 kern.info] ses0 is /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 (ses0) online Oct 31 18:24:52 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0 ,0/pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:52 mslvstdp02r mpt request inquiry page 0x89 for SATA target:59 failed! Oct 31 18:24:53 mslvstdp02r scsi: [ID 583861 kern.info] ses1 at mpt0: unit-address 59,0: target 59 lun 0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 936769 kern.info] ses1 is /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 (ses1) online I'm thinking that the issue is memory related, so the current test I am running is: ZFS tuneables: /etc/system: # Limit the amount of memory the ARC cache will use # See this link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache # Limit to 24G set zfs:zfs_arc_max = 25769803776 # Limit meta data to 20GB set zfs:zfs_arc_meta_limit = 21474836480 # Disable ZFS prefetch - InnoDB Does its own set zfs:zfs_prefetch_disable = 1 MySQL memory: Set Innodb buffer pool size to 44G (down another 10G from 54G).. That should allow 44+24=68 for ARC and MySQL and 12G for anything else that I haven't considered... I am using arcstat.pl to collect/write stats on arc size, hit ratio, requests, etc. to a file every 5 seconds. and vmstat also every 5 seconds. I'm hoping that should the issue present itself again, that I can find a possible cause, but I'm really concerned about this issue - we want to make use of ZFS in production, but this seemingly inexplicable lock ups are not filling us with confidence :( Has anyone seen similar things before and do you have any suggestions for what else I should consider looking at? Thanks and Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080
_______________________________________________ zfs-discuss mailing list firstname.lastname@example.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss