Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
FWIW, we recommend disabling C-states in the BIOS for NexentaStor systems. C-states are evil. -- richard On Oct 31, 2011, at 9:46 PM, Lachlan Mulcahy wrote: > Hi All, > > > We did not have the latest firmware on the HBA - through a lot of pain I > managed to boot into an MS-DOS disk and run the firmware update. We're now > running the latest on this card from the LSI.com website. (both HBA BIOS and > Firmware) > > No joy.. the system seized up again within a few hours of coming back up. > > Now trying another suggestion sent to me by a direct poster: > > * Recommendation from Sun (Oracle) to work around a bug: > * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior > set idle_cpu_prefer_mwait = 0 > set idle_cpu_no_deep_c = 1 > > Was apparently the cause of a similar symptom for them and we are using > Nehalem. > > At this point I'm running out of options, so it can't hurt to try it. > > Regards, > -- > Lachlan Mulcahy > Senior DBA, > Marin Software Inc. > San Francisco, USA > > AU Mobile: +61 458 448 721 > US Mobile: +1 (415) 867 2839 > Office : +1 (415) 671 6080 > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com LISA '11, Boston, MA, December 4-9 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All, We did not have the latest firmware on the HBA - through a lot of pain I > managed to boot into an MS-DOS disk and run the firmware update. We're now > running the latest on this card from the LSI.com website. (both HBA BIOS > and Firmware) > No joy.. the system seized up again within a few hours of coming back up. Now trying another suggestion sent to me by a direct poster: * Recommendation from Sun (Oracle) to work around a bug: * 6958068 - Nehalem deeper C-states cause erratic scheduling behavior set idle_cpu_prefer_mwait = 0 set idle_cpu_no_deep_c = 1 Was apparently the cause of a similar symptom for them and we are using Nehalem. At this point I'm running out of options, so it can't hurt to try it. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
Thanks for the reply. On 11/1/2011 11:03 AM, Richard Elling wrote: On Oct 26, 2011, at 7:56 PM, weiliam.hong wrote: Questions: 1. Why does SG SAS drives degrade to<10 MB/s while WD RE4 remain consistent at>100MB/s after 10-15 min? 2. Why does SG SAS drive show only 70+ MB/s where is the published figures are> 100MB/s refer here? Are the SAS drives multipathed? If so, do you have round-robin (default in most Solaris distros) or logical-block? Physically, the SAS drives are not multipathed as I connected them directly to the HBA. I also disable multipathing via mpt_sas.conf. Regards, 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas driver is used. Are SAS and SATA drives handled differently ? Yes. SAS disks can be multipathed, SATA disks cannot. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
On Oct 26, 2011, at 7:56 PM, weiliam.hong wrote: > > Questions: > 1. Why does SG SAS drives degrade to <10 MB/s while WD RE4 remain consistent > at >100MB/s after 10-15 min? > 2. Why does SG SAS drive show only 70+ MB/s where is the published figures > are > 100MB/s refer here? Are the SAS drives multipathed? If so, do you have round-robin (default in most Solaris distros) or logical-block? > 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas driver > is used. Are SAS and SATA drives handled differently ? Yes. SAS disks can be multipathed, SATA disks cannot. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA '11, Boston, MA, December 4-9 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi All/Marion, A small update... known to have lockups/timeouts when used with SAS expanders (disk >> enclosures) >> with incompatible firmware revisions, and/or with older mpt drivers. >> > > I'll need to check that out -- I'm 90% sure that these are fresh out of > box HBAs. > > Will try an upgrade there and see if we get any joy there... > We did not have the latest firmware on the HBA - through a lot of pain I managed to boot into an MS-DOS disk and run the firmware update. We're now running the latest on this card from the LSI.com website. (both HBA BIOS and Firmware) > The MD1220 is a 6Gbit/sec device. You may be better off with a matching >> HBA -- Dell has certainly told us the MD1200-series is not intended for >> use with the 3Gbit/sec HBA's. We're doing fine with the LSI SAS 9200-8e, >> for example, when connecting to Dell MD1200's with the 2TB "nearline SAS" >> disk drives. >> > > I was aware the MD1220 is a 6G device, but I figured that since our IO > throughput doesn't actually come close to saturating 3Gbit/sec that it > would just operate at the lower speed and be OK. I guess it is something to > look at if I run out of other options... > This was my mistake - this particular system has MD1120s attached to it. We have a mix of 1220s and 1120s since we've been with Dell since the 1120s were current model. Just kicked off the system running with the same logging as before with this new firmware, so I'll see if this goes any better. Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi Marion, Thanks for your swifty reply! Have you got the latest firmware on your LSI 1068E HBA's? These have been > known to have lockups/timeouts when used with SAS expanders (disk > enclosures) > with incompatible firmware revisions, and/or with older mpt drivers. > I'll need to check that out -- I'm 90% sure that these are fresh out of box HBAs. Will try an upgrade there and see if we get any joy there... > The MD1220 is a 6Gbit/sec device. You may be better off with a matching > HBA -- Dell has certainly told us the MD1200-series is not intended for > use with the 3Gbit/sec HBA's. We're doing fine with the LSI SAS 9200-8e, > for example, when connecting to Dell MD1200's with the 2TB "nearline SAS" > disk drives. > I was aware the MD1220 is a 6G device, but I figured that since our IO throughput doesn't actually come close to saturating 3Gbit/sec that it would just operate at the lower speed and be OK. I guess it is something to look at if I run out of other options... Last, are you sure it's memory-related? You might keep an eye on " > arcstat.pl" > output and see what the ARC sizes look like just prior to lockup. Also, > maybe you can look up instructions on how to force a crash dump when the > system hangs -- one of the experts around here could tell a lot from a > crash dump file. > I'm starting to doubt that it is a memory issue now -- especially since I now have some results from my latest "test"... output of arcstat.pl looked like this just prior to the lock up: 19:57:3624G 24G 94 16161 194 1 1 19:57:4124G 24G 96 17462 213 0 0 time arcsz c mh% mhit hit% hits l2hit% l2hits 19:57:4623G 24G 94 16162 192 1 1 19:57:5124G 24G 96 16963 205 0 0 19:57:5624G 24G 95 16961 206 0 0 ^-- This is the very last line printed... I actually discovered and rebooted the machine via DRAC at around 20:44, so it had been in it's bad state for around 1 hour. Some snippets from the output some 20 minutes earlier shows the point at while the arcsz grew to reach the maximum: time arcsz c mh% mhit hit% hits l2hit% l2hits 19:36:4521G 24G 95 15258 177 0 0 19:37:0022G 24G 95 15657 182 0 0 19:37:1522G 24G 95 15959 185 0 0 19:37:3023G 24G 94 15358 178 0 0 19:37:4523G 24G 95 16959 195 0 0 19:38:0024G 24G 95 16059 187 0 0 19:38:2524G 24G 96 15158 177 0 0 So it seems that arcsz reaching the 24G maximum wasn't necessarily to blame, since the system operated for a good 20mins in this state. I was also logging "vmstat 5" prior to the crash (though I forgot to include some timestamps in my output) and these are the final lines recorded in that log: kthr memorypagedisk faults cpu r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id 0 0 0 25885248 18012208 71 2090 0 0 0 0 0 0 0 0 22 17008 210267 30229 1 5 94 0 0 0 25884764 18001848 71 2044 0 0 0 0 0 0 0 0 25 14846 151228 25911 1 5 94 0 0 0 25884208 17991876 71 2053 0 0 0 0 0 0 0 0 8 16343 185416 28946 1 5 93 So it seems there was some 17-18G free in the system when the lock up occurred. Curious... I was also capturing some arc info from mdb -k and the output prior to the lock up was... Monday, October 31, 2011 07:57:51 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4621 MB arc_meta_limit= 20480 MB arc_meta_max = 4732 MB Monday, October 31, 2011 07:57:56 PM UTC arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 4622 MB arc_meta_limit= 20480 MB arc_meta_max = 4732 MB Looks like metadata was not primarily responsible for consuming all of that 24G of ARC in arcstat.pl output... Also seems nothing interesting in /var/adm/messages leading up to my rebooting : Oct 31 18:42:57 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:44:01 mslvstdp02r last message repeated 1 time Oct 31 18:45:05 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 512 PPM exceeds tolerance 500 PPM Oct 31 18:46:09 mslvstdp02r last message repeated 1 time Oct 31 18:47:23 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:06:13 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:09:27 mslvstdp02r last message repeated 4 times Oct 31 19:25:04 mslvstdp02r ntpd[368]: [ID 702911 daemon.notice] frequency error 505 PPM exceeds tolerance 500 PPM Oct 31 19:28:17 mslvstdp02r last m
Re: [zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
lmulc...@marinsoftware.com said: > . . . > The MySQL server is: > Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks > each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) > > I have also seen similar symptoms on systems with MD1000 disk arrays > containing 2TB 7200RPM SATA drives. > > The only thing of note that seems to show up in the /var/adm/messages file on > this MySQL server is: > > Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/ > pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt > request inquiry page 0x89 for SATA target:58 failed! Oc > . . . Have you got the latest firmware on your LSI 1068E HBA's? These have been known to have lockups/timeouts when used with SAS expanders (disk enclosures) with incompatible firmware revisions, and/or with older mpt drivers. The MD1220 is a 6Gbit/sec device. You may be better off with a matching HBA -- Dell has certainly told us the MD1200-series is not intended for use with the 3Gbit/sec HBA's. We're doing fine with the LSI SAS 9200-8e, for example, when connecting to Dell MD1200's with the 2TB "nearline SAS" disk drives. Last, are you sure it's memory-related? You might keep an eye on "arcstat.pl" output and see what the ARC sizes look like just prior to lockup. Also, maybe you can look up instructions on how to force a crash dump when the system hangs -- one of the experts around here could tell a lot from a crash dump file. Regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Solaris Based Systems "Lock Up" - Possibly ZFS/memory related?
Hi Folks, I have been having issues with Solaris kernel based systems "locking up" and am wondering if anyone else has observed a similar symptom before. Some information/background... Systems the symptom has presented on: NFS server (Nexenta Core 3.01) and a MySQL Server (Sol 11 Express). The issue presents itself as almost total unresponsiveness -- Cannot SSH to the host any longer, access on the local console (via Dell Remote Access Console) is also unresponsive. The only case I have seen some level of responsiveness is in the case of a MySQL server... I was able to connect to the server and issue extremely basic commands like SHOW PROCESSLIST -- anything else would just hang. I feel like this could be explained by the fact that MySQL keeps a thread cache (no need to allocate memory for a new thread on incoming connection) and SHOW PROCESSLIST can be served almost entirely from allocated memory structures. The NFS server has 48G physical memory and no specifically tuned ZFS settings in /etc/system. The MySQL server has 80G physical memory and I have had a variety of ZFS tuning settings -- this is now that system that I am primarily focused in on troubleshooting... The primary cache for the MySQL data zpool is set for metadata only (InnoDB has it's own buffer pool for data) and I have prefetch disabled, since InnoDB also does it's own prefetching... Originally when the lock up was first observed I had limited ARC to 4G (to allow most memory to MySQL), but then I saw this lock up happen. I then tuned the server thinking I wasn't allowing ZFS enough breathing room -- I didn't realise how much metadata can really consume for a 20TB zpool! So I removed the ARC limit and set InnoDB buffer pool to 54G, down from the previous setting of 64G ... This should allow about 26G to the kernel and ZFS The server ran fine for a few days, but then the symptom showed up again... I rebooted the machine and interestingly while MySQL was doing crash recovery, the system locked up yet again!.. Hardware wise we are using mostly Dell gear. The MySQL server is: Dell R710 / 80G Memory with two daisy chained MD1220 disk arrays - 22 Disks each - 600GB 10k RPM SAS Drives Storage Controller: LSI, Inc. 1068E (JBOD) I have also seen similar symptoms on systems with MD1000 disk arrays containing 2TB 7200RPM SATA drives. The only thing of note that seems to show up in the /var/adm/messages file on this MySQL server is: Oct 31 18:24:51 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0 ,0/pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:51 mslvstdp02r mpt request inquiry page 0x89 for SATA target:58 failed! Oct 31 18:24:52 mslvstdp02r scsi: [ID 583861 kern.info] ses0 at mpt0: unit-address 58,0: target 58 lun 0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 936769 kern.info] ses0 is /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 Oct 31 18:24:52 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@58,0 (ses0) online Oct 31 18:24:52 mslvstdp02r scsi: [ID 243001 kern.warning] WARNING: /pci@0 ,0/pci8086,3410@9/pci1000,3080@0 (mpt0): Oct 31 18:24:52 mslvstdp02r mpt request inquiry page 0x89 for SATA target:59 failed! Oct 31 18:24:53 mslvstdp02r scsi: [ID 583861 kern.info] ses1 at mpt0: unit-address 59,0: target 59 lun 0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 936769 kern.info] ses1 is /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 Oct 31 18:24:53 mslvstdp02r genunix: [ID 408114 kern.info] /pci@0 ,0/pci8086,3410@9/pci1000,3080@0/ses@59,0 (ses1) online I'm thinking that the issue is memory related, so the current test I am running is: ZFS tuneables: /etc/system: # Limit the amount of memory the ARC cache will use # See this link: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache # Limit to 24G set zfs:zfs_arc_max = 25769803776 # Limit meta data to 20GB set zfs:zfs_arc_meta_limit = 21474836480 # Disable ZFS prefetch - InnoDB Does its own set zfs:zfs_prefetch_disable = 1 MySQL memory: Set Innodb buffer pool size to 44G (down another 10G from 54G).. That should allow 44+24=68 for ARC and MySQL and 12G for anything else that I haven't considered... I am using arcstat.pl to collect/write stats on arc size, hit ratio, requests, etc. to a file every 5 seconds. and vmstat also every 5 seconds. I'm hoping that should the issue present itself again, that I can find a possible cause, but I'm really concerned about this issue - we want to make use of ZFS in production, but this seemingly inexplicable lock ups are not filling us with confidence :( Has anyone seen similar things before and do you have any suggestions for what else I should consider looking at? Thanks and Regards, -- Lachlan Mulcahy Senior DBA, Marin Software Inc. San Francisco, USA AU Mobile: +61 458 448 721 US Mobile: +1 (415) 867 2839 Office : +1 (415) 671 6080 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org htt
Re: [zfs-discuss] Log disk with all ssd pool?
On 10/28/2011 01:04 AM, Mark Wolek wrote: before the forum closed. Did I miss something? Karl CONFIDENTIALITY NOTICE: This communication (including all attachments) is confidential and is intended for the use of the named addressee(s) only and may contain information that is private, confidential, privileged, and exempt from disclosure under law. All rights to privilege are expressly claimed and reserved and are not waived. Any use, dissemination, distribution, copying or disclosure of this message and any attachments, in whole or in part, by anyone other than the intended recipient(s) is strictly prohibited. If you have received this communication in error, please notify the sender immediately, delete this communication from all data storage devices and destroy all hard copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
Thanks for the reply. Some background.. The server is fresh installed. Right before running the tests, the pools are newly created. Some comments below On 10/31/2011 10:33 PM, Paul Kraus wrote: A couple points in line below ... On Wed, Oct 26, 2011 at 10:56 PM, weiliam.hong wrote: I have a fresh installation of OI151a: - SM X8DTH, 12GB RAM, LSI 9211-8i (latest IT-mode firmware) - pool_A : SG ES.2 Constellation (SAS) - pool_B : WD RE4 (SATA) - no settings in /etc/system Load generation via 2 concurrent dd streams: -- dd if=/dev/zero of=/pool_A/bigfile bs=1024k count=100 dd if=/dev/zero of=/pool_B/bigfile bs=1024k count=100 dd generates "straight line" data, all sequential. yes. capacity operationsbandwidth poolalloc free read write read write -- - - - - - - pool_A 15.5G 2.70T 0 50 0 6.29M mirror15.5G 2.70T 0 50 0 6.29M c7t5000C50035062EC1d0 - - 0 62 0 7.76M c8t5000C50034C03759d0 - - 0 50 0 6.29M -- - - - - - - pool_B 28.0G 1.79T 0 1.07K 0 123M mirror28.0G 1.79T 0 1.07K 0 123M c1t50014EE057FCD628d0 - - 0 1.02K 0 123M c2t50014EE6ABB89957d0 - - 0 1.02K 0 123M What does `iostat -xnM c7t5000C50035062EC1d0 c8t5000C50034C03759d0 c1t50014EE057FCD628d0 c2t50014EE6ABB89957d0 1` show ? That will give you much more insight into the OS<-> drive interface. iostat numbers are similar. I will try to get the figures, a bit hard now as the hardware has been taken off my hands. What does `fsstat /pool_A /pool_B 1` show ? That will give you much more insight into the application<-> filesystem interface. In this case "application" == "dd". In my opinion, `zpool iostat -v` is somewhat limited in what you can learn from it. The only thing I use it for these days is to see distribution of data and I/O between vdevs. Questions: 1. Why does SG SAS drives degrade to<10 MB/s while WD RE4 remain consistent at>100MB/s after 10-15 min? Something changes to slow them down ? Sorry for the obvious retort :-) See what iostat has to say. If the %b column is climbing, then you are slowly saturating the drives themselves, for example. There is no other workload or user using this system. The system is freshly installed, booted and the pools newly created. 2. Why does SG SAS drive show only 70+ MB/s where is the published figures are> 100MB/s refer here? "published" where ? http://www.seagate.com/www/en-au/products/enterprise-hard-drives/constellation-es/constellation-es-2/#tTabContentSpecifications What does a "dd" to the device itself (no ZFS, no FS at all) show ? For example, `dd if=/dev/zero of=/dev/dsk/c7t5000C50035062EC1d0s0 bs=1024k count=100` (after you destroy the zpool and use format to create an s0 of the entire disk). This will test the device driver / HBA / drive with no FS or volume manager involved. Use iostat to watch the OS<-> drive interface. Perhaps the test below is useful to understand the observation. *dd test on slice 0* dd if=/dev/zero of=/dev/rdsk/c1t5000C50035062EC1d0s0 bs=1024k extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 155.40.0 159129.7 0.0 1.00.06.3 0 97 c1 0.0 155.40.0 159129.7 0.0 1.00.06.3 0 97 c1t5000C50035062EC1d0 <== this is best case *dd test on slice 6* **dd if=/dev/zero of=/dev/rdsk/c1t5000C50035062EC1d0s6 bs=1024k extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 21.40.0 21913.6 0.0 1.00.0 46.6 0 100 c1 0.0 21.40.0 21913.6 0.0 1.00.0 46.6 0 100 c1t5000C50035062EC1d0 <== only 20+MB/s !!! *Partition table info* Part TagFlag First Sector Size Last Sector 0usrwm 256 100.00GB 209715455 1 unassignedwm 000 2 unassignedwm 000 3 unassignedwm 000 4 unassignedwm 000 5 unassignedwm 000 6usrwm5650801295 100.00GB 5860516749 8 reservedwm5860516751 8.00MB 5860533134 Referring to pg 18 of http://www.seagate.com/staticfiles/support/docs/manual/enterprise/Constellation%203_5%20in/100628615f.pdf The transfer rate is supposed range from 68 - 155 MB/s. Why is the inner cylinders only showing 20+ MB/s ? Am I testing and understanding this wrongly ? 3. All 4 drives are connected to a singl
Re: [zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
On Sat, Oct 29, 2011 at 1:57 PM, Jim Klimov wrote: > I am catching up with some 500 posts that I skipped this > summer, and came up with a new question. In short, is it > possible to add "restartability" to ZFS SEND, for example > by adding artificial snapshots (of configurable increment > size) into already existing datasets [too large to be > zfs-sent successfully as one chunk of stream data]? We addressed this by decreasing our snapshot interval from 1 day to 1 hour. We rarely have a snapshot bigger than a few GB now. I keep meaning to put together a snapshot script that takes a new snapshot when the amount of changed data increases to a certain point (for example, take a snapshot whenever the snapshot would contain 250 MB of data). Not enough round toits with all the other broken stuff to fix :-( -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Poor relative performance of SAS over SATA drives
A couple points in line below ... On Wed, Oct 26, 2011 at 10:56 PM, weiliam.hong wrote: > I have a fresh installation of OI151a: > - SM X8DTH, 12GB RAM, LSI 9211-8i (latest IT-mode firmware) > - pool_A : SG ES.2 Constellation (SAS) > - pool_B : WD RE4 (SATA) > - no settings in /etc/system > Load generation via 2 concurrent dd streams: > -- > dd if=/dev/zero of=/pool_A/bigfile bs=1024k count=100 > dd if=/dev/zero of=/pool_B/bigfile bs=1024k count=100 dd generates "straight line" data, all sequential. > capacity operations bandwidth > pool alloc free read write read write > -- - - - - - - > pool_A 15.5G 2.70T 0 50 0 6.29M > mirror 15.5G 2.70T 0 50 0 6.29M > c7t5000C50035062EC1d0 - - 0 62 0 7.76M > c8t5000C50034C03759d0 - - 0 50 0 6.29M > -- - - - - - - > pool_B 28.0G 1.79T 0 1.07K 0 123M > mirror 28.0G 1.79T 0 1.07K 0 123M > c1t50014EE057FCD628d0 - - 0 1.02K 0 123M > c2t50014EE6ABB89957d0 - - 0 1.02K 0 123M What does `iostat -xnM c7t5000C50035062EC1d0 c8t5000C50034C03759d0 c1t50014EE057FCD628d0 c2t50014EE6ABB89957d0 1` show ? That will give you much more insight into the OS <-> drive interface. What does `fsstat /pool_A /pool_B 1` show ? That will give you much more insight into the application <-> filesystem interface. In this case "application" == "dd". In my opinion, `zpool iostat -v` is somewhat limited in what you can learn from it. The only thing I use it for these days is to see distribution of data and I/O between vdevs. > Questions: > 1. Why does SG SAS drives degrade to <10 MB/s while WD RE4 remain consistent > at >100MB/s after 10-15 min? Something changes to slow them down ? Sorry for the obvious retort :-) See what iostat has to say. If the %b column is climbing, then you are slowly saturating the drives themselves, for example. > 2. Why does SG SAS drive show only 70+ MB/s where is the published figures > are > 100MB/s refer here? "published" where ? What does a "dd" to the device itself (no ZFS, no FS at all) show ? For example, `dd if=/dev/zero of=/dev/dsk/c7t5000C50035062EC1d0s0 bs=1024k count=100` (after you destroy the zpool and use format to create an s0 of the entire disk). This will test the device driver / HBA / drive with no FS or volume manager involved. Use iostat to watch the OS <-> drive interface. > 3. All 4 drives are connected to a single HBA, so I assume the mpt_sas > driver is used. Are SAS and SATA drives handled differently ? I assume there are (at least) four ports on the HBA ? I assume this from the c7, c8, c1, c2 device names. That means that the drives should _not_ be affecting each other. As another poster mentioned, the behavior of the interface chip may change based on which drives are seeing I/O, but I doubt that would be this big of a factor. > This is a test server, so any ideas to try and help me understand greatly > appreciated. What do real benchmarks (iozone, filebench, orion) show ? -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug
On Mon, Oct 31, 2011 at 9:07 AM, Jim Klimov wrote: > 2011-10-31 16:28, Paul Kraus wrote: >> Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of >> RAM >> to destroy the problem snapshot). I had not yet posted a summary as we >> are still working through the overall problem (we tripped over this on >> the replica, now we are working on it on the production copy). > > Good for you ;) > Does Oracle loan such systems free to support their own foul-ups? > Or do you have to pay a lease anyway? ;) If you are paying for a support contract, _demand_ what is needed to fix the problem. If you are not paying for support, well, then you are on your own (as I believe the license says). Maybe I've been in this business longer than many of the folks here, but I both expect software to have bugs and I do NOT expect commercial software vendors to provide fixes for free. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug
2011-10-31 16:28, Paul Kraus wrote: How big is / was the snapshot and dataset ? I am dealing with a 7 TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM. I had a smaller-scale problem, with datasets and snapshots sized several hundred GB, but on an 8Gb RAM system. So proportionally it seems similar ;) I have deduped data on the system, which adds to the strain of dataset removal. The plan was to save some archive data there, with few to no removals planned. But during testing of different dataset layout hierarchies, things got out of hand ;) I've also had an approx. 4Tb dataset to destroy (a volume where I kept another pool), but armed with the knowledge of how things are expected to fail, I did its cleanup in small steps and very few (perhaps no?) hangs while evacuating the data to the toplevel pool (which contained this volume). Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM to destroy the problem snapshot). I had not yet posted a summary as we are still working through the overall problem (we tripped over this on the replica, now we are working on it on the production copy). Good for you ;) Does Oracle loan such systems free to support their own foul-ups? Or do you have to pay a lease anyway? ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy snapshot runs out of memory bug
On Sun, Oct 30, 2011 at 5:13 PM, Jim Klimov wrote: >> I know there was (is ?) a bug where a zfs destroy of a large >> snapshot would run a system out of kernel memory, but searching the > Symptoms are like what you've described, including the huge scanrate > just before the system dies (becomes unresponsive). Also if you try running > with "vmstat 1" you can see that in the last few seconds of > uptime the system would go from several hundred free MBs (or even > over a GB free RAM) down to under 32Mb very quickly - consuming > hundreds of MBs per second. That is the traditional symptoms of a Solaris kernel memory bug :-) > Unlike your system, my pool started with ZFSv28 (oi_148a), so any > bugfixes and on-disk layout fixes relevant for ZFSv26 patches are > in place already. Ahhh, but jumping to the end... > In my case I saw that between reboots and import attempts this > counter went down by some 3 million blocks every uptime, and > after a couple of stressful weeks the destroyed dataset was gone > and the pool just worked on and on. So your pool does have the fix. With zpool 22 NO PROGRESS is made at all with each boot-import-habg cycle. I have an mdb command that I got from Oracle support to determine the size of the snapshot that is being destroyed. The bug in 22 is that a snapshot destroy is committed as a single TXG. In 26 this is fixed (I assume there are on disk checkpoints to permit a snapshot to be destroyed in multiple TXG). How big is / was the snapshot and dataset ? I am dealing with a 7 TB dataset and a 2.5 TB snapshot on a system with 32 GB RAM. Oracle has provided a loaner system with 128 GB RAM and it took 75 GB of RAM to destroy the problem snapshot). I had not yet posted a summary as we are still working through the overall problem (we tripped over this on the replica, now we are working on it on the production copy). -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss