Re: [zfs-discuss] zfs under medium load causes SMB to delay writes
This is not the appropriate group/list for this message. Crossposting to zfs-discuss (where it perhaps primarily belongs) and to cifs-discuss, which also relates. Hi, I have an I/O load issue and after days of searching wanted to know if anyone has pointers on how to approach this. My 1-year stable zfs system (raidz3 8 2TB drives, all OK) just started to cause problems when I introduced a new backup script that puts medium I/O load. This script simply tars up a few filesystems and md5sums the tarball, to copy to another system for off OpenSolaris backup. The simple commands are: tar -c /tank/[filesystem]/.zfs/snapshot/[snapshot] /tank/[otherfilesystem]/file.tar md5sum -b /tank/[otherfilesystem]/file.tar file.md5sum These 2 commands obviously cause high read/write I/O because the 8 drives are directly reading and writing a large amount of data as fast as the system can go. This is OK and OpenSolaris functions fine. The problem is I host VMWare images on another PC which access their images on this zfs box over smb, and during this high I/O period, these VMWare guests are crashing. What I think is happening is during the backup with high I/O, zfs is delaying reads/writes to the VMWare images. This is quickly causing VMWare to freeze the guest machines. When the backup script is not running, the VMWare guests are fine, and have been fine for 1-year. (setup has been rock solid) Any idea how to address this? I'd thought puting the relevant filesystem (tank/vmware) on a higher priority for reads/writes, but haven't figured out how. Another way is to deprioritize the backup somehow. Any pointers would be appreciated. Thanks, Tom -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Online zpool expansion feature in Solaris 10 9/10
An update to this query. Cindy did look into this for me, and found that: [i]I have confirmation that although the ZFS autoexpand property and supporting bugs like 475340 are in the Solaris 10 9/10 release, the supporting sd driver work to update the EFI label is not. In previous Solaris 10 releases, you could workaround the LUN expansion problem by either export/importing the pool or rebooting the system. My sense is that the integration of the autoexpand property is now restricting this workaround. We are considering how to resolve this but it might take some time.[/i] So, at the end of the day I can make the LUN larger via my SAN managment interface, but the sd driver is unaware of this change so the EFI label never gets updated, meaning format continues to show the old size. I realise there are tricks with labelling under the format command using autoconfigure, but we wanted to avoid that non supported method. Also, as Cindy mentions, a zpool export/import fails to udpate the EFI label. Apparently this has worked in the past, although I'd never tried this. As a result, the best method to extend a zpool in our case is to create a second larger LUN and add it to the zpool to create a mirror, then detach the old smaller device. Then run zpool online -e to actually expand the zpool. James. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
Wow, sounds familiar - binderedondat. I thought it was just when using expanders... guess it's just anything 1068-based. Lost a 20TB pool to having the controller basically just hose up what it was doing and write scragged data to the disk. 1) The suggestion using the serial number of the drive to trace back to what's connected to what is good, assuming you can pull drives to look at their serial numbers. 2) One thing I've done over the years is, given that I often use the same motherboards, is physically map out the PCI slot addresses - /dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi /dev/cfg/c3 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi /dev/cfg/c4 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c5 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi /dev/cfg/c6 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi ^^^ this part will correspond with a physical slot ^ if you have a SM dual-IOH board, these represent the two IOH-36s ^ on a single-IOH board, I've noted that this often corresponds to the physical slot number (it's unit-address from DDI) So far, it's involved stick a card in a slot/reboot/reconfig/see what address it's at/note it down or other forms of reverse engineering. Handy to have occasionally. If you're doing a BYO, taking the time up front to figure this out is a Good Idea. 3) Get a copy of lsiutil for solaris (available from LSI's site) - it's an easy way to check out the controller and see if it's there or whether it sees the drives or what. (There is a newer version of lsiutil that supports the 2008s... strangely, it's not available from the LSI site. Their tech support didn't even know it existed when I asked. I got my copy off someone on hardforum.) 4) Things you didn't want to know: the LSI1068 actually has a very small write cache on board. So if you manage a certain set of situations (namely, setting the device i/o timeout in the BIOS to something other than 0, then having a SATA drive blow up in a certain way such that it hangs for longer than the timeout you set), the mpt driver (it seems) can get impatient and re-initialize the controller, or that's what it looks like. Great way to scrag a volume. :( 5) your basic plan seems sound. Message: 1 Date: Sat, 06 Nov 2010 13:27:08 -0500 From: Dave Pooser dave@alfordmedia.com To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Apparent SAS HBA failure-- now what? Message-ID: c8fb082c.34460%dave@alfordmedia.com Content-Type: text/plain; charset=US-ASCII First question-- is there an easy way to identify which controller is c10? Second question-- What is the best way to handle replacement (of either the bad controller or of all three controllers if I can't identify the bad controller)? I was thinking that I should be able to shut the server down, remove the controller(s), install the replacement controller(s), check to see that all the drives are visible, run zpool clear for each pool and then do another scrub to verify the problem has been resolved. Does that sound like a good plan? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apparent SAS HBA failure-- now what?
On 8/11/10 10:21 AM, Jeff Bacon wrote: Wow, sounds familiar - binderedondat. I thought it was just when using expanders... guess it's just anything 1068-based. Lost a 20TB pool to having the controller basically just hose up what it was doing and write scragged data to the disk. 1) The suggestion using the serial number of the drive to trace back to what's connected to what is good, assuming you can pull drives to look at their serial numbers. Except that this could cause more problems if you happen to pull the wrong one in the middle of a resilver operation. Or anything, really. 2) One thing I've done over the years is, given that I often use the same motherboards, is physically map out the PCI slot addresses - /dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi /dev/cfg/c3 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi /dev/cfg/c4 ../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c5 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi /dev/cfg/c6 ../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi /dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi ^^^ this part will correspond with a physical slot ^ if you have a SM dual-IOH board, these represent the two IOH-36s ^ on a single-IOH board, I've noted that this often corresponds to the physical slot number (it's unit-address from DDI) So far, it's involved stick a card in a slot/reboot/reconfig/see what address it's at/note it down or other forms of reverse engineering. Handy to have occasionally. If you're doing a BYO, taking the time up front to figure this out is a Good Idea. This is what FMA's libtopo solves for you. Fairly well, too: # /usr/lib/fm/fmd/fmtopo -V hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0 label stringPCIE0 Slot FRU fmri hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0 ASRU fmri dev:p...@0,0/pci10de,3...@a/pci1000,3...@0 group: authority version: 1 stability: Private/Private product-idstringSun-Ultra-40-M2-Workstation chassis-idstring0802FMY00N server-id stringblinder group: io version: 1 stability: Private/Private dev string/p...@0,0/pci10de,3...@a/pci1000,3...@0 driverstringmpt modulefmri mod:///mod-name=mpt/mod-id=57 group: pciversion: 1 stability: Private/Private device-id string58 extended-capabilities stringpciexdev class-codestring1 vendor-id string1000 assigned-addresses uint32[] [ 2164391952 0 16384 0 256 2197946388 0 2686517248 0 16384 2197946396 0 2686451712 0 65536 ] Note the label property. 3) Get a copy of lsiutil for solaris (available from LSI's site) - it's an easy way to check out the controller and see if it's there or whether it sees the drives or what. (There is a newer version of lsiutil that supports the 2008s... strangely, it's not available from the LSI site. Their tech support didn't even know it existed when I asked. I got my copy off someone on hardforum.) Sigh. Sun (and now Oracle) didn't distribute lsiutil due at least in part to the likelihood of customers killing their hbas. Which, having used lsiutil (to recover from failed operations) is depressingly easy. The replacement is sas2ircu. I believe the same caveats apply. 4) Things you didn't want to know: the LSI1068 actually has a very small write cache on board. So if you manage a certain set of situations (namely, setting the device i/o timeout in the BIOS to something other than 0, then having a SATA drive blow up in a certain way such that it hangs for longer than the timeout you set), the mpt driver (it seems) can get impatient and re-initialize the controller, or that's what it looks like. Great way to scrag a volume. :( The 1068 also has a limitation of 122 devices for its logical target-id concept. But we don't talk about that in polite company :-) Please, go and have a poke around the output from libtopo. I think you'll be pleasantly surprised at what you can discover with it. McB