Re: [zfs-discuss] zfs under medium load causes SMB to delay writes

2010-11-07 Thread Richard L. Hamilton
This is not the appropriate group/list for this message.
Crossposting to zfs-discuss (where it perhaps primarily
belongs) and to cifs-discuss, which also relates.

 Hi,
 
 I have an I/O load issue and after days of searching
 wanted to know if anyone has pointers on how to
 approach this.
 
 My 1-year stable zfs system (raidz3 8 2TB drives, all
 OK) just started to cause problems when I introduced
 a new backup script that puts medium I/O load. This
 script simply tars up a few filesystems and md5sums
 the tarball, to copy to another system for off
 OpenSolaris backup. The simple commands are:
 
 tar -c /tank/[filesystem]/.zfs/snapshot/[snapshot] 
 /tank/[otherfilesystem]/file.tar
 md5sum -b /tank/[otherfilesystem]/file.tar 
 file.md5sum
 
 These 2 commands obviously cause high read/write I/O
 because the 8 drives are directly reading and writing
 a large amount of data as fast as the system can go.
 This is OK and OpenSolaris functions fine.
 
 The problem is I host VMWare images on another PC
 which access their images on this zfs box over smb,
 and during this high I/O period, these VMWare guests
 are crashing.
 
 What I think is happening is during the backup with
 high I/O, zfs is delaying reads/writes to the VMWare
 images. This is quickly causing VMWare to freeze the
 guest machines. 
 
 When the backup script is not running, the VMWare
 guests are fine, and have been fine for 1-year.
 (setup has been rock solid)
 
 Any idea how to address this? I'd thought puting the
 relevant filesystem (tank/vmware) on a higher
 priority for reads/writes, but haven't figured out
 how. 
 Another way is to deprioritize the backup somehow.
 
 Any pointers would be appreciated.
 
 Thanks,
 Tom
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Online zpool expansion feature in Solaris 10 9/10

2010-11-07 Thread James Patterson
An update to this query.  Cindy did look into this for me, and found that:

[i]I have confirmation that although the ZFS autoexpand property and supporting 
bugs like 475340 are in the Solaris 10 9/10 release, the supporting sd driver 
work to update the EFI label is not.
In previous Solaris 10 releases, you could workaround the LUN expansion problem 
by either export/importing the pool or rebooting the system. My sense is that 
the integration of the autoexpand property is now restricting this workaround.
We are considering how to resolve this but it might take some time.[/i]

So, at the end of the day I can make the LUN larger via my SAN managment 
interface, but the sd driver is unaware of this change so the EFI label never 
gets updated, meaning format continues to show the old size. I realise there 
are tricks with labelling under the format command using autoconfigure, but we 
wanted to avoid that non supported method.

Also, as Cindy mentions, a zpool export/import fails to udpate the EFI label.  
Apparently this has worked in the past, although I'd never tried this.

As a result, the best method to extend a zpool in our case is to create a 
second larger LUN and add it to the zpool to create a mirror, then detach the 
old smaller device.  Then run zpool online -e to actually expand the zpool.

James.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-07 Thread Jeff Bacon
Wow, sounds familiar - binderedondat. I thought it was just when using
expanders... guess it's just anything 1068-based. Lost a 20TB pool to
having the controller basically just hose up what it was doing and write
scragged data to the disk. 

1) The suggestion using the serial number of the drive to trace back to
what's connected to what is good, assuming you can pull drives to look
at their serial numbers. 

2) One thing I've done over the years is, given that I often use the
same motherboards, is physically map out the PCI slot addresses - 

/dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi
/dev/cfg/c3
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c4
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c5
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c6
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi
   ^^^ this part will
correspond with a physical slot 
 ^ if you have a SM dual-IOH board,
these represent the two IOH-36s
^ on a single-IOH board,
I've noted that this often
corresponds to the
physical slot number (it's unit-address from DDI)

So far, it's involved stick a card in a slot/reboot/reconfig/see what
address it's at/note it down or other forms of reverse engineering.
Handy to have occasionally. If you're doing a BYO, taking the time up
front to figure this out is a Good Idea. 

3) Get a copy of lsiutil for solaris (available from LSI's site) -
it's an easy way to check out the controller and see if it's there or
whether it sees the drives or what.

(There is a newer version of lsiutil that supports the 2008s...
strangely, it's not available from the LSI site. Their tech support
didn't even know it existed when I asked. I got my copy off someone on
hardforum.) 

4) Things you didn't want to know: the LSI1068 actually has a very small
write cache on board. So if you manage a certain set of situations
(namely, setting the device i/o timeout in the BIOS to something other
than 0, then having a SATA drive blow up in a certain way such that it
hangs for longer than the timeout you set), the mpt driver (it seems)
can get impatient and re-initialize the controller, or that's what it
looks like. Great way to scrag a volume. :(
 
5) your basic plan seems sound.


 Message: 1
 Date: Sat, 06 Nov 2010 13:27:08 -0500
 From: Dave Pooser dave@alfordmedia.com
 To: zfs-discuss@opensolaris.org
 Subject: [zfs-discuss] Apparent SAS HBA failure-- now what?
 Message-ID: c8fb082c.34460%dave@alfordmedia.com
 Content-Type: text/plain; charset=US-ASCII
 
 First question-- is there an easy way to identify which controller is
c10?
 Second question-- What is the best way to handle replacement (of
either the
 bad controller or of all three controllers if I can't identify the bad
 controller)? I was thinking that I should be able to shut the server
down,
 remove the controller(s), install the replacement controller(s), check
to
 see that all the drives are visible, run zpool clear for each pool and
then
 do another scrub to verify the problem has been resolved. Does that
sound
 like a good plan?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Apparent SAS HBA failure-- now what?

2010-11-07 Thread McBofh

On  8/11/10 10:21 AM, Jeff Bacon wrote:

Wow, sounds familiar - binderedondat. I thought it was just when using
expanders... guess it's just anything 1068-based. Lost a 20TB pool to
having the controller basically just hose up what it was doing and write
scragged data to the disk.

1) The suggestion using the serial number of the drive to trace back to
what's connected to what is good, assuming you can pull drives to look
at their serial numbers.


Except that this could cause more problems if you happen to
pull the wrong one in the middle of a resilver operation. Or
anything, really.



2) One thing I've done over the years is, given that I often use the
same motherboards, is physically map out the PCI slot addresses -

/dev/cfg/c2 ../../devices/p...@0,0/pci8086,3...@3/pci1000,3...@0:scsi
/dev/cfg/c3
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c4
../../devices/p...@7a,0/pci8086,3...@5/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c5
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@f:scsi
/dev/cfg/c6
../../devices/p...@7a,0/pci8086,3...@7/pci1000,3...@0/ip...@v0:scsi
/dev/cfg/c7 ../../devices/p...@7a,0/pci8086,3...@9/pci1000,3...@0:scsi
^^^ this part will
correspond with a physical slot
  ^ if you have a SM dual-IOH board,
these represent the two IOH-36s
 ^ on a single-IOH board,
I've noted that this often
 corresponds to the
physical slot number (it's unit-address from DDI)

So far, it's involved stick a card in a slot/reboot/reconfig/see what
address it's at/note it down or other forms of reverse engineering.
Handy to have occasionally. If you're doing a BYO, taking the time up
front to figure this out is a Good Idea.


This is what FMA's libtopo solves for you. Fairly well, too:

# /usr/lib/fm/fmd/fmtopo -V



hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
  group: protocol   version: 1   stability: Private/Private
resource  fmri  
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=0
label stringPCIE0 Slot
FRU   fmri  
hc://:product-id=Sun-Ultra-40-M2-Workstation:server-id=blinder:chassis-id=0802FMY00N/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0
ASRU  fmri  dev:p...@0,0/pci10de,3...@a/pci1000,3...@0
  group: authority  version: 1   stability: Private/Private
product-idstringSun-Ultra-40-M2-Workstation
chassis-idstring0802FMY00N
server-id stringblinder
  group: io version: 1   stability: Private/Private
dev   string/p...@0,0/pci10de,3...@a/pci1000,3...@0
driverstringmpt
modulefmri  mod:///mod-name=mpt/mod-id=57
  group: pciversion: 1   stability: Private/Private
device-id string58
extended-capabilities stringpciexdev
class-codestring1
vendor-id string1000
assigned-addresses uint32[]  [ 2164391952 0 16384 0 256 2197946388 0 
2686517248 0 16384 2197946396 0 2686451712 0 65536 ]



Note the label property.






3) Get a copy of lsiutil for solaris (available from LSI's site) -
it's an easy way to check out the controller and see if it's there or
whether it sees the drives or what.
(There is a newer version of lsiutil that supports the 2008s...
strangely, it's not available from the LSI site. Their tech support
didn't even know it existed when I asked. I got my copy off someone on
hardforum.)


Sigh. Sun (and now Oracle) didn't distribute lsiutil due at least in
part to the likelihood of customers killing their hbas. Which, having
used lsiutil (to recover from failed operations) is depressingly easy.

The replacement is sas2ircu. I believe the same caveats apply.
 

4) Things you didn't want to know: the LSI1068 actually has a very small
write cache on board. So if you manage a certain set of situations
(namely, setting the device i/o timeout in the BIOS to something other
than 0, then having a SATA drive blow up in a certain way such that it
hangs for longer than the timeout you set), the mpt driver (it seems)
can get impatient and re-initialize the controller, or that's what it
looks like. Great way to scrag a volume. :(


The 1068 also has a limitation of 122 devices for its logical target-id
concept. But we don't talk about that in polite company :-)

Please, go and have a poke around the output from libtopo. I think you'll
be pleasantly surprised at what you can discover with it.


McB