Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Jeff Bacon
  Have I missed any changes/updates in the situation?
 
 I'm been getting very bad performance out of a LSI 9211-4i card
 (mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and
 Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute
 at random under heavy load.

Hm. That I haven't seen. Is this hang as in some drive hangs up with
iostat busy% at 100 and nothing else happening (can't talk to a disk) or
a hang as perceived by applications under load? 

What's your read/write mix, and what are you using for CPU/mem? How many
drives? 

I wonder if maybe your SSDs are flooding the channel. I have a (many)
847E2 chassis, and I'm considering putting in a second pair of
controllers and splitting the drives front/back so it's 24/12 vs all 36
on one pair. 

 Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
 performance by 30-40% instantly and there are no hangs anymore so I'm
 guessing it's something related to the mpt_sas driver.

Well, I sorta hate to swap out all of my controllers (bother, not to
mention the cost) but it'd be nice to have raidutil/lsiutil back.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Giovanni Tirloni
On Wed, Jun 23, 2010 at 10:14 AM, Jeff Bacon ba...@walleyesoftware.com wrote:
  Have I missed any changes/updates in the situation?

 I'm been getting very bad performance out of a LSI 9211-4i card
 (mpt_sas) with Seagate Constellation 2TB SAS disks, SM SC846E1 and
 Intel X-25E/M SSDs. Long story short, I/O will hang for over 1 minute
 at random under heavy load.

 Hm. That I haven't seen. Is this hang as in some drive hangs up with
 iostat busy% at 100 and nothing else happening (can't talk to a disk) or
 a hang as perceived by applications under load?

 What's your read/write mix, and what are you using for CPU/mem? How many
 drives?

I'm using iozone to get some performance numbers and I/O hangs when
it's doing the writing phase.

This pool has:

18 x 2TB SAS disks as 9 data mirrors
2 x 32GB X-25E as log mirror
1 x 160GB X-160M as cache

iostat shows 2 I/O operations active and SSDs at 100% busy when it's stuck.

There are timeout messages when this happens:

Jun 23 00:05:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Disconnected command timeout for Target 11
Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:05:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:11:51 osol-x8-hba scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Disconnected command timeout for Target 11
Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc
Jun 23 00:11:51 osol-x8-hba scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
Jun 23 00:11:51 osol-x8-hba Log info 0x3114 received for target 11.
Jun 23 00:11:51 osol-x8-hba scsi_status=0x0, ioc_status=0x8048,
scsi_state=0xc



 I wonder if maybe your SSDs are flooding the channel. I have a (many)
 847E2 chassis, and I'm considering putting in a second pair of
 controllers and splitting the drives front/back so it's 24/12 vs all 36
 on one pair.

My plan is to use the newest SC846E26 chassis with 2 cables but right
now what I've available for testing is the SC846E1.

I like the fact that SM uses the LSI chipsets in their backplanes.
It's been a good experience so far.


 Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
 performance by 30-40% instantly and there are no hangs anymore so I'm
 guessing it's something related to the mpt_sas driver.

 Well, I sorta hate to swap out all of my controllers (bother, not to
 mention the cost) but it'd be nice to have raidutil/lsiutil back.

As much as I would like to blame faulty hardware for this issue, I
only pointed out that using the MegaRAID doesn't show the problem
because that's what I've been using without any issues in this
particular setup.

This system will be available to me for quite some time, so if anyone
wants all kinds of tests to understand what's happening, I would be
happy to provide those.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] COMSTAR iSCSI and two Windows computers

2010-06-23 Thread Scott Meilicke
Look again at how XenServer does storage. I think you will find it already has 
a solution, both for iSCSI and NFS.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Scott Meilicke
Reaching into the dusty regions of my brain, I seem to recall that since RAIDz 
does not work like a traditional RAID 5, particularly because of variably sized 
stripes, that the data may not hit all of the disks, but it will always be 
redundant. 

I apologize for not having a reference for this assertion, so I may be 
completely wrong.

I assume your hardware is recent, the controllers are on PCIe x4 buses, etc.

-Scott
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] c5-c9 device name change prevents beadm activate

2010-06-23 Thread Evan Layton

On 6/23/10 4:29 AM, Brian Nitz wrote:

I saw a problem while upgrading from build 140 to 141 where beadm
activate {build141BE} failed because installgrub failed:

# BE_PRINT_ERR=true beadm activate opensolarismigi-4
be_do_installgrub: installgrub failed for device c5t0d0s0.
Unable to activate opensolarismigi-4.
Unknown external error.

The reason installgrub failed is that it is attempting to install grub
on c5t0d0s0 which is where my root pool is:
# zpool status
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 2010
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors

But the raw device doesn't exist:
# ls -ls /dev/rdsk/c5*
/dev/rdsk/c5*: No such file or directory

Even though zfs pool still sees it as c5, the actual device seen by
format is c9t0d0s0


Is there any workaround for this problem? Is it a bug in install, zfs or
somewhere else in ON?



In this instance beadm is a victim of the zpool configuration reporting
the wrong device. This does appear to be a ZFS issue since the device
actually being used is not what zpool status is reporting. I'm forwarding
this on to the ZFS alias to see if anyone has any thoughts there.

-evan
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

 Hi,
 
 
 zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0
 
 zfs set atime=off test
 zfs set recordsize=16k test
 (I know...)
 
 now if I create a one large file with filebench and simulate a randomread 
 workload with 1 or more threads then disks on c2 and c3 controllers are 
 getting about 80% more reads. This happens both on 111b and snv_134. I would 
 rather except all of them to get about the same number of iops.
 
 Any idea why?
 
 
 -- 
 Robert Milkowski
 http://milek.blogspot.com
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] c5-c9 device name change prevents beadm activate

2010-06-23 Thread Cindy Swearingen



On 06/23/10 10:40, Evan Layton wrote:

On 6/23/10 4:29 AM, Brian Nitz wrote:

I saw a problem while upgrading from build 140 to 141 where beadm
activate {build141BE} failed because installgrub failed:

# BE_PRINT_ERR=true beadm activate opensolarismigi-4
be_do_installgrub: installgrub failed for device c5t0d0s0.
Unable to activate opensolarismigi-4.
Unknown external error.

The reason installgrub failed is that it is attempting to install grub
on c5t0d0s0 which is where my root pool is:
# zpool status
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 2010
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors

But the raw device doesn't exist:
# ls -ls /dev/rdsk/c5*
/dev/rdsk/c5*: No such file or directory

Even though zfs pool still sees it as c5, the actual device seen by
format is c9t0d0s0


Is there any workaround for this problem? Is it a bug in install, zfs or
somewhere else in ON?



In this instance beadm is a victim of the zpool configuration reporting
the wrong device. This does appear to be a ZFS issue since the device
actually being used is not what zpool status is reporting. I'm forwarding
this on to the ZFS alias to see if anyone has any thoughts there.

-evan


Hi Evan,

I suspect that some kind of system, hardware, or firmware event changed
this device name. We could identify the original root pool device with
the zpool history output from this pool.

Brian, you could boot this system from the OpenSolaris LiveCD and
attempt to import this pool to see if that will update the device info
correctly.

If that doesn't help, then create /dev/rdsk/c5* symlinks to point to
the correct device.

Thanks,

Cindy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] c5-c9 device name change prevents beadm activate

2010-06-23 Thread Lori Alt

Cindy Swearingen wrote:



On 06/23/10 10:40, Evan Layton wrote:

On 6/23/10 4:29 AM, Brian Nitz wrote:

I saw a problem while upgrading from build 140 to 141 where beadm
activate {build141BE} failed because installgrub failed:

# BE_PRINT_ERR=true beadm activate opensolarismigi-4
be_do_installgrub: installgrub failed for device c5t0d0s0.
Unable to activate opensolarismigi-4.
Unknown external error.

The reason installgrub failed is that it is attempting to install grub
on c5t0d0s0 which is where my root pool is:
# zpool status
pool: rpool
state: ONLINE
status: The pool is formatted using an older on-disk format. The 
pool can

still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scan: scrub repaired 0 in 5h3m with 0 errors on Tue Jun 22 22:31:08 
2010

config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c5t0d0s0 ONLINE 0 0 0

errors: No known data errors

But the raw device doesn't exist:
# ls -ls /dev/rdsk/c5*
/dev/rdsk/c5*: No such file or directory

Even though zfs pool still sees it as c5, the actual device seen by
format is c9t0d0s0


Is there any workaround for this problem? Is it a bug in install, 
zfs or

somewhere else in ON?



In this instance beadm is a victim of the zpool configuration reporting
the wrong device. This does appear to be a ZFS issue since the device
actually being used is not what zpool status is reporting. I'm 
forwarding

this on to the ZFS alias to see if anyone has any thoughts there.

-evan


Hi Evan,

I suspect that some kind of system, hardware, or firmware event changed
this device name. We could identify the original root pool device with
the zpool history output from this pool.

Brian, you could boot this system from the OpenSolaris LiveCD and
attempt to import this pool to see if that will update the device info
correctly.

If that doesn't help, then create /dev/rdsk/c5* symlinks to point to
the correct device.

I've seen this kind of device name change in a couple contexts now 
related to installs, image-updates, etc.


I think we need to understand why this is happening.  Prior to 
OpenSolaris and the new installer, we used to go to a fair amount of 
trouble to make sure that device names, once assigned, never changed.  
Various parts of the system depended on device names remaining the same 
across upgrades and other system events.


Does anyone know why these device names are changing?  Because that 
seems like the root of the problem.  Creating symlinks with the old 
names seems like a band-aid, which could cause problems down the 
road--what if some other device on the system gets assigned that name on 
a future update?


Lori

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Jeff Bacon

Gack, that's the same message we're seeing with the mpt controller with
SATA drives. I've never seen it with a SAS drive before .

Has anyone noticed a trend of 2TB SATA drives en-masse not working well
with the LSI SASx28/x36 expander chips? I can seemingly reproduce it on
demand - hook  4 2TB disks to one of my supermicro chassis, spin up the
array, and beat on it. (The last part is optional; merely hooking up the
WD Caviar Blacks and attempting an import is sometimes sufficient.) 

Sun guys, I've got piles of hardware, if you want a testbed you got it. 


  What's your read/write mix, and what are you using for CPU/mem? How
many
  drives?
 
 I'm using iozone to get some performance numbers and I/O hangs
 when it's doing the writing phase.
 
 This pool has:
 
 18 x 2TB SAS disks as 9 data mirrors
 2 x 32GB X-25E as log mirror
 1 x 160GB X-160M as cache
 
 iostat shows 2 I/O operations active and SSDs at 100% busy when
 it's stuck.

 There are timeout messages when this happens:
 
 Jun 23 00:05:51 osol-x8-hba scsi: [ID 107833 kern.warning]
 WARNING:
 /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
 Jun 23 00:05:51 osol-x8-hba Disconnected command timeout for
 Target 11
 Jun 23 00:05:51 osol-x8-hba scsi: [ID 365881 kern.info]
 /p...@0,0/pci8086,3...@3/pci1000,3...@0 (mpt_sas0):
 Jun 23 00:05:51 osol-x8-hba Log info 0x3114 received for
target
 11.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Jeff Bacon
 I'm using iozone to get some performance numbers and I/O hangs when
 it's doing the writing phase.
 
 This pool has:
 
 18 x 2TB SAS disks as 9 data mirrors
 2 x 32GB X-25E as log mirror
 1 x 160GB X-160M as cache
 
 iostat shows 2 I/O operations active and SSDs at 100% busy when
 it's stuck.

Interesting.  Have a SM 847E2 chassis with 33 constellation 2TB SAS and
3 vertex LE 100G, dual-connected across a pair of 9211-8is, sol10u8 with
may patchset, and it runs like a champ - left several bonnie++ processes
running on it for three days straight thrashing the pool, not even a
blip. 

(the rear and front backplanes are separately cabled to the
controllers.)

(that's with load-balance=none, in deference to Josh Simon's
observations - not really willing to lock the paths because I want the
auto-failover. I'm going to be dropping in another pair of 9211-4is  and
connecting the back 12 drives to them since I have the PCIe slots,
though it's probably not especially necessary.) 

I wonder if the expander chassis work better if you're running with the
dual-expander-chip backplane? So far all of my testing with the 2TB SAS
drives have been with single-expander-chip backplanes. Hm, might have to
give that a try; it never came up simply because both of my
dual-expander-chip-backplane JBODs were filled and in use, which just
recently changed.

 My plan is to use the newest SC846E26 chassis with 2 cables but right
 now what I've available for testing is the SC846E1.

Agreed. I just got my first 847E2 chassis in today - been waiting for
months for them to be available, and I'm not entirely sure there's any
real stock (sorta like SM's quad-socket Magny-Cours boards - a month
ago, they didn't even have any boards in the USA available for RMA, they
got one batch in and sold it in a week or so). 

  Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
  performance by 30-40% instantly and there are no hangs anymore so
I'm
  guessing it's something related to the mpt_sas driver.

Wait. The mpt_sas driver by default uses scsi_vhci, and scsi_vhci by
default does load-balance round-robin. Have you tried setting
load-balance=none in scsi_vhci.conf? 

-bacon
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Robert Milkowski


128GB.

Does it mean that for dataset used for databases and similar 
environments where basically all blocks have fixed size and there is no 
other data all parity information will end-up on one (z1) or two (z2) 
specific disks?




On 23/06/2010 17:51, Adam Leventhal wrote:

Hey Robert,

How big of a file are you making? RAID-Z does not explicitly do the parity 
distribution that RAID-5 does. Instead, it relies on non-uniform stripe widths 
to distribute IOPS.

Adam

On Jun 18, 2010, at 7:26 AM, Robert Milkowski wrote:

   

Hi,


zpool create test raidz c0t0d0 c1t0d0 c2t0d0 c3t0d0 \
  raidz c0t1d0 c1t1d0 c2t1d0 c3t1d0 \
  raidz c0t2d0 c1t2d0 c2t2d0 c3t2d0 \
  raidz c0t3d0 c1t3d0 c2t3d0 c3t3d0 \
  [...]
  raidz c0t10d0 c1t10d0 c2t10d0 c3t10d0

zfs set atime=off test
zfs set recordsize=16k test
(I know...)

now if I create a one large file with filebench and simulate a randomread 
workload with 1 or more threads then disks on c2 and c3 controllers are getting 
about 80% more reads. This happens both on 111b and snv_134. I would rather 
except all of them to get about the same number of iops.

Any idea why?


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl


   


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Adam Leventhal
 Does it mean that for dataset used for databases and similar environments 
 where basically all blocks have fixed size and there is no other data all 
 parity information will end-up on one (z1) or two (z2) specific disks?

No. There are always smaller writes to metadata that will distribute parity. 
What is the total width of your raidz1 stripe?

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Ross Walker
On Jun 23, 2010, at 1:48 PM, Robert Milkowski mi...@task.gda.pl wrote:

 
 128GB.
 
 Does it mean that for dataset used for databases and similar environments 
 where basically all blocks have fixed size and there is no other data all 
 parity information will end-up on one (z1) or two (z2) specific disks?

What's the record size on those datasets?

8k?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Open Solaris installation help for backup application

2010-06-23 Thread Albert Davis
This forum has been tremendously helpful, but I decided to get some help from a 
Solaris Guru install Solaris for a backup application.

I do not want to disturb the flow of this forum, but where can I post to get 
some paid help on this forum? We are located in the San Francisco Bay Area. Any 
help would be appreciated.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] WD caviar/mpt issues

2010-06-23 Thread Giovanni Tirloni
On Wed, Jun 23, 2010 at 2:43 PM, Jeff Bacon ba...@walleyesoftware.com wrote:
  Swapping the 9211-4i for a MegaRAID ELP (mega_sas) improves
  performance by 30-40% instantly and there are no hangs anymore so
 I'm
  guessing it's something related to the mpt_sas driver.

 Wait. The mpt_sas driver by default uses scsi_vhci, and scsi_vhci by
 default does load-balance round-robin. Have you tried setting
 load-balance=none in scsi_vhci.conf?

That didn't help.

-- 
Giovanni Tirloni
gtirl...@sysdroid.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss