Re: [zfs-discuss] repost - high read iops

2009-12-30 Thread Toby Thain


On 29-Dec-09, at 11:53 PM, Ross Walker wrote:

On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn  
bfrie...@simple.dallas.tx.us wrote:


...
 However, zfs does not implement RAID 1 either.  This is easily  
demonstrated since you can unplug one side of the mirror and the  
writes to the zfs mirror will still succeed, catching up the  
mirror which is behind as soon as it is plugged back in.  When  
using mirrors, zfs supports logic which will catch that mirror  
back up (only sending the missing updates) when connectivity  
improves.  With RAID 1 where is no way to recover a mirror other  
than a full copy from the other drive.


That's not completely true these days as a lot of raid  
implementations use bitmaps to track changed blocks and a raid1  
continues to function when the other side disappears. The real  
difference is the mirror implementation in ZFS is in the file  
system and not at an abstracted block-io layer so it is more  
intelligent in it's use and layout.


Another important difference is that ZFS has the means to know which  
side of a mirror returned valid data.


--Toby
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-30 Thread Bob Friesenhahn

On Tue, 29 Dec 2009, Ross Walker wrote:


Some important points to consider are that every write to a raidz vdev must 
be synchronous.  In other words, the write needs to complete on all the 
drives in the stripe before the write may return as complete. This is also 
true of RAID 1 (mirrors) which specifies that the drives are perfect 
duplicates of each other.


I believe mirrored vdevs can do this in parallel though, while raidz vdevs 
need to do this serially due to the ordered nature of the transaction which 
makes the sync writes faster on the mirrors.


I don't think that the raidz needs to write the stripe serially, but 
it does need to ensure that the data is committed to the drives before 
considering the write to be completed.  This is due to the nature of 
the RAID5 stripe, which needs to be completely written.  It seems that 
mirrors are more sloppy in that writing and committing to one mirror 
is enough.


Bob, an interesting question was brought up to me about how copies may affect 
random read performance. I didn't know the answer, but if ZFS knows there are 
additional copies would it not also spread the load across those as well to 
make sure the wait queues on each spindle are as even as possible?


Previously we were told that zfs uses a semi-random algorithm to 
select which mirror side (or copy) to read from.  Presumably more 
(double) performance would be available if zfs was able to precisely 
schedule and interleave reads from the mirror devices, but perfecting 
that could be quite challenging.  With mirrors, we do see more read 
performance than one device can provide, but we don't see double.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-30 Thread Ross Walker
On Wed, Dec 30, 2009 at 12:35 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Tue, 29 Dec 2009, Ross Walker wrote:

 Some important points to consider are that every write to a raidz vdev
 must be synchronous.  In other words, the write needs to complete on all the
 drives in the stripe before the write may return as complete. This is also
 true of RAID 1 (mirrors) which specifies that the drives are perfect
 duplicates of each other.

 I believe mirrored vdevs can do this in parallel though, while raidz vdevs
 need to do this serially due to the ordered nature of the transaction which
 makes the sync writes faster on the mirrors.

 I don't think that the raidz needs to write the stripe serially, but it does
 need to ensure that the data is committed to the drives before considering
 the write to be completed.  This is due to the nature of the RAID5 stripe,
 which needs to be completely written.  It seems that mirrors are more sloppy
 in that writing and committing to one mirror is enough.

Ok, that makes sense, as long as the metadata is committed, which can
happen for mirrors as soon as one side is written, but not for raidz
until the whole stripe is written. So accurate in the increased
latency for raidz, but for the wrong reason.

 Bob, an interesting question was brought up to me about how copies may
 affect random read performance. I didn't know the answer, but if ZFS knows
 there are additional copies would it not also spread the load across those
 as well to make sure the wait queues on each spindle are as even as
 possible?

 Previously we were told that zfs uses a semi-random algorithm to select
 which mirror side (or copy) to read from.  Presumably more (double)
 performance would be available if zfs was able to precisely schedule and
 interleave reads from the mirror devices, but perfecting that could be quite
 challenging.  With mirrors, we do see more read performance than one device
 can provide, but we don't see double.

That isn't quite what I was getting at. Say one has a pool of mirrors,
then they set copies=2 on the pool, which in theory should create a
second copy of the data on another vdev in the pool. If during
servicing a read request, is the ZFS scheduler smart enough to read
from the second copy if the vdev where the first copy lies is being
serviced by another read/write request? If so this would increase the
read performance of the pool of mirrors by sacrificing some write
performance.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-30 Thread Richard Elling

On Dec 30, 2009, at 9:35 AM, Bob Friesenhahn wrote:


On Tue, 29 Dec 2009, Ross Walker wrote:


Some important points to consider are that every write to a raidz  
vdev must be synchronous.  In other words, the write needs to  
complete on all the drives in the stripe before the write may  
return as complete. This is also true of RAID 1 (mirrors) which  
specifies that the drives are perfect duplicates of each other.


I believe mirrored vdevs can do this in parallel though, while  
raidz vdevs need to do this serially due to the ordered nature of  
the transaction which makes the sync writes faster on the mirrors.


I don't think that the raidz needs to write the stripe serially, but  
it does need to ensure that the data is committed to the drives  
before considering the write to be completed.  This is due to the  
nature of the RAID5 stripe, which needs to be completely written.   
It seems that mirrors are more sloppy in that writing and committing  
to one mirror is enough.


Yes, though I wouldn't call it sloppy ;-)
With traditional software RAID, you have to make sure both sides of  
the mirror
are written because you also assume that you can later read either  
side.  For
ZFS, if only one side of the mirror is written, you know the bad side  
is bad

because of the checksum. The checksum is owned by the parent, which is
an important design decision that applies here, too.

Methinks it might be a good idea to start a comparison wiki to share  
some

of the details...
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread przemolicc
On Mon, Dec 28, 2009 at 01:40:03PM -0800, Brad wrote:
 This doesn't make sense to me. You've got 32 GB, why not use it?
 Artificially limiting the memory use to 20 GB seems like a waste of
 good money.
 
 I'm having a hard time convincing the dbas to increase the size of the SGA to 
 20GB because their philosophy is, no matter what eventually you'll have to 
 hit disk to pick up data thats not stored in cache (arc or l2arc).  The 
 typical database server in our environment holds over 3TB of data.

Brad,

are your DBAs aware that if you increase your SGA (currently 4 GB)
- to 8  GB - you get 100 % more memory for SGA
- to 16 GB - you get 300 % more memory for SGA
- to 20 GB - you get 400 % ...

If they are not aware, well ...

But try to be patient - I had similar situation. It took quite long time to 
convince
our DBA to increase SGA from 16 GB to 20 GB. Finally they did :-)

You can always use stronger argument that not using already bought memory
is wasting of _money_.

Regards
Przemyslaw Bak (przemol)
--
http://przemol.blogspot.com/





























--
Sprawdz, co przyniesie Nowy Rok!
Zapytaj wrozke  http://link.interia.pl/f254d

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle but whats 
the difference between a raidz mirrored vdev vs a raid10 setup? 

We have tested a zfs stripe configuration before with 15 disks and our tester 
was extremely happy with the performance.  After talking to our tester, she 
doesn't feel comfortable with the current raidz setup.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Ross Walker

On Dec 29, 2009, at 7:55 AM, Brad bene...@yahoo.com wrote:


Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle  
but whats the difference between a raidz mirrored vdev vs a raid10  
setup?


A mirrored raidz provides redundancy at a steep cost to performance  
and might I add a high monetary cost.


Because each write of a raidz is striped across the disks the  
effective IOPS of the vdev is equal to that of a single disk. This can  
be improved by utilizing multiple (smaller) raidz vdevs which are  
striped, but not by mirroring them.


With raid10 each mirrored pair has the IOPS of a single drive. Since  
these mirrors are typically 2 disk vdevs, you can have a lot more of  
them and thus a lot more IOPS (some people talk about using 3 disk  
mirrors, but it's probably just as good as getting setting copies=2 on  
a regular pool of mirrors).


We have tested a zfs stripe configuration before with 15 disks and  
our tester was extremely happy with the performance.  After talking  
to our tester, she doesn't feel comfortable with the current raidz  
setup.


How many luns are you working with now? 15?

Is the storage direct attached or is it coming from a storage server  
that may have the physical disks in a raid configuration already?


If direct attached, create a pool of mirrors. If it's coming from a  
storage server where the disks are in a raid already, just create a  
striped pool and set copies=2.


-Ross



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at  4:55, Brad wrote:

Thanks for the suggestion!

I have heard mirrored vdevs configuration are preferred for Oracle
but whats the difference between a raidz mirrored vdev vs a raid10
setup?

We have tested a zfs stripe configuration before with 15 disks and
our tester was extremely happy with the performance.  After talking
to our tester, she doesn't feel comfortable with the current raidz
setup.


As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev.  Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

When you're in a workload that you expect to be bounded by random IO
performance, in ZFS you'd want to increase the number of VDEVs to be
as large as possible, which acts to distribute random work across all
of your disks.  Building a pool out of 2-disk mirrors, then, is the
preferred layout for random performance, since it's the highest ratio
of disks to vdevs you can achieve (short of non-fault-tolerant
configurations).

This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different.  Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error.  In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@ross

Because each write of a raidz is striped across the disks the
effective IOPS of the vdev is equal to that of a single disk. This can
be improved by utilizing multiple (smaller) raidz vdevs which are
striped, but not by mirroring them.

So with random reads, would it perform better on a raid5 layout since the FS 
blocks are written to each disk instead of a stripe?

With zfs's implementation of raid10, would we still get data protection and 
checksumming?

How many luns are you working with now? 15?  
Is the storage direct attached or is it coming from a storage server
that may have the physical disks in a raid configuration already?
If direct attached, create a pool of mirrors. If it's coming from a
storage server where the disks are in a raid already, just create a
striped pool and set copies=2.

We're not using a SAN but a Sun X4270 with sixteen SAS drives (two dedicated to 
OS, two for ssd, raid 11+1.
There's a total of seven datasets from a single pool.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

It sounds like we'll need 16 vdevs striped in a pool to at least get the 
performance of 15 drives plus another 16 mirrored for redundancy.

If we are bounded in iops by the vdev, would it make sense to go with the bare 
minimum of drives (3) per vdev?

This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10 it doesn't 
have checksumming built-in.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Bob Friesenhahn

On Tue, 29 Dec 2009, Ross Walker wrote:


A mirrored raidz provides redundancy at a steep cost to performance and might 
I add a high monetary cost.


I am not sure what a mirrored raidz is.  I have never heard of such 
a thing before.


With raid10 each mirrored pair has the IOPS of a single drive. Since these 
mirrors are typically 2 disk vdevs, you can have a lot more of them and thus 
a lot more IOPS (some people talk about using 3 disk mirrors, but it's 
probably just as good as getting setting copies=2 on a regular pool of 
mirrors).


This is another case where using a term like raid10 does not make 
sense when discussing zfs.  ZFS does not support raid10.  ZFS does 
not support RAID 0 or RAID 1 so it can't support RAID 1+0 (RAID 10).


Some important points to consider are that every write to a raidz vdev 
must be synchronous.  In other words, the write needs to complete on 
all the drives in the stripe before the write may return as complete. 
This is also true of RAID 1 (mirrors) which specifies that the 
drives are perfect duplicates of each other.  However, zfs does not 
implement RAID 1 either.  This is easily demonstrated since you can 
unplug one side of the mirror and the writes to the zfs mirror will 
still succeed, catching up the mirror which is behind as soon as it is 
plugged back in.  When using mirrors, zfs supports logic which will 
catch that mirror back up (only sending the missing updates) when 
connectivity improves.  With RAID 1 where is no way to recover a 
mirror other than a full copy from the other drive.


Zfs load-shares across vdevs so it will load-share across mirror vdevs 
rather than striping (as RAID 10 would require).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Mattias Pantzare
On Tue, Dec 29, 2009 at 18:16, Brad bene...@yahoo.com wrote:
 @eric

 As a general rule of thumb, each vdev has the random performance
 roughly the same as a single member of that vdev. Having six RAIDZ
 vdevs in a pool should give roughly the performance as a stripe of six
 bare drives, for random IO.

 It sounds like we'll need 16 vdevs striped in a pool to at least get the 
 performance of 15 drives plus another 16 mirrored for redundancy.

 If we are bounded in iops by the vdev, would it make sense to go with the 
 bare minimum of drives (3) per vdev?

Minimum is 1 drive per vdev. Minimum with redundancy is 2 if you use
mirroring. You should do mirroring to get the best performance.

 This winds up looking similar to RAID10 in layout, in that you're
 striping across a lot of disks that each consists of a mirror, though
 the checksumming rules are different. Performance should also be
 similar, though it's possible RAID10 may give slightly better random
 read performance at the expense of some data quality guarantees, since
 I don't believe RAID10 normally validates checksums on returned data
 if the device didn't return an error. In normal practice, RAID10 and
 a pool of mirrored vdevs should benchmark against each other within
 your margin of error.

 That's interesting to know that with ZFS's implementation of raid10 it 
 doesn't have checksumming built-in.

He was talking about RAID10, not mirroring in ZFS. ZFS will always use
checksums.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at  9:16, Brad wrote:

@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.

It sounds like we'll need 16 vdevs striped in a pool to at least get
the performance of 15 drives plus another 16 mirrored for redundancy.


If you were striping across 16 devices before, you will achieve
similar random IO performance by striping across 16 vdevs, regardless
of their type.  Sequential throughput is more a function of the number
of devices, not the number of vdevs, in that a 3-disk RAIDZ will have
the sequential write throughput (roughly) of a pair of drives.

You still get checksumming, but if a device fails or you get a
corruption in your non-redundant stripe, zfs may not have enough
information to repair your data.  For a read-only data reference,
maybe a restore from backup in these situations is okay, but for most
installations that is unacceptable.

The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.


If we are bounded in iops by the vdev, would it make sense to go
with the bare minimum of drives (3) per vdev?


ZFS supports non-redundant vdev layouts, but they're generally not
recommended.  The smallest mirror you can build is 2 devices, and the
smallest raidz is 3 devices per vdev.


This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10
it doesn't have checksumming built-in.


I don't believe I said this.  I am reasonably certain that all
zpool/zfs layouts validate checksums, even if built with no
redundancy.  The RAID10-similar layout in ZFS is an array of
mirrors, such that you build a bunch of 2-device mirrored vdevs, and
add them all into a single pool.  You wind up with a layout like:

Pool0
  mirror-0
disk0
disk1
  mirror-1
disk2
disk3
  mirror-2
disk4
disk5
  ...
  mirror-N
disk-2N
disk-2N+1

This will give you the best random IO performance possible with ZFS,
independent of the type of disks used.  (Obviously some of the same
rules may not apply with ramdisks or SSDs, but those are special cases
for most.)

--eric


--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 9:16 AM, Brad wrote:


@eric

As a general rule of thumb, each vdev has the random performance
roughly the same as a single member of that vdev. Having six RAIDZ
vdevs in a pool should give roughly the performance as a stripe of six
bare drives, for random IO.


This model begins to break down with raidz2 and further breaks down
with raidz3.  Since I wrote about this simple model here:
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
we've refined it a bit, to take into account the number of parity  
devices.


For small, random read IOPS the performance of a single, top-level  
vdev is

performance = performance of a disk * (N / (N - P))

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3:  250 IOPS

Once again, it is clear that mirroring will offer the best small,  
random read

IOPS.

It sounds like we'll need 16 vdevs striped in a pool to at least get  
the performance of 15 drives plus another 16 mirrored for redundancy.


If we are bounded in iops by the vdev, would it make sense to go  
with the bare minimum of drives (3) per vdev?


This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10  
it doesn't have checksumming built-in.


ZFS always checksums everything unless you explicitly disable
checksumming for data. Metadata is always checksummed.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Tim Cook
On Tue, Dec 29, 2009 at 12:07 PM, Richard Elling
richard.ell...@gmail.comwrote:

 On Dec 29, 2009, at 9:16 AM, Brad wrote:

  @eric

 As a general rule of thumb, each vdev has the random performance
 roughly the same as a single member of that vdev. Having six RAIDZ
 vdevs in a pool should give roughly the performance as a stripe of six
 bare drives, for random IO.


 This model begins to break down with raidz2 and further breaks down
 with raidz3.  Since I wrote about this simple model here:

 http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
 we've refined it a bit, to take into account the number of parity devices.

 For small, random read IOPS the performance of a single, top-level vdev is
performance = performance of a disk * (N / (N - P))

 where,
N = number of disks in the vdev
P = number of parity devices in the vdev

 For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3:  250 IOPS

 Once again, it is clear that mirroring will offer the best small, random
 read
 IOPS.


  It sounds like we'll need 16 vdevs striped in a pool to at least get the
 performance of 15 drives plus another 16 mirrored for redundancy.

 If we are bounded in iops by the vdev, would it make sense to go with the
 bare minimum of drives (3) per vdev?

 This winds up looking similar to RAID10 in layout, in that you're
 striping across a lot of disks that each consists of a mirror, though
 the checksumming rules are different. Performance should also be
 similar, though it's possible RAID10 may give slightly better random
 read performance at the expense of some data quality guarantees, since
 I don't believe RAID10 normally validates checksums on returned data
 if the device didn't return an error. In normal practice, RAID10 and
 a pool of mirrored vdevs should benchmark against each other within
 your margin of error.

 That's interesting to know that with ZFS's implementation of raid10 it
 doesn't have checksumming built-in.


 ZFS always checksums everything unless you explicitly disable
 checksumming for data. Metadata is always checksummed.
  -- richard



I imagine he's referring to the fact that it cannot fix any checksum errors
it finds.  flamesuitLet me open the can of worms by saying this is nearly
as bad as not doing checksumming at all.  Knowing the data is bad when you
can't do anything to fix it doesn't really help if you have no way to
regenerate it. /flamesuit


-- 
--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Erik Trimble

Eric D. Mudama wrote:

On Tue, Dec 29 at  9:16, Brad wrote:
The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.

ZFS can't do a raidz of mirrors or a mirror of raidz.  Members of a 
mirror or raidz[123] must be a fundamental device (i.e. file or drive)





This winds up looking similar to RAID10 in layout, in that you're
striping across a lot of disks that each consists of a mirror, though
the checksumming rules are different. Performance should also be
similar, though it's possible RAID10 may give slightly better random
read performance at the expense of some data quality guarantees, since
I don't believe RAID10 normally validates checksums on returned data
if the device didn't return an error. In normal practice, RAID10 and
a pool of mirrored vdevs should benchmark against each other within
your margin of error.

That's interesting to know that with ZFS's implementation of raid10
it doesn't have checksumming built-in.


I don't believe I said this.  I am reasonably certain that all
zpool/zfs layouts validate checksums, even if built with no
redundancy.  The RAID10-similar layout in ZFS is an array of
mirrors, such that you build a bunch of 2-device mirrored vdevs, and
add them all into a single pool.  You wind up with a layout like:



Yes. PLEASE be careful - checksumming and redundancy are DIFFERENT concepts.

In ZFS, EVERYTHING is checksummed - the data blocks, and the metadata.  
This is separate from redundancy.  Regardless of the zpool layout 
(mirrors, raidz, or no redundancy), ZFS stores a checksum of all objects 
- this checksum is used to determine if the object has been corrupted.  
This check is done on any /read/


Should the checksum determine that the object is corrupt, then there are 
two things that can happen:  if your zpool has some form of redundancy 
for that object, ZFS will then reread the object from the redundant side 
of the mirror, or reconstruct the data using parity.  It will then 
re-write the object to another place in the zpool, and eliminate the 
bad object.  Else, if there is no redundancy, then it will fail to 
return the data, and log an error message to the syslog.


In the case of metadata, even in a non-redundant zpool, some of that 
metadata is stored multiple times, so there is the possibility that you 
will be able to recover/reconstruct some metadata which fails checksumming.


In short, Checksumming is how ZFS /determines/ data corruption, and 
Redundancy is how ZFS /fixes/ it.  Checksumming is /always/ present, 
while redundancy depends on the pool layout and options (cf. copies 
property).




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Brad
@relling
For small, random read IOPS the performance of a single, top-level
vdev is
performance = performance of a disk * (N / (N - P))  
  133 * 12/(12-1)=
  133 * 12/11

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

performance of a disk = Is this a rough estimate of the disk's IOP?


For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3: 250 IOPS

So if the rated iops on our disks is @133 iops
133 * 12/(12-1) = 145

11+1 raidz: 145 IOPS?

If that's the rate for a 11+1 raidz vdev, then why is iostat showing
about 700 combined IOPS (reads/writes) per disk?

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Richard Elling

On Dec 29, 2009, at 11:26 AM, Brad wrote:


@relling
For small, random read IOPS the performance of a single, top-level
vdev is
performance = performance of a disk * (N / (N - P))
 133 * 12/(12-1)=
 133 * 12/11

where,
N = number of disks in the vdev
P = number of parity devices in the vdev

performance of a disk = Is this a rough estimate of the disk's IOP?


For example, using 5 disks @ 100 IOPS we get something like:
2-disk mirror: 200 IOPS
4+1 raidz: 125 IOPS
3+2 raidz2: 167 IOPS
2+3 raidz3: 250 IOPS

So if the rated iops on our disks is @133 iops
133 * 12/(12-1) = 145

11+1 raidz: 145 IOPS?

If that's the rate for a 11+1 raidz vdev, then why is iostat showing
about 700 combined IOPS (reads/writes) per disk?


Because the model is for small, random read IOPS
over the full size of the disk. What you are seeing is
caching and seek optimization at work (a good thing).
But, AFAIK,  there are no decent performance models
which take caching into account. In most cases, storage
is sized based on empirical studies.
 -- richard


r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1402.2 7805.3 2.7 36.2 0.2 54.9 0.0 6.0 0 940 c1
10.8 1.0 0.1 0.0 0.0 0.1 0.0 7.0 0 7 c1t0d0
117.1 640.7 0.2 1.8 0.0 4.5 0.0 5.9 1 76 c1t1d0
116.9 638.2 0.2 1.7 0.0 4.6 0.0 6.1 1 78 c1t2d0
116.4 639.1 0.2 1.8 0.0 4.6 0.0 6.0 1 78 c1t3d0
116.6 638.1 0.2 1.7 0.0 4.6 0.0 6.1 1 77 c1t4d0
113.2 638.0 0.2 1.8 0.0 4.6 0.0 6.1 1 77 c1t5d0
116.6 635.3 0.2 1.7 0.0 4.5 0.0 6.0 1 76 c1t6d0
116.2 637.8 0.2 1.8 0.0 4.7 0.0 6.2 1 79 c1t7d0
115.3 636.7 0.2 1.8 0.0 4.4 0.0 5.8 1 77 c1t8d0
115.4 637.8 0.2 1.8 0.0 4.5 0.0 5.9 1 77 c1t9d0
114.8 635.0 0.2 1.8 0.0 4.3 0.0 5.7 1 76 c1t10d0
114.9 639.9 0.2 1.8 0.0 4.7 0.0 6.2 1 78 c1t11d0
115.1 638.7 0.2 1.8 0.0 4.4 0.0 5.9 1 77 c1t12d0
1.6 140.0 0.0 15.1 0.0 0.6 0.0 4.4 0 8 c1t13d0
1.3 9.1 0.0 0.1 0.0 0.0 0.0 1.0 0 0 c1t14d0
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Eric D. Mudama

On Tue, Dec 29 at 11:14, Erik Trimble wrote:

Eric D. Mudama wrote:

On Tue, Dec 29 at  9:16, Brad wrote:
The disk cost of a raidz pool of mirrors is identical to the disk cost
of raid10.

ZFS can't do a raidz of mirrors or a mirror of raidz.  Members of a 
mirror or raidz[123] must be a fundamental device (i.e. file or 
drive)


Sorry, typo/thinko ... I meant to say a zpool of mirrors, not a raidz
pool of mirrors.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-29 Thread Ross Walker
On Dec 29, 2009, at 12:36 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Tue, 29 Dec 2009, Ross Walker wrote:


A mirrored raidz provides redundancy at a steep cost to performance  
and might I add a high monetary cost.


I am not sure what a mirrored raidz is.  I have never heard of  
such a thing before.


With raid10 each mirrored pair has the IOPS of a single drive.  
Since these mirrors are typically 2 disk vdevs, you can have a lot  
more of them and thus a lot more IOPS (some people talk about using  
3 disk mirrors, but it's probably just as good as getting setting  
copies=2 on a regular pool of mirrors).


This is another case where using a term like raid10 does not make  
sense when discussing zfs.  ZFS does not support raid10.  ZFS does  
not support RAID 0 or RAID 1 so it can't support RAID 1+0 (RAID 10).


Did it again... I understand the difference. I hope it didn't confuse  
the OP by throwing that out there. What I meant to say was a zpool of  
mirror vdevs.


Some important points to consider are that every write to a raidz  
vdev must be synchronous.  In other words, the write needs to  
complete on all the drives in the stripe before the write may return  
as complete. This is also true of RAID 1 (mirrors) which specifies  
that the drives are perfect duplicates of each other.


I believe mirrored vdevs can do this in parallel though, while raidz  
vdevs need to do this serially due to the ordered nature of the  
transaction which makes the sync writes faster on the mirrors.


 However, zfs does not implement RAID 1 either.  This is easily  
demonstrated since you can unplug one side of the mirror and the  
writes to the zfs mirror will still succeed, catching up the mirror  
which is behind as soon as it is plugged back in.  When using  
mirrors, zfs supports logic which will catch that mirror back up  
(only sending the missing updates) when connectivity improves.  With  
RAID 1 where is no way to recover a mirror other than a full copy  
from the other drive.


That's not completely true these days as a lot of raid implementations  
use bitmaps to track changed blocks and a raid1 continues to function  
when the other side disappears. The real difference is the mirror  
implementation in ZFS is in the file system and not at an abstracted  
block-io layer so it is more intelligent in it's use and layout.


Zfs load-shares across vdevs so it will load-share across mirror  
vdevs rather than striping (as RAID 10 would require).


Bob, an interesting question was brought up to me about how copies may  
affect random read performance. I didn't know the answer, but if ZFS  
knows there are additional copies would it not also spread the load  
across those as well to make sure the wait queues on each spindle are  
as even as possible?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Richard Elling

Hi Brad, comments below...

On Dec 27, 2009, at 10:24 PM, Brad wrote:

Richard - the l2arc is c1t13d0.  What tools can be use to show the  
l2arc stats?


 raidz1 2.68T   580G543453  4.22M  3.70M
   c1t1d0   -  -258102   689K   358K
   c1t2d0   -  -256103   684K   354K
   c1t3d0   -  -258102   690K   359K
   c1t4d0   -  -260103   687K   354K
   c1t5d0   -  -255101   686K   358K
   c1t6d0   -  -263103   685K   354K
   c1t7d0   -  -259101   689K   358K
   c1t8d0   -  -259103   687K   354K
   c1t9d0   -  -260102   689K   358K
   c1t10d0  -  -263103   686K   354K
   c1t11d0  -  -260102   687K   359K
   c1t12d0  -  -263104   684K   354K
 c1t14d0 396K  29.5G  0 65  7  3.61M
cache-  -  -  -  -  -
 c1t13d029.7G  11.1M157 84  3.93M  6.45M

We've added 16GB to the box bring the overall total to 32GB.


In general, this is always a good idea.


arc_max is set to 8GB:
set zfs:zfs_arc_max = 8589934592


You will be well served to add much more memory to the SGA
and reduce that to the ARC.  More below...


arc_summary output:
ARC Size:
Current Size: 8192 MB (arcsize)
Target Size (Adaptive):   8192 MB (c)
Min Size (Hard Limit):1024 MB (zfs_arc_min)
Max Size (Hard Limit):8192 MB (zfs_arc_max)

ARC Size Breakdown:
Most Recently Used Cache Size:  39%3243 MB (p)
Most Frequently Used Cache Size:60%4948 MB (c-p)

ARC Efficency:
Cache Access Total: 154663786
Cache Hit Ratio:  41%   64221251   [Defined  
State for buffer]
Cache Miss Ratio: 58%   90442535   [Undefined  
State for Buffer]
REAL Hit Ratio:   41%   64221251   [MRU/MFU Hits  
Only]


Data Demand   Efficiency:38%
Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

   CACHE HITS BY CACHE LIST:
 Anon:   --%Counter Rolled.
 Most Recently Used: 17%8906  
(mru) [ Return Customer ]
 Most Frequently Used:   82%53102345  
(mfu) [ Frequent Customer ]
 Most Recently Used Ghost:   14%9427708  
(mru_ghost)[ Return Customer Evicted, Now Back ]
 Most Frequently Used Ghost:  6%4344287  
(mfu_ghost)[ Frequent Customer Evicted, Now Back ]

   CACHE HITS BY DATA TYPE:
 Demand Data:84%5108
 Prefetch Data:   0%0
 Demand Metadata:15%9777143
 Prefetch Metadata:   0%0
   CACHE MISSES BY DATA TYPE:
 Demand Data:96%87542292
 Prefetch Data:   0%0
 Demand Metadata: 3%2900243
 Prefetch Metadata:   0%0


Also disabled file-level pre-fletch and vdev cache max:
set zfs:zfs_prefetch_disable = 1
set zfs:zfs_vdev_cache_max = 0x1


I think this is a waste of time. The database will prefetch,
by default, so you might as well start that work ahead of time.
Note that ZFS uses an intelligent prefetch algorithm, so if it
detects that the accesses are purely random, it won't prefetch.

After reading about some issues with concurrent ios, I tweaked the  
setting down from 35 to 1 and it reduced the response times greatly  
(2 - 8ms):

set zfs:zfs_vdev_max_pending=1


This can be a red herring.  Judging by the number of IOPS below,
it has not improved. At this point, I will assume you are using
disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
If you only issue one command at a time, you effectively disable
NCQ and thus cannot take advantage of its efficiencies.

It did increased the actv...I'm still unsure about the side-effects  
here:

   r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.00.00.00.0  0.0  0.00.00.0   0   0 c0
   0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
2295.2  398.74.27.2  0.0 18.60.06.9   0 1084 c1
   0.00.80.00.0  0.0  0.00.00.1   0   0 c1t0d0
 190.3   22.90.40.0  0.0  1.50.07.0   0  87 c1t1d0
 180.9   20.60.30.0  0.0  1.70.08.5   0  95 c1t2d0
 195.0   43.00.30.2  0.0  1.60.06.8   0  93 c1t3d0
 193.2   21.70.40.0  0.0  1.50.06.8   0  88 c1t4d0
 195.7   34.80.30.1  0.0  1.70.07.5   0  97 c1t5d0
 186.8   20.60.30.0  0.0  1.50.07.3   0  88 c1t6d0
 188.4   21.00.40.0  0.0  1.60.07.7   0  91 c1t7d0
 189.6   21.20.30.0  0.0  1.60.07.4   0  91 c1t8d0
 193.8   22.60.40.0  0.0  1.50.07.1   0  91 c1t9d0
 192.6   20.8 

Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Brad
Try an SGA more like 20-25 GB. Remember, the database can cache more
effectively than any file system underneath. The best I/O is the I/O
you don't have to make.

We'll be turning up the SGA size from 4GB to 16GB.
The arc size will be set from 8GB to 4GB.

This can be a red herring. Judging by the number of IOPS below,
it has not improved. At this point, I will assume you are using
disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
If you only issue one command at a time, you effectively disable
NCQ and thus cannot take advantage of its efficiencies.

Here's another sample of the data taken at another time after the number of 
concurrent ios change from 10 to 1.  We're using Seagate Savio 10K SAS 
drives...I could not pull up info if the drives support NCQ or not.  What's the 
recommended value to set concurrent IOs to?  

r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
 1402.2 7805.32.7   36.2  0.2 54.90.06.0   0 940 c1
   10.81.00.10.0  0.0  0.10.07.0   0   7 c1t0d0
  117.1  640.70.21.8  0.0  4.50.05.9   1  76 c1t1d0
  116.9  638.20.21.7  0.0  4.60.06.1   1  78 c1t2d0
  116.4  639.10.21.8  0.0  4.60.06.0   1  78 c1t3d0
  116.6  638.10.21.7  0.0  4.60.06.1   1  77 c1t4d0
  113.2  638.00.21.8  0.0  4.60.06.1   1  77 c1t5d0
  116.6  635.30.21.7  0.0  4.50.06.0   1  76 c1t6d0
  116.2  637.80.21.8  0.0  4.70.06.2   1  79 c1t7d0
  115.3  636.70.21.8  0.0  4.40.05.8   1  77 c1t8d0
  115.4  637.80.21.8  0.0  4.50.05.9   1  77 c1t9d0
  114.8  635.00.21.8  0.0  4.30.05.7   1  76 c1t10d0
  114.9  639.90.21.8  0.0  4.70.06.2   1  78 c1t11d0
  115.1  638.70.21.8  0.0  4.40.05.9   1  77 c1t12d0
1.6  140.00.0   15.1  0.0  0.60.04.4   0   8 c1t13d0
1.39.10.00.1  0.0  0.00.01.0   0   0 c1t14d0
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Richard Elling

On Dec 28, 2009, at 12:40 PM, Brad wrote:


Try an SGA more like 20-25 GB. Remember, the database can cache more
effectively than any file system underneath. The best I/O is the I/O
you don't have to make.

We'll be turning up the SGA size from 4GB to 16GB.
The arc size will be set from 8GB to 4GB.


This doesn't make sense to me. You've got 32 GB, why not use it?
Artificially limiting the memory use to 20 GB seems like a waste of
good money.


This can be a red herring. Judging by the number of IOPS below,
it has not improved. At this point, I will assume you are using
disks that have NCQ or CTQ (eg most SATA and all FC/SAS drives).
If you only issue one command at a time, you effectively disable
NCQ and thus cannot take advantage of its efficiencies.

Here's another sample of the data taken at another time after the  
number of concurrent ios change from 10 to 1.  We're using Seagate  
Savio 10K SAS drives...I could not pull up info if the drives  
support NCQ or not.  What's the recommended value to set concurrent  
IOs to?


   r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   0.00.00.00.0  0.0  0.00.00.0   0   0 c0
   0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
1402.2 7805.32.7   36.2  0.2 54.90.06.0   0 940 c1
  10.81.00.10.0  0.0  0.10.07.0   0   7 c1t0d0
 117.1  640.70.21.8  0.0  4.50.05.9   1  76 c1t1d0
 116.9  638.20.21.7  0.0  4.60.06.1   1  78 c1t2d0
 116.4  639.10.21.8  0.0  4.60.06.0   1  78 c1t3d0
 116.6  638.10.21.7  0.0  4.60.06.1   1  77 c1t4d0
 113.2  638.00.21.8  0.0  4.60.06.1   1  77 c1t5d0
 116.6  635.30.21.7  0.0  4.50.06.0   1  76 c1t6d0
 116.2  637.80.21.8  0.0  4.70.06.2   1  79 c1t7d0
 115.3  636.70.21.8  0.0  4.40.05.8   1  77 c1t8d0
 115.4  637.80.21.8  0.0  4.50.05.9   1  77 c1t9d0
 114.8  635.00.21.8  0.0  4.30.05.7   1  76 c1t10d0
 114.9  639.90.21.8  0.0  4.70.06.2   1  78 c1t11d0
 115.1  638.70.21.8  0.0  4.40.05.9   1  77 c1t12d0
   1.6  140.00.0   15.1  0.0  0.60.04.4   0   8 c1t13d0
   1.39.10.00.1  0.0  0.00.01.0   0   0 c1t14d0


SAS will be CTQ, basically the same thing as NCQ for SATA disks.
You can see here that you're averaging 4.6 I/Os queued at the
disks (actv column) and the response time is quite good.
Meanwhile, the disks are handling more than 700 IOPS with
less than 10 ms response time.  Not bad at all for HDDs, but
not a level that can be expected, either. Here we see more
than 600 small write IOPS. These will be sequential (as in
contiguous blocks, not sequential as in large blocks) so
they get buffered and efficiently written by the disk.  When
your workload returns to the read-mostly random activity,
then the IOPS will go down.

As to what is the magic number? It is hard to say.  In this case,
more than 4 is good.  Remember, the default of 35 is as much
of a guess as anything.  For HDDs, 35 might be a little bit too
much, but for a RAID array, something more like 1,000 might
be optimal.  Keeping an eye on the actv column of iostat can
help you make that decision.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Brad
This doesn't make sense to me. You've got 32 GB, why not use it?
Artificially limiting the memory use to 20 GB seems like a waste of
good money.

I'm having a hard time convincing the dbas to increase the size of the SGA to 
20GB because their philosophy is, no matter what eventually you'll have to hit 
disk to pick up data thats not stored in cache (arc or l2arc).  The typical 
database server in our environment holds over 3TB of data.

If the performance does not improve then we'll possibly have to change the raid 
layout from raidz to a raid10.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Bob Friesenhahn

On Mon, 28 Dec 2009, Brad wrote:

I'm having a hard time convincing the dbas to increase the size of 
the SGA to 20GB because their philosophy is, no matter what 
eventually you'll have to hit disk to pick up data thats not stored 
in cache (arc or l2arc).  The typical database server in our 
environment holds over 3TB of data.


But if the working set is 25GB, then things will be magically better. 
If it is 50GB or 500GB, then performance may still suck.


If the performance does not improve then we'll possibly have to 
change the raid layout from raidz to a raid10.


Mirror vdevs are what is definitely recommended for use with 
databases.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-28 Thread Richard Elling

On Dec 28, 2009, at 1:40 PM, Brad wrote:


This doesn't make sense to me. You've got 32 GB, why not use it?
Artificially limiting the memory use to 20 GB seems like a waste of
good money.

I'm having a hard time convincing the dbas to increase the size of  
the SGA to 20GB because their philosophy is, no matter what  
eventually you'll have to hit disk to pick up data thats not stored  
in cache (arc or l2arc).  The typical database server in our  
environment holds over 3TB of data.


Wow!  Where did you find DBAs who didn't want more resources? :-)
If that is the case, then you might need many more disks to keep the
(hungry) database fed.

If the performance does not improve then we'll possibly have to  
change the raid layout from raidz to a raid10.


Yes, the notion of adding more disks and using them as mirrors
are closely aligned. However, you know that the data in the ARC
is more than 50% frequently used, which makes the argument that
a larger SGA (or ARC) should benefit the workload.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] repost - high read iops

2009-12-27 Thread Richard Elling

OK, I'll take a stab at it...

On Dec 26, 2009, at 9:52 PM, Brad wrote:


repost - Sorry for ccing the other forums.

I'm running into a issue where there seems to be a high number of  
read iops hitting disks and physical free memory is fluctuating  
between 200MB - 450MB out of 16GB total. We have the l2arc  
configured on a 32GB Intel X25-E ssd and slog on another 32GB X25-E  
ssd.


OK, this shows that memory is being used... a good thing.

According to our tester, Oracle writes are extremely slow (high  
latency).


OK, this is a workable problem statement... another good thing.


Below is a snippet of iostat:

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0
401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0
421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0
403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0
406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0
414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0
406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0
404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0
404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0
407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0
407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0
402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0
408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0
9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0
0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0


You are getting 400+ IOPS @ 4 KB out of HDDs.  Count your lucky stars.
Don't expect that kind of performance as normal, it is much better than
normal.

Is this an indicator that we need more physical memory? From http://blogs.sun.com/brendan/entry/test 
, the order that a read request is satisfied is:


   0) Oracle SGA

1) ARC
2) vdev cache of L2ARC devices
3) L2ARC devices
4) vdev cache of disks
5) disks

Using arc_summary.pl, we determined that prefletch was not helping  
much so we disabled.


CACHE HITS BY DATA TYPE:
Demand Data: 22% 158853174
Prefetch Data: 17% 123009991 ---not helping???
Demand Metadata: 60% 437439104
Prefetch Metadata: 0% 2446824

The write iops started to kick in more and latency reduced on  
spinning disks:


0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1
0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0
126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0
129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0
128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0
128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0
125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0
128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0
128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0
128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0
129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0
130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0
126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0
129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0
90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0
0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0


latency is reduced, but you are also now only seeing 200 IOPS,
not 400+ IOPS.  This is closer to what you would see as a max
for HDDs.

I cannot tell which device is the cache device.  I would expect
to see one disk with significantly more reads than the others.
What do the l2arc stats show?

Is it true if your MFU stats start to go over 50% then more memory  
is needed?


That is a good indicator. It means that most of the cache entries are
frequently used. Grow your SGA and you should see this go down.


CACHE HITS BY CACHE LIST:
Anon: 10% 74845266 [ New Customer, First Cache Hit ]
Most Recently Used: 19% 140478087 (mru) [ Return Customer ]
Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ]
Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer  
Evicted, Now Back ]
Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent  
Customer Evicted, Now Back ]

CACHE HITS BY DATA TYPE:
Demand Data: 22% 158852935
Prefetch Data: 17% 123009991
Demand Metadata: 60% 437438658
Prefetch Metadata: 0% 2446824

My theory is since there's not enough memory for the arc to cache  
data, its hits the l2arc where it can't find data and has to query  
the disk for the request. This causes contention between reads and  
writes causing the service times to inflate.


If you have a choice of where to use memory, always choose closer to
the application. Try a larger SGA first.  Be aware of large page  
stealing --

consider increasing the SGA immediately after a reboot and before the
database or applications are started.
 -- richard


uname: 5.10 Generic_141445-09 i86pc i386 i86pc
Sun Fire X4270: 11+1 raidz (SAS)
   l2arc Intel X25-E
   slog Intel X25-E
Thoughts?
--
This message posted from opensolaris.org
___
zfs-discuss mailing list

Re: [zfs-discuss] repost - high read iops

2009-12-27 Thread Brad
Richard - the l2arc is c1t13d0.  What tools can be use to show the l2arc stats?

  raidz1 2.68T   580G543453  4.22M  3.70M
c1t1d0   -  -258102   689K   358K
c1t2d0   -  -256103   684K   354K
c1t3d0   -  -258102   690K   359K
c1t4d0   -  -260103   687K   354K
c1t5d0   -  -255101   686K   358K
c1t6d0   -  -263103   685K   354K
c1t7d0   -  -259101   689K   358K
c1t8d0   -  -259103   687K   354K
c1t9d0   -  -260102   689K   358K
c1t10d0  -  -263103   686K   354K
c1t11d0  -  -260102   687K   359K
c1t12d0  -  -263104   684K   354K
  c1t14d0 396K  29.5G  0 65  7  3.61M
cache-  -  -  -  -  -
  c1t13d029.7G  11.1M157 84  3.93M  6.45M

We've added 16GB to the box bring the overall total to 32GB.
arc_max is set to 8GB:
set zfs:zfs_arc_max = 8589934592

arc_summary output:
ARC Size:
 Current Size: 8192 MB (arcsize)
 Target Size (Adaptive):   8192 MB (c)
 Min Size (Hard Limit):1024 MB (zfs_arc_min)
 Max Size (Hard Limit):8192 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:  39%3243 MB (p)
 Most Frequently Used Cache Size:60%4948 MB (c-p)

ARC Efficency:
 Cache Access Total: 154663786
 Cache Hit Ratio:  41%   64221251   [Defined State for 
buffer]
 Cache Miss Ratio: 58%   90442535   [Undefined State for 
Buffer]
 REAL Hit Ratio:   41%   64221251   [MRU/MFU Hits Only]

 Data Demand   Efficiency:38%
 Data Prefetch Efficiency:DISABLED (zfs_prefetch_disable)

CACHE HITS BY CACHE LIST:
  Anon:   --%Counter Rolled.
  Most Recently Used: 17%8906 (mru) [ 
Return Customer ]
  Most Frequently Used:   82%53102345 (mfu) [ 
Frequent Customer ]
  Most Recently Used Ghost:   14%9427708 (mru_ghost)[ 
Return Customer Evicted, Now Back ]
  Most Frequently Used Ghost:  6%4344287 (mfu_ghost)[ 
Frequent Customer Evicted, Now Back ]
CACHE HITS BY DATA TYPE:
  Demand Data:84%5108
  Prefetch Data:   0%0
  Demand Metadata:15%9777143
  Prefetch Metadata:   0%0
CACHE MISSES BY DATA TYPE:
  Demand Data:96%87542292
  Prefetch Data:   0%0
  Demand Metadata: 3%2900243
  Prefetch Metadata:   0%0


Also disabled file-level pre-fletch and vdev cache max:
set zfs:zfs_prefetch_disable = 1
set zfs:zfs_vdev_cache_max = 0x1

After reading about some issues with concurrent ios, I tweaked the setting down 
from 35 to 1 and it reduced the response times greatly (2 - 8ms):
set zfs:zfs_vdev_max_pending=1

It did increased the actv...I'm still unsure about the side-effects here:
r/sw/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
0.00.00.00.0  0.0  0.00.00.0   0   0 c0
0.00.00.00.0  0.0  0.00.00.0   0   0 c0t0d0
 2295.2  398.74.27.2  0.0 18.60.06.9   0 1084 c1
0.00.80.00.0  0.0  0.00.00.1   0   0 c1t0d0
  190.3   22.90.40.0  0.0  1.50.07.0   0  87 c1t1d0
  180.9   20.60.30.0  0.0  1.70.08.5   0  95 c1t2d0
  195.0   43.00.30.2  0.0  1.60.06.8   0  93 c1t3d0
  193.2   21.70.40.0  0.0  1.50.06.8   0  88 c1t4d0
  195.7   34.80.30.1  0.0  1.70.07.5   0  97 c1t5d0
  186.8   20.60.30.0  0.0  1.50.07.3   0  88 c1t6d0
  188.4   21.00.40.0  0.0  1.60.07.7   0  91 c1t7d0
  189.6   21.20.30.0  0.0  1.60.07.4   0  91 c1t8d0
  193.8   22.60.40.0  0.0  1.50.07.1   0  91 c1t9d0
  192.6   20.80.30.0  0.0  1.40.06.8   0  88 c1t10d0
  195.7   22.20.30.0  0.0  1.50.06.7   0  88 c1t11d0
  184.7   20.30.30.0  0.0  1.40.06.8   0  84 c1t12d0
7.3   82.40.15.5  0.0  0.00.00.2   0   1 c1t13d0
1.3   23.90.01.3  0.0  0.00.00.2   0   0 c1t14d0

I'm still in talks with the dba in seeing if we can raise the SGA from 4GB to 
6GB to see if it'll help.

The changes that showed a lot of improvement is disabling file/device level 
pre-fletch and reducing concurrent ios from 35 to 1 (tried 10 but it didn't 
help much).  Is there anything else that could be tweaked to increase write 
performance?  Record sizes are set according to 8K and 128K for redo logs.
-- 
This 

[zfs-discuss] repost - high read iops

2009-12-26 Thread Brad
repost - Sorry for ccing the other forums.

I'm running into a issue where there seems to be a high number of read iops 
hitting disks and physical free memory is fluctuating between 200MB - 450MB 
out of 16GB total. We have the l2arc configured on a 32GB Intel X25-E ssd and 
slog on another 32GB X25-E ssd.

According to our tester, Oracle writes are extremely slow (high latency).

Below is a snippet of iostat:

r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
4898.3 34.2 23.2 1.4 0.1 385.3 0.0 78.1 0 1246 c1
0.0 0.8 0.0 0.0 0.0 0.0 0.0 16.0 0 1 c1t0d0
401.7 0.0 1.9 0.0 0.0 31.5 0.0 78.5 1 100 c1t1d0
421.2 0.0 2.0 0.0 0.0 30.4 0.0 72.3 1 98 c1t2d0
403.9 0.0 1.9 0.0 0.0 32.0 0.0 79.2 1 100 c1t3d0
406.7 0.0 2.0 0.0 0.0 33.0 0.0 81.3 1 100 c1t4d0
414.2 0.0 1.9 0.0 0.0 28.6 0.0 69.1 1 98 c1t5d0
406.3 0.0 1.8 0.0 0.0 32.1 0.0 79.0 1 100 c1t6d0
404.3 0.0 1.9 0.0 0.0 31.9 0.0 78.8 1 100 c1t7d0
404.1 0.0 1.9 0.0 0.0 34.0 0.0 84.1 1 100 c1t8d0
407.1 0.0 1.9 0.0 0.0 31.2 0.0 76.6 1 100 c1t9d0
407.5 0.0 2.0 0.0 0.0 33.2 0.0 81.4 1 100 c1t10d0
402.8 0.0 2.0 0.0 0.0 33.5 0.0 83.2 1 100 c1t11d0
408.9 0.0 2.0 0.0 0.0 32.8 0.0 80.3 1 100 c1t12d0
9.6 10.8 0.1 0.9 0.0 0.4 0.0 20.1 0 17 c1t13d0
0.0 22.7 0.0 0.5 0.0 0.5 0.0 22.8 0 33 c1t14d0

Is this an indicator that we need more physical memory? From 
http://blogs.sun.com/brendan/entry/test, the order that a read request is 
satisfied is:

1) ARC
2) vdev cache of L2ARC devices
3) L2ARC devices
4) vdev cache of disks
5) disks

Using arc_summary.pl, we determined that prefletch was not helping much so we 
disabled.

CACHE HITS BY DATA TYPE:
Demand Data: 22% 158853174
Prefetch Data: 17% 123009991 ---not helping???
Demand Metadata: 60% 437439104
Prefetch Metadata: 0% 2446824

The write iops started to kick in more and latency reduced on spinning disks:

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t0d0
1629.0 968.0 17.4 7.3 0.0 35.9 0.0 13.8 0 1088 c1
0.0 1.9 0.0 0.0 0.0 0.0 0.0 1.7 0 0 c1t0d0
126.7 67.3 1.4 0.2 0.0 2.9 0.0 14.8 0 90 c1t1d0
129.7 76.1 1.4 0.2 0.0 2.8 0.0 13.7 0 90 c1t2d0
128.0 73.9 1.4 0.2 0.0 3.2 0.0 16.0 0 91 c1t3d0
128.3 79.1 1.3 0.2 0.0 3.6 0.0 17.2 0 92 c1t4d0
125.8 69.7 1.3 0.2 0.0 2.9 0.0 14.9 0 89 c1t5d0
128.3 81.9 1.4 0.2 0.0 2.8 0.0 13.1 0 89 c1t6d0
128.1 69.2 1.4 0.2 0.0 3.1 0.0 15.7 0 93 c1t7d0
128.3 80.3 1.4 0.2 0.0 3.1 0.0 14.7 0 91 c1t8d0
129.2 69.3 1.4 0.2 0.0 3.0 0.0 15.2 0 90 c1t9d0
130.1 80.0 1.4 0.2 0.0 2.9 0.0 13.6 0 89 c1t10d0
126.2 72.6 1.3 0.2 0.0 2.8 0.0 14.2 0 89 c1t11d0
129.7 81.0 1.4 0.2 0.0 2.7 0.0 12.9 0 88 c1t12d0
90.4 41.3 1.0 4.0 0.0 0.2 0.0 1.2 0 6 c1t13d0
0.0 24.3 0.0 1.2 0.0 0.0 0.0 0.2 0 0 c1t14d0


Is it true if your MFU stats start to go over 50% then more memory is needed?
CACHE HITS BY CACHE LIST:
Anon: 10% 74845266 [ New Customer, First Cache Hit ]
Most Recently Used: 19% 140478087 (mru) [ Return Customer ]
Most Frequently Used: 65% 475719362 (mfu) [ Frequent Customer ]
Most Recently Used Ghost: 2% 20785604 (mru_ghost) [ Return Customer Evicted, 
Now Back ]
Most Frequently Used Ghost: 1% 9920089 (mfu_ghost) [ Frequent Customer Evicted, 
Now Back ]
CACHE HITS BY DATA TYPE:
Demand Data: 22% 158852935
Prefetch Data: 17% 123009991
Demand Metadata: 60% 437438658
Prefetch Metadata: 0% 2446824

My theory is since there's not enough memory for the arc to cache data, its 
hits the l2arc where it can't find data and has to query the disk for the 
request. This causes contention between reads and writes causing the service 
times to inflate.

uname: 5.10 Generic_141445-09 i86pc i386 i86pc
Sun Fire X4270: 11+1 raidz (SAS)
l2arc Intel X25-E
slog Intel X25-E
Thoughts?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss