Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-20 Thread Edward Ned Harvey
 From: Edward Ned Harvey [mailto:sh...@nedharvey.com]
  
 Let's crunch some really quick numbers here.  Suppose a 6Gbit/sec
 sas/sata bus, with 6 disks in a raid-5.  Each disk is 1TB, 1000G, and
 each disk is capable of sustaining 1 Gbit/sec sequential operations.
 These are typical measurements for systems I use.  Then 1000G =
 8000Gbit.  It will take 8000 sec to resilver = 133min.  So whenever
 people have resilver times longer than that ... It's because ZFS
 resilver code for raidzN is inefficient.

I hate to be the unfortunate one verifying my own point here, but:

One of the above mentioned disks needed to be resilvered yesterday.
(Actually a 2T disk.)  It has now resilvered 1.12T in 18.5 hrs, and has 10.5
hrs remaining.  This is a mirror.  The problem would be several times worse
if it were a raidz.

So I guess it's unfair to say raidz is inefficient at resilvering.  The
truth is, ZFS in general is inefficient at resilvering, but the problem is
several times worse on raidz than it is for mirrors.  The more disks in the
vdev, the worse the problem.  The fewer vdev's in the pool, the worse the
problem.  So you're able to minimize the problem by using a bunch of mirrors
instead of raidzN.

Although the problem exists on mirrors too, it's nothing so dramatic that I
would destroy  recreate my pool because of it.  People with raidzN often
do.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-20 Thread Trond Michelsen
On Wed, Oct 20, 2010 at 2:50 PM, Edward Ned Harvey sh...@nedharvey.com wrote:
 One of the above mentioned disks needed to be resilvered yesterday.
 (Actually a 2T disk.)  It has now resilvered 1.12T in 18.5 hrs, and has 10.5
 hrs remaining.  This is a mirror.  The problem would be several times worse
 if it were a raidz.

Is this one of those Advanced format drives (Western Digital EARS or
Samsung F4), which emulates 512 byte sectors? Or is that only a
problem with raidz anyway?

-- 
Trond Michelsen
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-20 Thread Erik Trimble
On Mon, 2010-10-18 at 17:32 -0400, Edward Ned Harvey wrote:
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Marty Scholes
  
  Would it make sense for scrub/resilver to be more aware of operating in
  disk order instead of zfs order?
 
 It would certainly make sense.  As mentioned, even if you do the entire disk
 this way, including unused space, it is faster than making the poor little
 disks randomly seek all over the place for tiny little fragments that
 eventually add up to a significant portion of the whole disk.
 
 The main question is:  How difficult would it be to implement?
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Ideally, you want the best of both worlds:  ZFS is currently *much*
faster when doing partial resyncs (i.e. updating stale drives) by using
the walk-the-metadata-tree method.   However, it would be nice to have
it recognize when a full disk rebuild is required, and switch to some
form of a full disk sequential copy.


The problem with a full sequential copy is threefold, however:

(a) you (often) copy a whole lots of bits that aren't actually holding
any valuable info

(b) it can get a little tricky distinguishing between the case of an
interrupted full-disk resilver and a freshen-the-stale-drive resilver.

(c) You generally punt on any advantage of knowing how the pool is
structured.


Frankly, if I could ever figure out when the mythical BP rewrite (or
equivalent feature) will appear, I'd be able to implement a defragger
(or, maybe, a compactor is a better term). Having a defrag util keep
the zpool relatively compacted would seriously reduce the work in a
resilver.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-18 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  This is one of the reasons the raidzN resilver code is inefficient.
  Since you end up waiting for the slowest seek time of any one disk in
  the vdev, and when that's done, the amount of data you were able to
  process was at most 128K.  Rinse and repeat.
 
 How is this different than all other RAID implementations?

Hardware raid has the disadvantage that it must resilver the whole disk
regardless of how much of the disk is used.  Hardware raid has the advantage
that it will resilver sequentially, so despite the fact that it resilvers
unused space, it is limited by sustainable throughput instead of random seek
time.  The resilver time for hardware raid is a constant regardless of what
the OS has done with the disks over time (neglecting system usage during
resilver.)

If your ZFS vdev is significantly full, with data that was written, and
snapshotted, and rewritten, and snapshots destroyed, etc etc etc ... Typical
usage for a system that has been in production for a while ... then the time
to resilver the whole disk block-by-block will be lower than the time to
resilver the used portions in order of allocation time.  This is why
sometimes the ZFS resilver time for a raidzN can be higher than the time to
resilver a similar hardware raid.  As evidenced by the frequent comments 
complaints on this list about raidzN resilver time.

Let's crunch some really quick numbers here.  Suppose a 6Gbit/sec sas/sata
bus, with 6 disks in a raid-5.  Each disk is 1TB, 1000G, and each disk is
capable of sustaining 1 Gbit/sec sequential operations.  These are typical
measurements for systems I use.  Then 1000G = 8000Gbit.  It will take 8000
sec to resilver = 133min.  So whenever people have resilver times longer
than that ... It's because ZFS resilver code for raidzN is inefficient.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-18 Thread Marty Scholes
 Richard wrote:
 Yep, it depends entirely on how you use the pool.  As soon as you
 come up with a credible model to predict that, then we can optimize
 accordingly :-)

You say that somewhat tongue-in-cheek, but Edward's right.  If the resliver 
code progresses in slab/transaction-group/whatever-the-correct-term-is order, 
then a pool with any significant use will have the resilver code seeking all 
over the disk.

If instead, resilver blindly moved in block number order, then it would have 
very little seek activity and the effective throughput would be close to that 
of pure sequential i/o for both the new disk and the remaining disks in the 
vdev.

Would it make sense for scrub/resilver to be more aware of operating in disk 
order instead of zfs order?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Marty Scholes
 
 Would it make sense for scrub/resilver to be more aware of operating in
 disk order instead of zfs order?

It would certainly make sense.  As mentioned, even if you do the entire disk
this way, including unused space, it is faster than making the poor little
disks randomly seek all over the place for tiny little fragments that
eventually add up to a significant portion of the whole disk.

The main question is:  How difficult would it be to implement?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-17 Thread Bob Friesenhahn

On Sun, 17 Oct 2010, Edward Ned Harvey wrote:



The default blocksize is 128K.  If you are using mirrors, then each 
block on disk will be 128K whenever possible.  But if you're using 
raidzN with a capacity of M disks (M disks useful capacity + N disks 
redundancy) then the block size on each individual disk will be 128K 
/ M.  Right?  This is one of the reasons the raidzN resilver code is 
inefficient.  Since you end up waiting for the slowest seek time of 
any one disk in the vdev, and when that's done, the amount of data 
you were able to process was at most 128K.  Rinse and repeat.


Your idea about what it means for code to be inefficient is clearly 
vastly different than my own.  Regardless, the the physical layout 
issues (impacting IOPS requirements) are a reality.


Would it not be wise, when creating raidzN vdev's, to increase the 
blocksize to 128K * M?  Then, the on-disk blocksize for each disk 
could be the same as the mirror on-disk blocksize of 128K.  It still 
won't resilver as fast as a mirror, but the raidzN resilver would be 
accelerated by as much as M times.  Right?


This might work for HPC applications with huge files and huge 
sequential streaming data rate requirements.  It would be detrimental 
for the case of small files, or applications which issue many small 
writes, and particularly bad for many random synchronous writes.


The only disadvantage that I know of would be wasted space.  Every 
4K file in a mirror can waste up to 124K of disk space, right?  And 
in the above described scenario, every 4K file in the raidzN can 
waste up to 128K * M of disk space, right?  Also, if you have a lot 
of these sparse 4K blocks, then the resilver time doesn't actually 
improve either.  Because you perform one seek, and regardless if you 
fetch 128K or 128K*M, you still paid one maximum seek time to fetch 
4K of useful data.


The tally of disadvantages are quite large.  Note that zfs needs to 
write each zfs block and you are dramatically increasing the level 
of write amplification.  Also zfs needs to checksum each whole block 
and the checksum adds to the latency.  The risk of block corruption is 
increased.  128K is already quite large for a block.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-17 Thread Kyle McDonald

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 


On 10/17/2010 9:38 AM, Edward Ned Harvey wrote:

 The default blocksize is 128K. If you are using mirrors, then
 each block on disk will be 128K whenever possible. But if you're
 using raidzN with a capacity of M disks (M disks useful capacity +
 N disks redundancy) then the block size on each individual disk
 will be 128K / M. Right?


If I understand things correctly, I think this is why it is
recommended that you pick an M that divides into 128K evenly. I
believe powers of 2 are recommended.

I think increasing the block size to 128K*M would be overkill, but
that idea does make me wonder:

In cases where M can't be a power of 2, would it make sense to adjust
the block size so that M still divides evenly?

If M were 4 then the data written to each drive would be 32K. So if
you really wanted to M to be 5 drives, is there an advantage to making
the block size 160K, or if that's too big, how about 80K?

Like wise if you really wanted to M to be 3 drives, would adjusting it
BS to 96K make sense?

  -Kyle

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (MingW32)
 
iQEcBAEBAgAGBQJMuzG2AAoJEEADRM+bKN5wokMH/A2W3hjf2yZx0uO4n0UvSbIY
aAS2faGjx9R03ile3u1K/Qlg/dAm0zLdMkNoKY8Pcg8TPx3VLCapNvmlySxCldAf
rPXC8NC5xzIj75oGqb1VGByUlqerCdVldvBjo5vFKcDM83CcpLLjmO6gJzNe1UoV
MwcKsb0oZv3JzmYcvqjW/lNCIjaQzxkm0k0EP+pV1tx+HMPyHp+kaxnzv4v994GO
zwz0OfUOsHaIkSJda8t8ekg9qMdvZa63X8A0VGmhnR26lpjHZD/274IPBStapasx
IC+T7O0EYazQSO3fftZ6MCd9O6//0tbQX0MLHPDMpyX90EU+ihILuqYn/QjJjhg=
=4mvO
-END PGP SIGNATURE-

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RaidzN blocksize ... or blocksize in general ... and resilver

2010-10-17 Thread Richard Elling
On Oct 17, 2010, at 6:38 AM, Edward Ned Harvey wrote:

 The default blocksize is 128K.  If you are using mirrors, then each block on 
 disk will be 128K whenever possible.  But if you're using raidzN with a 
 capacity of M disks (M disks useful capacity + N disks redundancy) then the 
 block size on each individual disk will be 128K / M.  Right? 

Yes, but it is worse for RAID-5 where you will likely have to do a RMW if your
stripe size is not perfectly matched to the blocksize.  This is the case where 
raidz
shines over the alternatives.

 This is one of the reasons the raidzN resilver code is inefficient.  Since 
 you end up waiting for the slowest seek time of any one disk in the vdev, and 
 when that's done, the amount of data you were able to process was at most 
 128K.  Rinse and repeat.

How is this different than all other RAID implementations?

 Would it not be wise, when creating raidzN vdev's, to increase the blocksize 
 to 128K * M?  Then, the on-disk blocksize for each disk could be the same as 
 the mirror on-disk blocksize of 128K.  It still won't resilver as fast as a 
 mirror, but the raidzN resilver would be accelerated by as much as M times.  
 Right?

We had this discussion in 2007, IIRC. The bottom line was that if you have a
fixed record size workload, then set the appropriate recordsize and it will
make sense to adjust your raidz1 configuration to avoid gaps. For raidz2/3 or
mixed record length workloads, is not clear that matching the number of 
data/parity
disks offers any advantage.

 The only disadvantage that I know of would be wasted space.  Every 4K file in 
 a mirror can waste up to 124K of disk space, right? 

No.  4K files have recordsize of 4K.  This is why we refer to this case as a 
mixed
record size workloads.   Remember, the recordsize parameter is a maximum
limit, not a minimum limit.

 And in the above described scenario, every 4K file in the raidzN can waste up 
 to 128K * M of disk space, right? 

No.

 Also, if you have a lot of these sparse 4K blocks, then the resilver time 
 doesn't actually improve either.  Because you perform one seek, and 
 regardless if you fetch 128K or 128K*M, you still paid one maximum seek time 
 to fetch 4K of useful data.

Seek penalties are hard to predict or model. Modern drives have efficient 
algorithms
and large buffer caches.  It cannot be predicted whether the next read will be 
in the
buffer cache already.  Indeed, it is not even possible to predict the read 
order.  The
only sure-fire way to prevent seeks is to use SSDs.

 Point is:  If the goal is to reduce the number of on-disk slabs, and 
 therefore reduce the number of seeks necessary to resilver, one thing you 
 could do is increase the pool blocksize, right? 

No the pool block size, the application's block size.  Applications which make 
lots of
itty bitty I/Os will tend to take more time to resilved.  Applications that 
make lots of large
I/Os will resilver faster.

 YMMV, and YM will depend on how you use your pool.  Hopefully you're able to 
 bias your usage in favor of large block writes.

Yep, it depends entirely on how you use the pool.  As soon as you come up with 
a 
credible model to predict that, then we can optimize accordingly :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com













___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss