Re: [zfs-discuss] Thinking about spliting a zpool in system and data

2012-01-07 Thread Jim Klimov

Hello, Jesus,

  I have transitioned a number of systems roughly by the
same procedure as you've outlined. Sadly, my notes are
not in English so they wouldn't be of much help directly;
but I can report that I had success with similar in-place
manual transitions from mirrored SVM (pre-solaris 10u4)
to new ZFS root pools, as well as various transitions
of ZFS root pools from one layout to another, on systems
with limited numbers of disk drives (2-4 overall).

  As I've recently reported on the list, I've also done
such migration for my faulty single-disk rpool at home
via the data pool and backwards, changing the copies
setting enroute.

  Overall, your plan seems okay and has more failsafes
than we've had - because longer downtimes were affordable ;)
However, when doing such low-level stuff, you should make
sure that you have remote access to your systems (ILOM,
KVM, etc.; remotely-controlled PDUs for externally enforced
poweroff-poweron are welcome), and that you can boot the
systems over ILOM/rKVM with an image of a LiveUSB/LiveCD/etc
in case of bigger trouble.

  In the steps 6-7, where you reboot the system to test
that new rpool works, you might want to keep the zones
down, i.e. by disabling the zones service in the old BE
just before reboot, and zfs-sending this update to the
new small rpool. Also it is likely that in the new BE
(small rpool) your old data from the big rpool won't
get imported by itself and zones (or their services)
wouldn't start correctly anyway before steps 7-8.

---

Below I'll outline our experience from my notes, as it
successfully applied to an even more complicated situation
than yours:

  On many Sol10/SXCE systems with ZFS roots we've also
created a hierarchical layout (separate /var, /usr, /opt
with compression enabled), but this procedure HAS FAILED
for newer OpenIndiana systems. So for OI we have to use
the default single-root layout and only seperate some of
/var/* subdirs (adm, log, mail, crash, cores, ...) in
order to set quotas and higher compression on them.
Such datasets are also kept separate from OS upgrades
and are used in all boot environments without cloning.

  To simplify things, most of the transitions were done
in off-hours time so it was okay to shut down all the
zones and other services. In some cases for Sol10/SXCE
the procedure involved booting in the Failsafe Boot
mode; for all systems this can be done with the BootCD.

  For usual Solaris 10 and OpenSolaris SXCE maintenance
we did use LiveUpgrade, but at that time its ZFS support
was immature, so we circumvented LU and transitioned
manually. In those cases we used LU to update systems
to the base level supporting ZFS roots (Sol10u4+) while
running from SVM mirrors (one mirror for main root,
another mirror for LU root for new/old OS image).
After the transition to ZFS rpool, we cleared out the
LU settings (/etc/lu/, /etc/lutab) by using defaults
from the most recent SUNWlu* packages, and when booted
from ZFS - we created the current LU BE based on the
current ZFS rpool.

  When the OS was capable of booting from ZFS (sol10u4+,
snv_100 approx), we broke the SVM mirrors, repartitioned
the second disk to our liking (about 4-10Gb for rpool,
rest for data), created the new rpool and dataset
hierarchy we needed and had in mounted under /zfsroot.

  Note that in our case we used a minimized install
of Solaris which fit under 1-2Gb per BE, we did not use
a separate /dump device and the swap volume was located
in the ZFS data pool (mirror or raidz for 4-disk systems).
Zoneroots were also separate from the system rpool and
were stored in the data pool. This DID yield problems
for LiveUpgrade, so zones were detached before LU and
reattached-with-upgrade after the OS upgrade and disk
migrations.

  Then we copied the root FS data like this:

# cd /zfsroot  ( ufsdump 0f - / | ufsrestore -rf - )

  If the source (SVM) paths like /var, /usr or /boot are
separate UFS filesystems - repeat likewise, changing the
current paths in the command above.

  For non-UFS systems, such as migration from VxFS or
even ZFS (if you need a different layout, compression,
etc. - so ZFS send/recv is not applicable), you can use
Sun cpio (it should carry over extended attributes and
ACLs). For example, if you're booted from the LiveCD
and the old UFS root is mounted in /usfroot and new
ZFS rpool hierarchy is in /zfsroot, you'd do this:

# cd /ufsroot  ( find . -xdev -depth -print | cpio -pvdm /zfsroot )

  The example above also copies only the data from
current FS, so you need to repeat it for each UFS
sub-fs like /var, etc.

  Another problem we've encountered while cpio'ing live
systems (when not running from failsafe/livecd) is that
find skips mountpoints of sub-fses. While your new ZFS
hierarchy would provide usr, var, opt under /zfspool,
you might need to manually create some others - see the
list in your current df output. Example:

# cd /zfsroot
# mkdir -p tmp proc devices var/run system/contract system/object 
etc/svc/volatile


[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Jim Klimov

Hello all,

  For smaller systems such as laptops or low-end servers,
which can house 1-2 disks, would it make sense to dedicate
a 2-4Gb slice to the ZIL for the data pool, separate from
rpool? Example layout (single-disk or mirrored):

s0 - 16Gb - rpool
s1 - 4Gb  - data-zil
s3 - *Gb  - data pool

  The idea would be to decrease fragmentation (committed
writes to data pool would be more coalesced) and to keep
the ZIL at faster tracks of the HDD drive.

  I'm actually more interested in the former: would the
dedicated ZIL decrease fragmentation of the pool?

  Likewise, for larger pools (such as my 6-disk raidz2)
can fragmentation and/or performance benefit from some
dedicated ZIL slices (i.e. s0 = 1-2Gb ZIL per 2Tb disk,
with 3 mirrored ZIL sets overall)?

  Can several ZIL (mirrors) be concatenated for a single
data pool, or only one dedicated ZIL vdev can be used?

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Upgrade

2012-01-07 Thread Jim Klimov

2012-01-06 17:49, Edward Ned Harvey пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Ivan Rodriguez

Dear list,

  I'm about to upgrade a zpool from 10 to 29 version, I suppose that
this upgrade will improve several performance issues that are present
on 10, however
inside that pool we have several zfs filesystems all of them are
version 1 my first question is is there a problem with performance or
any other problem if you operate a zpool 29 with zfs filesystems
version 1 ?

Is it better to upgrade zfs to the latest version ?

Can we jump from zfs version 1 to 5 ?

Is there any implications on zfs send/receive with filesystem's and
pools with different versions ?


You can, and definitely should, upgrade all your zpool's and zfs
filesystems.  The only exceptions to think about are rpool.  You definitely
DON'T want to upgrade rpool higher than what's supported on the boot CD.  So
I suggest you create a test system, boot from the boot CD, create some
filesystem, check to see which zpool and zfs version they are.  Then,
upgrade rpool only to that level (just in case you ever need to boot from CD
to perform a rescue).  And upgrade all your other filesystems to the latest.


I believe in this case it might make sense to boot the
target system from this BootCD and use zpool upgrade
from this OS image. This way you can be more sure that
your recovery software (Solaris BootCD) would be helpful :)

But this is only applicable if you can afford the downtime...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Stress test zfs

2012-01-07 Thread Thomas Nau
Hi Grant

On 01/06/2012 04:50 PM, Richard Elling wrote:
 Hi Grant,
 
 On Jan 4, 2012, at 2:59 PM, grant lowe wrote:
 
 Hi all,

 I've got a solaris 10 running 9/10 on a T3. It's an oracle box with 128GB 
 memory RIght now oracle . I've been trying to load test the box with 
 bonnie++. I can seem to get 80 to 90 K writes, but can't seem to get more 
 than a couple K for writes. Any suggestions? Or should I take this to a 
 bonnie++ mailing list? Any help is appreciated. I'm kinda new to load 
 testing.
 
 I was hoping Roch (from Oracle) would respond, but perhaps he's not hanging 
 out on 
 zfs-discuss anymore?
 
 Bonnie++ sux as a benchmark. The best analysis of this was done by Roch and 
 published
 online in the seminal blog post:
   http://137.254.16.27/roch/entry/decoding_bonnie
 
 I suggest you find a benchmark that more closely resembles your expected 
 workload and
 do not rely on benchmarks that provide a summary metric.
  -- richard



I had good experience with filebench. I resembles your workload as
good as you are able to describe it but takes some time to get things
setup if you cannot find your workload in one of the many provided
examples

Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Jim Klimov

Hello all,

  I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

  I believe it was sometimes implied on this list that such
fragmentation for static data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

  I wonder how resilvering works, namely - does it write
blocks as they were or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
   in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
   new drive in order to repair the array and make it redundant
   again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

  Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

  For example, can(does?) metadata live separately
from data in some dedicated disk areas, while data
blocks are written as contiguously as they can?

  Many Windows defrag programs group files into several
zones on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

  I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
For smaller systems such as laptops or low-end servers,
 which can house 1-2 disks, would it make sense to dedicate
 a 2-4Gb slice to the ZIL for the data pool, separate from
 rpool? Example layout (single-disk or mirrored):

The idea would be to decrease fragmentation (committed
 writes to data pool would be more coalesced) and to keep
 the ZIL at faster tracks of the HDD drive.

I'm not authoritative, I'm speaking from memory of former discussions on
this list and various sources of documentation.

No, it won't help you.

First of all, all your writes to the storage pool are aggregated, so you're
already minimizing fragmentation of writes in your main pool.  However, over
time, as snapshots are created  destroyed, small changes are made to files,
and file contents are overwritten incrementally and internally...  The only
fragmentation you get creeps in as a result of COW.  This fragmentation only
impacts sequential reads of files which were previously written in random
order.  This type of fragmentation has no relation to ZIL or writes.

If you don't split out your ZIL separate from the storage pool, zfs already
chooses disk blocks that it believes to be optimized for minimal access
time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
few at the high end, and various other locations scattered throughout the
pool, so whatever the current head position, it tries to go to the closest
landing zone that's available for ZIL writes.  If anything, splitting out
your ZIL to a different partition might actually hurt your performance.

Also, the concept of faster tracks of the HDD is also incorrect.  Yes,
there was a time when HDD speeds were limited by rotational speed and
magnetic density, so the outer tracks of the disk could serve up more data
because more magnetic material passed over the head in each rotation.  But
nowadays, the hard drive sequential speed is limited by the head speed,
which is invariably right around 1Gbps.  So the inner and outer sectors of
the HDD are equally fast - the outer sectors are actually less magnetically
dense because the head can't handle it.  And the random IO speed is limited
by head seek + rotational latency, where seek is typically several times
longer than latency.  

So basically, the only thing that matters, to optimize the performance of
any modern typical HDD, is to minimize the head travel.  You want to be
seeking sectors which are on tracks that are nearby to the present head
position.

Of course, if you want to test  benchmark the performance of splitting
apart the ZIL to a different partition, I encourage that.  I'm only speaking
my beliefs based on my understanding of the architectures and limitations
involved.  This is my best prediction.  And I've certainly been wrong
before.  ;-)  Sometimes, being wrong is my favorite thing, because you learn
so much from it.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
I understand that relatively high fragmentation is inherent
 to ZFS due to its COW and possible intermixing of metadata
 and data blocks (of which metadata path blocks are likely
 to expire and get freed relatively quickly).
 
I believe it was sometimes implied on this list that such
 fragmentation for static data can be currently combatted
 only by zfs send-ing existing pools data to other pools at
 some reserved hardware, and then clearing the original pools
 and sending the data back. This is time-consuming, disruptive
 and requires lots of extra storage idling for this task (or
 at best - for backup purposes).

Can be combated by sending  receiving.  But that's not the only way.  You
can defrag, (or apply/remove dedup and/or compression, or any of the other
stuff that's dependent on BP rewrite) by doing any technique which
sequentially reads the existing data, and writes it back to disk again.  For
example, if you cp -p file1 file2  mv file2 file1 then you have
effectively defragged file1 (or added/removed dedup or compression).  But of
course it's requisite that file1 is sufficiently not being used right now.


I wonder how resilvering works, namely - does it write
 blocks as they were or in an optimized (defragmented)
 fashion, in two usecases:

resilver goes according to temporal order.  While this might sometimes yield
a slightly better organization (If a whole bunch of small writes were
previously spread out over a large period of time on a largely idle system,
they will now be write-aggregated to sequential blocks) usually resilvering
recreates fragmentation similar to the pre-existing fragmentation.  

In fact, even if you zfs send | zfs receive while preserving snapshots,
you're still recreating the data in something loosely temporal order.
Because it will do all the blocks of the oldest snapshot, and then all the
blocks of the second oldest snapshot, etc.  So by preserving the old
snapshots, you might sometimes be recreating significant amount of
fragmentation anyway.


 1) Resilvering from a healthy array (vdev) onto a spare drive
 in order to replace one of the healthy drives in the vdev;
 2) Resilvering a degraded array from existing drives onto a
 new drive in order to repair the array and make it redundant
 again.

Same behavior either way.  Unless...  If your old disks are small and very
full, and your new disks are bigger, then sometimes in the past you may have
suffered fragmentation due to lack of available sequential unused blocks.
So resilvering onto new *larger* disks might make a difference.


Finally, what would the gurus say - does fragmentation
 pose a heavy problem on nearly-filled-up pools made of
 spinning HDDs 

Yes.  But that's not unique to ZFS or COW.  No matter what your system, if
your disk is nearly full, you will suffer from fragmentation.


 and can fragmentation be effectively combatted
 on ZFS at all (with or without BP rewrite)?

With BP rewrite, yes you can effectively combat fragmentation.
Unfortunately it doesn't exist.  :-/

Without BP rewrite...  Define effectively.  ;-)  I have successfully
defragged, compressed, enabled/disabled dedup on pools before, by using zfs
send | zfs receive...  Or by asking users, Ok, we're all in agreement, this
weekend, nobody will be using the a directory.  Right?  So then I sudo rm
-rf a, and restore from the latest snapshot.  Or something along those
lines.  Next weekend, we'll do the b directory...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

it seems that s11 shadow migration can help:-)


On 1/7/2012 9:50 AM, Jim Klimov wrote:

Hello all,

  I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

  I believe it was sometimes implied on this list that such
fragmentation for static data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

  I wonder how resilvering works, namely - does it write
blocks as they were or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
   in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
   new drive in order to repair the array and make it redundant
   again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

  Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

  For example, can(does?) metadata live separately
from data in some dedicated disk areas, while data
blocks are written as contiguously as they can?

  Many Windows defrag programs group files into several
zones on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

  I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Hung-Sheng Tsao Ph D.
Founder  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs read-ahead and L2ARC

2012-01-07 Thread Jim Klimov

I wonder if it is possible (currently or in the future as an RFE)
to tell ZFS to automatically read-ahead some files and cache them
in RAM and/or L2ARC?

One use-case would be for Home-NAS setups where multimedia (video
files or catalogs of images/music) are viewed form a ZFS box. For
example, if a user wants to watch a film, or listen to a playlist
of MP3's, or push photos to a wall display (photo frame, etc.),
the storage box should read-ahead all required data from HDDs
and save it in ARC/L2ARC. Then the HDDs can spin down for hours
while the pre-fetched gigabytes of data are used by consumers
from the cache. End-users get peace, quiet and less electricity
used while they enjoy their multimedia entertainment ;)

Is it possible? If not, how hard would it be to implement?

In terms of scripting, would it suffice to detect reads (i.e.
with DTrace) and read the files to /dev/null to get them cached
along with all required metadata (so that mechanical HDDs are
not required for reads afterwards)?

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Upgrade

2012-01-07 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Jim Klimov wrote:

I believe in this case it might make sense to boot the
target system from this BootCD and use zpool upgrade
from this OS image. This way you can be more sure that
your recovery software (Solaris BootCD) would be helpful :)


Also keep in mind that it would be a grevious error if the zpool 
version supported by the BootCD was never than what the installed GRUB 
and OS can support.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Jim Klimov wrote:


Several RAID systems have implemented spread spare drives
in the sense that there is not an idling disk waiting to
receive a burst of resilver data filling it up, but the
capacity of the spare disk is spread among all drives in
the array. As a result, the healthy array gets one more
spindle and works a little faster, and rebuild times are
often decreased since more spindles can participate in
repairs at the same time.


I think that I would also be interested in a system which uses the 
so-called spare disks for more protective redundancy but then reduces 
that protective redundancy in order to use that disk to replace a 
failed disk or to automatically enlarge the pool.


For example, a pool could start out with four-way mirroring when there 
is little data in the pool.  When the pool becomes more full, mirror 
devices are automatically removed (from existing vdevs), and used to 
add more vdevs.  Eventually a limit would be hit so that no more 
mirrors are allowed to be removed.


Obviously this approach works with simple mirrors but not for raidz.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Richard Elling
Hi Jim,

On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:

 Hello all,
 
 I have a new idea up for discussion.
 
 Several RAID systems have implemented spread spare drives
 in the sense that there is not an idling disk waiting to
 receive a burst of resilver data filling it up, but the
 capacity of the spare disk is spread among all drives in
 the array. As a result, the healthy array gets one more
 spindle and works a little faster, and rebuild times are
 often decreased since more spindles can participate in
 repairs at the same time.

Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
There have been other implementations of more distributed RAIDness in the
past (RAID-1E, etc). 

The big question is whether they are worth the effort. Spares solve a 
serviceability
problem and only impact availability in an indirect manner. For single-parity 
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).

 I don't think I've seen such idea proposed for ZFS, and
 I do wonder if it is at all possible with variable-width
 stripes? Although if the disk is sliced in 200 metaslabs
 or so, implementing a spread-spare is a no-brainer as well.

Put some thoughts down on paper and work through the math. If it all works
out, let's implement it!
 -- richard

 
 To be honest, I've seen this a long time ago in (Falcon?)
 RAID controllers, and recently - in a USEnix presentation
 of IBM GPFS on YouTube. In the latter the speaker goes
 a greater depth describing how their declustered RAID
 approach (as they call it: all blocks - spare, redundancy
 and data are intermixed evenly on all drives and not in
 a single group or a mid-level VDEV as would be for ZFS).
 
 http://www.youtube.com/watch?v=2g5rx4gP6yUfeature=related
 
 GPFS with declustered RAID not only decreases rebuild
 times and/or impact of rebuilds on end-user operations,
 but it also happens to increase reliability - there is
 a smaller time window in case of multiple-disk failure
 in a large RAID-6 or RAID-7 array (in the example they
 use 47-disk sets) that the data is left in a critical
 state due to lack of redundancy, and there is less data
 overall in such state - so the system goes from critical
 to simply degraded (with some redundancy) in a few minutes.
 
 Another thing they have in GPFS is temporary offlining
 of disks so that they can catch up when reattached - only
 newer writes (bigger TXG numbers in ZFS terms) are added to
 reinserted disks. I am not sure this exists in ZFS today,
 either. This might simplify physical systems maintenance
 (as it does for IBM boxes - see presentation if interested)
 and quick recovery from temporarily unavailable disks, such
 as when a disk gets a bus reset and is unavailable for writes
 for a few seconds (or more) while the array keeps on writing.
 
 I find these ideas cool. I do believe that IBM might get
 angry if ZFS development copy-pasted them as is, but it
 might get nonetheless get us inventing a similar wheel
 that would be a bit different ;)
 There are already several vendors doing this in some way,
 so perhaps there is no (patent) monopoly in place already...
 
 And I think all the magic of spread spares and/or declustered
 RAID would go into just making another write-block allocator
 in the same league raidz or mirror are nowadays...
 BTW, are such allocators pluggable (as software modules)?
 
 What do you think - can and should such ideas find their
 way into ZFS? Or why not? Perhaps from theoretical or
 real-life experience with such storage approaches?
 
 //Jim Klimov
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Richard Elling
On Jan 7, 2012, at 7:12 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
   For smaller systems such as laptops or low-end servers,
 which can house 1-2 disks, would it make sense to dedicate
 a 2-4Gb slice to the ZIL for the data pool, separate from
 rpool? Example layout (single-disk or mirrored):
 
   The idea would be to decrease fragmentation (committed
 writes to data pool would be more coalesced) and to keep
 the ZIL at faster tracks of the HDD drive.
 
 I'm not authoritative, I'm speaking from memory of former discussions on
 this list and various sources of documentation.
 
 No, it won't help you.

Correct :-)

 First of all, all your writes to the storage pool are aggregated, so you're
 already minimizing fragmentation of writes in your main pool.  However, over
 time, as snapshots are created  destroyed, small changes are made to files,
 and file contents are overwritten incrementally and internally...  The only
 fragmentation you get creeps in as a result of COW.  This fragmentation only
 impacts sequential reads of files which were previously written in random
 order.  This type of fragmentation has no relation to ZIL or writes.
 
 If you don't split out your ZIL separate from the storage pool, zfs already
 chooses disk blocks that it believes to be optimized for minimal access
 time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
 few at the high end, and various other locations scattered throughout the
 pool, so whatever the current head position, it tries to go to the closest
 landing zone that's available for ZIL writes.  If anything, splitting out
 your ZIL to a different partition might actually hurt your performance.
 
 Also, the concept of faster tracks of the HDD is also incorrect.  Yes,
 there was a time when HDD speeds were limited by rotational speed and
 magnetic density, so the outer tracks of the disk could serve up more data
 because more magnetic material passed over the head in each rotation.  But
 nowadays, the hard drive sequential speed is limited by the head speed,
 which is invariably right around 1Gbps.  So the inner and outer sectors of
 the HDD are equally fast - the outer sectors are actually less magnetically
 dense because the head can't handle it.  And the random IO speed is limited
 by head seek + rotational latency, where seek is typically several times
 longer than latency.  

Disagree. My data, and the vendor specs, continue to show different sequential
media bandwidth speed for inner vs outer cylinders.

 
 So basically, the only thing that matters, to optimize the performance of
 any modern typical HDD, is to minimize the head travel.  You want to be
 seeking sectors which are on tracks that are nearby to the present head
 position.
 
 Of course, if you want to test  benchmark the performance of splitting
 apart the ZIL to a different partition, I encourage that.  I'm only speaking
 my beliefs based on my understanding of the architectures and limitations
 involved.  This is my best prediction.  And I've certainly been wrong
 before.  ;-)  Sometimes, being wrong is my favorite thing, because you learn
 so much from it.  ;-)

Good idea.

I think you will see a tradeoff on the read side of the mixed read/write 
workload.
Sync writes have higher priority than reads so the order of I/O sent to the disk
will appear to be very random and not significantly coalesced. This is the 
pathological worst case workload for a HDD.

OTOH, you're not trying to get high performance from an HDD are you?  That
game is over.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Jim Klimov

2012-01-08 5:37, Richard Elling пишет:

The big question is whether they are worth the effort. Spares solve a 
serviceability
problem and only impact availability in an indirect manner. For single-parity
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).


Well, regarding this part: in the presentation linked in my OP,
the IBM presenter suggests that for a 6-disk raid10 (3 mirrors)
with one spare drive, overall a 7-disk set, there are such
options for critical hits to data redundancy when one of
drives dies:

1) Traditional RAID - one full disk is a mirror of another
   full disk; 100% of a disk's size is critical and has to
   be prelicated into a spare drive ASAP;

2) Declustered RAID - all 7 disks are used for 2 unique data
   blocks from original setup and one spare block (I am not
   sure I described it well in words, his diagram shows it
   better); if a single disk dies, only 1/7 worth of disk
   size is critical (not redundant) and can be fixed faster.

   For their typical 47-disk sets of RAID-7-like redundancy,
   under 1% of data becomes critical when 3 disks die at once,
   which is (deemed) unlikely as is.

Apparently, in the GPFS layout, MTTDL is much higher than
in raid10+spare with all other stats being similar.

I am not sure I'm ready (or qualified) to sit down and present
the math right now - I just heard some ideas that I considered
worth sharing and discussing ;)

Thanks for the input,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Tim Cook
On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling richard.ell...@gmail.comwrote:

 Hi Jim,

 On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:

  Hello all,
 
  I have a new idea up for discussion.
 
  Several RAID systems have implemented spread spare drives
  in the sense that there is not an idling disk waiting to
  receive a burst of resilver data filling it up, but the
  capacity of the spare disk is spread among all drives in
  the array. As a result, the healthy array gets one more
  spindle and works a little faster, and rebuild times are
  often decreased since more spindles can participate in
  repairs at the same time.

 Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
 There have been other implementations of more distributed RAIDness in the
 past (RAID-1E, etc).

 The big question is whether they are worth the effort. Spares solve a
 serviceability
 problem and only impact availability in an indirect manner. For
 single-parity
 solutions, spares can make a big difference in MTTDL, but have almost no
 impact
 on MTTDL for double-parity solutions (eg. raidz2).



I disagree.  Dedicated spares impact far more than availability.  During a
rebuild performance is, in general, abysmal.  ZIL and L2ARC will obviously
help (L2ARC more than ZIL), but at the end of the day, if we've got a 12
hour rebuild (fairly conservative in the days of 2TB SATA drives), the
performance degradation is going to be very real for end-users.  With
distributed parity and spares, you should in theory be able to cut this
down an order of magnitude.  I feel as though you're brushing this off as
not a big deal when it's an EXTREMELY big deal (in my mind).  In my opinion
you can't just approach this from an MTTDL perspective, you also need to
take into account user experience.  Just because I haven't lost data,
doesn't mean the system isn't (essentially) unavailable (sorry for the
double negative and repeated parenthesis).  If I can't use the system due
to performance being a fraction of what it is during normal production, it
might as well be an outage.





  I don't think I've seen such idea proposed for ZFS, and
  I do wonder if it is at all possible with variable-width
  stripes? Although if the disk is sliced in 200 metaslabs
  or so, implementing a spread-spare is a no-brainer as well.

 Put some thoughts down on paper and work through the math. If it all works
 out, let's implement it!
  -- richard


I realize it's not intentional Richard, but that response is more than a
bit condescending.  If he could just put it down on paper and code
something up, I strongly doubt he would be posting his thoughts here.  He
would be posting results.  The intention of his post, as far as I can tell,
is to perhaps inspire someone who CAN just write down the math and write up
the code to do so.  Or at least to have them review his thoughts and give
him a dev's perspective on how viable bringing something like this to ZFS
is.  I fear responses like the code is there, figure it out makes the
*aris community no better than the linux one.




 
  To be honest, I've seen this a long time ago in (Falcon?)
  RAID controllers, and recently - in a USEnix presentation
  of IBM GPFS on YouTube. In the latter the speaker goes
  a greater depth describing how their declustered RAID
  approach (as they call it: all blocks - spare, redundancy
  and data are intermixed evenly on all drives and not in
  a single group or a mid-level VDEV as would be for ZFS).
 
  http://www.youtube.com/watch?v=2g5rx4gP6yUfeature=related
 
  GPFS with declustered RAID not only decreases rebuild
  times and/or impact of rebuilds on end-user operations,
  but it also happens to increase reliability - there is
  a smaller time window in case of multiple-disk failure
  in a large RAID-6 or RAID-7 array (in the example they
  use 47-disk sets) that the data is left in a critical
  state due to lack of redundancy, and there is less data
  overall in such state - so the system goes from critical
  to simply degraded (with some redundancy) in a few minutes.
 
  Another thing they have in GPFS is temporary offlining
  of disks so that they can catch up when reattached - only
  newer writes (bigger TXG numbers in ZFS terms) are added to
  reinserted disks. I am not sure this exists in ZFS today,
  either. This might simplify physical systems maintenance
  (as it does for IBM boxes - see presentation if interested)
  and quick recovery from temporarily unavailable disks, such
  as when a disk gets a bus reset and is unavailable for writes
  for a few seconds (or more) while the array keeps on writing.
 
  I find these ideas cool. I do believe that IBM might get
  angry if ZFS development copy-pasted them as is, but it
  might get nonetheless get us inventing a similar wheel
  that would be a bit different ;)
  There are already several vendors doing this in some way,
  so perhaps there is no (patent) monopoly in place already...
 
  And I think all the magic of