Re: [zfs-discuss] Thinking about spliting a zpool in "system" and "data"

2012-01-07 Thread Jim Klimov

Hello, Jesus,

  I have transitioned a number of systems roughly by the
same procedure as you've outlined. Sadly, my notes are
not in English so they wouldn't be of much help directly;
but I can report that I had success with similar "in-place"
manual transitions from mirrored SVM (pre-solaris 10u4)
to new ZFS root pools, as well as various transitions
of ZFS root pools from one layout to another, on systems
with limited numbers of disk drives (2-4 overall).

  As I've recently reported on the list, I've also done
such "migration" for my faulty single-disk rpool at home
via the data pool and backwards, changing the "copies"
setting enroute.

  Overall, your plan seems okay and has more failsafes
than we've had - because longer downtimes were affordable ;)
However, when doing such low-level stuff, you should make
sure that you have remote access to your systems (ILOM,
KVM, etc.; remotely-controlled PDUs for externally enforced
poweroff-poweron are welcome), and that you can boot the
systems over ILOM/rKVM with an image of a LiveUSB/LiveCD/etc
in case of bigger trouble.

  In the steps 6-7, where you reboot the system to test
that new rpool works, you might want to keep the zones
down, i.e. by disabling the zones service in the old BE
just before reboot, and zfs-sending this update to the
new small rpool. Also it is likely that in the new BE
(small rpool) your old "data" from the big rpool won't
get imported by itself and zones (or their services)
wouldn't start correctly anyway before steps 7-8.

---

Below I'll outline our experience from my notes, as it
successfully applied to an even more complicated situation
than yours:

  On many Sol10/SXCE systems with ZFS roots we've also
created a hierarchical layout (separate /var, /usr, /opt
with compression enabled), but this procedure HAS FAILED
for newer OpenIndiana systems. So for OI we have to use
the default single-root layout and only seperate some of
/var/* subdirs (adm, log, mail, crash, cores, ...) in
order to set quotas and higher compression on them.
Such datasets are also kept separate from OS upgrades
and are used in all boot environments without cloning.

  To simplify things, most of the transitions were done
in off-hours time so it was okay to shut down all the
zones and other services. In some cases for Sol10/SXCE
the procedure involved booting in the "Failsafe Boot"
mode; for all systems this can be done with the BootCD.

  For usual Solaris 10 and OpenSolaris SXCE maintenance
we did use LiveUpgrade, but at that time its ZFS support
was immature, so we circumvented LU and transitioned
manually. In those cases we used LU to update systems
to the base level supporting ZFS roots (Sol10u4+) while
running from SVM mirrors (one mirror for main root,
another mirror for LU root for new/old OS image).
After the transition to ZFS rpool, we cleared out the
LU settings (/etc/lu/, /etc/lutab) by using defaults
from the most recent SUNWlu* packages, and when booted
from ZFS - we created the "current" LU BE based on the
current ZFS rpool.

  When the OS was capable of booting from ZFS (sol10u4+,
snv_100 approx), we broke the SVM mirrors, repartitioned
the second disk to our liking (about 4-10Gb for rpool,
rest for data), created the new rpool and dataset
hierarchy we needed and had in mounted under "/zfsroot".

  Note that in our case we used a "minimized" install
of Solaris which fit under 1-2Gb per BE, we did not use
a separate /dump device and the swap volume was located
in the ZFS data pool (mirror or raidz for 4-disk systems).
Zoneroots were also separate from the system rpool and
were stored in the data pool. This DID yield problems
for LiveUpgrade, so zones were detached before LU and
reattached-with-upgrade after the OS upgrade and disk
migrations.

  Then we copied the root FS data like this:

# cd /zfsroot && ( ufsdump 0f - / | ufsrestore -rf - )

  If the source (SVM) paths like /var, /usr or /boot are
separate UFS filesystems - repeat likewise, changing the
current paths in the command above.

  For non-UFS systems, such as migration from VxFS or
even ZFS (if you need a different layout, compression,
etc. - so ZFS send/recv is not applicable), you can use
Sun cpio (it should carry over extended attributes and
ACLs). For example, if you're booted from the LiveCD
and the old UFS root is mounted in "/usfroot" and new
ZFS rpool hierarchy is in "/zfsroot", you'd do this:

# cd /ufsroot && ( find . -xdev -depth -print | cpio -pvdm /zfsroot )

  The example above also copies only the data from
current FS, so you need to repeat it for each UFS
sub-fs like /var, etc.

  Another problem we've encountered while cpio'ing live
systems (when not running from failsafe/livecd) is that
"find" skips mountpoints of sub-fses. While your new ZFS
hierarchy would provide usr, var, opt under /zfspool,
you might need to manually create some others - see the
list in your current "df" output. Example:

# cd /zfsroot
# mkdir -p tmp proc devices var/run system/contract syste

[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Jim Klimov

Hello all,

  For smaller systems such as laptops or low-end servers,
which can house 1-2 disks, would it make sense to dedicate
a 2-4Gb slice to the ZIL for the data pool, separate from
rpool? Example layout (single-disk or mirrored):

s0 - 16Gb - rpool
s1 - 4Gb  - data-zil
s3 - *Gb  - data pool

  The idea would be to decrease fragmentation (committed
writes to data pool would be more coalesced) and to keep
the ZIL at faster tracks of the HDD drive.

  I'm actually more interested in the former: would the
dedicated ZIL decrease fragmentation of the pool?

  Likewise, for larger pools (such as my 6-disk raidz2)
can fragmentation and/or performance benefit from some
dedicated ZIL slices (i.e. s0 = 1-2Gb ZIL per 2Tb disk,
with 3 mirrored ZIL sets overall)?

  Can several ZIL (mirrors) be concatenated for a single
data pool, or only one dedicated ZIL vdev can be used?

Thanks,
//Jim Klimov

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Upgrade

2012-01-07 Thread Jim Klimov

2012-01-06 17:49, Edward Ned Harvey пишет:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Ivan Rodriguez

Dear list,

  I'm about to upgrade a zpool from 10 to 29 version, I suppose that
this upgrade will improve several performance issues that are present
on 10, however
inside that pool we have several zfs filesystems all of them are
version 1 my first question is is there a problem with performance or
any other problem if you operate a zpool 29 with zfs filesystems
version 1 ?

Is it better to upgrade zfs to the latest version ?

Can we jump from zfs version 1 to 5 ?

Is there any implications on zfs send/receive with filesystem's and
pools with different versions ?


You can, and definitely should, upgrade all your zpool's and zfs
filesystems.  The only exceptions to think about are rpool.  You definitely
DON'T want to upgrade rpool higher than what's supported on the boot CD.  So
I suggest you create a test system, boot from the boot CD, create some
filesystem, check to see which zpool and zfs version they are.  Then,
upgrade rpool only to that level (just in case you ever need to boot from CD
to perform a rescue).  And upgrade all your other filesystems to the latest.


I believe in this case it might make sense to boot the
target system from this BootCD and use "zpool upgrade"
from this OS image. This way you can be more sure that
your recovery software (Solaris BootCD) would be helpful :)

But this is only applicable if you can afford the downtime...

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Stress test zfs

2012-01-07 Thread Thomas Nau
Hi Grant

On 01/06/2012 04:50 PM, Richard Elling wrote:
> Hi Grant,
> 
> On Jan 4, 2012, at 2:59 PM, grant lowe wrote:
> 
>> Hi all,
>>
>> I've got a solaris 10 running 9/10 on a T3. It's an oracle box with 128GB 
>> memory RIght now oracle . I've been trying to load test the box with 
>> bonnie++. I can seem to get 80 to 90 K writes, but can't seem to get more 
>> than a couple K for writes. Any suggestions? Or should I take this to a 
>> bonnie++ mailing list? Any help is appreciated. I'm kinda new to load 
>> testing.
> 
> I was hoping Roch (from Oracle) would respond, but perhaps he's not hanging 
> out on 
> zfs-discuss anymore?
> 
> Bonnie++ sux as a benchmark. The best analysis of this was done by Roch and 
> published
> online in the seminal blog post:
>   http://137.254.16.27/roch/entry/decoding_bonnie
> 
> I suggest you find a benchmark that more closely resembles your expected 
> workload and
> do not rely on benchmarks that provide a summary metric.
>  -- richard



I had good experience with "filebench". I resembles your workload as
good as you are able to describe it but takes some time to get things
setup if you cannot find your workload in one of the many provided
examples

Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Jim Klimov

Hello all,

  I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

  I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

  I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
   in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
   new drive in order to repair the array and make it redundant
   again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

  Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

  For example, can(does?) metadata live "separately"
from data in some "dedicated" disk areas, while data
blocks are written as contiguously as they can?

  Many Windows defrag programs group files into several
"zones" on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

  I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
>For smaller systems such as laptops or low-end servers,
> which can house 1-2 disks, would it make sense to dedicate
> a 2-4Gb slice to the ZIL for the data pool, separate from
> rpool? Example layout (single-disk or mirrored):
>
>The idea would be to decrease fragmentation (committed
> writes to data pool would be more coalesced) and to keep
> the ZIL at faster tracks of the HDD drive.

I'm not authoritative, I'm speaking from memory of former discussions on
this list and various sources of documentation.

No, it won't help you.

First of all, all your writes to the storage pool are aggregated, so you're
already minimizing fragmentation of writes in your main pool.  However, over
time, as snapshots are created & destroyed, small changes are made to files,
and file contents are overwritten incrementally and internally...  The only
fragmentation you get creeps in as a result of COW.  This fragmentation only
impacts sequential reads of files which were previously written in random
order.  This type of fragmentation has no relation to ZIL or writes.

If you don't split out your ZIL separate from the storage pool, zfs already
chooses disk blocks that it believes to be optimized for minimal access
time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
few at the high end, and various other locations scattered throughout the
pool, so whatever the current head position, it tries to go to the closest
"landing zone" that's available for ZIL writes.  If anything, splitting out
your ZIL to a different partition might actually hurt your performance.

Also, the concept of "faster tracks of the HDD" is also incorrect.  Yes,
there was a time when HDD speeds were limited by rotational speed and
magnetic density, so the outer tracks of the disk could serve up more data
because more magnetic material passed over the head in each rotation.  But
nowadays, the hard drive sequential speed is limited by the head speed,
which is invariably right around 1Gbps.  So the inner and outer sectors of
the HDD are equally fast - the outer sectors are actually less magnetically
dense because the head can't handle it.  And the random IO speed is limited
by head seek + rotational latency, where seek is typically several times
longer than latency.  

So basically, the only thing that matters, to optimize the performance of
any modern typical HDD, is to minimize the head travel.  You want to be
seeking sectors which are on tracks that are nearby to the present head
position.

Of course, if you want to test & benchmark the performance of splitting
apart the ZIL to a different partition, I encourage that.  I'm only speaking
my beliefs based on my understanding of the architectures and limitations
involved.  This is my best prediction.  And I've certainly been wrong
before.  ;-)  Sometimes, being wrong is my favorite thing, because you learn
so much from it.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
>I understand that relatively high fragmentation is inherent
> to ZFS due to its COW and possible intermixing of metadata
> and data blocks (of which metadata path blocks are likely
> to expire and get freed relatively quickly).
> 
>I believe it was sometimes implied on this list that such
> fragmentation for "static" data can be currently combatted
> only by zfs send-ing existing pools data to other pools at
> some reserved hardware, and then clearing the original pools
> and sending the data back. This is time-consuming, disruptive
> and requires lots of extra storage idling for this task (or
> at best - for backup purposes).

Can be combated by sending & receiving.  But that's not the only way.  You
can defrag, (or apply/remove dedup and/or compression, or any of the other
stuff that's dependent on BP rewrite) by doing any technique which
sequentially reads the existing data, and writes it back to disk again.  For
example, if you "cp -p file1 file2 && mv file2 file1" then you have
effectively defragged file1 (or added/removed dedup or compression).  But of
course it's requisite that file1 is sufficiently "not being used" right now.


>I wonder how resilvering works, namely - does it write
> blocks "as they were" or in an optimized (defragmented)
> fashion, in two usecases:

resilver goes according to temporal order.  While this might sometimes yield
a slightly better organization (If a whole bunch of small writes were
previously spread out over a large period of time on a largely idle system,
they will now be write-aggregated to sequential blocks) usually resilvering
recreates fragmentation similar to the pre-existing fragmentation.  

In fact, even if you zfs send | zfs receive while preserving snapshots,
you're still recreating the data in something loosely temporal order.
Because it will do all the blocks of the oldest snapshot, and then all the
blocks of the second oldest snapshot, etc.  So by preserving the old
snapshots, you might sometimes be recreating significant amount of
fragmentation anyway.


> 1) Resilvering from a healthy array (vdev) onto a spare drive
> in order to replace one of the healthy drives in the vdev;
> 2) Resilvering a degraded array from existing drives onto a
> new drive in order to repair the array and make it redundant
> again.

Same behavior either way.  Unless...  If your old disks are small and very
full, and your new disks are bigger, then sometimes in the past you may have
suffered fragmentation due to lack of available sequential unused blocks.
So resilvering onto new *larger* disks might make a difference.


>Finally, what would the gurus say - does fragmentation
> pose a heavy problem on nearly-filled-up pools made of
> spinning HDDs 

Yes.  But that's not unique to ZFS or COW.  No matter what your system, if
your disk is nearly full, you will suffer from fragmentation.


> and can fragmentation be effectively combatted
> on ZFS at all (with or without BP rewrite)?

With BP rewrite, yes you can effectively combat fragmentation.
Unfortunately it doesn't exist.  :-/

Without BP rewrite...  Define "effectively."  ;-)  I have successfully
defragged, compressed, enabled/disabled dedup on pools before, by using zfs
send | zfs receive...  Or by asking users, "Ok, we're all in agreement, this
weekend, nobody will be using the "a" directory.  Right?"  So then I sudo rm
-rf a, and restore from the latest snapshot.  Or something along those
lines.  Next weekend, we'll do the "b" directory...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-07 Thread Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.

it seems that s11 shadow migration can help:-)


On 1/7/2012 9:50 AM, Jim Klimov wrote:

Hello all,

  I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).

  I believe it was sometimes implied on this list that such
fragmentation for "static" data can be currently combatted
only by zfs send-ing existing pools data to other pools at
some reserved hardware, and then clearing the original pools
and sending the data back. This is time-consuming, disruptive
and requires lots of extra storage idling for this task (or
at best - for backup purposes).

  I wonder how resilvering works, namely - does it write
blocks "as they were" or in an optimized (defragmented)
fashion, in two usecases:
1) Resilvering from a healthy array (vdev) onto a spare drive
   in order to replace one of the healthy drives in the vdev;
2) Resilvering a degraded array from existing drives onto a
   new drive in order to repair the array and make it redundant
   again.

Also, are these two modes different at all?
I.e. if I were to ask ZFS to replace a working drive with
a spare in the case (1), can I do it at all, and would its
data simply be copied over, or reconstructed from other
drives, or some mix of these two operations?

  Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?

  For example, can(does?) metadata live "separately"
from data in some "dedicated" disk areas, while data
blocks are written as contiguously as they can?

  Many Windows defrag programs group files into several
"zones" on the disk based on their last-modify times, so
that old WORM files remain defragmented for a long time.
There are thus some empty areas reserved for new writes
as well as for moving newly discovered WORM files to
the WORM zones (free space permitting)...

  I wonder if this is viable with ZFS (COW and snapshots
involved) when BP-rewrites are implemented? Perhaps such
zoned defragmentation can be done based on block creation
date (TXG number) and the knowledge that some blocks in
certain order comprise at least one single file (maybe
more due to clones and dedup) ;)

What do you think? Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Hung-Sheng Tsao Ph D.
Founder&  Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

<>___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs read-ahead and L2ARC

2012-01-07 Thread Jim Klimov

I wonder if it is possible (currently or in the future as an RFE)
to tell ZFS to automatically read-ahead some files and cache them
in RAM and/or L2ARC?

One use-case would be for Home-NAS setups where multimedia (video
files or catalogs of images/music) are viewed form a ZFS box. For
example, if a user wants to watch a film, or listen to a playlist
of MP3's, or push photos to a wall display (photo frame, etc.),
the storage box "should" read-ahead all required data from HDDs
and save it in ARC/L2ARC. Then the HDDs can spin down for hours
while the pre-fetched gigabytes of data are used by consumers
from the cache. End-users get peace, quiet and less electricity
used while they enjoy their multimedia entertainment ;)

Is it possible? If not, how hard would it be to implement?

In terms of scripting, would it suffice to detect reads (i.e.
with DTrace) and read the files to /dev/null to get them cached
along with all required metadata (so that mechanical HDDs are
not required for reads afterwards)?

Thanks,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Upgrade

2012-01-07 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Jim Klimov wrote:

I believe in this case it might make sense to boot the
target system from this BootCD and use "zpool upgrade"
from this OS image. This way you can be more sure that
your recovery software (Solaris BootCD) would be helpful :)


Also keep in mind that it would be a grevious error if the zpool 
version supported by the BootCD was never than what the installed GRUB 
and OS can support.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Jim Klimov wrote:


Several RAID systems have implemented "spread" spare drives
in the sense that there is not an idling disk waiting to
receive a burst of resilver data filling it up, but the
capacity of the spare disk is spread among all drives in
the array. As a result, the healthy array gets one more
spindle and works a little faster, and rebuild times are
often decreased since more spindles can participate in
repairs at the same time.


I think that I would also be interested in a system which uses the 
so-called spare disks for more protective redundancy but then reduces 
that protective redundancy in order to use that disk to replace a 
failed disk or to automatically enlarge the pool.


For example, a pool could start out with four-way mirroring when there 
is little data in the pool.  When the pool becomes more full, mirror 
devices are automatically removed (from existing vdevs), and used to 
add more vdevs.  Eventually a limit would be hit so that no more 
mirrors are allowed to be removed.


Obviously this approach works with simple mirrors but not for raidz.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Richard Elling
Hi Jim,

On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:

> Hello all,
> 
> I have a new idea up for discussion.
> 
> Several RAID systems have implemented "spread" spare drives
> in the sense that there is not an idling disk waiting to
> receive a burst of resilver data filling it up, but the
> capacity of the spare disk is spread among all drives in
> the array. As a result, the healthy array gets one more
> spindle and works a little faster, and rebuild times are
> often decreased since more spindles can participate in
> repairs at the same time.

Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
There have been other implementations of more distributed RAIDness in the
past (RAID-1E, etc). 

The big question is whether they are worth the effort. Spares solve a 
serviceability
problem and only impact availability in an indirect manner. For single-parity 
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).

> I don't think I've seen such idea proposed for ZFS, and
> I do wonder if it is at all possible with variable-width
> stripes? Although if the disk is sliced in 200 metaslabs
> or so, implementing a spread-spare is a no-brainer as well.

Put some thoughts down on paper and work through the math. If it all works
out, let's implement it!
 -- richard

> 
> To be honest, I've seen this a long time ago in (Falcon?)
> RAID controllers, and recently - in a USEnix presentation
> of IBM GPFS on YouTube. In the latter the speaker goes
> a greater depth describing how their "declustered RAID"
> approach (as they call it: all blocks - spare, redundancy
> and data are intermixed evenly on all drives and not in
> a single "group" or a mid-level VDEV as would be for ZFS).
> 
> http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> 
> GPFS with declustered RAID not only decreases rebuild
> times and/or impact of rebuilds on end-user operations,
> but it also happens to increase reliability - there is
> a smaller time window in case of multiple-disk failure
> in a large RAID-6 or RAID-7 array (in the example they
> use 47-disk sets) that the data is left in a "critical
> state" due to lack of redundancy, and there is less data
> overall in such state - so the system goes from critical
> to simply degraded (with some redundancy) in a few minutes.
> 
> Another thing they have in GPFS is temporary offlining
> of disks so that they can catch up when reattached - only
> newer writes (bigger TXG numbers in ZFS terms) are added to
> reinserted disks. I am not sure this exists in ZFS today,
> either. This might simplify physical systems maintenance
> (as it does for IBM boxes - see presentation if interested)
> and quick recovery from temporarily unavailable disks, such
> as when a disk gets a bus reset and is unavailable for writes
> for a few seconds (or more) while the array keeps on writing.
> 
> I find these ideas cool. I do believe that IBM might get
> angry if ZFS development copy-pasted them "as is", but it
> might get nonetheless get us inventing a similar wheel
> that would be a bit different ;)
> There are already several vendors doing this in some way,
> so perhaps there is no (patent) monopoly in place already...
> 
> And I think all the magic of spread spares and/or "declustered
> RAID" would go into just making another write-block allocator
> in the same league "raidz" or "mirror" are nowadays...
> BTW, are such allocators pluggable (as software modules)?
> 
> What do you think - can and should such ideas find their
> way into ZFS? Or why not? Perhaps from theoretical or
> real-life experience with such storage approaches?
> 
> //Jim Klimov
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-07 Thread Richard Elling
On Jan 7, 2012, at 7:12 AM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Jim Klimov
>> 
>>   For smaller systems such as laptops or low-end servers,
>> which can house 1-2 disks, would it make sense to dedicate
>> a 2-4Gb slice to the ZIL for the data pool, separate from
>> rpool? Example layout (single-disk or mirrored):
>> 
>>   The idea would be to decrease fragmentation (committed
>> writes to data pool would be more coalesced) and to keep
>> the ZIL at faster tracks of the HDD drive.
> 
> I'm not authoritative, I'm speaking from memory of former discussions on
> this list and various sources of documentation.
> 
> No, it won't help you.

Correct :-)

> First of all, all your writes to the storage pool are aggregated, so you're
> already minimizing fragmentation of writes in your main pool.  However, over
> time, as snapshots are created & destroyed, small changes are made to files,
> and file contents are overwritten incrementally and internally...  The only
> fragmentation you get creeps in as a result of COW.  This fragmentation only
> impacts sequential reads of files which were previously written in random
> order.  This type of fragmentation has no relation to ZIL or writes.
> 
> If you don't split out your ZIL separate from the storage pool, zfs already
> chooses disk blocks that it believes to be optimized for minimal access
> time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
> few at the high end, and various other locations scattered throughout the
> pool, so whatever the current head position, it tries to go to the closest
> "landing zone" that's available for ZIL writes.  If anything, splitting out
> your ZIL to a different partition might actually hurt your performance.
> 
> Also, the concept of "faster tracks of the HDD" is also incorrect.  Yes,
> there was a time when HDD speeds were limited by rotational speed and
> magnetic density, so the outer tracks of the disk could serve up more data
> because more magnetic material passed over the head in each rotation.  But
> nowadays, the hard drive sequential speed is limited by the head speed,
> which is invariably right around 1Gbps.  So the inner and outer sectors of
> the HDD are equally fast - the outer sectors are actually less magnetically
> dense because the head can't handle it.  And the random IO speed is limited
> by head seek + rotational latency, where seek is typically several times
> longer than latency.  

Disagree. My data, and the vendor specs, continue to show different sequential
media bandwidth speed for inner vs outer cylinders.

> 
> So basically, the only thing that matters, to optimize the performance of
> any modern typical HDD, is to minimize the head travel.  You want to be
> seeking sectors which are on tracks that are nearby to the present head
> position.
> 
> Of course, if you want to test & benchmark the performance of splitting
> apart the ZIL to a different partition, I encourage that.  I'm only speaking
> my beliefs based on my understanding of the architectures and limitations
> involved.  This is my best prediction.  And I've certainly been wrong
> before.  ;-)  Sometimes, being wrong is my favorite thing, because you learn
> so much from it.  ;-)

Good idea.

I think you will see a tradeoff on the read side of the mixed read/write 
workload.
Sync writes have higher priority than reads so the order of I/O sent to the disk
will appear to be very random and not significantly coalesced. This is the 
pathological worst case workload for a HDD.

OTOH, you're not trying to get high performance from an HDD are you?  That
game is over.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Jim Klimov

2012-01-08 5:37, Richard Elling пишет:

The big question is whether they are worth the effort. Spares solve a 
serviceability
problem and only impact availability in an indirect manner. For single-parity
solutions, spares can make a big difference in MTTDL, but have almost no impact
on MTTDL for double-parity solutions (eg. raidz2).


Well, regarding this part: in the presentation linked in my OP,
the IBM presenter suggests that for a 6-disk raid10 (3 mirrors)
with one spare drive, overall a 7-disk set, there are such
options for "critical" hits to data redundancy when one of
drives dies:

1) Traditional RAID - one full disk is a mirror of another
   full disk; 100% of a disk's size is "critical" and has to
   be prelicated into a spare drive ASAP;

2) Declustered RAID - all 7 disks are used for 2 unique data
   blocks from "original" setup and one spare block (I am not
   sure I described it well in words, his diagram shows it
   better); if a single disk dies, only 1/7 worth of disk
   size is critical (not redundant) and can be fixed faster.

   For their typical 47-disk sets of RAID-7-like redundancy,
   under 1% of data becomes critical when 3 disks die at once,
   which is (deemed) unlikely as is.

Apparently, in the GPFS layout, MTTDL is much higher than
in raid10+spare with all other stats being similar.

I am not sure I'm ready (or qualified) to sit down and present
the math right now - I just heard some ideas that I considered
worth sharing and discussing ;)

Thanks for the input,
//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-07 Thread Tim Cook
On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling wrote:

> Hi Jim,
>
> On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
>
> > Hello all,
> >
> > I have a new idea up for discussion.
> >
> > Several RAID systems have implemented "spread" spare drives
> > in the sense that there is not an idling disk waiting to
> > receive a burst of resilver data filling it up, but the
> > capacity of the spare disk is spread among all drives in
> > the array. As a result, the healthy array gets one more
> > spindle and works a little faster, and rebuild times are
> > often decreased since more spindles can participate in
> > repairs at the same time.
>
> Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
> There have been other implementations of more distributed RAIDness in the
> past (RAID-1E, etc).
>
> The big question is whether they are worth the effort. Spares solve a
> serviceability
> problem and only impact availability in an indirect manner. For
> single-parity
> solutions, spares can make a big difference in MTTDL, but have almost no
> impact
> on MTTDL for double-parity solutions (eg. raidz2).
>


I disagree.  Dedicated spares impact far more than availability.  During a
rebuild performance is, in general, abysmal.  ZIL and L2ARC will obviously
help (L2ARC more than ZIL), but at the end of the day, if we've got a 12
hour rebuild (fairly conservative in the days of 2TB SATA drives), the
performance degradation is going to be very real for end-users.  With
distributed parity and spares, you should in theory be able to cut this
down an order of magnitude.  I feel as though you're brushing this off as
not a big deal when it's an EXTREMELY big deal (in my mind).  In my opinion
you can't just approach this from an MTTDL perspective, you also need to
take into account user experience.  Just because I haven't lost data,
doesn't mean the system isn't (essentially) unavailable (sorry for the
double negative and repeated parenthesis).  If I can't use the system due
to performance being a fraction of what it is during normal production, it
might as well be an outage.




>
> > I don't think I've seen such idea proposed for ZFS, and
> > I do wonder if it is at all possible with variable-width
> > stripes? Although if the disk is sliced in 200 metaslabs
> > or so, implementing a spread-spare is a no-brainer as well.
>
> Put some thoughts down on paper and work through the math. If it all works
> out, let's implement it!
>  -- richard
>
>
I realize it's not intentional Richard, but that response is more than a
bit condescending.  If he could just put it down on paper and code
something up, I strongly doubt he would be posting his thoughts here.  He
would be posting results.  The intention of his post, as far as I can tell,
is to perhaps inspire someone who CAN just write down the math and write up
the code to do so.  Or at least to have them review his thoughts and give
him a dev's perspective on how viable bringing something like this to ZFS
is.  I fear responses like "the code is there, figure it out" makes the
*aris community no better than the linux one.




> >
> > To be honest, I've seen this a long time ago in (Falcon?)
> > RAID controllers, and recently - in a USEnix presentation
> > of IBM GPFS on YouTube. In the latter the speaker goes
> > a greater depth describing how their "declustered RAID"
> > approach (as they call it: all blocks - spare, redundancy
> > and data are intermixed evenly on all drives and not in
> > a single "group" or a mid-level VDEV as would be for ZFS).
> >
> > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related
> >
> > GPFS with declustered RAID not only decreases rebuild
> > times and/or impact of rebuilds on end-user operations,
> > but it also happens to increase reliability - there is
> > a smaller time window in case of multiple-disk failure
> > in a large RAID-6 or RAID-7 array (in the example they
> > use 47-disk sets) that the data is left in a "critical
> > state" due to lack of redundancy, and there is less data
> > overall in such state - so the system goes from critical
> > to simply degraded (with some redundancy) in a few minutes.
> >
> > Another thing they have in GPFS is temporary offlining
> > of disks so that they can catch up when reattached - only
> > newer writes (bigger TXG numbers in ZFS terms) are added to
> > reinserted disks. I am not sure this exists in ZFS today,
> > either. This might simplify physical systems maintenance
> > (as it does for IBM boxes - see presentation if interested)
> > and quick recovery from temporarily unavailable disks, such
> > as when a disk gets a bus reset and is unavailable for writes
> > for a few seconds (or more) while the array keeps on writing.
> >
> > I find these ideas cool. I do believe that IBM might get
> > angry if ZFS development copy-pasted them "as is", but it
> > might get nonetheless get us inventing a similar wheel
> > that would be a bit different ;)
> > There are already several vendors doin