Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-08 Thread Jim Klimov

First of all, I would like to thank Bob, Richard and Tim for
at least taking time to look at this proposal and responding ;)

It is also encouraging to see that 2 of 3 responders consider
this idea at least worth pondering and discussng, as it appeals
to their direct interest. Even Richard was not dismissive of it ;)

Finally, as Tim was right to note, I am not a kernel developer
(and won't become one as good as those present on this list).
Of course, I could pull the blanket onto my side and say
that I'd try to write that code myself... but it would
probably be a long wait, like that for BP rewrite - because,
I already have quite a few commitments and responsibilities
as an admin and recently as a parent (yay!)

So, I guess, my piece of the pie is currently limited to RFEs
and bug reports... and working in IT for a software development
company, I believe (or hope) that's not a useless part of the
process ;)

I do believe that ZFS technology is amazing - despite some
shortcomings that are still present - and I do want to see
it flourish... ASAP! :^)

//Jim


2012-01-08 7:15, Tim Cook wrote:



On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling richard.ell...@gmail.com
mailto:richard.ell...@gmail.com wrote:

Hi Jim,

On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:

  Hello all,
 
  I have a new idea up for discussion.
 

...



I disagree.  Dedicated spares impact far more than availability.  During
a rebuild performance is, in general, abysmal.  ...
  If I can't use the system due to performance being a fraction of what
it is during normal production, it might as well be an outage.



  I don't think I've seen such idea proposed for ZFS, and
  I do wonder if it is at all possible with variable-width
  stripes? Although if the disk is sliced in 200 metaslabs
  or so, implementing a spread-spare is a no-brainer as well.

Put some thoughts down on paper and work through the math. If it all
works
out, let's implement it!
  -- richard


I realize it's not intentional Richard, but that response is more than a
bit condescending.  If he could just put it down on paper and code
something up, I strongly doubt he would be posting his thoughts here.
  He would be posting results.  The intention of his post, as far as I
can tell, is to perhaps inspire someone who CAN just write down the math
and write up the code to do so.  Or at least to have them review his
thoughts and give him a dev's perspective on how viable bringing
something like this to ZFS is.  I fear responses like the code is
there, figure it out makes the *aris community no better than the linux
one.

 
  What do you think - can and should such ideas find their
  way into ZFS? Or why not? Perhaps from theoretical or
  real-life experience with such storage approaches?
 
  //Jim Klimov

As always, feel free to tell me why my rant is completely off base ;)

--Tim



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 I wonder if it is possible (currently or in the future as an RFE)
 to tell ZFS to automatically read-ahead some files and cache them
 in RAM and/or L2ARC?
 
 One use-case would be for Home-NAS setups where multimedia (video
 files or catalogs of images/music) are viewed form a ZFS box. For
 example, if a user wants to watch a film, or listen to a playlist
 of MP3's, or push photos to a wall display (photo frame, etc.),
 the storage box should read-ahead all required data from HDDs
 and save it in ARC/L2ARC. Then the HDDs can spin down for hours
 while the pre-fetched gigabytes of data are used by consumers
 from the cache. End-users get peace, quiet and less electricity
 used while they enjoy their multimedia entertainment ;)

This whole subject is important and useful - and not unique to ZFS.  The
whole question is, how can the system predict which things are going to be
requested next?

In the case of a video - there's a big file which is likely to be read
sequentially.  I don't know how far readahead currently will read ahead, but
it is surely only smart enough to stay within a single file.  If the
readahead buffer starts to get low, and the disks have been spun down, I
don't know how low the buffer gets before it will trigger more readahead.
But at least in the case of streaming video files, there's a very realistic
possibility that something like the existing readahead can do what you want.

In the case of your MP3 collection...  Probably the only thing you can do is
to write a script which will simply go read all the files you predict will
be read soon.  The key here is the prediction - There's no way ZFS or
solaris, or any other OS in the present day is going to intelligently
predict which files you'll be requesting soon.  But you, the user, who knows
your usage patterns, might be able to make these predictions and request to
cache them.  The request is simply - telling the system to start reading
those files now.  So it's very easy to cache, as long as you know what to
cache.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  Also, the concept of faster tracks of the HDD is also incorrect.  Yes,
  there was a time when HDD speeds were limited by rotational speed and
  magnetic density, so the outer tracks of the disk could serve up more
data
  because more magnetic material passed over the head in each rotation.
 But
  nowadays, the hard drive sequential speed is limited by the head speed,
  which is invariably right around 1Gbps.  So the inner and outer sectors
of
  the HDD are equally fast - the outer sectors are actually less
magnetically
  dense because the head can't handle it.  And the random IO speed is
 limited
  by head seek + rotational latency, where seek is typically several times
  longer than latency.
 
 Disagree. My data, and the vendor specs, continue to show different
 sequential
 media bandwidth speed for inner vs outer cylinders.

Any reference?  I know, as I sit and dd from some disk | pv  /dev/null, it
will tell me something like 1.0Gbps...  I periodically check its progress
while it's in progress, and while it varies a little (say, sometimes 1.0,
1.1, 1.2) it goes up and down throughout the process.  There is no
noticeable difference between the early, mid, and late behavior,
sequentially reading the whole disk.

If the performance of the outer tracks is better than the performance of the
inner tracks due to limitations of magnetic density or rotation speed (not
being limited by the head speed or bus speed), then the sequential
performance of the drive should increase as a square function, going toward
the outer tracks.  c = pi * r^2

It is my belief, based on specs I've previously looked at, that mfgrs break
the drive down into zones.  So, something like the inner 20% of the tracks
will have magnetic layout pattern A, and the next 20% will have magnetic
layout pattern B, and so forth...  Within a single magnetic layout pattern,
jumping from individual track to individual track can yield a difference of
performance, but it's not a huge step from one to the next.  And when you
transition from layout pattern to layout pattern, the pattern just repeats
itself again.  They're trying to optimize, to a first order, ensure the
performance limitations are mostly caused by head and/or bus speed.  If
those are the bottlenecks, let them be the bottlenecks, and at least solve
all the other problems that are solvable.

So, small variations of sequential performance are possible, jumping from
track to track, but based on what I've seen, the maximum performance
difference from the absolute slowest track to the absolute fastest track
(which may or may not have any relation to inner vs outer) ... maximum
variation on-par with 10% performance difference.  Not a square function.


 OTOH, you're not trying to get high performance from an HDD are you?  That
 game is over.

Lots of us still have to live with HDD's, due to capacity and cost
requirements.  We accept a relative definition of high performance, and
still want to get all the performance we can out of whatever device we're
using.  Even if there exists a faster device somewhere in the world.

Also, for sequential performance, HDD's are on-par with, and often better
than SSD's.  (For now.)  While many SSD's publish specs including something
like 220 MB/s which is higher than HDD's can reach...  SSD's publish their
maximum performance, which is not typical performance.  After you use them
for a month, they slow down.  Often to half or worse, of the speed they
originally were able to run.  Which is... as I say...  on-par with, or worse
than, the sequential speed of an HDD.

Even crappy SSD's can have random IO worse than HDD's.  Just benchmark any
high-cost top-tier USB3 flash memory stick, and you'll see what I mean.  ;-)
The only SSD's that are faster than HDD's in any way are *actual* internal
sas/sata/etc SSD's, which are faster than HDD in terms of random IOPS and
maybe sequential.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread John Martin

On 01/08/12 09:30, Edward Ned Harvey wrote:


In the case of your MP3 collection...  Probably the only thing you can do is
to write a script which will simply go read all the files you predict will
be read soon.  The key here is the prediction - There's no way ZFS or
solaris, or any other OS in the present day is going to intelligently
predict which files you'll be requesting soon.



The other prediction is whether the blocks will be reused.
If the blocks of a streaming read are only used once, then
it may be wasteful for a file system to allow these blocks
to placed in the cache.  If a file system purposely
chooses to not cache streaming reads, manually scheduling a
pre-read of particular files may simply cause the file to be read
from disk twice: on the manual pre-read and when it is read again
by the actual application.

I believe Joerg Moellenkamp published a discussion
several years ago on how L1ARC attempt to deal with the pollution
of the cache by large streaming reads, but I don't have
a bookmark handy (nor the knowledge of whether the
behavior is still accurate).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Jim Klimov

2012-01-08 19:15, John Martin пишет:

On 01/08/12 09:30, Edward Ned Harvey wrote:


In the case of your MP3 collection... Probably the only thing you can
do is
to write a script which will simply go read all the files you predict
will
be read soon. The key here is the prediction - There's no way ZFS or
solaris, or any other OS in the present day is going to intelligently
predict which files you'll be requesting soon.



The other prediction is whether the blocks will be reused.
If the blocks of a streaming read are only used once, then
it may be wasteful for a file system to allow these blocks
to placed in the cache. If a file system purposely
chooses to not cache streaming reads, manually scheduling a
pre-read of particular files may simply cause the file to be read
from disk twice: on the manual pre-read and when it is read again
by the actual application.

I believe Joerg Moellenkamp published a discussion
several years ago on how L1ARC attempt to deal with the pollution
of the cache by large streaming reads, but I don't have
a bookmark handy (nor the knowledge of whether the
behavior is still accurate).


Well, this point is valid for intensively-used servers - but
then such blocks might just get evicted from the caches by
newer and/or more-frequently-used blocks.

However for smaller servers, such as home NASes which have
about one user overall, pre-reading and caching files even
for a single use might be an objective per se - just to let
the hard-disks spin down. Say, if I sit down to watch a
movie from my NAS, it is likely that for 90 or 120 minutes
there will be no other IO initiated by me. The movie file
can be pre-read in a few seconds, and then most of the
storage system can go to sleep.

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 Thread Jim Klimov

2012-01-08 18:56, Edward Ned Harvey wrote:

From: Richard Elling [mailto:richard.ell...@gmail.com]
Disagree. My data, and the vendor specs, continue to show different
sequential
media bandwidth speed for inner vs outer cylinders.


Any reference?


Well, Richard's data matches mine with tests of my HDDs
at home: I read in some 10-gb blocks at different offsets
(dd  /dev/null), and linear speeds dropped from about
150MBps to about 80-100MBps.

This was tested on a relatively modern 2TB Seagate drive.

Random IOs are still crappy on mechanical drives, often
under 10MBps ;)

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Edward Ned Harvey wrote:


If you don't split out your ZIL separate from the storage pool, zfs already
chooses disk blocks that it believes to be optimized for minimal access
time.  In fact, I believe, zfs will dedicate a few sectors at the low end, a
few at the high end, and various other locations scattered throughout the
pool, so whatever the current head position, it tries to go to the closest
landing zone that's available for ZIL writes.  If anything, splitting out
your ZIL to a different partition might actually hurt your performance.


Something else to be aware of is that even if you don't have a 
dedicated ZIL device, zfs will create a ZIL using devices in the main 
pool so there is always a ZIL, even if you don't see it.  Also, the 
ZIL is only used to record pending small writes.  Larger writes (I 
think 128K or more) are written to their pre-allocated final location 
in the main pool.  This choice is made since the purpose of the ZIL is 
to minimize random I/O to disk, and writing large amounts of data to 
the ZIL would create a bandwidth bottleneck.


There are postings by Matt Ahrens to this list (and elsewhere) which 
provide an accurate description of how the ZIL works.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)

2012-01-08 Thread Casper . Dik


If the performance of the outer tracks is better than the performance of the
inner tracks due to limitations of magnetic density or rotation speed (not
being limited by the head speed or bus speed), then the sequential
performance of the drive should increase as a square function, going toward
the outer tracks.  c = pi * r^2

Decrease because the outer tracks are the lower numbered tracks; they
have the same density but they are larger.

So, small variations of sequential performance are possible, jumping from
track to track, but based on what I've seen, the maximum performance
difference from the absolute slowest track to the absolute fastest track
(which may or may not have any relation to inner vs outer) ... maximum
variation on-par with 10% performance difference.  Not a square function.

I've noticed a change of 50% in speed or more between the lower and the
higher numbers.  (60MB to 30MB)

In benchmark land, they do short-stroke disks for better performance;
I believe the Pillar boxes do similar tricks under the covers (if you want 
more performance, it gives you the faster tracks)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs defragmentation via resilvering?

2012-01-08 Thread Bob Friesenhahn

On Sat, 7 Jan 2012, Jim Klimov wrote:


 I understand that relatively high fragmentation is inherent
to ZFS due to its COW and possible intermixing of metadata
and data blocks (of which metadata path blocks are likely
to expire and get freed relatively quickly).


To put things in proper perspective, with 128K filesystem blocks, the 
worst case file fragmentation as a percentage is 0.39% 
(100*1/((128*1024)/512)).  On a Microsoft Windows system, the 
defragger might suggest that defragmentation is not warranted for this 
percentage level.



 Finally, what would the gurus say - does fragmentation
pose a heavy problem on nearly-filled-up pools made of
spinning HDDs (I believe so, at least judging from those
performance degradation problems writing to 80+%-filled
pools), and can fragmentation be effectively combatted
on ZFS at all (with or without BP rewrite)?


There are different types of fragmentation.  The fragmentation which 
causes a slowdown when writing to an almost full pool is fragmentation 
of the free-list/area (causing zfs to take longer to find free space 
to write to) as opposed to fragmentation of the files themselves. 
The files themselves will still not be fragmented any more severely 
than the zfs blocksize.  However, there are seeks and there are 
*seeks* and some seeks take longer than others so some forms of 
fragmentation are worse than others.  When the free space is 
fragmented into smaller blocks, there is necessarily more file 
fragmentation then the file is written.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-08 Thread Pasi Kärkkäinen
On Sun, Jan 08, 2012 at 06:59:57AM +0400, Jim Klimov wrote:
 2012-01-08 5:37, Richard Elling ??:
 The big question is whether they are worth the effort. Spares solve a 
 serviceability
 problem and only impact availability in an indirect manner. For single-parity
 solutions, spares can make a big difference in MTTDL, but have almost no 
 impact
 on MTTDL for double-parity solutions (eg. raidz2).

 Well, regarding this part: in the presentation linked in my OP,
 the IBM presenter suggests that for a 6-disk raid10 (3 mirrors)
 with one spare drive, overall a 7-disk set, there are such
 options for critical hits to data redundancy when one of
 drives dies:

 1) Traditional RAID - one full disk is a mirror of another
full disk; 100% of a disk's size is critical and has to
be prelicated into a spare drive ASAP;

 2) Declustered RAID - all 7 disks are used for 2 unique data
blocks from original setup and one spare block (I am not
sure I described it well in words, his diagram shows it
better); if a single disk dies, only 1/7 worth of disk
size is critical (not redundant) and can be fixed faster.

For their typical 47-disk sets of RAID-7-like redundancy,
under 1% of data becomes critical when 3 disks die at once,
which is (deemed) unlikely as is.

 Apparently, in the GPFS layout, MTTDL is much higher than
 in raid10+spare with all other stats being similar.

 I am not sure I'm ready (or qualified) to sit down and present
 the math right now - I just heard some ideas that I considered
 worth sharing and discussing ;)


Thanks for the video link (http://www.youtube.com/watch?v=2g5rx4gP6yU). 
It's very interesting!

GPFS Native RAID seems to be more advanced than current ZFS,
and it even has rebalancing implemented (the infamous missing zfs bp-rewrite).

It'd definitely be interesting to have something like this implemented in ZFS.

-- Pasi

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread John Martin

On 01/08/12 11:30, Jim Klimov wrote:


However for smaller servers, such as home NASes which have
about one user overall, pre-reading and caching files even
for a single use might be an objective per se - just to let
the hard-disks spin down. Say, if I sit down to watch a
movie from my NAS, it is likely that for 90 or 120 minutes
there will be no other IO initiated by me. The movie file
can be pre-read in a few seconds, and then most of the
storage system can go to sleep.


Isn't this just a more extreme case of prediction?
In addition to the file system knowing there will only
be one client reading 90-120 minutes of (HD?) video
that will fit in the memory of a small(er) server,
now the hard drive power management code also knows there
won't be another access for 90-120 minutes so it is OK
to spin down the hard drive(s).
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Jim Klimov

2012-01-09 0:29, John Martin пишет:

On 01/08/12 11:30, Jim Klimov wrote:


However for smaller servers, such as home NASes which have
about one user overall, pre-reading and caching files even
for a single use might be an objective per se - just to let
the hard-disks spin down. Say, if I sit down to watch a
movie from my NAS, it is likely that for 90 or 120 minutes
there will be no other IO initiated by me. The movie file
can be pre-read in a few seconds, and then most of the
storage system can go to sleep.


I can't find such home-NAS usage uncommon, because I am
my own example user - so I see this pattern often ;)




Isn't this just a more extreme case of prediction?


Probably is, and this is probably not a task for only ZFS,
but for logic outside it. There are some requirements
that ZFS should meet, in order for this to work, though.
Details follow...


In addition to the file system knowing there will only
be one client reading 90-120 minutes of (HD?) video
that will fit in the memory of a small(er) server,
now the hard drive power management code also knows there
won't be another access for 90-120 minutes so it is OK
to spin down the hard drive(s).


Well, in the original post I did suggest that the prediction
logic might go into scripting or some other user-level tool.
And it should, really, to keep the kernel clean and slim.

The predictor might be as simple as a DTrace file access
monitor, which would cat or tar files into /dev/null.
I.e. if it detected access to *.(avi|mkv|wmv), then it
should cat the file. If it detected *.(mp3|ogg|jpg) it
should tar the parent directory. Might be dumb and still
sufficiently efficient ;)

However, for such usecases this tool would need some
guarantees from ZFS. One would be that the read-ahead
data will find its way into caches and won't be evicted
for no reason (when there's no other RAM pressure).
This means that the tool should be able to read all the
data and metadata required by ZFS, so that no more disk
access is required if it's all in cache.
It might require a tunable in ZFS for home-NAS users
which would disable current no-caching for detected
streaming reads: we need the opposite of that behavior.

Another part is HDD power-management, which reportedly
works in Solaris, allowing disks to spin down when there
was no access for some time. Probably there is a syscall
to do this on-demand as well...

On a side note, for home-NASes or other not-heavily-used
storage servers, it would be wonderful to be able to cache
small writes into ZIL devices, if present, and not flush
them onto the main pool until some megabyte limit is
reached (i.e. ZIL is full), or a pool export/import event
occurs. This would allow main disk arrays to remain idle
for a long time while small sporadic writes which are
initiated by the OS (logs, atimes, web-browser cache
files, whatever), and have these writes persistently
stored in ZIL. Essentially, this would be like setting
TXG-commit times to practical infinity, and actually
commit based on bytecount limits. One possible difference
would be not-streaming larger writes to pool disks at once,
but also storing them in dedicated ZIL.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Pool faulted in a bad way

2012-01-08 Thread Henrik Johansson
Hello,

I have been asked to take a look at at poll on a old OSOL 2009.06 host. It have 
been left unattended for a long time and it was found in a FAULTED state. Two 
of the disks in the raildz2 pool seems to have failed, one have been replaced 
by a spare, the other one is UNAVAIL. The machine was restarted and the damaged 
disks was removed to make it possible to access the pool without it hanging on 
I/O-errors.

Now, I have no indication on that more than two disk should have failed,  and 
one of them seems to have been replaced by the spare. I would then have 
expected the pool to be in a working state even with two failed disks and some 
bad data on the remaining disks since metadata has additional replication.

This is the current state of the pool, unable to be imported (at least with 
2009.06):

  pool: tank
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
tank   FAULTED  0 0 1  corrupted data
  raidz2   DEGRADED 0 0 6
c12t0d0ONLINE   0 0 0
c12t1d0ONLINE   0 0 0
spare  ONLINE   0 0 0
  c12t2d0  ONLINE   0 0 0
  c12t7d0  ONLINE   0 0 0
c12t3d0ONLINE   0 0 0
c12t4d0ONLINE   0 0 0
c12t5d0ONLINE   0 0 0
c12t6d0UNAVAIL  0 0 0  cannot open

If we look at the status it is a mismatch of between the status message that 
states that insufficient replicas are available and the status of the disks. 
More troublesome is the corrupted data status for the whole pool. I also get 
bad config type 16 for stats from zdb.

What can possible cause something like this, a faulty controller? Is there any 
way to recover (UB rollback with OI perhaps?) The server has ECC memory and 
another pool that is still working fine. The controller is a ARECA 1280.

And some output from zdb:

# zdb tank | more   
zdb: can't open tank: I/O error
version=14
name='tank'
state=0
txg=0
pool_guid=17315487329998392945
hostid=8783846
hostname='storage'
vdev_tree
type='root'
id=0
guid=17315487329998392945
bad config type 16 for stats
children[0]
type='raidz'
id=0
guid=14250359679717261360
nparity=2
metaslab_array=24
metaslab_shift=37
ashift=9
asize=14002698321920
is_log=0
root@storage:~# zdb tank  
version=14
name='tank'
state=0
txg=0
pool_guid=17315487329998392945
hostid=8783846
hostname='storage'
vdev_tree
type='root'
id=0
guid=17315487329998392945
bad config type 16 for stats
children[0]
type='raidz'
id=0
guid=14250359679717261360
nparity=2
metaslab_array=24
metaslab_shift=37
ashift=9
asize=14002698321920
is_log=0
bad config type 16 for stats
children[0]
type='disk'
id=0
guid=5644370057710608379
path='/dev/dsk/c12t0d0s0'
devid='id1,sd@x001b4d23002bb800/a'

phys_path='/pci@0,0/pci8086,25f8@4/pci8086,370@0/pci17d3,1260@e/disk@0,0:a'
whole_disk=1
DTL=154
bad config type 16 for stats
children[1]
type='disk'
id=1
guid=7134885674951774601
path='/dev/dsk/c12t1d0s0'
devid='id1,sd@x001b4d23002bb810/a'

phys_path='/pci@0,0/pci8086,25f8@4/pci8086,370@0/pci17d3,1260@e/disk@1,0:a'
whole_disk=1
DTL=153
bad config type 16 for stats
children[2]
type='spare'
id=2
guid=7434068041432431375
whole_disk=0
bad config type 16 for stats
children[0]
type='disk'
id=0
guid=5913529661608977121
path='/dev/dsk/c12t2d0s0'
devid='id1,sd@x001b4d23002bb820/a'


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Richard Elling
On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote:

 I wonder if it is possible (currently or in the future as an RFE)
 to tell ZFS to automatically read-ahead some files and cache them
 in RAM and/or L2ARC?

See discussions on the ZFS intelligent prefetch algorithm. I think Ben 
Rockwood's
description is the best general description:
http://www.cuddletech.com/blog/pivot/entry.php?id=1040

And a more engineer-focused description is at:
http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch
 -- richard


 
 One use-case would be for Home-NAS setups where multimedia (video
 files or catalogs of images/music) are viewed form a ZFS box. For
 example, if a user wants to watch a film, or listen to a playlist
 of MP3's, or push photos to a wall display (photo frame, etc.),
 the storage box should read-ahead all required data from HDDs
 and save it in ARC/L2ARC. Then the HDDs can spin down for hours
 while the pre-fetched gigabytes of data are used by consumers
 from the cache. End-users get peace, quiet and less electricity
 used while they enjoy their multimedia entertainment ;)
 
 Is it possible? If not, how hard would it be to implement?
 
 In terms of scripting, would it suffice to detect reads (i.e.
 with DTrace) and read the files to /dev/null to get them cached
 along with all required metadata (so that mechanical HDDs are
 not required for reads afterwards)?
 
 Thanks,
 //Jim Klimov
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Jim Klimov

2012-01-09 4:14, Richard Elling пишет:

On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote:


I wonder if it is possible (currently or in the future as an RFE)
to tell ZFS to automatically read-ahead some files and cache them
in RAM and/or L2ARC?


See discussions on the ZFS intelligent prefetch algorithm. I think Ben 
Rockwood's
description is the best general description:
http://www.cuddletech.com/blog/pivot/entry.php?id=1040

And a more engineer-focused description is at:
http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch
  -- richard


Thanks for the pointers. While I've seen those articles
(in fact, one of the two non-spam comments in Ben's
blog was mine), rehashing the basics is always useful ;)

Still, how does VDEV prefetch play along with File-level
Prefetch? For example, if ZFS prefetched 64K from disk
at the SPA level, and those sectors luckily happen to
contain next blocks of a streaming-read file, would
the file-level prefetch take the data from RAM cache
or still request them from the disk?

In what cases would it make sense to increase the
zfs_vdev_cache_size? Does it apply to all disks
combined, or to each disk (or even slice/partition)
separately?

In fact, this reading got me thinking that I might have
a fundamental misunderstanding lately; hence a couple
of new yes-no questions arose:

Is it true or false that: ZFS might skip the cache and
go to disks for streaming reads? (The more I think
about it, the more senseless this sentence seems, and
I might have just mistaken it with ZIL writes of bulk
data).

Is it true or false that: ARC might evict cached blocks
based on age (without new reads or other processes
requiring the RAM space)?

And I guess the generic answer to my original question
regarding intelligent pre-fetching of whole files is
that this should be done by scripts outside ZFS itself,
and that the read-prefetch as well as ARC/L2ARC is all
in place already. So if no other IOs occur, the disks
may spin down... if only not for those nasty writes
that may sporadically occur and which I'd love to see
pushed out to dedicated ZILs ;)

Thanks,
//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs read-ahead and L2ARC

2012-01-08 Thread Richard Elling
On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote:
 2012-01-09 4:14, Richard Elling пишет:
 On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote:
 
 I wonder if it is possible (currently or in the future as an RFE)
 to tell ZFS to automatically read-ahead some files and cache them
 in RAM and/or L2ARC?
 
 See discussions on the ZFS intelligent prefetch algorithm. I think Ben 
 Rockwood's
 description is the best general description:
 http://www.cuddletech.com/blog/pivot/entry.php?id=1040
 
 And a more engineer-focused description is at:
 http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch
  -- richard
 
 Thanks for the pointers. While I've seen those articles
 (in fact, one of the two non-spam comments in Ben's
 blog was mine), rehashing the basics is always useful ;)
 
 Still, how does VDEV prefetch play along with File-level
 Prefetch?

Trick question… it doesn't. vdev prefetching is disabled in opensolaris b148, 
illumos,
and Solaris 11 releases. The benefits of having the vdev cache for large 
numbers of 
disks does not appear to justify the cost. See
http://wesunsolve.net/bugid/id/6684116
https://www.illumos.org/issues/175

 For example, if ZFS prefetched 64K from disk
 at the SPA level, and those sectors luckily happen to
 contain next blocks of a streaming-read file, would
 the file-level prefetch take the data from RAM cache
 or still request them from the disk?

As of b70, vdev_cache only contains metadata. See 
http://wesunsolve.net/bugid/id/6437054

 In what cases would it make sense to increase the
 zfs_vdev_cache_size? Does it apply to all disks
 combined, or to each disk (or even slice/partition)
 separately?

It applies to each leaf vdev.

 
 In fact, this reading got me thinking that I might have
 a fundamental misunderstanding lately; hence a couple
 of new yes-no questions arose:
 
 Is it true or false that: ZFS might skip the cache and
 go to disks for streaming reads? (The more I think
 about it, the more senseless this sentence seems, and
 I might have just mistaken it with ZIL writes of bulk
 data).

Unless the primarycache parameter is set to none, reads 
will look in the ARC first.

 
 Is it true or false that: ARC might evict cached blocks
 based on age (without new reads or other processes
 requiring the RAM space)?

False. Evictions occur when needed.

NB, I'm not sure of the status of the Solaris 11 ARC no-grow issue.
As that code is not open sourced, and we know that Oracle rewrote
some of the ARC code, all bets are off.

 And I guess the generic answer to my original question
 regarding intelligent pre-fetching of whole files is
 that this should be done by scripts outside ZFS itself,
 and that the read-prefetch as well as ARC/L2ARC is all
 in place already. So if no other IOs occur, the disks
 may spin down... if only not for those nasty writes
 that may sporadically occur and which I'd love to see
 pushed out to dedicated ZILs ;)

I've setup external prefetching for specific use cases.  Spin-down 
is another can of worms…
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
illumos meetup, Jan 10, 2012, Menlo Park, CA
http://www.meetup.com/illumos-User-Group/events/41665962/ 














___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-08 Thread Richard Elling
Note: more analysis of the GPFS implementations is needed, but that will take 
more
time than I'll spend this evening :-) Quick hits below...

On Jan 7, 2012, at 7:15 PM, Tim Cook wrote:
 On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling richard.ell...@gmail.com 
 wrote:
 Hi Jim,
 
 On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote:
 
  Hello all,
 
  I have a new idea up for discussion.
 
  Several RAID systems have implemented spread spare drives
  in the sense that there is not an idling disk waiting to
  receive a burst of resilver data filling it up, but the
  capacity of the spare disk is spread among all drives in
  the array. As a result, the healthy array gets one more
  spindle and works a little faster, and rebuild times are
  often decreased since more spindles can participate in
  repairs at the same time.
 
 Xiotech has a distributed, relocatable model, but the FRU is the whole ISE.
 There have been other implementations of more distributed RAIDness in the
 past (RAID-1E, etc).
 
 The big question is whether they are worth the effort. Spares solve a 
 serviceability
 problem and only impact availability in an indirect manner. For single-parity
 solutions, spares can make a big difference in MTTDL, but have almost no 
 impact
 on MTTDL for double-parity solutions (eg. raidz2).
 
 
 I disagree.  Dedicated spares impact far more than availability.  During a 
 rebuild performance is, in general, abysmal.

In ZFS, there is a resilver throttle that is designed to ensure that 
resilvering activity
does not impact interactive performance. Do you have data that suggests 
otherwise?

  ZIL and L2ARC will obviously help (L2ARC more than ZIL),

ZIL makes zero impact on resilver.  I'll have to check to see if L2ARC is still 
used, but
due to the nature of the ARC design, read-once workloads like backup or 
resilver do 
not tend to negatively impact frequently used data.

 but at the end of the day, if we've got a 12 hour rebuild (fairly 
 conservative in the days of 2TB
 SATA drives), the performance degradation is going to be very real for 
 end-users.  

I'd like to see some data on this for modern ZFS implementations (post Summer 
2010)

 With distributed parity and spares, you should in theory be able to cut this 
 down an order of magnitude.  
 I feel as though you're brushing this off as not a big deal when it's an 
 EXTREMELY big deal (in my mind).  In my opinion you can't just approach this 
 from an MTTDL perspective, you also need to take into account user 
 experience.  Just because I haven't lost data, doesn't mean the system isn't 
 (essentially) unavailable (sorry for the double negative and repeated 
 parenthesis).  If I can't use the system due to performance being a fraction 
 of what it is during normal production, it might as well be an outage.

So we have a method to analyze the ability of a system to perform during 
degradation:
performability. This can be applied to computer systems and we've done some 
analysis
specifically on RAID arrays. See also
http://www.springerlink.com/content/267851748348k382/
http://blogs.oracle.com/relling/tags/performability

Hence my comment about doing some math :-)

  I don't think I've seen such idea proposed for ZFS, and
  I do wonder if it is at all possible with variable-width
  stripes? Although if the disk is sliced in 200 metaslabs
  or so, implementing a spread-spare is a no-brainer as well.
 
 Put some thoughts down on paper and work through the math. If it all works
 out, let's implement it!
  -- richard
 
 
 I realize it's not intentional Richard, but that response is more than a bit 
 condescending.  If he could just put it down on paper and code something up, 
 I strongly doubt he would be posting his thoughts here.  He would be posting 
 results.  The intention of his post, as far as I can tell, is to perhaps 
 inspire someone who CAN just write down the math and write up the code to do 
 so.  Or at least to have them review his thoughts and give him a dev's 
 perspective on how viable bringing something like this to ZFS is.  I fear 
 responses like the code is there, figure it out makes the *aris community 
 no better than the linux one.

When I talk about spares in tutorials, we discuss various tradeoffs and how to 
analyse
the systems. Interestingly, for the GPFS case, the mirrors example clearly 
shows the
benefit of declustered RAID. However, the triple-parity example (similar to 
raidz3) is
not so persuasive. If you have raidz3 + spares, then why not go ahead and do 
raidz4?
In the tutorial we work through a raidz2 + spare vs raidz2 case and the raidz2 
case
is better in both performance and dependability without sacrificing space (an 
unusual
condition!)

It is not very difficult to add a raidz4 or indeed any number of additional 
parity, but 
there is a point of diminishing returns, usually when some other system 
component
becomes more critical than the RAID protection. So, raidz4 + spare is less 
dependable
than raidz5, and so on.
 -- 

Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?

2012-01-08 Thread Jim Klimov

2012-01-09 6:25, Richard Elling wrote:

Note: more analysis of the GPFS implementations is needed, but that will take 
more
time than I'll spend this evening :-) Quick hits below...


Good to hear you might look into it after all ;)


but at the end of the day, if we've got a 12 hour rebuild (fairly conservative 
in the days of 2TB
SATA drives), the performance degradation is going to be very real for 
end-users.


I'd like to see some data on this for modern ZFS implementations (post Summer 
2010)



Is scrubbing performance irrelevant in this discussion?
I think that in general, scrubbing is the read-half of
a larger rebuild process, at least for a single-vdev pool,
so rebuilds are about as long or worse. Am I wrong?

In my home-NAS case a raidz2 pool of six 2Tb drives, which
is filled 76%, consistently takes 85 hours to scrub.
No SSDs involved, no L2ARC, no ZILs. According to iostat,
the HDDs are often utilized to 100% with random IO load,
yielding from 500KBps to 2-3MBps in about 80-100IOPS per
disk (I have a scrub going on at this moment).

This system variably runs oi_148a (LiveUSB recovery) and
oi_151a when alive ;)

HTH,
//Jim Klimov
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss