from:"Edward Ned Harvey"

Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-14 Thread Edward Ned Harvey

 From: David Magda [mailto:dma...@ee.ryerson.ca]
 
 On Wed, October 13, 2010 21:26, Edward Ned Harvey wrote:
 
  I highly endorse mirrors for nearly all purposes.
 
 Are you a member of BAARF?
 
 http://www.miracleas.com/BAARF/BAARF2.html

Never heard of it.  I don't quite get it ... They want people to stop
talking about pros/cons of various types of raid?  That's definitely not me.


I think there are lots of pros/cons, and many of them have nuances, and vary
by implementation...  I think it's important to keep talking about it, and
all us experts in the field can keep current on all this ...

Take, for example, the number of people discussing things in this mailing
list, who say they still use hardware raid.  That alone demonstrates
misinformation (in most cases) and warrants more discussion.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Toby Thain
 
  I don't want to heat up the discussion about ZFS managed discs vs.
  HW raids, but if RAID5/6 would be that bad, no one would use it
  anymore.
 
 It is. And there's no reason not to point it out. The world has

Well, neither one of the above statements is really fair.

The truth is:  radi5/6 are generally not that bad.  Data integrity failures
are not terribly common (maybe one bit per year out of 20 large disks or
something like that.)

And in order to reach the conclusion nobody would use it, the people using
it would have to first *notice* the failure.  Which they don't.  That's kind
of the point.

Since I started using ZFS in production, about a year ago, on three servers
totaling approx 1.5TB used, I have had precisely one checksum error, which
ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
the error would have gone undetected and nobody would have noticed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ian D
 
 ok... we're making progress.  After swapping the LSI HBA for a Dell
 H800 the issue disappeared.  Now, I'd rather not use those controllers
 because they don't have a JBOD mode. We have no choice but to make
 individual RAID0 volumes for each disks which means we need to reboot
 the server every time we replace a failed drive.  That's not good...

I believe those are rebranded LSI controllers.  I know the PERC controllers
are.  I use MegaCLI on Perc systems for this purpose.

You should be able to find a utility which allows you to do this sort of
thing while the OS is running.

If you happen to find that MegaCLI is the right tool for your hardware, let
me know, and I'll paste a few commands here, which will simplify your life.
When I first started using it, I found it terribly cumbersome.  But now I've
gotten used to it, and MegaCLI commands just roll off the tongue.


 To resume the issue, when we copy files from/to the JBODs connected to
 that HBA using NFS/iSCSI, we get slow transfer rate 20M/s and a 1-2
 second pause between each file.   When we do the same experiment
 locally using the external drives as a local volume (no NFS/iSCSI
 involved) then it goes upward of 350M/sec with no delay between files.

Baffling.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux [SEC=UNCLASSIFIED]

2010-10-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Wilkinson, Alex
 
 can you paste them anyway ?

Note:  If you have more than one adapter, I believe you can specify -aALL in
the commands below, instead of -a0

I have 2 disks (slots 4  5) that are removable and rotate offsite for
backups.
  To remove disks safely:
zpool export removable-pool
export EnclosureID=`MegaCli -PDList -a0 | grep 'Enclosure Device ID' |
uniq | sed 's/.* //'`
for DriveNum in 4 5 ; do MegaCli -PDOffline
PhysDrv[${EnclosureID}:${DriveNum}] -a0 ; done

Disks blink alternate orange  green.  Safe to remove.

  To insert disks safely:
Insert disks.
MegaCli -CfgForeign -Clear -a0
MegaCli -CfgEachDskRaid0 -a0
devfsadm -Cv
zpool import -a

To clear foreign config off drives:
MegaCli -CfgForeign -Clear -a0

To create a one-disk raid0 for each disk that's not currently part of
another group:
MegaCli -CfgEachDskRaid0 -a0

To configure all drives WriteThrough
MegaCli -LDSetProp WT Lall -aALL

To configure all drives WriteBack
MegaCli -LDSetProp WB Lall -aALL

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] adding new disks and setting up a raidz2

2010-10-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Derek G Nokes
 
 r...@dnokes.homeip.net:~# zpool create marketData raidz2
 c0t5000C5001A6B9C5Ed0 c0t5000C5001A81E100d0 c0t5000C500268C0576d0
 c0t5000C500268C5414d0 c0t5000C500268CFA6Bd0 c0t5000C500268D0821d0
 cannot label 'c0t5000C500268CFA6Bd0': try using fdisk(1M) and then
 provide a specific slice
 
 Any idea what this means?

I think it means there is something pre-existing on that drive.  Maybe ZFS
related, maybe not.  You should probably double-check everything, to make
sure there's no valuable data on that device...  And then ... Either zero
the drive the long way via dd ... or use your raid controller to
initialize the device, which will virtually zero it the short way ...

In some cases you have no choice, and you need to do it the long way.
time dd if=/dev/zero of=/dev/rdsk/c0t5000C500268CFA6Bd0 bs=1024k

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

I have a Dell R710 which has been flaky for some time.  It crashes about
once per week.  I have literally replaced every piece of hardware in it, and
reinstalled Sol 10u9 fresh and clean.  

 

I am wondering if other people out there are using Dell hardware, with what
degree of success, and in what configuration?

 

The failure seems to be related to the perc 6i.  For some period around the
time of crash, the system still responds to ping, and anything currently in
memory or running from remote storage continues to function fine.  But new
processes that require the local storage ... Such as inbound ssh etc, or
even physical login at the console ... those are all hosed.  And eventually
the system stops responding to ping.  As soon as the problem starts, the
only recourse is power cycle.

 

I can't seem to reproduce the problem reliably, but it does happen
regularly.  Yesterday it happened several times in one day, but sometimes it
will go 2 weeks without a problem.

 

Again, just wondering what other people are using, and experiencing.  To see
if any more clues can be found to identify the cause.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

 From: Markus Kovero [mailto:markus.kov...@nebula.fi]
 Sent: Wednesday, October 13, 2010 10:43 AM
 
 Hi, we've been running opensolaris on Dell R710 with mixed results,
 some work better than others and we've been struggling with same issue
 as you are with latest servers.
 I suspect somekind powersaving issue gone wrong, system disks goes to
 sleep and never wake up or something similar.
 Personally, I cannot recommend using them with solaris, support is not
 even close to what it should be.

How consistent are your problems?  If you change something and things get
better or worse, will you be able to notice?

Right now, I think I have improved matters by changing the Perc to
WriteThrough instead of WriteBack.  Yesterday the system crashed several
times before I changed that, and afterward, I can't get it to crash at all.
But as I said before ... Sometimes the system goes 2 weeks without a
problem.

Do you have all your disks configured as individual disks?
Do you have any SSD?
WriteBack or WriteThrough?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Steve Radich, BitShop, Inc.
 
 Do you have dedup on? Removing large files, zfs destroy a snapshot, or
 a zvol and you'll see hangs like you are describing.

Thank you, but no.

I'm running sol 10u9, which does not have dedup yet, because dedup is not
yet considered stable for reasons like you mentioned.

I will admit, when dedup is available in sol 11, I do want it.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 Out of curiosity, did you run into this:
 http://blogs.everycity.co.uk/alasdair/2010/06/broadcom-nics-dropping-
 out-on-solaris-10/

I personally haven't had the broadcom problem.  When my system crashes,
surprisingly, it continues responding to ping, answers on port 22 (but you
can't ssh in), and if there are any cron jobs that run from NFS, they're
able to continue.  For some period of time, and eventually the whole thing
crashes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

Dell R710 ... Solaris 10u9 ... With stability problems ...
Notice that I have several CPU's whose current_cstate is higher than the
supported_max_cstate.

Logically, that sounds like a bad thing.  But I can't seem to find
documentation that defines the meaning of supported_max_cstates, to verify
that this is a bad thing.

I'm looking for other people out there ... with and without problems ... to
try this too, and see if a current_cstate higher than the
supported_max_cstate might be a simple indicator of system instability.

kstat | grep current_cstate ; kstat | grep supported_max_cstate
current_cstate  1
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  1
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  0
current_cstate  3
current_cstate  3
current_cstate  3
current_cstate  1
current_cstate  3
current_cstate  3
current_cstate  3
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2
supported_max_cstates   2

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 Dell R710 ... Solaris 10u9 ... With stability problems ...
 Notice that I have several CPU's whose current_cstate is higher than
 the
 supported_max_cstate.

One more data point:

Sun x4275 ... Solaris 10u6 fully updated (equivalent of 10u9??) ... No
problems ...
There are no current_cstate's higher than supported_max_cstate.

kstat | grep current_cstate ; kstat | grep supported_max_cstate
current_cstate  2
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  0
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  1
current_cstate  2
current_cstate  2
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3
supported_max_cstates   3

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2010-10-13 Thread Edward Ned Harvey

 From: Henrik Johansen [mailto:hen...@scannet.dk]
 
 The 10g models are stable - especially the R905's are real workhorses.

You would generally consider all your machines stable now?
Can you easily pdsh to all those machines?

kstat | grep current_cstate ; kstat | grep supported_max_cstates

I'd really love to see if some current_cstate is higher than
supported_max_cstates is an accurate indicator of system instability.

So far the two data points I have support this theory.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff cannot stat shares

2010-10-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of dirk schelfhout
 
 Wanted to test the zfs diff command and ran into this.

What's zfs diff?  I know it's been requested, but AFAIK, not implemented
yet.  Is that new feature being developed now or something?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Optimal raidz3 configuration

2010-10-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Peter Taps
 
 If I have 20 disks to build a raidz3 pool, do I create one big raidz
 vdev or do I create multiple raidz3 vdevs? Is there any advantage of
 having multiple raidz3 vdevs in a single pool?

whatever you do, *don't* configure one huge raidz3.

Consider either:  3 vdev's of each 7-disk raidz1, or 3 vdev's of 7-disk
raidz2, or something along these lines.  Perhaps 3 vdev's of each 6-disk
raidz1, and two hotspares.

raidzN takes a really long time to resilver (code written inefficiently,
it's a known problem.)  If you had a huge raidz3, it would literally never
finish, because it couldn't resilver as fast as new data appears.  A week
later you'd destroy  rebuild your whole pool.

If you can afford mirrors, your risk is much lower.  Because although it's
physically possible for 2 disks to fail simultaneously and ruin the pool,
the probability of that happening is smaller than the probability of 3
simultaneous disk failures on the raidz3.  Due to smaller resilver window.

I highly endorse mirrors for nearly all purposes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 You are implying that the issues resulted from the H/W raid(s) and I
 don't think that this is appropriate.

Please quote originals when you reply.  If you don't - then it's easy to
follow the thread on the web forum, but not in email.  So if you don't
quote, you'll be losing a lot of the people following the thread.  

I think it's entirely appropriate to imply that your problem this time stems
from hardware.  I'll say it outright.  You have a hardware problem.  Because
if there is a repeatable checksum failure (bad disk) then if anything can
find it, scrub can.  And scrub is the best way to find it.

If you have a nonrepeatable checksum failure (such as you have) then there
is only one possibility.  You are experiencing a hardware problem.

One possibility is that there's a failing disk in your hardware raid set,
and your hardware raid controller is unable to detect it, because hardware
raid doesn't do checksumming.  Sometimes ZFS reads the device, and gets an
error.  Sometimes the hardware raid controller reads the other side of the
mirror, and there is no error.

This is not the only possibility.  There could be some other piece of
hardware yielding your intermittent checksum errors.  But there's one
absolute conclusion:  Your intermittent checksum errors are caused by
hardware.

If scrub didn't find an error, then there was no error at the time of scrub.

If scrub didn't find an error, and then something else *did* find an error,
it means one of two things.  (a) Maybe the error only occurred after the
scrub.  or (b) the hardware raid controller or some other piece of hardware
didn't produce corrupted data during the scrub, but will produce corrupted
data at some other time.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach

   c3t211378AC0253d0  ONLINE   0 0 0

How many disks are there inside of c3t211378AC0253d0?

How are they configured?  Hardware raid 5?  A mirror of two hardware raid
5's?  The point is:  This device, as seen by ZFS, is not a pure storage
device.  It is a high level device representing some LUN or something, which
is configured  controlled by hardware raid.

If there's zero redundancy in that device, then scrub would probably find
the checksum errors consistently and repeatably.

If there's some redundancy in that device, then all bets are off.  Sometimes
scrub might read the good half of the data, and other times, the bad half.


But then again, the error might not be in the physical disks themselves.
The error might be somewhere in the raid controller(s) or the interconnect.
Or even some weird unsupported driver or something.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey

 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
 I now also got what you meant by good half but I don't dare to say
 whether or not this is also the case in a raid6 setup.

The same concept applies to raid5 or raid6.  When you read the device, you
never know if you're actually reading the data or the parity and in
fact, they're mixed together in order to fully utilize all the hardware
available.  (Assuming you have some decently smart hardware.)

But all of that is mostly irrelevant.  One fact remains:

You have checksum errors.  There is only one cause for checksum errors:
Hardware failure.

It may be the physical disks themselves, or the raid card, or ram, or cpu,
or any of the interconnect in between.  I suppose it could be a driver
problem, but that's less likely.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Multiple SLOG devices per pool

2010-10-12 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ray Van Dolson
 
 I have a pool with a single SLOG device rated at Y iops.
 
 If I add a second (non-mirrored) SLOG device also rated at Y iops will
 my zpool now theoretically be able to handle 2Y iops?  Or close to
 that?

Yes.

But we're specifically talking about sync mode writes.  Not async, and not
read.  And we're not comparing apples to oranges etc, not measuring an
actual number of IOPS, because of aggregation etc.  But I don't think that's
what you were asking.  I don't think you are trying to quantify the number
of IOPS.  I think you're trying to confirm the qualitative characteristic,
If I have N slogs, I will write N times faster than a single slog.  And
that's a simple answer.

Yes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving camp, lock stock and barrel

2010-10-11 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Harry Putnam
 
   beep beep   beep beep   beep beep
 
 I'm kind of having a brain freeze about this:
 So what are the standard tests or cmds to run to collect enough data
 to try to make a determination of what the problem is?

Definitely hardware.

To diagnose hardware, no standard test.  Start replacing hardware.  You'll
know you fixed it when the problem stops.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Dyer-Bennet
 
 I must say that this concept of scrub running w/o error when corrupted
 files, detectable to zfs send, apparently exist, is very disturbing.

As previously mentioned, the OP is using a hardware raid system.  It is
impossible for ZFS to read both sides of the mirror, which means it's pure
chance.  The hardware raid may fetch data from a bad disk one time, and
fetch good data from another disk the next time.  Or vice-versa.

You should always configure JBOD and allow ZFS to manage the raid.  Don't do
it in hardware, as the OP of this thread is soundly demonstrating the
reasons why.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS equivalent of inotify

2010-10-08 Thread Edward Ned Harvey

Is there a ZFS equivalent (or alternative) of inotify?

 

You have some thing, which wants to be notified whenever a specific file or
directory changes.  For example, a live sync application of some kind...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Edward Ned Harvey

 From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com]
 Sent: Thursday, October 07, 2010 10:02 PM
 
 On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey sh...@nedharvey.com
 wrote:
 If you're going raidz3, with 7 disks, then you might as well just make
 mirrors instead, and eliminate the slow resilver.
 
 There is a difference in reliability:  raidzN means _any_ N disks can
 fail, whereas mirror means one disk in each mirror pair can fail.
 With a mirror, Murphy's Law says that the second disk to fail will be
 the pair of the first disk :-).

Maybe.  But in reality, you're just guessing the probability of a single
failure, the probability of multiple failures, and the probability of
multiple failures within the critical time window and critical redundancy
set.

The probability of a 2nd failure within the critical time window is smaller
whenever the critical time window is decreased, and the probability of that
failure being within the critical redundancy set is smaller whenever your
critical redundancy set is smaller.  So if raidz2 takes twice as long to
resilver than a mirror, and has a larger critical redundancy set, then you
haven't gained any probable resiliency over a mirror.

Although it's true with mirrors, it's possible for 2 disks to fail and
result in loss of pool, I think the probability of that happening is smaller
than the probability of a 3-disk failure in the raidz2.

How much longer does a 7-disk raidz2 take to resilver as compared to a
mirror?  According to my calculations, it's in the vicinity of 10x longer.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS equivalent of inotify

2010-10-08 Thread Edward Ned Harvey

 From: cas...@holland.sun.com [mailto:cas...@holland.sun.com] On Behalf
 Of casper@sun.com
 
 Is there a ZFS equivalent (or alternative) of inotify?
 
 Have you looked at port_associate and ilk?

port_associate looks promising.  But google is less than useful on ilk.
Got any pointers, or additional search terms to narrow the context?

Thanks...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
 
 In addition to this comes another aspect. What if one drive fails and
 you find bad data on another in the same VDEV while resilvering. This
 is quite common these days, and for mirrors, that will mean data loss
 unless you mirror 3-way or more, which will be rather costy.

Like the resilver, scrub goes faster with mirrors.  Scrub regularly.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-08 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ian D
 
 the help to community can provide.  We're running the latest version of
 Nexenta on a pretty powerful machine (4x Xeon 7550, 256GB RAM, 12x
 100GB Samsung SSDs for the cache, 50GB Samsung SSD for the ZIL, 10GbE
 on a dedicated switch, 11x pairs of 15K HDDs for the pool).  We're

If you have a single SSD for dedicated log, that will surely be a bottleneck
for you.  All sync writes (which are all writes in the case of iscsi) will
hit the log device before the main pool.  But you should still be able to
read fast...

Also, with so much cache  ram, it wouldn't surprise me a lot to see really
low disk usage just because it's already cached.  But that doesn't explain
the ridiculously slow performance...

I'll suggest trying something completely different, like, dd if=/dev/zero
bs=1024k | pv | ssh othermachine 'cat  /dev/null' ...  Just to verify there
isn't something horribly wrong with your hardware (network).

In linux, run ifconfig ... You should see errors:0

Make sure each machine has an entry for the other in the hosts file.  I
haven't seen that cause a problem for iscsi, but certainly for ssh.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 I
 conducted a couple of tests, where I configured my raids as jbods and
 mapped each drive out as a seperate LUN and I couldn't notice a
 difference in performance in any way.

Not sure if my original points were communicated clearly.  Giving JBOD's to
ZFS is not for the sake of performance.  The reason for JBOD is reliability.
Because hardware raid cannot detect or correct checksum errors.  ZFS can.
So it's better to skip the hardware raid and use JBOD, to enable ZFS access
to each separate side of the redundant data.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey

 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:
  * Because ZFS automatically buffers writes in ram in order to
  aggregate as previously mentioned, the hardware WB cache is not
  beneficial.  There is one exception.  If you are doing sync writes
  to spindle disks, and you don't have a dedicated log device, then
  the WB cache will benefit you, approx half as much as you would
  benefit by adding dedicated log device.  The sync write sort-of
  by-passes the ram buffer, and that's the reason why the WB is able
  to do some good in the case of sync writes.
 
 All of your comments made sense except for this one.
 
 (etc)

Your point about long-term fragmentation and significant drive emptiness are
well received.  I never let a pool get over 90% full, for several reasons
including this one.  My target is 70%, which seems to be sufficiently empty.

Also, as you indicated, blocks of 128K are not sufficiently large for
reordering to benefit.  There's another thread here, where I calculated, you
need blocks approx 40MB in size, in order to reduce random seek time below
1% of total operation time.  So all that I said will only be relevant or
accurate if within 30sec (or 5 sec in the future) there exists at least 40M
of aggregatable sequential writes.

It's really easy to measure and quantify what I was saying.  Just create a
pool, and benchmark it in each configuration.  Results that I measured were:

(stripe of 2 mirrors) 
721  IOPS without WB or slog.  
2114 IOPS with WB
2722 IOPS with WB and slog
2927 IOPS with slog, and no WB

There's a whole spreadsheet full of results that I can't publish, but the
trend of WB versus slog was clear and consistent.

I will admit the above were performed on relatively new, relatively empty
pools.  It would be interesting to see if any of that changes, if the test
is run on a system that has been in production for a long time, with real
user data in it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Swapping disks in pool to facilitate pool growth

2010-10-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Kevin Walker
 
 We are a running a Solaris 10 production server being used for backup
 services within our DC. We have 8 500GB drives in a zpool and we wish
 to swap them out 1 by 1 for 1TB drives.
 
 I would like to know if it is viable to add larger disks to zfs pool to
 grow the pool size and then remove the smaller disks?
 
 I would assume this would degrade the pool and require it to resilver?

Because it's a raidz, yes it will be degraded each time you remove one disk.
You will not be using attach and detach.  You will be using replace

Because it's a raidz, each resilver time will be unnaturally long.  Raidz
resilver code is inefficient.  Just be patient and let it finish each time
before you replace the next disk.  Performance during resilver will be
exceptionally poor.  Exceptionally.

Because of the inefficient raidz resilver code, do everything within your
power to reduce IO on the system during the resilver.  Of particular
importance:  Don't create snapshots while the system is resilvering.  This
will exponentially increase the resilver time.  (I'm exaggerating by saying
exponentially, don't take it literally.  But in reality, it *is*
significant.)

Because you're going to be degrading your redundancy, you *really* want to
ensure all the disks are good before you do any degrading.  This means,
don't begin your replace until after you've completed a scrub.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey

 From: Cindy Swearingen [mailto:cindy.swearin...@oracle.com]
 
 I would not discount the performance issue...
 
 Depending on your workload, you might find that performance increases
 with ZFS on your hardware RAID in JBOD mode.

Depends on the raid card you're comparing to.  I've certainly seen some raid
cards that were too dumb to read from 2 disks in a mirror simultaneously for
the sake of read performance enhancement.  And many other similar
situations.

But I would not say that's generally true anymore.  In the last several
years, all the hardware raid cards that I've bothered to test were able to
utilize all the hardware available.  Just like ZFS.

There are performance differences...  like ... the hardware raid might be
able to read 15% faster in raid5, while ZFS is able to write 15% faster in
raidz, and so forth.  Differences that roughly balance each other out.

For example, here's one data point I can share (2 mirrors striped, results
normalized):
8 initial writers, 8 rewriters, 8 readers   
ZFS 1.432.995.05
HW  2.002.542.96

8 re-readers,   8 reverse readers,  8 stride readers
ZFS 4.193.593.93
HW  3.022.802.90

8 random readers,   8 random mix,   8 random writers
ZFS 2.572.401.69
HW  1.991.701.73

average
ZFS 3.09
HW  2.40

There were some categories where ZFS was faster.  Some where HW was faster.
On average, ZFS was faster, but they were all in the same ballpark, and the
results were highly dependent on specific details and tunables.  AKA, not a
place you should explore, unless you have a highly specialized use case that
you wish to optimize.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ian Collins
 
 I would seriously consider raidz3, given I typically see 80-100 hour
 resilver times for 500G drives in raidz2 vdevs.  If you haven't
 already,

If you're going raidz3, with 7 disks, then you might as well just make
mirrors instead, and eliminate the slow resilver.

Mirrors resilver enormously faster than raidzN.  At least for now, until
maybe one day the raidz resilver code might be rewritten.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Increase size of 2-way mirror

2010-10-06 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tony MacDoodle
 
 Is it possible to add 2 disks to increase the size of the pool below?
 
 NAME STATE READ WRITE CKSUM
   testpool ONLINE 0 0 0
 mirror-0 ONLINE 0 0 0
 c1t2d0 ONLINE 0 0 0
 c1t3d0 ONLINE 0 0 0
 mirror-1 ONLINE 0 0 0
 c1t4d0 ONLINE 0 0 0
 c1t5d0 ONLINE 0 0 0

It's important that you know the difference between add and attach
methods for increasing this size...

If you add another mirror, then you'll have mirror-0, mirror-1, and
mirror-2.  You cannot remove any of the existing devices. 

If you attach a larger disk to mirror-0, and possibly fiddle with the
autoexpand property and a little bit of additional futzing (pretty basic,
including resilver  detach the old devices) then you can effectively
replace the existing devices with larger devices.  No need to consume extra
disk bays.

It's all a matter of which is the more desirable outcome for you.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 Ian,
 
 yes, although these vdevs are FC raids themselves, so the risk is… uhm…
 calculated.

Whenever possible, you should always JBOD the storage and let ZFS manage the 
raid, for several reasons.  (See below).  Also, as counter-intuitive as this 
sounds (see below) you should disable hardware write-back cache (even with BBU) 
because it hurts performance in any of these situations:  (a) Disable WB if you 
have access to SSD or other nonvolatile dedicated log device.  (b) Disable WB 
if you know all of your writes to be async mode and not sync mode.  (c) Disable 
WB if you've opted to disable ZIL.

* Hardware raid blindly assumes the redundant data written to disk is written 
correctly.  So later, if you experience a checksum error (such as you have) 
then it's impossible for ZFS to correct it.  The hardware raid doesn't know a 
checksum error has occurred, and there is no way for the OS to read the other 
side of the mirror to attempt correcting the checksum via redundant data.

* ZFS has knowledge of both the filesystem, and the block level devices, while 
hardware raid has only knowledge of block level devices.  Which means ZFS is 
able to optimize performance in ways that hardware cannot possibly do.  For 
example, whenever there are many small writes taking place concurrently, ZFS is 
able to remap the physical disk blocks of those writes, to aggregate them into 
a single sequential write.  Depending on your metric, this yields 1-2 orders of 
magnitude higher IOPS.

* Because ZFS automatically buffers writes in ram in order to aggregate as 
previously mentioned, the hardware WB cache is not beneficial.  There is one 
exception.  If you are doing sync writes to spindle disks, and you don't have a 
dedicated log device, then the WB cache will benefit you, approx half as much 
as you would benefit by adding dedicated log device.  The sync write sort-of 
by-passes the ram buffer, and that's the reason why the WB is able to do some 
good in the case of sync writes.  

Ironically, if you have WB enabled, and you have a SSD log device, then the WB 
hurts you.  You get the best performance with SSD log, and no WB.  Because the 
WB lies to the OS, saying some tiny chunk of data has been written... then 
the OS will happily write another tiny chunk, and another, and another.  The WB 
is only buffering a lot of tiny random writes, and in aggregate, it will only 
go as fast as the random writes.  It undermines ZFS's ability to aggregate 
small writes into sequential writes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 Now, scrub would reveal corrupted blocks on the devices, but is there a
 way to identify damaged files as well?

I saw a lot of people offering the same knee-jerk reaction that I had:
Scrub.  And that is the only correct answer, to make a best effort at
salvaging data.  But I think there is a valid question here which was
neglected.

*Does* scrub produce a list of all the names of all the corrupted files?
And if so, how does it do that?

If scrub is operating at a block-level (and I think it is), then how can
checksum failures be mapped to file names?  For example, this is a
long-requested feature of zfs send which is fundamentally difficult or
impossible to implement.

Zfs send operates at a block level.  And there is a desire to produce a list
of all the incrementally changed files in a zfs incremental send.  But no
capability of doing that.

It seems, if scrub is able to list the names of files that correspond to
corrupted blocks, then zfs send should be able to list the names of files
that correspond to changed blocks, right?

I am reaching the opposite conclusion of what's already been said.  I think
you should scrub, but don't expect file names as a result.  I think if you
want file names, then tar  /dev/null will be your best friend.

I didn't answer anything at first, cuz I was hoping somebody would have that
answer.  I only know that I don't know, and the above is my best guess.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When is it okay to turn off the verify option.

2010-10-04 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Peter Taps
 
 As I understand, the hash generated by sha256 is almost guaranteed
 not to collide. I am thinking it is okay to turn off verify property
 on the zpool. However, if there is indeed a collision, we lose data.
 Scrub cannot recover such lost data.
 
 I am wondering in real life when is it okay to turn off verify
 option? I guess for storing business critical data (HR, finance, etc.),
 you cannot afford to turn this option off.

Right on all points.  It's a calculated risk.  If you have a hash collision,
you will lose data undetected, and backups won't save you unless *you* are
the backup.  That is, if the good data, before it got corrupted by your
system, happens to be saved somewhere else before it reached your system.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When is it okay to turn off the verify option.

2010-10-04 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Scott Meilicke
 
 Why do you want to turn verify off? If performance is the reason, is it
 significant, on and off?

Under most circumstances, verify won't hurt performance.  It won't hurt
reads of any kind, and it won't hurt writes when you're writing unique data,
or if you're writing duplicate data which is warm in the read cache.  

It will basically hurt write performance if you are writing duplicate data,
which was not read recently.  This might be the case, for example, if this
machine is the target for some remote machine to backup onto.

The problem doesn't exist if you're copying local data, because you first
read the data (now it's warm in cache) before writing it.  So the verify
operation is essentially zero time in that case.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] drive speeds etc

2010-09-27 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk

  extended device statistics
 devicer/sw/s   kr/s   kw/s wait actv  svc_t  %w  %b
 sd1   0.5  140.30.3 2426.3  0.0  1.07.2   0  14
 sd2   0.0  138.30.0 2476.3  0.0  1.5   10.6   0  18
 sd3   0.0  303.90.0 2633.8  0.0  0.41.3   0   7
 sd4   0.5  306.90.3 2555.8  0.0  0.41.2   0   7
 sd5   1.0  308.50.5 2579.7  0.0  0.31.0   0   7
 sd6   1.0  304.90.5 2352.1  0.0  0.31.1   1   7
 sd7   1.0  298.90.5 2764.5  0.0  0.62.0   0  13
 sd8   1.0  304.90.5 2400.8  0.0  0.30.9   0   6

Unless I'm misunderstanding this output...
It looks like all disks are doing approx the same data throughput.
It looks like sd1  sd2 are doing half the IOPS.

So sd1  sd2 must be doing larger chunks.  How are these drives configured?  
One vdev of raidz2?  No cache/log devices, etc...

It would be easy to explain, if you're striping mirrors.  Difficult (at least 
for me) to explain if you're using raidzN.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [osol-discuss] zfs send/receive?

2010-09-26 Thread Edward Ned Harvey

 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 It is relatively easy to find the latest, common snapshot on two file
 systems.
 Once you know the latest, common snapshot, you can send the
 incrementals
 up to the latest.

I've always relied on the snapshot names matching.  Is there a way to find
the latest common snapshot if the names don't match?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Long resilver time

2010-09-26 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jason J. W. Williams
 
 I just witnessed a resilver that took 4h for 27gb of data. Setup is 3x
 raid-z2 stripes with 6 disks per raid-z2. Disks are 500gb in size. No
 checksum errors.

27G on a 6-disk raidz2 means approx 6.75G per disk.  Ideally, the disk could 
write 7G = 56 Gbit in a couple minutes if it were all sequential and no other 
activity in the system.  So you're right to suspect something is suboptimal, 
but the root cause is inefficient resilvering code in zfs specifically for 
raidzN.  The resilver code spends a *lot* of time seeking, because it's not 
optimized by disk layout.  This may change some day, but not in the near future.

Mirrors don't suffer the same effect.  At least, if they do, it's far less 
dramatic.

For now, all you can do is:  (a) factor this into your decision to use mirror 
versus raidz, and (b) ensure no snapshots, and minimal IO during the resilver, 
and (c) if you opt for raidz, keep the number of disks in a raidz to a minimum. 
 It is preferable to use 3 vdev's each of 7-disk raidz, instead of using a 
21-disk raidz3.

Your setup of 3x raidz2 is pretty reasonable, and 4h resilver, although slow, 
is successful.  Which is more than you could say if you had a 21-disk raidz3.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup relationship between pool and filesystem

2010-09-25 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Brad Stone
 
 For de-duplication to perform well you need to be able to fit the de-
 dup table in memory. Is a good rule-of-thumb for needed RAM  Size=(pool
 capacity/avg block size)*270 bytes? Or perhaps it's
 Size/expected_dedup_ratio?

For now, the rule of thumb is 3G ram for every 1TB of unique data, including
snapshots and vdev's.

After a system is running, I don't know how/if you can measure current mem
usage, to gauge the results of your own predictions.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [osol-discuss] zfs send/receive?

2010-09-25 Thread Edward Ned Harvey

 From: opensolaris-discuss-boun...@opensolaris.org [mailto:opensolaris-
 discuss-boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
 
 I'm using a custom snaopshot scheme which snapshots every hour, day,
 week and month, rotating 24h, 7d, 4w and so on. What would be the best
 way to zfs send/receive these things? I'm a little confused about how
 this works for delta udpates...

Out of curiosity, why custom?  It sounds like a default config.

Anyway, as long as the present destination filesystem matches a snapshot from 
the source system, you can incrementally send any newer snapshot.  Generally 
speaking, you don't want to send anything that's extremely volatile such as 
hourly...  because if the snap of the source disappears, then you have nothing 
to send incrementally from anymore.  Make sense?

I personally send incrementals once a day, and only send the daily incrementals.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup relationship between pool and filesystem

2010-09-25 Thread Edward Ned Harvey

 From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net]
 
  For now, the rule of thumb is 3G ram for every 1TB of unique data,
  including
  snapshots and vdev's.
 
 3 gigs? Last I checked it was a little more than 1GB, perhaps 2 if you
 have small files.

http://opensolaris.org/jive/thread.jspa?threadID=131761

The true answer is it varies depending on things like block size, etc, so if 
you want to say 1G or 3G, despite sounding like a big difference, it's in the 
noise.  We're only talking rule of thumb here, based on vague (vague) and 
widely variable estimates of your personal usage characteristics.

It's just a rule of thumb, and slightly over 1G ~= slightly under 3G in this 
context.

Hence, the comment:

 After a system is running, I don't know how/if you can measure current
 mem usage, to gauge the results of your own predictions.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Any zfs fault injection tools?

2010-09-24 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Freddie Cash
 
 The following works well:
   dd if=/dev/random of=/dev/disk-node bs=1M count=1 seek=whatever
 
 If you have long enough cables, you can move a disk outside the case
 and run a magnet over it to cause random errors.
 
 Plugging/unplugging the SATA/SAS cable from a disk while doing normal
 reads/writes is also fun.
 
 Using the controller software (if a RAID controller) to delete
 LUNs/disks is also fun.


You don't have any friends that are computers anymore, do you.  ;-)  The
words cruel and unusual come to mind. 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup relationship between pool and filesystem

2010-09-23 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Peter Taps
 
 The dedup property is set on a filesystem, not on the pool.
 
 However, the dedup ratio is reported on the pool and not on the
 filesystem.

As with most other ZFS concepts, the core functionality of ZFS is
implemented in zpool.  Hence, zpool is up to what ... version 25 or so now?
Think of ZFS (the posix filesystem) as just an interface which tightly
integrates the zpool features.  ZFS is only up to what, version 4 now?

Perfect example:  

If you create a zvol in linux, without formatting it zfs, and format it
ext3/4, then you can snapshot it, and I believe you can even zfs send and
receive.  And so on.  The core functionality is mostly present.  But if you
want to access the snapshot, you have to create some mountpoint, and mount
read-only the snapshot zvol to the mountpoint.  It's not automatic.  It's
barely any better than the crappy snapshot concept linux has in LVM.  If
you want good automatic snapshot creation  seamless mounting  automatic
mounting, then you need the ZFS filesystem on top of the zpool.  Cuz the ZFS
filesystem knows about that underlying zpool feature, and makes it
convenient and easy good experience.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS checksum errors (ZFS-8000-8A)

2010-09-19 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 It is very unusual to obtain the same number of errors (probably same
 errors) from two devices in a pair.  This should indicate a common
 symptom such as a memory error (does your system have ECC?),
 controller glitch, or a shared power supply issue.

Bob's right.  I didn't notice that both sides of the mirror have precisely
56 checksum errors.  Ignore what I said about adding a 3rd disk to the
mirror.  It won't help.  The 3rd mirror would have only been useful if the
block corruption on these 2 disks weren't the same blocks.

I think you have to acknowledge the fact that you have corrupt data.  And
you should run some memory diagnostics on your system to see if you can
detect some failing memory.  The cause is not necessarily memory, as Bob
pointed out, but a typical way to produce the result you're seeing is ...
ZFS calculates a checksum of a block it's about to write to disk, and of
course that checksum is stored in ram.  Unfortunately, if it's stored in
corrupt ram, then ... when it's written to disk, of course the checksum will
mismatch.  And the faulty checksum gets written to both sides of the mirror.
It is discovered later during your scrub.  There is no un-corrupt copy of
the data that ZFS thought it wrote.

At least it's detected by ZFS.  Without checksumming, that error would pass
undetected.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Dyer-Bennet

  For example, if you start with an empty drive, and you write a large
  amount
  of data to it, you will have no fragmentation.  (At least, no
 significant
  fragmentation; you may get a little bit based on random factors.)  As
 life
  goes on, as long as you keep plenty of empty space on the drive,
 there's
  never any reason for anything to become significantly fragmented.
 
 Sure, if only a single thread is ever writing to the disk store at a
 time.

This has already been discussed in this thread.

The threading model doesn't affect the outcome of files being fragmented or
unfragmented on disk.  The OS is smart enough to know these blocks writen
by process A are all sequential, and those blocks all written by process B
are also sequential, but separate.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Marty Scholes
  
 What appears to be missing from this discussion is any shred of
 scientific evidence that fragmentation is good or bad and by how much.
 We also lack any detail on how much fragmentation does take place.

Agreed.  I've been rather lazily asserting a few things here and there that
I expected to be challenged, so I've been thinking up tests to
verify/dispute my claims, but then nobody challenged.  Specifically, the
blocks on disk are not interleaved just because multiple threads were
writing at the same time.

So there's at least one thing which is testable, if anyone cares.

But there's also no way that I know of, to measure fragmentation in a real
system that's been in production for a year.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practice for Sol10U9 ZIL -- mirrored or not?

2010-09-17 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bryan Horstmann-Allen

 The ability to remove the slogs isn't really the win here, it's import
 -F. The

Disagree.

Although I agree the -F is important and good, I think the log device
removal is the main win.  Prior to log device removal, if you lose your
slog, then you lose your whole pool, and probably your system halts (or does
something equally bad, which isn't strictly halting).  Therefore you want
your slog to be as redundant as the rest of your pool.

With log device removal, if you lose a slog while the system is up, worst
case is performance degradation.

With log device removal, there's only one thing you have to worry about:
Your slog goes bad, and undetected.  So the system keeps writing to it,
unaware that it will never be able to read, and therefore when you get a
system crash, and for the first time your system tries to read that device,
you lose information.  Not your whole pool.  You lose up to 30 sec of writes
that the system thought it wrote, but never did.  You require the -F to
import.

Historically, people always recommend mirroring your log device, even with
log device removal, to protect against the above situation.  But in a recent
conversation including Neil, it seems there might be a bug which causes the
log device mirror to be ignored during import, thus rendering the mirror
useless in the above situation.

Neil, or anyone, is there any confirmation or development on that bug?

Given all of this, I would say it's recommended to forget about mirroring
log devices for now.  In the past, the recommendation was Yes mirror.
Right now, it's No don't mirror, and after the bug is fixed, the
recommendation will again become Yes, mirror.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver that never finishes

2010-09-17 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tom Bird
 

We recently had a long discussion in this list, about resilver times versus
raid types.  In the end, the conclusion was:  resilver code is very
inefficient for raidzN.  Someday it may be better optimized, but until that
day comes, you really need to break your giant raidzN into smaller vdev's.

3 vdev's of 7 disk raidz is preferable over a 21 disk raidz3.

If you want this resilver to complete, you should do anything you can to (a)
stop taking snapshots (b) don't scrub (c) stop all IO possible.  And be
patient.

Most people in your situation find it faster to zfs send to some other
storage, and then destroy  recreate the pool.  I know it stinks.  But
that's what you're facing.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS file system without pool

2010-09-15 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ramesh Babu
 
 I would like to know if  I can create ZFS file system without ZFS
 storage pool. Also I would like to know if I can create ZFS pool/ZFS
 pool on Veritas Volume.

Unless I'm mistaken, you seem to be confused, thinking zpools can only be
created from physical devices.  You can make zpools from files, sparse
files, physical devices, remote devices, in memory ... basically any type of
storage you can communicate with.  You create the zpool, and the zfs
filesystem optionally comes along with it.

All the magic is done in the pool - snapshots, dedup, etc.  The only reason
you would want a zfs filesystem is because it's specifically designed to
leverage the magic of a zpool natively.  If it were possible to create a zfs
filesystem without a zpool, you might as well just use ufs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Edward Ned Harvey

 From: Richard Elling [mailto:rich...@nexenta.com]

  Suppose you want to ensure at least 99% efficiency of the drive.  At
 most 1%
  time wasted by seeking.
 
 This is practically impossible on a HDD.  If you need this, use SSD.

Lately, Richard, you're saying some of the craziest illogical things I've
ever heard, about fragmentation and/or raid.

It is absolutely not difficult to avoid fragmentation on a spindle drive, at
the level I described.  Just keep plenty of empty space in your drive, and
you won't have a fragmentation problem.  (Except as required by COW.)  How
on earth do you conclude this is practically impossible?

For example, if you start with an empty drive, and you write a large amount
of data to it, you will have no fragmentation.  (At least, no significant
fragmentation; you may get a little bit based on random factors.)  As life
goes on, as long as you keep plenty of empty space on the drive, there's
never any reason for anything to become significantly fragmented.

Again, except for COW.  It is known that COW will cause fragmentation if you
write randomly in the middle of a file that is protected by snapshots.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-15 Thread Edward Ned Harvey

 From: Richard Elling [mailto:rich...@nexenta.com]
 
 It is practically impossible to keep a drive from seeking.  It is also

The first time somebody (Richard) said you can't prevent a drive from
seeking, I just decided to ignore it.  But then it was said twice.  (Ian.)

I don't get why anybody is saying drives seek.  Did anybody say drives
don't seek?

I said you can quantify how much fragmentation is acceptable, given drive
speed characteristics, and a percentage of time you consider acceptable for
seeking.  I suggested acceptable was 99% efficiency and 1% time waste
seeking.  Roughly calculated, I came up with 40 MB sequential data per
random seek would yield 99% efficiency.

For some situations, that's entirely possible and likely to be the norm.
For other cases, it may be unrealistic, and you may suffer badly from
fragmentation.

Is there some point we're talking about here?  I don't get why the
conversation seems to have taken such a tangent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Edward Ned Harvey

 From: Haudy Kazemi [mailto:kaze0...@umn.edu]
 
 With regard to multiuser systems and how that negates the need to
 defragment, I think that is only partially true.  As long as the files
 are defragmented enough so that each particular read request only
 requires one seek before it is time to service the next read request,
 further defragmentation may offer only marginal benefit.  On the other

Here's a great way to quantify how much fragmentation is acceptable:

Suppose you want to ensure at least 99% efficiency of the drive.  At most 1%
time wasted by seeking.
Suppose you're talking about 7200rpm sata drives, which sustain 500Mbit/s
transfer, and have average seek time 8ms.

8ms is 1% of 800ms.
In 800ms, the drive could read 400 Mbit of sequential data.
That's 40 MB

So as long as the fragment size of your files are approx 40 MB or larger,
then fragmentation has a negligible effect on performance.  One seek per
every 40MB read/written will yield less than 1% performance impact.

For the heck of it, let's see how that would have computed with 15krpm SAS
drives.
Sustained transfer 1Gbit/s, and average seek 3.5ms
3.5ms is 1% of 350ms
In 350ms, the drive could read 350 Mbit (call it 43MB)

That's certainly in the ballpark of 40 MB.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread Edward Ned Harvey

 From: Richard Elling [mailto:rich...@nexenta.com]
  With appropriate write caching and grouping or re-ordering of writes
 algorithms, it should be possible to minimize the amount of file
 interleaving and fragmentation on write that takes place.
 
 To some degree, ZFS already does this.  The dynamic block sizing tries
 to ensure
 that a file is written into the largest block[1]

Yes, but the block sizes in question are typically up to 128K.
As computed in my email 1 minute ago ... The fragment size needs to be on
the order of 40 MB in order to effectively eliminate performance loss of
fragmentation.


 Also, ZFS has an intelligent prefetch algorithm that can hide some
 performance
 aspects of defragmentation on HDDs.

Unfortunately, prefetch can only hide fragmentation on systems that have
idle disk time.  Prefetch isn't going to help you if you actually need to
transfer a whole file as fast as possible.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedicated ZIL/L2ARC

2010-09-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Wolfraider
 
 We are looking into the possibility of adding a dedicated ZIL and/or
 L2ARC devices to our pool. We are looking into getting 4 – 32GB  Intel
 X25-E SSD drives. Would this be a good solution to slow write speeds?

If you have slow write speeds, a dedicated log device might help.  (log devices 
are for writes, not for reads.)

It sounds like your machine is an iscsi target.  In which case, you're 
certainly doing a lot of sync writes, and therefore hitting your ZIL hard.  So 
it's all but certain adding dedicated log devices will help.

One thing to be aware of:  Once you add dedicated log, *all* of your sync 
writes will hit that log device.  While a single SSD or pair of SSD's will have 
fast IOPS, they can easily become a new bottleneck with worse performance than 
what you had before ... If you've got 80 spindle disks now, and by any chance, 
you perform sequential sync writes, then a single pair of SSD's won't compete.  
I'd suggest adding several SSD's for log devices, and no mirroring.  Perhaps 
one SSD for every raidz2 vdev, or every other, or every third, depending on 
what you can afford.

If you have slow reads, l2arc cache might help.  (cache devices are for read, 
not write.)


 We are currently sharing out different slices of the pool to windows
 servers using comstar and fibrechannel. We are currently getting around
 300MB/sec performance with 70-100% disk busy.

You may be facing some other problem, aside from just having cache/log devices. 
 I suggest giving us some more detail here.  Such as ...  

Large sequential operations are good on raidz2.  But if you're performing 
random IO, that performs pretty poor on raidz2.

What sort of network are you using?  I know you said comstar and 
fibrechannel, and sharing slices to windows ... I assume this means you're 
doing iscsi, right?  Dual 4Gbit links per server?  You're getting 2.4 Gbit and 
you expect what?

You have a pool made up of 18 raidz2 vdev's with 5 drives each (capacity of 3 
disks each) ... Is each vdev on its own bus?  What type of bus is it?  
(Generally speaking, it is preferable to spread vdev's across buses, instead of 
making 1vdev on 1 bus, for reliability purposes) ...  How many disks, of what 
type, on each bus?  What type of bus, at what speed?

What are the usage characteristics, how are you making your measurement?


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey

 From: Richard Elling [mailto:rich...@nexenta.com]
 
 This operational definition of fragmentation comes from the single-
 user,
 single-tasking world (PeeCees). In that world, only one thread writes
 files
 from one application at one time. In those cases, there is a reasonable
 expectation that a single file's blocks might be contiguous on a single
 disk.
 That isn't the world we live in, where have RAID, multi-user, or multi-
 threaded
 environments.

I don't know what you're saying, but I'm quite sure I disagree with it.

Regardless of multithreading, multiprocessing, it's absolutely possible to
have contiguous files, and/or file fragmentation.  That's not a
characteristic which depends on the threading model.

Also regardless of raid, it's possible to have contiguous or fragmented
files.  The same concept applies to multiple disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I was thinking to delete all zfs snapshots before zfs send receive to
 another new zpool. Then everything would be defragmented, I thought.

You don't need to delete snaps before zfs send, if your goal is to
defragment your filesystem.  Just perform a single zfs send, and don't do
any incrementals afterward.  The receiving filesystem will layout the
filesystem as it wishes.


 (I assume snapshots works this way: I snapshot once and do some
 changes, say delete file A and edit file B. When I delete the
 snapshot, the file A is still deleted and file B is still edited.
 In other words, deletion of snapshot does not revert back the changes.

You are correct.

A snapshot is a read-only image of the filesystem, as it was, at some time
in the past.  If you destroy the snapshot, you've only destroyed the
snapshot.  You haven't destroyed the most recent live version of the
filesystem.

If you wanted to, you could rollback, which destroys the live version of
the filesystem, and restores you back to some snapshot.  But that is a very
different operation.  Rollback is not at all similar to destroying a
snapshot.  These two operations are basically opposites of each other.

All of this is discussed in the man pages.  I suggest man zpool and man
zfs

Everything you need to know is written there.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread Edward Ned Harvey

 From: Richard Elling [mailto:rich...@nexenta.com]
 
  Regardless of multithreading, multiprocessing, it's absolutely
 possible to
  have contiguous files, and/or file fragmentation.  That's not a
  characteristic which depends on the threading model.
 
 Possible, yes.  Probable, no.  Consider that a file system is
 allocating
 space for multiple, concurrent file writers.

Process A is writing.  Suppose it starts writing at block 10,000 out of my
1,000,000 block device.
Process B is also writing.  Suppose it starts writing at block 50,000.

These two processes write simultaneously, and no fragmentation occurs,
unless Process A writes more than 40,000 blocks.  In that case, A's file
gets fragmented, and the 2nd fragment might begin at block 300,000.

The concept which causes fragmentation (not counting COW) in the size of the
span of unallocated blocks.  Most filesystems will allocate blocks from the
largest unallocated contiguous area of the physical device, so as to
minimize fragmentation.

I can't say how ZFS behaves authoritatively, but I'd be extremely surprised
if two processes writing different files as fast as possible result in all
their blocks interleaved with each other on physical disk.  I think this is
possible if you have multiple processes lazily writing at less-than full
speed, because then ZFS might remap a bunch of small writes into a single
contiguous write.


  Also regardless of raid, it's possible to have contiguous or
 fragmented
  files.  The same concept applies to multiple disks.
 
 RAID works against the efforts to gain performance by contiguous access
 because the access becomes non-contiguous.

These might as well have been words randomly selected from the dictionary to
me - I recognize that it's a complete sentence, but you might have said
processors aren't needed in computers anymore, or something equally
illogical.

Suppose you have a 3-disk raid stripe set, using traditional simple
striping, because it's very easy to explain.  Suppose a process is writing
as fast as it can, and suppose it's going to write block 0 through block 99
of a virtual device.

virtual block 0 = block 0 of disk 0
virtual block 1 = block 0 of disk 1
virtual block 2 = block 0 of disk 2
virtual block 3 = block 1 of disk 0
virtual block 4 = block 1 of disk 1
virtual block 5 = block 1 of disk 2
virtual block 6 = block 2 of disk 0
virtual block 7 = block 2 of disk 1
virtual block 8 = block 2 of disk 2
virtual block 9 = block 3 of disk 0
...
virtual block 96 = block 32 of disk 0
virtual block 97 = block 32 of disk 1
virtual block 98 = block 32 of disk 2
virtual block 99 = block 33 of disk 0

Thanks to buffering and command queueing, the OS tells the RAID controller
to write blocks 0-8, and the raid controller tells disk 0 to write blocks
0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2,
simultaneously.  So the total throughput is the sum of all 3 disks writing
continuously and contiguously to sequential blocks.

This accelerates performance for continuous sequential writes.  It does not
work against efforts to gain performance by contiguous access.

The same concept is true for raid-5 or raidz, but it's more complicated.
The filesystem or raid controller does in fact know how to write sequential
filesystem blocks to sequential physical blocks on the physical devices for
the sake of performance enhancement on contiguous read/write.

If you don't believe me, there's a very easy test to prove it:

Create a zpool with 1 disk in it.  time writing 100G (or some amount of data
 larger than RAM.)
Create a zpool with several disks in a raidz set, and time writing 100G.
The speed scales up linearly with the number of disks, until you reach some
other hardware bottleneck, such as bus speed or something like that.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs compression with Oracle - anyone implemented?

2010-09-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Brad
 
 Hi!  I'd been scouring the forums and web for admins/users who deployed
 zfs with compression enabled on Oracle backed by storage array luns.
 Any problems with cpu/memory overhead?

I don't think your question is clear.  What do you mean on oracle backed by
storage luns?

Do you mean on oracle hardware?
Do you mean you plan to run oracle database on the server, with ZFS under
the database?

Generally speaking, you can enable compression on any zfs filesystem, and
the cpu overhead is not very big, and the compression level is not very
strong by default.  However, if the data you have is generally
uncompressible, any overhead is a waste.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-12 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 I am not really worried about fragmentation. I was just wondering if I
 attach new drives and zfs send recieve to a new zpool, would count as
 defrag. But apparently, not.

Apparently not in all situations would be more appropriate.

The understanding I had was:  If you send a single zfs send | receive, then
it does effectively get defragmented, because the receiving filesystem is
going to re-layout the received filesystem, and there is nothing
pre-existing to make the receiving filesystem dance around...  But if you're
sending some initial, plus incrementals, then you're actually repeating the
same operations that probably caused the original filesystem to become
fragmented in the first place.  And in fact, it seems unavoidable...

Suppose you have a large file, which is all sequential on disk.  You make a
snapshot of it.  Which means all the individual blocks must not be
overwritten.  And then you overwrite a few bytes scattered randomly in the
middle of the file.  The nature of copy on write is such that of course, the
latest version of the filesystem is impossible to remain contiguous.  Your
only choices are:  To read  write copies of the whole file, including
multiple copies of what didn't change, or you leave the existing data in
place where it is on disk, and you instead write your new random bytes to
other non-contiguous locations on disk.  Hence fragmentation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Freddie Cash
 
 No, it (21-disk raidz3 vdev) most certainly will not resilver in the
 same amount of time.  In fact, I highly doubt it would resilver at
 all.
 
 My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
 Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
 multilane controllers.  Nice 10 TB storage pool.  Worked beatifully as
 we filled it with data.  Had less than 50% usage when a disk died.
 
 No problem, it's ZFS, it's meant to be easy to replace a drive, just
 offline, swap, replace, wait for it to resilver.
 
 Well, 3 days later, it was still under 10%, and every disk light was
 still solid grrn.  SNMP showed over 100 MB/s of disk I/O continuously,

I don't believe your situation is typical.  I think you either encountered a 
bug, or you had something happening that you weren't aware of (scrub, 
autosnapshots, etc) ... because the only time I've ever seen anything remotely 
similar to the behavior you described was the bug I've mentioned in other 
emails, which occurs when disk is 100% full and a scrub is taking place.

I know it's not the same bug for you, because you said your pool was only 50% 
full.  But I don't believe that what you saw was normal or typical.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Erik Trimble
 
 the thing that folks tend to forget is that RaidZ is IOPS limited.  For
 the most part, if I want to reconstruct a single slab (stripe) of data,
 I have to issue a read to EACH disk in the vdev, and wait for that disk
 to return the value, before I can write the computed parity value out
 to
 the disk under reconstruction.

If I'm trying to interpret your whole message, Erik, and condense it, I
think I get the following.  Please tell me if and where I'm wrong.

In any given zpool, some number of slabs are used in the whole pool.  In
raidzN, a portion of each slab is written on each disk.  Therefore, during
resilver, if there are a total of 1million slabs used in the zpool, it means
each good disk will need to read 1million partial slabs, and the replaced
disk will need to write 1 million partial slabs.  Each good disk receives a
read request in parallel, and all of them must complete before a write is
given to the new disk.  Each read/write cycle is completed before the next
cycle begins.  (It seems this could be accelerated by allowing all the good
disks to continue reading in parallel instead of waiting, right?)

The conclusion I would reach is:

Given no bus bottleneck:

It is true that resilvering a raidz will be slower with many disks in the
vdev, because the average latency for the worst of N disks will increase as
N increases.  But that effect is only marginal, and bounded between the
average latency of a single disk, and the worst case latency of a single
disk.

The characteristic that *really* makes a big difference is the number of
slabs in the pool.  i.e. if your filesystem is composed of mostly small
files or fragments, versus mostly large unfragmented files.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey

 From: Hatish Narotam [mailto:hat...@gmail.com]
 
 PCI-E 8X 4-port ESata Raid Controller.
 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on
 the controller).
 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier).

Assuming your disks can all sustain 500Mbit/sec, which I find to be typical
for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit
upstream bottleneck, it means each of your groups of 5 should be fine in a
raidz1 configuration.

You think that your sata card can do 32Gbit because it's on a PCIe x8 bus.
I highly doubt it unless you paid a grand or two for your sata controller,
but please prove me wrong.  ;-)  I think the backplane of the sata
controller is more likely either 3G or 6G.  

If it's 3G, then you should use 4 groups of raidz1.
If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of
500Mbit can only sustain 5Gbit)
If it's 12G or higher, then you can make all of your drives one big vdev of
raidz3.


 According to Samsungs site, max read speed is 250MBps, which
 translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.

I guarantee you this is not a sustainable speed for 7.2krpm sata disks.  You
can get a decent measure of sustainable speed by doing something like:
(write 1G byte)
time dd if=/dev/zero of=/some/file bs=1024k count=1024
(beware: you might get an inaccurate speed measurement here
due to ram buffering.  See below.)

(reboot to ensure nothing is in cache)
(read 1G byte)
time dd if=/some/file of=/dev/null bs=1024k
(Now you're certain you have a good measurement.
If it matches the measurement you had before,
that means your original measurement was also
accurate.  ;-) )

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 The characteristic that *really* makes a big difference is the number
 of
 slabs in the pool.  i.e. if your filesystem is composed of mostly small
 files or fragments, versus mostly large unfragmented files.

Oh, if at least some of my reasoning was correct, there is one valuable
take-away point for hatish:

Given some number X total slabs used in the whole pool.  If you use a single
vdev for the whole pool, you will have X partial slabs written on each disk.
If you have 2 vdev's, you'll have approx X/2 partial slabs written on each
disk.  3 vdevs ~ X/3 partial slabs on each disk.  Therefore, the resilver
time approximately divides by the number of separate vdev's you are using in
your pool.

So the largest factor affecting resilver time of a single large vdev versus
many smaller vdev's is NOT the quantity of data written on each disk, but
just the fact that fewer slabs are used on each disk when using smaller
vdev's.

If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each
7disk raidz1, then:  The raidz3 provides better redundancy, but has the
disadvantage that every slab must be partially written on every disk.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done

2010-09-09 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 There should be little doubt that NetApp's goal was to make money by
 suing Sun.  Nexenta does not have enough income/assets to make a risky
 lawsuit worthwhile.

But in all likelihood, Apple still won't touch ZFS.  Apple would be worth
suing.  A big fat juicy...

On interesting take-away point, however:  Oracle is now in a solid position
to negotiate with Apple.  If Apple wants to pay for ZFS and indemnification
against netapp lawsuit, Oracle can grant it.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] resilver = defrag?

2010-09-09 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 A) Resilver = Defrag. True/false?

I think everyone will agree false on this question.  However, more detail
may be appropriate.  See below.


 B) If I buy larger drives and resilver, does defrag happen?

Scores so far:
2 No
1 Yes


 C) Does zfs send zfs receive mean it will defrag?

Scores so far:
1 No
2 Yes

 ...

Does anybody here know what they're talking about?  I'd feel good if perhaps
Erik ... or Neil ... perhaps ... answered the question with actual
knowledge.

Thanks...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-09 Thread Edward Ned Harvey

 From: Haudy Kazemi [mailto:kaze0...@umn.edu]

 There is another optimization in the Best Practices Guide that says the
 number of devices in a vdev should be (N+P) with P = 1 (raidz), 2
 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8.
 I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level.
 
 I.e. Optimal sizes
 RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev
 RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev
 RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev

This sounds logical, although I don't know how real it is.  The logic seems
to be ... Assuming slab sizes of 128K, the amount of data written to each
disk within the vdev gets divided into something which is a multiple of 512b
or 4K (newer drives supposedly starting to use 4K block sizes instead of
512b).  

But I have doubts about the real-ness here, because ... An awful lot of
times, your actual slabs are smaller than 128K just because you're not
performing sustained sequential writes very often.

But it seems to make sense, whenever you *do* have some sequential writes,
you would want the data written to each disk to be a multiple of 512b or 4K.
If you had a 128K slab, divided into 5, then each disk would write 25.6K and
even for sustained sequential writes, some degree of fragmentation would be
impossible to avoid.  Actually, I don't think fragmentation is techinically
the correct term for that behavior.  It might be more appropriate to simply
say it forces a less-than-100% duty cycle.

And another thing ... Doesn't the checksum take up some space anyway?  Even
if you obeyed the BPG and used ... let's say ... 4 disks for N ... then each
disk has 32K of data to write, which is a multiple of 4K and 512b ... but
each disk also needs to write the checksum.  So each disk writes 32K + a few
bytes.  Which defeats the whole purpose anyway, doesn't it?

The effect, if real at all, might be negligible.  I don't know how small it
is, but I'm quite certain it's not huge.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-08 Thread Edward Ned Harvey

 From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of
 Mattias Pantzare
 
 It
 is about 1 vdev with 12 disk or  2 vdev with 6 disks. If you have 2
 vdev you have to read half the data compared to 1 vdev to resilver a
 disk.

Let's suppose you have 1T of data.  You have 12-disk raidz2.  So you have
approx 100G on each disk, and you replace one disk.  Then 11 disks will each
read 100G, and the new disk will write 100G.

Let's suppose you have 1T of data.  You have 2 vdev's that are each 6-disk
raidz1.  Then we'll estimate 500G is on each vdev, so each disk has approx
100G.  You replace a disk.  Then 5 disks will each read 100G, and 1 disk
will write 100G.

Both of the above situations resilver in equal time, unless there is a bus
bottleneck.  21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck.  But 21
disks in a single raidz3 provides better redundancy than 3 vdev's each
containing a 7 disk raidz1.

In my personal experience, approx 5 disks can max out approx 1 bus.  (It
actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks
on a good bus, or good disks on a crap bus, but generally speaking people
don't do that.  Generally people get a good bus for good disks, and cheap
disks for crap bus, so approx 5 disks max out approx 1 bus.)  

In my personal experience, servers are generally built with a separate bus
for approx every 5-7 disk slots.  So what it really comes down to is ...

Instead of the Best Practices Guide saying Don't put more than ___ disks
into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck
by constructing your vdev's using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 10u9

2010-09-08 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Magda
 
 The 9/10 Update appears to have been released. Some of the more
 noticeable
 ZFS stuff that made it in:
 
 More at:
 
 http://docs.sun.com/app/docs/doc/821-1840/gijtg

Awesome!  Thank you.  :-)
Log device removal in particular, I feel is very important.  (Got bit by
that one.)

Now when is dedup going to be ready?   ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-07 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of hatish
 
 I have just
 read the Best Practices guide, and it says your group shouldnt have  9
 disks. 

I think the value you can take from this is:
Why does the BPG say that?  What is the reasoning behind it?

Anything that is a rule of thumb either has reasoning behind it (you
should know the reasoning) or it doesn't (you should ignore the rule of
thumb, dismiss it as myth.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Suggested RaidZ configuration...

2010-09-07 Thread Edward Ned Harvey

 On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey sh...@nedharvey.com
 wrote:

 I think the value you can take from this is:
 Why does the BPG say that?  What is the reasoning behind it?
 
 Anything that is a rule of thumb either has reasoning behind it (you
 should know the reasoning) or it doesn't (you should ignore the rule of
 thumb, dismiss it as myth.)

Let's examine the myth that you should limit the number of drives in a vdev
because of resilver time.  The myth goes something like this:  You shouldn't
use more than ___ drives in a vdev raidz_ configuration, because all the
drives need to read during a resilver, so the more drives are present, the
longer the resilver time.

The truth of the matter is:  Only the size of used data is read.  Because
this is ZFS, it's smarter than a hardware solution which would have to read
all disks in their entirety.  In ZFS, if you have a 6-disk raidz1 with
capacity of 5 disks, and a total of 50G of data, then each disk has roughly
10G of data in it.  During resilver, 5 disks will each read 10G of data, and
10G of data will be written to the new disk.  If you have a 11-disk raidz1
with capacity of 10 disks, then each disk has roughly 5G of data.  10 disks
will each read 5G of data, and 5G of data will be written to the new disk.
If anything, more disks means a faster resilver, because you're more easily
able to saturate the bus, and you have a smaller amount of data that needs
to be written to the replaced disk.

Let's examine the myth that you should limit the number of disks for the
sake of redundancy.  It is true that a carefully crafted system can survive
things like SCSI controller or tray failure.  Suppose you have 3 scsi cards.
Suppose you construct a raidz2 device using 2 disks from controller 0, 2
disks from controller 1, and 2 disks from controller 2.  Then if a
controller dies, you have only lost 2 disks, and you are degraded but still
functional as long as you don't lose another disk.

But you said you have 20 disks all connected to a single controller.  So
none of that matters in your case.

Personally, I can't imagine any good reason to generalize don't use more
than ___ devices in a vdev.  To me, a 12-disk raidz2 is just as likely to
fail as a 6-disk raidz1.  But a 12-disk raidz2 is slightly more reliable
than having two 6-disk raidz1's.

Perhaps, maybe, a 64bit processor is able to calculate parity on an 8-disk
raidz set in a single operation, but requires additional operations to
calculate parity if your raidz has 9 or more disks in it ... But I am highly
skeptical of this line of reasoning, and AFAIK, nobody has ever suggested
this before me.  I made it up just now.  I'm grasping at straws and
stretching my imagination to find *any* merit in the statement, don't use
more than ___ disks in a vdev.  I see no reasoning behind it, and unless
somebody can say anything to support it, I think it's bunk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool question

2010-09-05 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of bear
 
 [b]Short Version[/b]
 I used zpool add instead of zpool replace while trying to move drives
 from an si3124 controller card.  I can backup the data to other drives
 and destroy the pool, but would prefer not to since it involved around
 4 tb of data and will take forever.
 [b]zpool add mypool c4t2d0[/b]
 instead of
 [b]zpool replace mypool c2t1d0 c4t2d0[/b]

Yeah ... Unfortunately, you cannot remove a vdev from a pool once it's been
added.  So ...  

Temporarily, in order to get c4t2d0 back into your control for other
purposes, you could create a sparse file somewhere, and replace this device
with the sparse file.  This should be very fast, and should not hurt
performance, as long as you haven't written any significant amount of data
to the pool since adding that device, and won't be writing anything
significant until after all is said and done.  Don't create the sparse file
inside the pool.  Create the sparsefile somewhere in rpool, so you don't
have a gridlock mount order problem.

Rather than replacing each device one-by-one, I might suggest creating a new
raidz2 on the new hardware, and then use zfs send | zfs receive to
replicate the contents of the first raid set to the 2nd raid set...  Then,
just destroy (or export, or unmount) the first raid set, while changing the
mountpoint of the 2nd raid set.  (And export/import or unmount/mount.)

since you have data that's mostly not changing, the send/receive method
should be extremely efficient.  You do one send/receive, and you don't even
have to follow up with any incrementals later...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ian Collins
 
  However writes to already opened files are allowed.
 
 Think of this from the perspective of an application. How would write
 failure be reported?  

Both very good points.  But I agree with Robert.  

write() has a known failure mode when disk is full.  I agree bad things can
happen to applications that attempt write() when disk is full ... however
... Only a user with root privs is able to set readonly property.  I expect
the root user is doing this for a reason.  Willing, able, and aware to take
responsibility for the consequences.

The intuitive (generally expected) thing, when you're root and you make a
filesystem readonly, is that it becomes readonly.

If that is not the behavior ... Well, I can think of at least one really
specific, important example problem.

Suppose an application writes to a file infinitely.  Fills up the
filesystem.  This is a known bad thing for ZFS, sometimes causing
unrecoverable infinite IO and forcing power-cycle (I don't have a bug # but
see here: http://opensolaris.org/jive/thread.jspa?threadID=132383tstart=0 )
...

If you find yourself in the infinite IO, would-be-forced to power cycle
situation, the workaround is to reduce some reservation to free up space.
Then you should be able to rm, destroy, and stop scrub.  But if the
application is still infinitely writing to the open file handle that it
already owns ... then any space you can free up will just get consumed again
immediately by the bad application.

Another specific example ...

Suppose you zfs send from a primary server to a backup server.  You want
the filesystems to be readonly on the backup fileserver, in order to receive
incrementals.  If you make a mistake, and start writing to the backup server
filesystem, you want to be able to correct your mistake.  Make it readonly,
stop anything from writing to it, rollback to the unmodified snapshot, so
you're able to receive incrementals again.

If setting readonly doesn't stop open filehandles from writing ... What can
you do?  You either have to flex your brain muscle to figure out some
technique to find which application is performing writes (not always easy to
do) or you basically have to unmount  remount the filesystem to force
writes to stop, which might not be easy to do, because filehandles are in
use.  You might feel the need to simply reboot, instead of figuring out a
way to do all this.  You just complain to your colleagues and say yeah, the
stupid thing made me reboot in order to make the filesystem readonly.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ian Collins
 
 so it should behave in the same way as an unmount in
 the presence of open files.

+1

You can unmount lazy, or force, or by default, the unmount fails in the
presence of open files.  (I think.)  So to keep everybody happy, let people
do whatever they want.  ;-)

Setting readonly property should fail in the presence of open files, or you
can force it, which would truly sweep the rug out from under the writing
processes.  And if the developer(s) are feeling ambitious, implement lazy
too.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs set readonly=on does not entirely go into read-only mode

2010-08-27 Thread Edward Ned Harvey

 From: Ian Collins [mailto:i...@ianshome.com]
 
 On 08/28/10 12:45 PM, Edward Ned Harvey wrote:
  Another specific example ...
 
  Suppose you zfs send from a primary server to a backup server.  You
 want
  the filesystems to be readonly on the backup fileserver, in order to
 receive
  incrementals.  If you make a mistake, and start writing to the backup
 server
  filesystem, you want to be able to correct your mistake.  Make it
 readonly,
  stop anything from writing to it, rollback to the unmodified
 snapshot, so
  you're able to receive incrementals again.
 
 
 I think you have lost a not in there somewhere!

Didn't miss any not, but it may not have been written clearly.

If you *intended* to set the destination filesystem readonly before, and you
only discovered it's not readonly later, evident by the fact that something
wrote to it and now you can't receive incremental zfs snapshots...  Then you
want to correct your mistake.  Whatever was writing to the backup
fileserver, it shouldn't have been.  So set the filesystem readonly,
rollback to the latest snapshot that corresponds to the primary server, so
you can again start receiving incrementals.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey

 From: Neil Perrin [mailto:neil.per...@oracle.com]
 
 Hmm, I need to check, but if we get a checksum mismatch then I don't
 think we try other
 mirror(s). This is automatic for the 'main pool', but of course the ZIL
 code is different
 by necessity. This problem can of course be fixed. (It will be  a week
 and a bit before I can
 report back on this, as I'm on vacation).

Thanks...

If indeed that is the behavior, then I would conclude:  
* Call it a bug.  It needs a bug fix.
* Prior to log device removal (zpool 19) it is critical to mirror log
device.
* After introduction of ldr, before this bug fix is available, it is
pointless to mirror log devices.
* After this bug fix is introduced, it is again recommended to mirror slogs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of StorageConcepts
 
 So would say there are 2 bugs / missing features in this:
 
 1) zil needs to report truncated transactions on zilcorruption
 2) zil should need mirrored counterpart to recover bad block checksums

Add to that:

During scrubs, perform some reads on log devices (even if there's nothing to
read).
In fact, during scrubs, perform some reads on every device (even if it's
actually empty.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-25 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Neil Perrin
 
 This is a consequence of the design for performance of the ZIL code.
 Intent log blocks are dynamically allocated and chained together.
 When reading the intent log we read each block and checksum it
 with the embedded checksum within the same block. If we can't read
 a block due to an IO error then that is reported, but if the checksum
 does
 not match then we assume it's the end of the intent log chain.
 Using this design means we the minimum number of writes to add
 write an intent log record is just one write.
 
 So corruption of an intent log is not going to generate any errors.

I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?

Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Storage server hardwae

2010-08-25 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Dr. Martin Mundschenk
 
 devices attached. Unfortunately the USB and sometimes the FW devices
 just die, causing the whole system to stall, forcing me to do a hard
 reboot.
 
 Well, I wonder what are the components to build a stable system without
 having an enterprise solution: eSATA, USB, FireWire, FibreChannel?

There is no such thing as reliable external disks.  Not unless you want to
pay $1000 each, which is dumb.  You have to scrap your mini, and use
internal (or hotswappable) disks.

Never expect a mini to be reliable.  They're designed to be small and cute.
Not reliable.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup and handling corruptions - impossible?

2010-08-22 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of devsk
 
 If dedup is ON and the pool develops a corruption in a file, I can
 never fix it because when I try to copy the correct file on top of the
 corrupt file,
 the block hash will match with the existing blocks and only reference
 count will be updated. The only way to fix it is to delete all
 snapshots (to remove all references) and then delete the file and then
 copy the valid file. This is a pretty high cost if it is so (empirical
 evidence so far, I don't know internal details).

Um ... If dedup is on, and a file develops corruption, the original has
developed corruption too.  It was probably corrupt before it was copied.
This is what zfs checksumming and mirrors/redundancy are for.

If you have ZFS, and redundancy, this won't happen.  (Unless you have
failing ram/cpu/etc)

If you have *any* filesystem without redundancy, and this happens, you
should stop trying to re-copy the file, and instead throw away your disk and
restore from backup.

If you run without redundancy, and without backup, you got what you asked
for.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup and handling corruptions - impossible?

2010-08-22 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of devsk
 
 What do you mean original? dedup creates only one copy of the file
 blocks. The file was not corrupt when it was copied 3 months ago.

Please describe the problem.

If you copied the file 3 months ago, and the new  old copies are both
referencing the same blocks on disk thanks to dedup, and the new copy has
become corrupt, then the original has also become corrupt.

In the OP, you seem to imply that the original is not corrupt, but the new
copy is corrupt, and you can't fix the new copy by overwriting it with a
fresh copy of the original.  This makes no sense.


  If you have ZFS, and redundancy, this won't happen.
   (Unless you have
  ailing ram/cpu/etc)
 
 
 You are saying ZFS will detect and rectify this kind of corruption in a
 deduped pool automatically if enough redundancy is present? Can that
 fail sometimes? Under what conditions?

I'm saying ZFS checksums every block on disk, read or written, and if any
checksum mismatches, then ZFS automatically checks the other copy ... from
the other disk in the mirror, or reconstructed from the redundancy in raid,
or whatever.  By having redundancy, ZFS will automatically correct any
checksum mismatches it encounters.

If a checksum is mismatched on *both* sides of the mirror, it means either
(a) both disks went bad at the same time, which is unlikely, but nonzero
probability, or (b) there's faulty ram or cpu or some other single-point of
failure in the system.


 I raised a technical question and you are going all personal on me.

Woah.  Where did that come from???

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-22 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Eric D. Mudama
 
 On Sat, Aug 21 at  4:13, Orvar Korvar wrote:
 And by the way: Wasn't there a comment of Linus Torvals recently that
 people shound move their low-quality code into the codebase ??? ;)
 
 Anyone knows the link? Good against the Linux fanboys. :o)
 
 Can't find the original reference, but I believe he was arguing that
 by moving code into the kernel and marking as experimental, it's more
 likely to be tested and have the bugs worked out, than if it forever
 lives as patchsets.
 
 Given the test environment, can't say I can argue against that point
 of view.

Besides defending the point of view (checkin experimental changes to an
experimental area, to accelerate code review) ... which seems like a fair
point of view ...

Who finds it necessary to have ammunition against linux fanboys?  Linux is
good in its own way.  You got something against linux?  Just converse on the
points of merit, and both you and they will reach the best conclusions you
can, rather than pushing an agenda or encouraging unnecessary bias.

Each OS is better in its own way.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS development moving behind closed doors

2010-08-19 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Linder, Doug
 
 there are an
 awful lot of places that actively DO NOT want the latest and greatest,
 and for good reason.  

Agreed.  Latest-greatest has its place, which is not 24/7 must-stay-up core
servers.

Each OS - sol10 vs osol (or more appropriately now ... something like fedora
vs rhel) Each OS has its place.  Each one satisfies different requirements.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-19 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Peter Jeremy
 
 My interpretation of those results is that you can't generalise: The
 only way to determine whether your application is faster in 32-bit or
 64-bit more is to test it.  And your choice of algorithm is at least
 as important as whether it's 32-bit or 64-bit.

Not just your choice of algorithm, but architecture.

Consider the dramatic architecture difference between Intel and AMD.  Though
they may have the same instruction set (within reason) the internal circuits
to process those instructions are dramatically different, and hence,
performance is dramatically different.

Intel might be 4x faster at some instruction, while AMD is 4x faster at some
other instruction.

The same dramatic difference is present for 32 vs 64.  As soon as you change
modes of your CPU, the architecture of the chip might as well be totally
different.

If you want to optimize performance, you have to first be able to classify
your work load.  If you cannot create a job which is truly typical of your
work load, all bets are off.  Don't even bother.  For general computing, the
more you spend, the faster it goes.  Only if you have some task which will
be repeated for long periods of time ... Then you can benefit by trying this
CPU, or that CPU, or this mode, or that mode, or this chipset, or tweaking
the compile flags, etc.

If you have one task which is faster 32bit, it's not representative of 32 vs
64 in general.  And vice-versa.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in Linux (was Opensolaris is apparently dead)

2010-08-19 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Joerg Schilling
 
 1) The OpenSource definition
 http://www.opensource.org/docs/definition.php
 section 9 makes it very clear that an OSS license must not restrict
 other
 software and must not prevent to bundle different works under different
 licenses on one medium.
 
 2) given the fact that the GPL is an aproved OSS licensse, it obviously
 complies with the OSS definition.

Even if there is a compatibility problem between GPL and ZFS, it's all but
irrelevant.  Because the linux kernel can load modules which aren't required
to be GPL.  If they're compiled as modules, separately from the kernel, then
there's no argument over derived work or anything like that ...

All you would need is a /boot partition, where the kernel is able to load
the ZFS modules, and then you're home free.  Much as we do today, with grub
loading solaris kernel, and then the solaris kernel using the bootfs
property to determine which ZFS filesystem to mount as /

So even if there is a license compatibility problem, I think it's all but
irrelevant.  Because it's easily legally solvable, or avoidable.

The reasons for ZFS not in Linux must be more than just the license issue.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Alxen4
 
 Disabling ZIL converts all synchronous calls to asynchronous which
 makes ZSF to report data acknowledgment before it actually was written
 to stable storage which in turn improves performance but might cause
 data corruption in case of server crash.
 
 Is it correct ?

It is partially correct.

With the ZIL disabled, you could lose up to 30 sec of writes, but it won't
cause an inconsistent filesystem, or corrupt data.  If you make a
distinction between corrupt and lost data, then this is valuable for you
to know:

Disabling the ZIL can result in up to 30sec of lost data, but not corrupt
data.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris startup script location

2010-08-18 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Alxen4
 
 For example I'm trying to use ramdisk as ZIL device (ramdiskadm )

Other people have already corrected you about ramdisk for log.
It's already been said, use SSD, or disable ZIL completely.

But this was not said:

In many cases, you can gain a large performance increase by enabling the
WriteBack buffer of your ZFS server raid controller card.  You only want to
do this if you have a BBU enabled on the card.  The performance gain is
*not* quite as good as using a nonvolatile log device, but certainly worth
checking anyway.  Because it's low cost, and doesn't consume slots...

Also, if you get a log device, you want two of them, and mirror them.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-18 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Ethan Erchinger
 
 We've had a failed disk in a fully support Sun system for over 3 weeks,
 Explorer data turned in, and been given the runaround forever.  The
 7000
 series support is no better, possibly worse.

That is really weird.  What are you calling failed?  If you're getting
either a red blinking light, or a checksum failure on a device in a zpool...
You should get your replacement with no trouble.

I have had wonderful support, up to and including recently, on my Sun
hardware.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-18 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Garrett D'Amore
 
 interpretation.  Since it no longer is relevant to the topic of the
 list, can we please either take the discussion offline, or agree to
 just
 let the topic die (on the basis that there cannot be an authoritative
 answer until there is some case law upon which to base it?)

Compatibility of ZFS  Linux, as well as the future development of ZFS, and
the health and future of opensolaris / solaris, oracle  sun ... Are
definitely relevant to this list.

People are allowed to conjecture.

If you don't have interest in a thread, just ignore the thread.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 64-bit vs 32-bit applications

2010-08-17 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Will Murnane
 
  I am surprised with the performances of some 64-bit multi-threaded
   applications on my AMD Opteron machine. For most of the applications,
   the performance of 32-bit version is almost same as the performance of
   64-bit version. However, for a couple of applications, 32-bit versions

 This list discusses the ZFS filesystem.  Perhaps you'd be better off
 posting to perf-discuss or tools-gcc?
 
 That said, you need to provide more information.  What compiler and
 flags did you use?  What does your program (broadly speaking) do?
 What did you measure to conclude that it's slower in 64-bit mode?

Not only that, for most things the 32 vs 64bit architectures are expected to
perform about the same.  The 64bit architecture exists mostly for higher
memory addressing bits, not for twice the performance.  YMMV.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Edward Ned Harvey

 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 Sent: Sunday, August 15, 2010 8:17 PM
 
 (The only way I could see this changing would be if there was a sudden
 license change which would permit either ZFS to overtake btrfs in the
 Linux kernel, or permit btrfs to overtake zfs in the Solaris kernel.  I

Of course this has been discussed extensively, but I believe, the reasons for 
ZFS not to be in Linux kernel go beyond just the license incompatibility.

ZFS does raid, and mirroring, and resilvering, and partitioning, and NFS, and 
CIFS, and iSCSI, and device management via vdev's, and so on.  So ZFS steps on 
a lot of linux peoples' toes.  They already have code to do this, or that, why 
should they kill off all these other projects, and turn the world upside down, 
and bow down and acknowledge that anyone else did anything better than what 
they did?

No, they just want a copy-on-write filesystem, and nothing more.  Something 
which more closely complies to the architecture model that they're already 
using.

Something which doesn't hurt their ego when they accept it...  And of course by 
they I'm mostly referring to Linus.  And all the people who work on kernel, 
ext fs, software raid, and all these other things which already exist in a 
More Linuxy way...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Dyer-Bennet
 
 However, if Oracle makes a binary release of BTRFS-derived code, they
 must
 release the source as well; BTRFS is under the GPL.

When a copyright holder releases something under GPL, it only means they've
granted you and the rest of the world permission to use it according to the
terms of GPL.

The copyright holder always retains permission for themselves to
redistribute in any form, under a different license if they want to.

If you (Microsoft) are a developer of a proprietary product, and you want to
link in some GPL library and keep it private and proprietary, you can
attempt negotiations with the copyright holder, to get that code released to
you for your purposes, under terms which are not GPL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 
 Can someone provide a link to the requisite source files so that we
 can see the copyright statements?  It may well be that Oracle assigned
 the copyright to some other party.

BTRFS is inside the linux kernel.  
Copyright (C) 1989, 1991 Free Software Foundation, Inc.

There is no other copyright written in there (that I can find with grep) but
the GPL does say something to contributors, which could fuzz the line
between copyright owner for contributions added by somebody outside the FSF.
it is not the intent of this section to claim rights or contest your rights
to work written entirely by you

So maybe the contributor retains some rights to reproduce their work in
other situations, under a different license.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-15 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jerome Warnier
 
 Do not forget Btrfs is mainly developed by ... Oracle. Will it survive
 better than Free Solaris/ZFS?

It's gpl.  Just as zfs is cddl.  They cannot undo, or revoke the free
license they've granted to use and develop upon whatever they've released.

ZFS is not dead, although it is yet to be seen if future development will be
closed source.

BTRFS is not dead, and cannot be any more dead than zfs.

So honestly ... your comment above ... really has no bearing in reality.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-15 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Bob Friesenhahn
 
 The $400 number is bogus since the amount that Oracle quotes now
 depends on the value of the hardware that the OS will run on.  For my

Using the same logic, if I said MS Office costs $140, that's a bogus number,
because different vendors sell it at different prices.

It's $450 for 1yr, or $1200 for 3yrs to buy solaris 10 with basic support on
a dell server.  It costs more with a higher level of support, and it costs
less if you have a good relationship with Dell with a strong corporate
discount, or if you buy it at the end of Dell's quarter, when they have the
best sales going on.

I don't know how much it costs at other vendors.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-15 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Tim Cook

 The cost discussion is ridiculous, period.  $400 is a steal for
 support.  You'll pay 3x or more for the same thing from Redhat or
 Novell.

Actually, as a comparison with the message I sent 1 minute ago... in order
to compare apples to apples ...


 [Solaris is] $450 for 1yr, or $1200 for 3yrs to buy solaris 10 with basic
 support on
 a dell server.  It costs more with a higher level of support, and it
 costs
 less if you have a good relationship with Dell with a strong corporate
 discount, or if you buy it at the end of Dell's quarter, when they have
 the
 best sales going on.

If you buy RHEL ES support with the same dell servers, the cost would be
$350/yr for basic support.  Plus or minus, based on AS and level of support
and your relationship with Dell.

Solaris costs more, but the ballpark is certainly the same.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and VMware

2010-08-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 #3  I previously believed that vmfs3 was able to handle sparse files
 amazingly well, like, when you create a new vmdk, it appears almost
 instantly regardless of size, and I believed you could copy sparse
 vmdk's
 efficiently, not needing to read all the sparse consecutive zeroes.  I
 was
 wrong.  

Correction:  I was originally right.  ;-)  

In ESXi, if you go to command line (which is busybox) then sparse copies are
not efficient.
If you go into vSphere, and browse the datastore, and copy vmdk files via
gui, then it DOES copy efficiently.

The behavior is the same, regardless of NFS vs iSCSI.

You should always copy files via GUI.  That's the lesson here.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Compress ratio???

2010-08-14 Thread Edward Ned Harvey

 From: cyril.pli...@gmail.com [mailto:cyril.pli...@gmail.com] On Behalf
 Of Cyril Plisko
  
 The compressratio shows you how much *real* data was compressed.
 The file in question, however, can be sparse file and have its size
 vastly
 different from what du says, even without compression.

Ahhh.  Thank you.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Russ Price
 
 For me, Solaris had zero mindshare since its beginning, on account of
 being
 prohibitively expensive. 

I hear that a lot, and I don't get it.  $400/yr does move it out of peoples'
basements generally, and keeps sol10 out of enormous clustering facilities
that don't have special purposes or free alternatives.  But I wouldn't call
it prohibitively expensive, for a whole lot of purposes.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-14 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Andrej Podzimek
 
 Or Btrfs. It may not be ready for production now, but it could become a
 serious alternative to ZFS in one year's time or so. (I have been using

I will much sooner pay for sol11 instead of use btrfs.  Stability  speed  
maturity greatly outweigh a few hundred dollars a year, if you run your 
business on it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] one ZIL SLOG per zpool?

2010-08-13 Thread Edward Ned Harvey

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Chris Twa
 
 My plan now is to buy the ssd's and do extensive testing.  I want to
 focus my performance efforts on two zpools (7x146GB 15K U320 + 7x73GB
 10k U320).  I'd really like two ssd's for L2ARC (one ssd per zpool) and
 then slice the other two ssd's and then mirror the slices for SLOG (one
 mirrored slice per zpool).  I'm worried that the ZILs won't be
 significantly faster than writing to disk.  But I guess that's what
 testing is for.  If the ZIL in this arrangement isn't beneficial then I
 can have four disks for L2ARC instead of two (or my wife and I get
 ssd's for our laptops).

Remember that ZIL is only for sync writes.  So if you're not doing sync
writes, there is no benefit of a dedicated log device.  

Also, for a lot of purposes, disabling ZIL is actually viable.  It's zero
cost which guarantees absolute optimal performance on spindle disks.
Nothing is faster.  To quantify the risk, here's what you need to know:

In the event of an ungraceful crash, up to 30sec of async writes are lost.
Period.  But as long as you have not disabled ZIL, then all the sync writes
were not lost.

If you have ZIL disabled, then sync=async.  Up to 30sec of all writes are
lost.  Period.

But there is no corruption or data written out-of-order.  The end result is
as-if you halted the server suddenly, flushed all the buffers to disk, and
then powered off.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1109 matches

Mail list logo