[zfs-discuss] How to disable ZIL and benchmark disk speed irresponsibly

2010-03-02 Thread Edward Ned Harvey
I have a system with a bunch of disks, and I¹d like to know how much faster
it would be if I had an SSD for the ZIL; however, I don¹t have the SSD and I
don¹t want to buy one right now.  The reasons are complicated, but it¹s not
a cost barrier.  Naturally I can¹t do the benchmark right now...

But if I could create a RAM device, and use that for ZIL, of course it would
be irresponsible, but I have no data on the system yet, and this is all just
to establish the upper bound of what performance could be if I had the SSD.
After doing the benchmark, I would reformat the machine anyway.

Can I create a ram device and use it for the ZIL?
Can I somehow disable the ZIL in an irresponsible way, to establish the
upper bound for performance on my system?

Thanks...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Any way to fix ZFS "sparse file bug" #6792701

2010-03-03 Thread Edward Ned Harvey
I don't know the answer to your question, but I am running the same version
of OS you are, and this bug could affect us.  Do you have any link to any
documentation about this bug?  I'd like to forward something to inform the
other admins at work.

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Robert Loper
Sent: Tuesday, March 02, 2010 12:09 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] Any way to fix ZFS "sparse file bug" #6792701

 

I have a Solaris x86 server running update 6 (Solaris 10 10/08
s10x_u6wos_07b X86).  I recently hit this "sparse file bug" when I deleted a
512GB sparse file from a 1.2TB filesystem and the space was never freed up.
What I am asking is would there be any way to recover the space in the
filesystem without having to destroy and recreate it?  I am assuming before
trying anything I would need to update the server to U8.

Thanks in advance...

-- 
Robert Loper
rlo...@gmail.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-help] ZFS two way replication

2010-03-03 Thread Edward Ned Harvey
Sorry for double-post.  This thread was posted separately to
opensolaris-help and zfs-discuss.  So I'm replying to both lists.


> I'm wondering what the possibilities of two-way replication are for a
> ZFS storage pool.

Based on all the description you gave, I wouldn't call this two-way
replication.  Because two-way replication implies changes are happening at
both sides, that need to be propagated and merged both ways.  What you're
talking about is to have a one-way replication, and then later, reverse the
direction, and then later reverse the direction.  At all points, your
changes are only happening at one side.  This is infinitely easier than what
your subject implies.

Unless somebody knows more than I do ...  As for VM's, I think it's
important that you know you shouldn't expect live sync.  If you send a
snapshot of a local pool to the remote side, then 1 second later the remote
pool is already 1 second behind, and in a possibly different state than the
local pool.  If you fail the local pool, you'll have to connect to the
remote pool, and tell your VM to revert to whatever the latest snapshot was
on the remote pool.

Thinking about how to migrate back to the local storage ... 

Suppose you have a local snapshot called "t...@12345" and you send it to the
remote side.  That means these two filesystems are now in sync at that
snapshot.  If you can, make the local filesystem read-only as long as you're
making modifications on the remote side.  Then you can send an incremental
snap from remote "t...@12345" to "t...@23456" back to the local side.  Make
the remote side read-only while you're allowing modifications to the local
side, and vice-versa.

The complication that could arise, is when you want to send an incremental
snap to a filesystem which has been modified after the "baseline" snap was
taken.  This would happen if you synced the two sides, and made
modifications to the local side, and had an unintentionally failed local
pool.  You would then connect your VM to the remote pool, and made
modifications to the remote pool.  When your local pool comes back online
... These two pools were synced some time ago, but they've both been
modified since.  You would have to find a way to revert the local pool back
to the snap where they were in sync (I don't know how to do that, I don't
know for sure it's even possible).  You would have to acknowledge risking
the loss of local changes that happened after the sync.  And you would send
the remote incremental snap @23456 to the local side.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] Moving Storage to opensolaris+zfs. What about backup?

2010-03-04 Thread Edward Ned Harvey
> Is there any work on an upgrade of zfs send/receive to handle resuming
> on next media?

Please see Darren's post, pasted below.


> -Original Message-
> From: opensolaris-discuss-boun...@opensolaris.org [mailto:opensolaris-
> discuss-boun...@opensolaris.org] On Behalf Of Darren Mackay
>
> mkfifo /tmp/some_pipe
> zfs snapshot ...
> zfs send -I startsnap endsnap > /tmp/some_pipe
> 
> then something that is able to backup to tape from a pipe, and manage
> your tapes, etc - use what you are comfortable with, but not all backup
> apps are the same
> 
> note - if changing tapes takes too long, the pipe *may* actually time
> out on the send side if you are not careful...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] Moving Storage to opensolaris+zfs. What about backup?

2010-03-04 Thread Edward Ned Harvey
> Is there any work on an upgrade of zfs send/receive to handle resuming
> on next media?

See Darren's post, regarding mkfifo.  The purpose is to enable you to use
"normal" backup tools that support changing tapes, to backup your "zfs send"
to multiple split tapes.  I wonder though - During a restore, does the
backup tool support writing to another fifo?  Would you have to restore to a
file first, and then do a "cat somefile | zfs receive?"

Please also be aware, "zfs send" was never meant to be stored, on tape or
any other media.  It is meant to stream directly into a "zfs receive."
Please consider this alternative:

You can create a file container (can be sparse) and create a ZFS filesystem
inside it.  Do a "zfs receive" into the file.  Then you export the
filesystem, and you can use whatever "normal" file backup tool you want.
This method has the disadvantage that it requires extra staging disk space,
but it has the advantage that it's far more reliable as a backup/restore
technique.  If there are bit errors inside the tape, ZFS will simply
checksum and correct them.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] WriteBack versus SSD-ZIL

2010-03-05 Thread Edward Ned Harvey
In this email, when I say PERC, I really mean either a PERC, or any other
hardware WriteBack buffered raid controller with BBU.

 

For future server purchases, I want to know which is faster:  (a) A bunch of
hard disks with PERC and WriteBack enabled, or (b) A bunch of hard disks,
plus one SSD for ZIL.  Unfortunately, I don't have an SSD available for
testing.  So here is what I was able to do:

 

I measured the write speed of the naked disks (PERC set to WriteThrough).
Results were around 350 ops/sec.

I measured the write speed with the WriteBack enabled.  Results were around
1250 ops/sec.

 

So right from the start, we can see there's a huge performance boost by
enabling the WriteBack.  Even for large sequential writes, buffering allows
the disks to operate much more continuously.  The next question is how it
compares against the SSD ZIL.

 

Since I don't have an SSD available, I created a ram device and put the ZIL
there.  This is not a measure of the speed if I had an SSD; rather, it is a
measure which the SSD cannot possibly achieve.  So it serves to establish an
upper bound.  If the upper and lower bounds are near each other, then we
have a good estimate of the speed with SSD ZIL . but if the upper bound and
lower bound are far apart, we haven't gotten much information.

 

With the ZIL in RAM, results were around 1700 ops/sec.

 

This measure is very far from the lower bound.  But it still serves to
provide some useful knowledge.  The take-home knowledge is:

. There's a lot of performance to be gained by accelerating the ZIL.
Potentially up to 6x or 7x faster than naked disks.

. The WriteBack raid controller achieves a lot of this performance
increase.  About 3x or 4x acceleration.

. I don't know how much an SSD would help.  I don't know if it's
better, the same, or worse than the PERC.  I don't know if the combination
of PERC and SSD together would go faster than either one individually.

 

I have a hypothesis.  I think the best configuration will be to use a PERC,
with WriteBack enabled on all the spindle hard drives, but include an SSD
for ZIL, and set the PERC for WriteThrough on the SSD.  This has yet to be
proven or disproven.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Monitoring my disk activity

2010-03-06 Thread Edward Ned Harvey
Recently, I'm benchmarking all kinds of stuff on my systems.  And one
question I can't intelligently answer is what blocksize I should use in
these tests.

 

I assume there is something which monitors present disk activity, that I
could run on my production servers, to give me some statistics of the block
sizes that the users are actually performing on the production server.  And
then I could use that information to make an informed decision about block
size to use while benchmarking.

 

Is there a man page I should read, to figure out how to monitor and get
statistics on my real life users' disk activity?

 

Thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL

2010-03-06 Thread Edward Ned Harvey
>  From everything I've seen, an SSD wins simply because it's 20-100x the
> size. HBAs almost never have more than 512MB of cache, and even fancy
> SAN boxes generally have 1-2GB max. So, HBAs are subject to being
> overwhelmed with heavy I/O. The SSD ZIL has a much better chance of
> being able to weather a heavy I/O period without being filled. Thus,
> SSDs are better at "average" performance - they provide a relatively
> steady performance profile, whereas HBA cache is very spiky.

This is a really good point.  So you think I may actually get better
performance by disabling the WriteBack on all the spindle disks, and
enabling it on the SSD instead.  This is precisely the opposite of what I
was thinking.

I'm planning to publish some more results soon, but haven't gathered it all
yet.  But see these:
Just naked disks, no acceleration.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteThr
ough.txt
Same configuration as above, but WriteBack enabled.
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k.txt
Same configuration as the naked disks, but a ramdrive was created for ZIL
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-ramZIL.t
xt
Using the ramdrive for ZIL, and also WriteBack enabled on PERC
http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac
k_and_ramZIL.txt

This result shows the WriteBack enabled makes a huge performance difference
(3-4x higher) for writes, compared to the naked disks.  I don't think it's
because an entire write operation fits into the HBA DRAM, or the HBA is
remaining un-saturated.  The PERC has 256M, but the test includes 8 threads
all simultaneously writing separate 4G files in various sized chunks and
patterns.  I think when the PERC ram is full of stuff queued for write to
disk, it's simply able to order and organize and optimize the write
operations to leverage the disks as much as possible.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool on sparse files

2010-03-06 Thread Edward Ned Harvey
> You are running into this bug:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6929751
> Currently, building a pool from files is not fully supported.

I think Cindy and I interpreted the question differently.  If you want the
zpool inside a file to stay mounted while the system is running, and come up
again after reboot, then I think she's right.  You're running into that bug.

If you want to dismount your zpool, for the sake of backing it up to tape or
something like that ... and then you're seeing this error on reboot, I think
you need to export your filesystem before you do your backups or reboot.
Then when you want to mount it again, you just import it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Monitoring my disk activity

2010-03-08 Thread Edward Ned Harvey
> It all depends on how they are connecting to the storage.  iSCSI, CIFS,
> NFS,
> database, rsync, ...?
> 
> The reason I say this is because ZFS will coalesce writes, so just
> looking at
> iostat data (ops versus size) will not be appropriate.  You need to
> look at the
> data flowing between ZFS and the users. fsstat works for file systems,
> but
> won't work for zvols, as an example.
>  -- richard

Actually, maybe that is right.  Since the users are connecting via CIFS and
NFS and ssh to use a ZFS volume, it stands to reason that ZFS is ultimately
the thing which is performing all the read/write operations on the physical
disks, right?  So if I use iostat, and I see coalesced data ... I get
statistics on ops and size ... which is truly the real world usage scenario
for my system, right?  Thus, when I am trying to optimize my disk
configuration, benchmarking with iozone or whatever, those statistics will
be the best measurement for me to use, when I tell iozone the blocksize it
should test.  Right?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-08 Thread Edward Ned Harvey
I don't have an answer to this question, but I can say, I've seen a similar
surprising result.  I ran iozone on various raid configurations of spindle
disks . and on a ramdisk.  I was surprised to see the ramdisk is only about
50% to 200% faster than the next best competitor in each category. . I don't
have any good explanation for that, but I didn't question it too hard.  I
accepted the results for what they are . the ramdisk performs surprisingly
poorly for some unknown reason.

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Matt Cowger
Sent: Monday, March 08, 2010 8:58 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk
(70% drop)

 

Hi Everyone,

 

It looks like I've got something weird going with zfs performance on a
ramdisk..ZFS is performing not even a 3rd of what UFS is doing.

 

Short version:

 

Create 80+ GB ramdisk (ramdiskadm), system has 96GB, so we aren't swapping

Create zpool on it (zpool create ram..)

Change zfs options to turn off checksumming (don't want it or need it),
atime, compression, 4K block size (this is the applications native
blocksize) etc.

Run a simple iozone benchmark (seq. write, seq. read, rndm write, rndm
read).

 

Same deal for UFS, replacing the ZFS stuff with newfs stuff and mounting the
UFS forcedirectio (no point in using a buffer cache memory for something
that's already in memory)

 

Measure IOPs performance using iozone:

 

iozone  -e -i 0 -i 1 -i 2 -n 5120 -O -q 4k -r 4k -s 5g

 

With the ZFS filesystem I get around:

ZFS
(seq write) 42360 (seq read)31010   (random
read)20953   (random write)32525

Not SOO bad, but here's UFS:

UFS
(seq write )42853 (seq read) 100761(random read)
100471   (random write) 101141

 

For all tests besides the seq write, UFS utterly destroys ZFS.

 

I'm curious if anyone has any clever ideas on why this huge disparity in
performance exists.  At the end of the day, my application will run on
either filesystem, it just surprises me how much worse ZFS performs in this
(admittedly edge case) scenario.

 

--M

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] backup zpool to tape

2010-03-10 Thread Edward Ned Harvey
> In my case where I reboot the server I cannot get the pool to come
> back up. It shows UNAVAIL, I have tried to export before reboot and
> reimport it and have not been successful and I dont like this in the
> case a power issue of some sort happens. My other option was to mount
> using lofiadm however I cannot get it to mount on boot, so the same
> thing happens. Does anyone have any experience with backing up zpools
> to tape? Please any ideas would be greatly beneficial.

I have a similar setup.  "zfs send | ssh somehost 'zfs receive'" works
perfectly, and the 2nd host is attached to a tape library.  I'm running
Netbackup on the 2nd host because we could afford it, and then I have an
honest-to-goodness support channel.

But if you don't want to spend the money for netbackup, I've heard good
things about using Amanda or Bacula to get this stuff onto tape.

FWIW, since I hate tapes so much, there's one more thing I'm doing.  I use
external hard drives, attached to the 2nd server, and periodically "zfs send
| zfs receive" from the 2nd server main disks to the 2nd server removable
disks.  Then export the removable disks, and take 'em offsite in a backup
rotation with the tapes.

The advantage of the tapes is an official support channel, and much greater
archive life.  The advantage of the removable disks is that you need no
special software to do a restore, and you could just as easily restore a
single file or the whole filesystem.

So let's see...  I have two different types of offline backup for the backup
server, which itself is just a backup of the main server.  So I'm feeling
pretty well covered.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send and receive ... any ideas for FEC?

2010-03-12 Thread Edward Ned Harvey
> I don't think retransmissions of b0rken packets is a problem anymore,
> most people
> use ssh which provides good error detection at a fine grain.  It is
> rare that one would
> need to resend an entire ZFS dump stream when using ssh (or TLS or ...)
> 
> Archival tape systems are already designed to wrap the payload with
> ECC.  For
> example, the T1b is rated at 30 year shelf life and 10^-19 BER -- 4
> orders of
> magnitude better than enterprise class HDDs and 5 orders of magnitude
> better
> than consumer class HDDs.
> 
> Also, many enterprise backup solutions have long-term tape management
> as part of
> their core feature set.
> 
> A well designed, enterprise class backup system should be easily
> capable of
> storing ZFS dump images reliably, safely, and securely.

Actually, this is a really good point.  We have some FEC experts at work,
and I asked them this question.  They couldn't name any software package
that does this, because it's always done in hardware.  After conversation, I
was shocked to learn how ridiculously everywhere FEC is in every device you
use.  It's required in your hard drive, your flash drive, your ram, your
motherboard, your Ethernet.  Everywhere.  And it's absolutely a design
decision how strong you want to build it.  So for a commodity hard drive,
you can expect lower grade, while an enterprise tape drive and tape media,
you can rest assured have much stronger.

Whenever a data stream gets corrupted, you can pretty well count on the idea
that another layer of FEC wasn't going to save you; if you get any errors at
all in your data, you're safe assuming the data is *dramatically* altered.

There was some good info about it, I read in this article coincidentally
yesterday.  Read the section that says "Hard disks are unreliable"
http://arstechnica.com/microsoft/news/2010/03/why-new-hard-disks-might-not-b
e-much-fun-for-xp-users.ars



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When to Scrub..... ZFS That Is

2010-03-13 Thread Edward Ned Harvey
In addition to backups on tape, I like to backup my ZFS to removable hard
disk.  (Created a ZFS filesystem on removable disk, and "zfs send | zfs
receive" onto the removable disk).  But since a single hard disk is so prone
to failure, I like to scrub my external disk regularly, just to verify the
drive has not begun failing.

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Tony MacDoodle
Sent: Saturday, March 13, 2010 3:30 PM
To: zfs-discuss
Subject: [zfs-discuss] When to Scrub. ZFS That Is

 

When would it be necessary to scrub a ZFS filesystem?

We have many "rpool", "datapool", and a NAS 7130, would you consider to
schedule monthly scrubs at off-peak hours or is it really necessary?

 

Thanks

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Edward Ned Harvey
> The one thing that I keep thinking, and which I have yet to see
> discredited, is that
> ZFS file systems use POSIX semantics.  So, unless you are using
> specific features
> (notably ACLs, as Paul Henson is), you should be able to backup those
> file systems
> using well known tools.  

This is correct.  Many people do backup using tar, star, rsync, etc.


> The Best Practices Guide is also very clear about send and receive NOT
> being
> designed explicitly for backup purposes.  I find it odd that so many
> people seem to
> want to force this point.  ZFS appears to have been designed to allow
> the use of
> well known tools that are available today to perform backups and
> restores.  I'm not
> sure how many people are actually using NFS v4 style ACLs, but those
> people have
> the most to worry about when it comes to using tar or NetBackup or
> Networker or
> Amanda or Bacula or star to backup ZFS file systems.  Everyone else,
> which appears
> to be the majority of people, have many tools to choose from, tools
> they've used
> for a long time in various environments on various platforms.  The
> learning curve
> doesn't appear to be as steep as most people seem to make it out to
> be.  I honestly
> think many people may be making this issue more complex than it needs
> to be.

I think what you're saying is:  Why bother trying to backup with "zfs send"
when the recommended practice, fully supportable, is to use other tools for
backup, such as tar, star, Amanda, bacula, etc.   Right?

The answer to this is very simple.
#1  "zfs send" is much faster.  Particularly for incrementals on large
numbers of files.
#2  "zfs send" will support every feature of the filesystem, including
things like filesystem properties, hard links, symlinks, and objects which
are not files, such as character special objects, fifo pipes, and so on.
Not to mention ACL's.  If you're considering some other tool (rsync, star,
etc), you have to read the man pages very carefully to formulate the exact
backup command, and there's no guarantee you'll find a perfect backup
command.  There is a certain amount of comfort knowing that the people who
wrote "zfs send" are the same people who wrote the filesystem.  It's simple,
and with no arguments, and no messing around with man page research, it's
guaranteed to make a perfect copy of the whole filesystem.

Did I mention fast?  ;-)  Prior to zfs, I backed up my file server via
rsync.  It's 1TB of mostly tiny files, and it ran for 10 hours every night,
plus 30 hours every weekend.  Now, I use zfs send, and it runs for an
average 7 minutes every night, depending on how much data changed that day,
and I don't know - 20 hours I guess - every month.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Edward Ned Harvey
> I think what you're saying is:  Why bother trying to backup with "zfs
> send"
> when the recommended practice, fully supportable, is to use other tools
> for
> backup, such as tar, star, Amanda, bacula, etc.   Right?
> 
> The answer to this is very simple.
> #1  ...
> #2  ...

Oh, one more thing.  "zfs send" is only discouraged if you plan to store the
data stream and do "zfs receive" at a later date.  

If instead, you are doing "zfs send | zfs receive" onto removable media, or
another server, where the data is immediately fed through "zfs receive" then
it's an entirely viable backup technique.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-17 Thread Edward Ned Harvey
> Why do we want to adapt "zfs send" to do something it was never
> intended
> to do, and probably won't be adapted to do (well, if at all) anytime
> soon instead of
> optimizing existing technologies for this use case?

The only time I see or hear of anyone using "zfs send" in a way it wasn't
intended is when people store the datastream on tape or a filesystem,
instead of feeding it directly into "zfs receive."

Although it's officially discouraged for this purpose, there is value in
doing so, and I can understand why some people sometimes (including myself)
would have interest in doing this.

So let's explore the reasons it's discouraged to store a "zfs send"
datastream:
#1  If a single bit goes bad, the whole dataset is bad.
#2  You can only receive the whole filesystem.  You cannot granularly
restore a single file or directory.

Now, if you acknowledge these two points, let's explore why somebody might
want to do it anyway:

To counter #1:
Let's acknowledge that storage media is pretty reliable.  We've all seen
tapes and disks go bad, but usually they don't.  If you've got a new tape
archive every week or every month...  The probability of *all* of those
tapes having one or more bad bits is astronomically low.  Nonzero risk, but
a calculated risk.

To counter #2:
There are two basic goals for backups.  (a) to restore some stuff upon
request, or (b) for the purposes of DR, to guarantee your manager that
you're able to get the company back into production quickly after a
disaster.  Such as the building burning down.

ZFS send to tape does not help you in situation (a).  So we can conclude
that "zfs send" to tape is not sufficient as an *only* backup technique.
You need something else, and at most, you might consider "zfs send" to tape
as an augmentation to your other backup technique.

Still ... If you're in situation (b) then you want as many options available
to you as possible.  I've helped many people and/or companies before, who
...  Had backup media, but didn't have the application that wrote the backup
media and therefore couldn't figure out how to restore.   ...  Had a backup
system that was live synchronizing the master file server to a slave file
server, and then when something blew up the master, it propagated and
deleted the slave too.  In this case, the only thing that saved them was an
engineer who had copied the whole directory a week ago onto his iPod, if you
can believe that.  ...  Had backup tapes but no tape drive  ...  Had
archives on DVD, and the DVD's were nearly all bad  ...  Looked through the
backups only to discover something critical had been accidentally excluded
...

Point is, having as many options available as possible is worthwhile in the
disaster situation.

Please see below for some more info, as it ties into some more of what
you've said ...


> But I got it.  "zfs send" is fast.  Let me ask you this, Ed...where do
> you "zfs send"
> your data to? Another pool?  Does it go to tape eventually?  If so,
> what is the setup
> such that it goes to tape?  I apologize for asking here, as I'm sure
> you described it
> in one of the other threads I mentioned, but I'm not able to go digging
> in those
> threads at the moment.

Here is my backup strategy:

I use "zfs send | ssh somehost 'zfs receive'" to send nightly incrementals
to a secondary backup server.  

This way, if something goes wrong with the primary fileserver, I can simply
change the IP address of the secondary, and let it assume the role of the
primary.  With the unfortunate loss of all of today's data ... going back to
last night.  I have had to do this once before, in the face of primary
fileserver disaster and service contract SLA failure by Netapp...  All the
users were very pleased that I was able to get them back into production
using last night's data in less than a few minutes.

>From the secondary server, I "zfs send | zfs receive" onto removable hard
disks.  This is ideal to restore either individual files, or the whole
filesystem.  No special tools would be necessary to restore on any random
ZFS server in the future, and nothing could be faster.  In fact, you
wouldn't even need to restore if you wanted to in a pinch, you could work
directly on the external disks.  However, removable disks are not very
reliable compared to tapes, and the disks are higher cost per GB, and
require more volume in the safe deposit box, so the external disk usage is
limited...  Only going back for 2-4 weeks of archive...

So there is also a need for tapes.  Once every so often, from the secondary
server, I "zfs send" the whole filesystem onto tape for archival purposes.
This would only be needed after a disaster, and also the failure or
overwriting of the removable disks.  We have so many levels of backups, this
is really unnecessary, but it makes me feel good.

And finally just because the data is worth millions of dollars, I also use
NetBackup to write tapes from the secondary server.  This way, nobody could
ever blame me if 

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Edward Ned Harvey
> My own stuff is intended to be backed up by a short-cut combination --
> zfs send/receive to an external drive, which I then rotate off-site (I
> have three of a suitable size).  However, the only way that actually
> works so far is to destroy the pool (not just the filesystem) and
> recreate it from scratch, and then do a full replication stream.  That
> works most of the time, hangs about 1/5.  Anything else I've tried is
> much worse, with hangs approaching 100%.

Interesting, that's precisely what we do at work, and it works 100% of the
time.  Solaris 10u8

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Edward Ned Harvey
> From what I've read so far, zfs send is a block level api and thus
> cannot be
> used for real backups. As a result of being block level oriented, the

Weirdo.  The above "cannot be used for real backups" is obviously
subjective, is incorrect and widely discussed here, so I just say "weirdo."
I'm tired of correcting this constantly.


> I invite erybody to join star development at:

We know, you have an axe to grind.  Don't insult some other product just
because it's not the one you personally work on.  Yours is better in some
ways, and "zfs send" is better in some ways.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Edward Ned Harvey
> > From what I've read so far, zfs send is a block level api and thus
> > cannot be
> > used for real backups. As a result of being block level oriented, the
> 
> Weirdo.  The above "cannot be used for real backups" is obviously
> subjective, is incorrect and widely discussed here, so I just say
> "weirdo."
> I'm tired of correcting this constantly.

I apologize if I was insulting, and it's clear that I was.  Seriously, I
apologize.  I should have thought about that more before I sent it, and I
should have been more considerate.

To clarify, more accurately, from a technical standpoint, what I meant:

There are circumstances, such as backup to removable disks, or time-critical
incremental data streams, where the performance of incremental "zfs send"
versus the performance of star, rsync, or any other file-based backup
mechanism, "zfs send" is the clear winner ... There are circumstances where
zfs send is enormously a winner.

There are other circumstances, such as writing to tape, where star, or tar,
or in other circumstances, where rsync or other tools may be the winner...
And I don't claim to know all the circumstances where something else beats
"zfs send."  There probably are many circumstances where some other tool
beats "zfs send" in some way.  

The only point which I wish to emphasize is that it's not fair to say
unilaterally that one technique is always better than another technique.
Each one has their own pros/cons.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-19 Thread Edward Ned Harvey
> > ZFS+CIFS even provides
> > Windows Volume Shadow Services so that Windows users can do this on
> > their own.
> 
> I'll need to look into that, when I get a moment.  Not familiar with
> Windows Volume Shadow Services, but having people at home able to do
> this
> directly seems useful.

I'd like to spin off this discussion into a new thread.  Any replies to this
one will surely just get buried in the (many messages) in this very long
thread...

New thread:
ZFS+CIFS:  Volume Shadow Services, or Simple Symlink?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS+CIFS: Volume Shadow Services, or Simple Symlink?

2010-03-19 Thread Edward Ned Harvey
> > ZFS+CIFS even provides

> > Windows Volume Shadow Services so that Windows users can do this on

> > their own.

> 

> I'll need to look into that, when I get a moment.  Not familiar with

> Windows Volume Shadow Services, but having people at home able to do

> this

> directly seems useful.

 

Even in a fully supported, all-MS environment, I've found the support for
"Previous Versions" is spotty and sort of unreliable at best.  Not to
mention, I think the user interface is just simply non-intuitive.

 

As an alternative, here's what I do:

ln -s .zfs/snapshot snapshots

 

Voila.  All Windows or Mac or Linux or whatever users are able to easily
access snapshots.

 

It's worth note, in the default config of zfs-auto-snapshot, the snaps are
created with non-cifs compatible characters in the filename (the ":" colon
character in the time.)  So I also make it a habit during installation, to
modify the zfs-auto-snapshot scripts, and substitute that character.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-19 Thread Edward Ned Harvey
> > I'll say it again: neither 'zfs send' or (s)tar is an enterprise (or
> > even home) backup system on their own one or both can be components
> of
> > the full solution.

I would be pretty comfortable with a solution thusly designed:

#1  A small number of external disks, "zfs send" onto the disks and rotate
offsite.  Then, you're satisfying the ability to restore individual files,
but you're not satisfying the archivability, longevity of tapes.

#2  Also, "zfs send" onto tapes.  So if ever you needed something older than
your removable disks, it's someplace reliable, just not readily accessible
if you only want a subset of files.


> I'm in the fortunate position of having my backups less than the size
> of a
> large single drive; so I'm rotating three backup drives, and intend to

It's of course convenient if your backup fits entirely inside a single
removable disk, but that's not a requirement.  You could always use
removable stripesets, or raidz, or whatever you wanted.  For example, you
could build a raidz removable volume out of 5 removable disks if you wanted.
Just be sure you attach all 5 disks before you "zpool import"

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-19 Thread Edward Ned Harvey
> 1. NDMP for putting "zfs send" streams on tape over the network.  So

Tell me if I missed something here.  I don't think I did.  I think this
sounds like crazy talk.

I used NDMP up till November, when we replaced our NetApp with a Solaris Sun
box.  In NDMP, to choose the source files, we had the ability to browse the
fileserver, select files, and specify file matching patterns.  My point is:
NDMP is file based.  It doesn't allow you to spawn a process and backup a
data stream.

Unless I missed something.  Which I doubt.  ;-)


> To Ed Harvey:
> 
> Some questions about your use of NetBackup on your secondary server:
> 
> 1. Do you successfully backup ZVOLs?  We know NetBackup should be able
> to capture datasets (ZFS file systems) using straight POSIX semantics.

I wonder if I'm confused by that question.  "backup zvols" to me, would
imply something at a lower level than the filesystem.  No, we're not doing
that.  We just specify "backup the following directory and all of its
subdirectories."  Just like any other typical backup tool.

The reason we bought NetBackup is because it intelligently supports all the
permissions, ACL's, weird (non-file) file types, and so on.  And it
officially supports ZFS, and you can pay for an enterprise support contract.

Basically, I consider the purchase cost of NetBackup to be insurance.
Although I never plan to actually use it for anything, because all our bases
are covered by "zfs send" to hard disks and tapes.  I actually trust the
"zfs send" solution more, but I can't claim that I, or anything I've ever
done, is 100% infallible.  So I need a commercial solution too, just so I
can point my finger somewhere if needed.


> 2. What version of NetBackup are you using?

I could look it up, but I'd have to VPN in and open up a console, etc etc.
We bought it in November, so it's whatever was current 4-5 months ago.


> 3. You simply run the NetBackup agent locally on the (Open)Solaris
> server?

Yup.  We're doing no rocket science with it.  Ours is the absolute most
basic NetBackup setup you could possibly have.  We're not using 90% of the
features of NetBackup.  It's installed on a Solaris 10 server, with locally
attached tape library, and it does backups directly from local disk to local
tape.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/OSOL/Firewire...

2010-03-19 Thread Edward Ned Harvey
> It would appear that the bus bandwidth is limited to about 10MB/sec
> (~80Mbps) which is well below the theoretical 400Mbps that 1394 is
> supposed to be able to handle.  I know that these two disks can go
> significantly higher since I was seeing 30MB/sec when they were used on
> Macs previously in the same daisy-chain configuration.

I have not done 1394 in solaris or opensolaris.  But I have used it in
windows, mac, and Linux.  Many times for each one.  I never have even the
remotest problem with it in any of these other platforms.  I consider it
more universally reliable, even than USB, because occasionally I see a bad
USB driver on some boot CD or something, which can only drive USB around
11Mbit.  Again, I've never had anything but decent performance out of 1394.

Generally speaking, I use 1394 on:
Dell laptops
Lenovo laptops
Apple laptops
Apple XServe
HP laptops
... and maybe some dell servers...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q : recommendations for zpool configuration

2010-03-19 Thread Edward Ned Harvey
> I have noted there is now raidz2 and been thinking witch woul be
> better.
> A pool with 2 mirrors or one pool with 4 disks raidz2

If you use raidz2, made of 4 disks, you will have usable capacity of 2
disks, and you can tolerate any 2 disks failing.

If you use 2 mirrors, you will have a total of 4 disks and usable capacity
of 2 disks.  Your redundancy is not quite as good as above ... You could
survive a failed disk in the first mirror, and a failed disk in the second
mirror, but you could not survive two failed disks that are in the same
mirror.

If you use raidz2, your reliability might be a little bit higher.
If you use 2 mirrors, your performance will certainly be higher for random
IO operations.

So you must choose what you care about more:  Performance or reliability.
Both ways are good ways.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Q : recommendations for zpool configuration

2010-03-19 Thread Edward Ned Harvey
> A pool with a 4-wide raidz2 is a completely nonsensical idea. It has
> the same amount of accessible storage as two striped mirrors. And would
> be slower in terms of IOPS, and be harder to upgrade in the future
> (you'd need to keep adding four drives for every expansion with raidz2
> - with mirrors you only need to add another two drives to the pool).
> 
> Just my $0.02

Here's my $0.04:

Suppose you had 4 disks, configured as 2 mirrors.  And you want to expand by
adding another mirror.  No problem.

Suppose you had 4 disks, configured as raidz2.  And you want to expand by
adding a mirror.  No problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-20 Thread Edward Ned Harvey
> > I'll say it again: neither 'zfs send' or (s)tar is an
> > enterprise (or
> > even home) backup system on their own one or both can
> > be components of
> > the full solution.
> >
> 
> Up to a point. zfs send | zfs receive does make a very good back up
> scheme for the home user with a moderate amount of storage. Especially
> when the entire back up will fit on a single drive which I think  would
> cover the majority of home users.

I'll repeat:  There is nothing preventing you from creating an external
zpool using more than one disk.  Sure it's convenient when your whole backup
fits onto a single external disk, but not necessary.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-20 Thread Edward Ned Harvey
> 5+ years ago the variety of NDMP that was available with the
> combination of NetApp's OnTap and Veritas NetBackup did backups at the
> volume level.  When I needed to go to tape to recover a file that was
> no longer in snapshots, we had to find space on a NetApp to restore
> the volume.  It could not restore the volume to a Sun box, presumably
> because the contents of the backup used a data stream format that was
> proprietary to NetApp.
> 
> An expired Internet Draft for NDMPv4 says:
> 
>   butype_name
>  Specifies the name of the backup method to be used
> for the
>  transfer (dump, tar, cpio, etc). Backup types are
> NDMP Server
>  implementation dependent and MUST match one of the
> Data
>  Server implementation specific butype_name
> strings accessible
>  via the NDMP_CONFIG_GET_BUTYPE_INFO request.
> 
> http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt
> 
> It seems pretty clear from this that an NDMP data stream can contain
> most anything and is dependent on the device being backed up.

So it's clear that at least the folks at ndmp.org were/are thinking about
doing backups using techniques not necessarily based on filesystem.  But ...

Where's the implementation?  It doesn't do any good if there's just an RFC
written somewhere that all the backup tools ignore.  I was using BackupExec
11d with NDMP Option to backup my Netapp.  This setup certainly couldn't do
anything other than file selection.  I can't generalize and say "nothing
does," but ...  Does anything?  Does anything out there support
non-file-based backup via NDMP?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-21 Thread Edward Ned Harvey
> That would add unnecessary code to the ZFS layer for something that
> cron can handle in one line.

Actually ... Why should there be a ZFS property to share NFS, when you can
already do that with "share" and "dfstab?"  And still the zfs property
exists.

I think the proposed existence of a ZFS scrub period property makes just as
much sense.  Which is to say, I don't see any point to either one.  ;-)
Personally, I use dfstab, and cron.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-21 Thread Edward Ned Harvey
> Most software introduced in Linux clearly violates the "UNIX
> philosophy".  

Hehehe, don't get me started on OSX.  ;-)  And for the love of all things
sacred, never say OSX is not UNIX.  I made that mistake once.  Which is not
to say I was proven wrong or anything - but it's apparently a subject that
people are oversensitive and emotional about.  Seriously out of control.
Avoid the subject.  Please.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-21 Thread Edward Ned Harvey
> > The only tool I'm aware of today that provides a copy of the data,
> and all of the ZPL metadata and all the ZFS dataset properties is 'zfs
> send'.
> 
> AFAIK, this is correct.
> Further, the only type of tool that can backup a pool is a tool like
> dd.

How is it different to backup a pool, versus to backup all the zfs
filesystems in a pool?  Obviously, dd is not viable for most situations as a
backup tool...  Is there any reason anyone would ever do it for ZFS, aside
from forensics?

Oh, I know one difference.  "dd" to backup the pool would also preserve all
the zfs snaps.  But if you want, you could have done that with "zfs send"
too.

So my question still stands:  What can you backup in a zpool, using dd, that
you can't backup via "zfs send?"

(As pointless as this may be, it's academic.)  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS+CIFS: Volume Shadow Services, or Simple Symlink?

2010-03-21 Thread Edward Ned Harvey
> > ln -s .zfs/snapshot snapshots
> >
> > Voila.  All Windows or Mac or Linux or whatever users are able to
> easily access snapshots.
> 
> Clever.
> 
> Just one minor problem though, you've circumvented the reason why the
> "snapdir"
> property defaults to "hidden."  This probably won't affect clients that
> understand
> symlinks, but IIRC, Windows doesn't.  Hopefully, the Windows clients
> will not try
> to run backup programs themselves and will rely on server-side backups
> :-)

There's no problem with the symlink being followed by CIFS clients,
including windows.  It just appears as if it's a normal directory that you
can look inside.

If anybody were trying to backup the ZFS root directory via CIFS, and didn't
have the ability (or intelligence) to exclude the snapshots directory, sure
it could be a problem.  But I certainly know I don't have that problem.  I
promise you will never catch me creating backups of ZFS via CIFS.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-21 Thread Edward Ned Harvey
> >Actually ... Why should there be a ZFS property to share NFS, when you
> can
> >already do that with "share" and "dfstab?"  And still the zfs property
> >exists.
> 
> Probably because it is easy to create new filesystems and clone them;
> as
> NFS only works per filesystem you need to edit dfstab every time when
> you
> add a filesystem.  With the nfs property, zfs create the NFS export,
> etc.

Either I'm missing something, or you are.

If I export /somedir and then I create a new zfs filesystem /somedir/foo/bar
then I don't have to mess around with dfstab, because it's a subdirectory of
an exported directory, it's already accessible via NFS.  So unless I
misunderstand what you're saying, you're wrong.  

This is the only situation I can imagine, where you would want to create a
ZFS filesystem and have it default to NFS exported.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Edward Ned Harvey
> Does cron happen to know how many other scrubs are running, bogging
> down
> your IO system? If the scrub scheduling was integrated into zfs itself,

It doesn't need to.

Crontab entry:  /root/bin/scruball.sh

/root/bin/scruball.sh:
#!/usr/bin/bash
for filesystem in filesystem1 filesystem2 filesystem3 ; do 
  zfs scrub $filesystem
done


If you were talking about something else, for example, multiple machines all
scrubbing a SAN at the same time, then ZFS can't solve that any better than
cron, because it would require inter-machine communication to coordinate.  I
contend a shell script could actually handle that better than a built-in zfs
property anyway.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Edward Ned Harvey
> no, it is not a subdirectory it is a filesystem mounted on top of the
> subdirectory.
> So unless you use NFSv4 with mirror mounts or an automounter other NFS
> version will show you contents of a directory and not a filesystem. It
> doesn't matter if it is a zfs or not.

Ok, I learned something here, that I want to share:

If you create a new zfs filesystem as a subdir of a zfs filesystem which is
exported via nfs and shared via cifs ...

The cifs clients see the contents of the child zfs filesystems.
But, as Robert said above, nfs clients do not see the contents of the child
zfs filesystem.

So, if you nest zfs filesystems inside each other (I don't) then the
sharenfs property of a parent can be inherited by a child, and if that's
your desired behavior, it's a cool feature.

For that matter, even if you do set the property, and you create a new child
filesystem with inheritance, that only means the server will auto-export the
filesystem.  It doesn't mean the client will auto-mount it, right?  So
what's the 2nd half of the solution?  Assuming you want the clients to see
the subdirectories as the server does.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Edward Ned Harvey
> IIRC it's "zpool scrub", and last time I checked, the zpool command
> exited (with status 0) as soon as it had started the scrub. Your
> command
> would start _ALL_ scrubs in paralell as a result.

You're right.  I did that wrong.  Sorry 'bout that.

So either way, if there's a zfs property for scrub, that still doesn't
prevent multiple scrubs from running simultaneously.  So ...  Presently
there's no way to avoid the simultaneous scrubs either way, right?  You have
to home-cook scripts to detect which scrubs are running on which
filesystems, and serialize the scrubs.  With, or without the property.

Don't get me wrong - I'm not discouraging the creation of the property.  But
if you want to avoid simul-scrub, you'd first have to create a mechanism for
that, and then you could create the autoscrub.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS+CIFS: Volume Shadow Services, or Simple Symlink?

2010-03-22 Thread Edward Ned Harvey
> Not being a CIFS user, could you clarify/confirm for me..  is this
> just a "presentation" issue, ie making a directory icon appear in a
> gooey windows explorer (or mac or whatever equivalent) view for people
> to click on?  The windows client could access the .zfs/snapshot dir
> via typed pathname if it knows to look, or if it's made visible, yes?

You are correct.  A CIFS client by default will not show the "hidden" .zfs
directory, but if you either click the checkbox "show hidden files" then
you'll see it, or if you type it into the addressbar, then you can access
it.

However, my users were used to having a hidden ".snapshots" directory in
every directory.  I didn't want to tell them "You have to go to the parent
of all directories, and type in .zfs" mostly because they can't remember
"zfs" ... So the softlink just makes it visible and easy to remember.

> > I
> > promise you will never catch me creating backups of ZFS via CIFS.  ;-
> )
> 
> Never say never..

Hehehehe.  Given the alternatives, I think this is a safe one.  I will never
backup a ZFS filesystem via CIFS client.  ;-)  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snapshots as versioning tool

2010-03-22 Thread Edward Ned Harvey
> This may be a bit dimwitted since I don't really understand how
> snapshots work.  I mean the part concerning COW (copy on right) and
> how it takes so little room.

COW and snapshots are very simple to explain.  Suppose you're chugging along
using your filesystem, and then one moment, you tell the filesystem to
"freeze."  Well, suppose a minute later you tell the FS to overwrite some
block that's in use already.  Instead of overwriting the actual block on
disk, the FS will overwrite some unused space, and report back to you that
the operation is completed.  So now there's a "copy" of the block as it was
at the moment of the "freeze," and there's another "copy" of the block as it
looks later in time.  The FS only needs to freeze the FS tables, to remember
which blocks belonged to which files in each of the snapshots.  Hence, Copy
On Write.

That being said, it's an inaccurate description to say "COW takes so little
room."  If anything, it takes more room than a filesystem which can't do
COW, because the FS must not delete any of the old blocks belonging to any
of the old "snapshots" of the filesystem.  The more frequently you take
snapshots, and the older your oldest snap is, and the more volatile your
data is, changing large sequences of blocks rapidly ... The more disk space
will be consumed.  No block is free, as long as any one of the snaps
references it.

But suppose you have n snapshots.  In a non-COW filesystem, you would have
n-times the data.  While in COW, you still have 1x the total used data size,
plus the byte differentials necessary to resurrect any/all of the old
snapshots.


> I'm thinking of scripting something like 10 minute snapshots during
> the time I'm working on a project, then just turn it off when not
> working on it.  When project is done... zap all those snapshots.

Yup, that's absolutely easy.  Just set up a cron job to snap every 10
minutes, using a unique string in the snapname, like "@myprojectsnap" ...
and when you're all done, you "zfs destroy" anything which matches
"@myprojectsnap"


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Proposition of a new zpool property.

2010-03-22 Thread Edward Ned Harvey
> In other words, there
> is no
> case where multiple scrubs compete for the resources of a single disk
> because
> a single disk only participates in one pool. 

Excellent point.  However, the problem scenario was described as SAN.  I can
easily imagine a scenario where some SAN administrator created a pool of
raid 5+1 or raid 0+1, and the pool is divided up into 3 LUNs which are
presented to 3 different machines.  Hence, when Machine A is hammering on
the disks, it could also affect Machine B or C.

The "catch" that I keep repeating, is that even a zfs property couldn't
possibly solve that problem.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] snapshots as versioning tool

2010-03-22 Thread Edward Ned Harvey
> > You can easily determine if the snapshot has changed by checking the
> > output of zfs list for the snapshot.
> 
> Do you mean to just grep it out of the output of
> 
>   zfs list -t snapshot

I think the point is:  You can easily tell how many MB changed in a
snapshot, and therefore you can easily tell "yes the snapshot changed."  But
unfortunately, no you can't easily tell which files changed.  Yet. 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] To reserve space

2010-03-24 Thread Edward Ned Harvey
Is there a way to reserve space for a particular user or group?  Or perhaps
to set a quota for a group which includes everyone else?

 

I have one big pool, which holds users' home directories, and also the
backend files for the svn repositories etc.  I would like to ensure the svn
server process will always have some empty space to work with, even if some
users go hog wild and consume everything they can.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a 11TB HW RAID-5 controller

2010-03-24 Thread Edward Ned Harvey
> Thank you all for your valuable experience and fast replies. I see your
> point and will create one virtual disk for the system and one for the
> storage pool. My RAID controller is battery backed up, so I'll leave
> write caching on.

I think the point is to say:  ZFS software raid is both faster and more
reliable than your hardware raid.  Surprising though it may be for a
newcomer, I have statistics to back that up, and explanation of how it's
possible.  If you want to know.  

You will do best if you configure the raid controller to JBOD.  Yes it's ok
to enable WriteBack on all those disks, but just use the raid card for write
buffering, not raid.

The above suggestion might be great ideally.  But how do you boot from some
disk which isn't attached to the raid controller?  Most servers don't have
any other option ...  So you might just make a 2-disk mirror, use that as a
boot volume, and then JBOD all the other disks.  That's somewhat a waste of
disk space, but it might be your best solution.  This is in fact, what I do.
I have 2x 1TB disks dedicated to nothing but the OS.  That's tremendous
overkill.  And all the other disks are a data pool.  All of the disks are
1TB, because it greatly simplifies the usage of a hotspare...  And I'm
wasting nearly 1TB on the OS disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To reserve space

2010-03-24 Thread Edward Ned Harvey
> zfs set reservation=100GB dataset/name
> 
> That will reserve 100 GB of space for the dataset, and will make that
> space unavailable to the rest of the pool.

That doesn't make any sense to me ... 

How does that allow "subversionuser" to use the space, and block "joeuser" from 
using it?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To reserve space

2010-03-24 Thread Edward Ned Harvey
> > zfs set reservation=100GB dataset/name
> >
> > That will reserve 100 GB of space for the dataset, and will make that
> > space unavailable to the rest of the pool.
> 
> That doesn't make any sense to me ...
> 
> How does that allow "subversionuser" to use the space, and block
> "joeuser" from using it?

Oh - I get it - In the case of subversion server, it's pretty safe to assume
all the "svnuser" files are under a specific subdirectory (or a manageably
finite number of directories) and therefore could use a separate zfs
filesystem within the same pool, and therefore that directory or directories
could have a space reservation.

I think that will be sufficient for our immediate needs.  Thanks for the
suggestion.

Out of curiosity, the more general solution would be the ability to create a
reservation on a per-user or per-group basis (just like you create quotas on
a per-user or per-group basis).  Is this possible?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a 11TB HW RAID-5 controller

2010-03-24 Thread Edward Ned Harvey
> > Which is why ZFS isn't a replacement for proper array controllers
> (defining proper as those with sufficient battery to leave you with a
> seemingly intact filesystem), but a very nice augmentation for them. ;)
> 
> Nothing prevents a clever chap from building a ZFS-based array
> controller
> which includes nonvolatile write cache. However, the economics suggest
> that the hybrid storage pool model can provide a highly dependable
> service
> at a lower price-point than the traditional array designs.

I don't have finished results that are suitable for sharing yet, but I'm
doing a bunch of benchmarks right now that suggest:

-1-  WriteBack enabled is much faster for writing than WriteThrough.  (duh.)
-2-  Ditching the WriteBack, and using a ZIL instead, is even faster than
that.

Oddly, the best performance seems to be using ZIL, with all the disks
WriteThrough.  You actually get slightly lower performance if you enable the
ZIL together with WriteBack.  My theory to explain the results I'm seeing
is:  Since the ZIL performs best for zillions of tiny write operations and
the spindle disks perform best for large sequential writes, I suspect the
ZIL accumulates tiny writes until they add up to a large sequential write,
and then they're flushed to spindle disks.  In this configuration, the HBA
writeback cannot add any benefit, because the datastreams are already
optimized for the device they're writing to.  Yet, by enabling the
WriteBack, you introduce a small delay before writes begin to hit the
spindle.  By switching to WriteThrough, you actually get better performance.
As counter-intuitive as that may seem.  :-)

So, if you've got access to a pair of decent ZIL devices, you're actually
faster and more reliable to run all your raid and caching and buffering via
ZFS instead of using a fancy HBA.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To reserve space

2010-03-24 Thread Edward Ned Harvey
The question is not how to create quotas for users.

The question is how to create reservations for users.

 

One way to create a reservation for a user is to create a quota for everyone
else, but that's a little less manageable, so a reservation per-user would
be cleaner and more desirable.

 

 

 

 

From: Brandon High [mailto:bh...@freaks.com] 
Sent: Wednesday, March 24, 2010 2:33 PM
To: Edward Ned Harvey
Cc: Freddie Cash; zfs-discuss
Subject: Re: [zfs-discuss] To reserve space

 

On Wed, Mar 24, 2010 at 11:18 AM, Edward Ned Harvey 
wrote:

Out of curiosity, the more general solution would be the ability to create a
reservation on a per-user or per-group basis (just like you create quotas on
a per-user or per-group basis).  Is this possible?


OpenSolaris's zfs has supported quotas for a little while, so make sure
you're using a recent build. I'm not sure if it's in Solaris 10, but I
believe it is.

Before quotas were supported, the answer was to create a new dataset per
user, eg: tank/home/user1, tank/home/user2, etc. It's easy to do in zfs, but
it doesn't always work for storage that is shared between users.

-B


-- 
Brandon High : bh...@freaks.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a 11TB HW RAID-5 controller

2010-03-25 Thread Edward Ned Harvey
> > I think the point is to say:  ZFS software raid is both faster and
> more
> > reliable than your hardware raid.  Surprising though it may be for a
> > newcomer, I have statistics to back that up,
> 
> Can you share it?

Sure.  Just go to http://nedharvey.com and you'll see four links on the left
side, "RAID Benchmarks."
The simplest comparison is to look at Bob's Method Summary.  You'll see
statistics for example:
raidz-5disks
raid5-5disks-hardware

You'll see that it's not a 100% victory for either one - the hardware raid
is able to do sequential writes faster, and stride reads faster, but all the
other categories, the raidz is faster, and by a larger margin.  If all
categories are equally important in your usage scenario, then the average is
3.53 vs 2.47 in favor of zfs raidz.  But if your usage characteristics don't
weight the various operations equally ... Most people care about random
reads and random writes more than they care about other operations ... then
the results are 2.18 vs 1.52 in favor of zfs raidz.

As for "more reliable" ... Here's my justification for saying that.  For
starters, one fewer single point of failure.  If you're doing RAID with your
HBA, and if it dies, then you risk losing your whole data set, regardless of
raid and the fact that the disks are still good.  Also, you can't attach
your disks to some other system to recover your data; you need to replace
your HBA with an identical HBA to even stand a chance.  But if you're doing
ZFS raid, there is no HBA that can die.  Suppose if you needed to, you could
attach your disks to another system and simply import the zfs datasets.


> > You will do best if you configure the raid controller to JBOD.
> 
> Problem: HP's storage controller doesn't support that mode.

Sure it does.  You just make a RAID0 or RAID1 with a single disk in it.  And
make another.  And make another.  Etc.

This is how I do it on a Dell PERC.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on a 11TB HW RAID-5 controller

2010-03-25 Thread Edward Ned Harvey
> The bigger problem is that you have to script around a disk failure, as
> the array won't bring a non-redundant logicaldrive back online after a
> disk failure without being kicked (which is a good thing in general,
> but
> annoying for ZFS).

I'd like to follow up on that point.  Because until recently, I always said
the same thing about the Dell PERC, and I learned there's a gray area.  I
bet there is for you too.

If you create a hardware mirror, and one disk dies, you can simply toss
another disk into the failed slot, and it will auto-resilver.  No need to
kick it.

If you create a raid-5 or raid-6, same is true.  No need to kick it.

If you have a hotspare ... and the hotspare gets consumed ... then you slap
a new disk into its place, and it will not automatically become a new
hotspare.

If you have a single disk raid-0 or raid-1, and it dies, you replace it, and
it will not automatically become a new raid-0 or raid-1.  You need to kick
it.

I don't know what the HP equivalent is, but in a Dell, using PERC, you use
the MegaCLI utility to monitor the health of your HBA.  Yes, it is a pain.
In a Sun, it varies based on which specific system, but they have a similar
thing.  Some GUI utility to monitor and reconfigure the HBA.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS backup configuration

2010-03-25 Thread Edward Ned Harvey
> Sorry if this has been dicussed before. I tried searching but I
> couldn't find any info about it. We would like to export our ZFS
> configurations in case we need to import the pool onto another box. We
> do not want to backup the actual data in the zfs pool, that is already
> handled through another program.

What's the question?  It seems like the answer is probably "zpool export"
and "man zpool"

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-26 Thread Edward Ned Harvey
> Using fewer than 4 disks in a raidz2 defeats the purpose of raidz2, as

> you will always be in a degraded mode.

 

Freddie, are you nuts?  This is false.

 

Sure you can use raidz2 with 3 disks in it.  But it does seem pointless to do 
that instead of a 3-way mirror.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-26 Thread Edward Ned Harvey
> Coolio.  Learn something new everyday.  One more way that raidz is
> different from RAID5/6/etc.

Freddie, again, you're wrong.  Yes, it's perfectly acceptable to create either 
raid-5 or raidz using 2 disks.  It's not degraded, but it does seem pointless 
to do this instead of a mirror.

Likewise, it's perfectly acceptable to create a raid-6 or raid-dp or raidz2 
using 3 disks.  It's not degraded, but seems pointless to do this instead of a 
3-way mirror.

Since it's pointless, some hardware vendors may not implement it in their raid 
controllers.  They might only give you the option of creating a mirror instead. 
 But that doesn't mean it's invalid raid configuration.


> So, is it just a "standard" that hardware/software RAID setups require
> 3 drives for a RAID5 array?  And 4 drives for RAID6?

It is just "standard" not to create a silly 2-disk raid5 or raidz.  But don't 
use the word "require."

It is common practice to create raidz2 only with 4 disks or more, but again, 
don't use the word "require."

Some people do in fact create these silly configurations just because they're 
unfamiliar with what it all means.  Take Bruno's original post as example, and 
that article he referenced on sun.com.  How these things get started, I'll 
never know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAIDZ2 configuration

2010-03-26 Thread Edward Ned Harvey
Just because most people are probably too lazy to click the link, I’ll paste a 
phrase from that sun.com webpage below:

“Creating a single-parity RAID-Z pool is identical to creating a mirrored pool, 
except that the ‘raidz’ or ‘raidz1’ keyword is used instead of ‘mirror’.”

And

“zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0”

 

So … Shame on you, Sun, for doing this to your poor unfortunate readers.  It 
would be nice if the page were a wiki, or somehow able to have feedback 
submitted…

 

 

 

From: zfs-discuss-boun...@opensolaris.org 
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Bruno Sousa
Sent: Thursday, March 25, 2010 3:28 PM
To: Freddie Cash
Cc: ZFS filesystem discussion list
Subject: Re: [zfs-discuss] RAIDZ2 configuration

 

Hmm...it might be completely wrong , but the idea of raidz2 vdev with 3 disks 
came from the reading of http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view 
.

This particular page has the following example :

zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0
# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:
 
NAME  STATE READ WRITE CKSUM
tank  ONLINE   0 0 0
  raidz2  ONLINE   0 0 0
c1t0d0ONLINE   0 0 0
c2t0d0ONLINE   0 0 0
c3t0d0ONLINE   0 0 0
 

So...what am i missing here? Just a bad example in the sun documentation 
regarding zfs?

Bruno

On 25-3-2010 20:10, Freddie Cash wrote: 

On Thu, Mar 25, 2010 at 11:47 AM, Bruno Sousa  wrote:

What do you mean by "Using fewer than 4 disks in a raidz2 defeats the purpose 
of raidz2, as you will always be in a degraded mode" ? Does it means that 
having 2 vdevs with 3 disks it won't be redundant in the advent of a drive 
failure?

 

raidz1 is similar to raid5 in that it is single-parity, and requires a minimum 
of 3 drives (2 data + 1 parity)

raidz2 is similar to raid6 in that it is double-parity, and requires a minimum 
of 4 drives (2 data + 2 parity)

 

IOW, a raidz2 vdev made up of 3 drives will always be running in degraded mode 
(it's missing a drive).

 

-- 

Freddie Cash
fjwc...@gmail.com

-- 
This message has been scanned for viruses and 
dangerous content by   MailScanner, and is 
believed to be clean. 

 
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS where to go!

2010-03-26 Thread Edward Ned Harvey
> OK, I have 3Ware looking into a driver for my cards (3ware 9500S-8) as
> I dont see an OpenSolaris driver for them.
> 
> But this leads me that they do have a FreeBSD Driver, so I could still
> use ZFS.
> 
> What does everyone thing about that? I bet it is not as mature as on
> OpenSolaris.

"mature" is not the right term in this case.  FreeBSD has been around much
longer than opensolaris, and it's equally if not more mature.  FreeBSD is
probably somewhat less featureful.  Because their focus is heavily on the
reliability and stability side, rather than early adoption.  Also it's less
popular so there are ... less package availability.

And FreeBSD in general will be built using older versions of packages than
what's in OpenSolaris.

Both are good OSes.  If you can use FreeBSD but OpenSolaris doesn't have the
driver for your hardware, go for it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS backup configuration

2010-03-26 Thread Edward Ned Harvey
> It seems like the zpool export will ques the drives and mark the pool
> as exported. This would be good if we wanted to move the pool at that
> time but we are thinking of a disaster recovery scenario. It would be
> nice to export just the config to where if our controller dies, we can
> use the zpool import on another box to get back up and running.

Correct, "zpool export" will offline your disks so you can remove them and
bring them somewhere else.

I don't think you need to do anything in preparation for possible server
failure.  Am I wrong about this?  I believe once your first server is down,
you just move your disks to another system, and then "zpool import."  I
don't believe the export is necessary in order to do an import.  You would
only export if you wanted to disconnect while the system is still powered
on.

You just "export" to tell the running OS "I'm about to remove those disks,
so don't freak out."  But if there is no running OS, you don't worry about
it.

Again, I'm only 98% sure of the above.  So it might be wise to test on a
sandbox system.

One thing that is worth mention:  If you have an HBA such as 3ware, or Perc,
or whatever ... it might be impossible to move the disks to a different HBA,
such as Perc or 3ware (swapped one for the other).  If your original system
is using Perc 6/i, only move them to another system with Perc 6/i (and if
possible, ensure the controller is using the same rev of firmware.)

If you're using a simple unintelligent non-raid sas or sata controller, you
should be good.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send and ARC

2010-03-26 Thread Edward Ned Harvey
> In the "Thoughts on ZFS Pool Backup Strategies thread" it was stated
> that zfs send, sends uncompress data and uses the ARC.
> 
> If "zfs send" sends uncompress data which has already been compress
> this is not very efficient, and it would be *nice* to see it send the
> original compress data. (or an option to do it)

You've got 2 questions in your post.  The one above first ...

It's true that "zfs send" sends uncompressed data.  So I've heard.  I haven't 
tested it personally.

I seem to remember there's some work to improve this, but not available yet.  
Because it was easier to implement the uncompressed send, and that already is 
super-fast compared to all the alternatives.


> I thought I would ask a true or false type questions mainly for
> curiosity sake.
> 
> If "zfs send" uses standard ARC cache (when something is not already in
> the ARC) I would expect this to hurt (to some degree??) the performance
> of the system. (ie I assume it has the effect of replacing
> current/useful data in the cache with not very useful/old data

And this is a separate question.

I can't say first-hand what ZFS does, but I have an educated guess.  I would 
say, for every block the "zfs send" needs to read ... if the block is in ARC or 
L2ARC, then it won't fetch again from disk.  But it is not obliterating the ARC 
or L2ARC with old data.  Because it's smart enough to work at a lower level 
than a user-space process, and tell the kernel (or whatever) something like 
"I'm only reading this block once; don't bother caching it for my sake."

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS where to go!

2010-03-26 Thread Edward Ned Harvey
> While I use zfs with FreeBSD (FreeNAS appliance with 4x SATA 1 TByte
> drives)
> it is trailing OpenSolaris by at least a year if not longer and hence
> lacks
> many key features people pick zfs over other file systems. The
> performance,
> especially CIFS is quite lacking. Purportedly (I have never seen the
> source
> nor am I a developer), such crucial features are nontrivial to backport
> because
> FreeBSD doesn't practice layer separation. Inasmuch this is still true
> for the future we'll see once the Oracle/Sun dust settles.

I'm not sure if it's a version thing, or something else ... I am running
solaris 10u6 (at least a year or two old) and the performance of that is not
just fine ... it's super awesome.

An important note, though, is that I'm using samba and not the zfs built-in
cifs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD As ARC

2010-03-28 Thread Edward Ned Harvey
> >> You can't share a device (either as ZIL or L2ARC) between multiple
> pools.
> >
> > Discussion here some weeks ago reached suggested that an L2ARC device
> > was used for all ARC evictions, regardless of the pool.
> >
> > I'd very much like an authoritative statement (and corresponding
> > documentation updates) if this was correct.
> 
> I'm responsible for propagating this false rumor. It is the result of
> some
> testing in which my analysis was flawed.  The authoritative response
> is,
> as usual, in the source.

Can't you slice the SSD in two, and then give each slice to the two zpools?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mixed ZFS vdev in same pool.

2010-03-28 Thread Edward Ned Harvey
> when I tried to create a pool (called group) with four 1TB disk in
> raidz and two 500GB disk in mirror configuration to the same pool ZFS
> complained and said if I wanted to do it I had to add a -f 

Honestly, I'm surprised by that.  I would think it's ok.  I am surprised by
the -f, just as you are.


> If one 500GB disk dies (c10dX) in the mirror and I choose not to
> replace it, 

Don't do that.  You cannot remove vdev's from your pool.  If one half of the
mirror dies, you're now degraded, and if the 2nd half dies, you lose your
pool.  You cannot migrate data out of the failing mirror onto the raidz,
even if there's empty space.

I'm sure the ability to remove vdev's will be created some day ... but not
yet.


> Would this configuration survive a two disk failure if the disk are in

Yup, no problem.  Lose 1 disk in the raidz, and 1 disk in the mirror is ok.
Just don't lose 2 disks from the same vdev at the same time.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs diff

2010-03-30 Thread Edward Ned Harvey
> On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams
>  wrote:
> > One really good use for zfs diff would be: as a way to index zfs send
> > backups by contents.
> 
> Or to generate the list of files for incremental backups via NetBackup
> or similar.  This is especially important for file systems will
> millions of files with relatively few changes.

+1

The reason "zfs send" is so fast, is not because it's so fast.  It's because
it does not need any time to index and compare and analyze which files have
changed since the last snapshot or increment.

If the "zfs diff" command could generate the list of changed files, and you
feed that into tar or whatever, then these 3rd party backup tools become
suddenly much more effective.  Able to rival the performance of "zfs send."

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs recreate questions

2010-03-30 Thread Edward Ned Harvey
If you "zfs export" it will offline your pool.  This is what you do when
you're going to intentionally remove disks from the live system.

 

If you suffered a hardware problem, and you're migrating your
uncleanly-unmounted disks to another system, then as Brandon described
below, you'll need the "-f" to force the import.

 

When you "zfs import" it does not matter if you've moved the disks around.
What used to be connected to SATA port 0 can move to port 6 or whatever.
Irrelevant.  The data on disks says not only which pool each disk belongs
to, but which position within the pool.  This makes sense and is
particularly important, because, suppose you have a pool in operation for
some years, with hotspare.  A disk fails, the hotspare is consumed, another
disk fails, another hotspare consumed, and so on.  Now you've got all your
disks jumbled around in random order.  And then your CPU dies so you need to
move your disks to another system, and there's no way for you to know which
order the disks were in the pool.  It's important to be able to import the
volume, with the disks all jumbled around in random order.

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Brandon High
Sent: Monday, March 29, 2010 6:54 PM
To: JD Trout
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] zfs recreate questions

 

On Mon, Mar 29, 2010 at 3:49 PM, JD Trout  wrote:

That is great to hear.  What is the command to do this?  I setup a test
situation and I would like to give it a try.

 

If you can plan the removal, simply 'zpool export' your pool, then 'zpool
import' it on the new controller / host.

 

If you don't do an export, use 'zpool import -f' to force it.

 

-B


-- 
Brandon High : bh...@freaks.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> But the speedup of disabling the ZIL altogether is
> appealing (and would
> probably be acceptable in this environment).

Just to make sure you know ... if you disable the ZIL altogether, and you
have a power interruption, failed cpu, or kernel halt, then you're likely to
have a corrupt unusable zpool, or at least data corruption.  If that is
indeed acceptable to you, go nuts.  ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> standard ZIL:   7m40s  (ZFS default)
> 1x SSD ZIL:  4m07s  (Flash Accelerator F20)
> 2x SSD ZIL:  2m42s  (Flash Accelerator F20)
> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
> 3x SSD ZIL:  2m47s  (Flash Accelerator F20)
> 4x SSD ZIL:  2m57s  (Flash Accelerator F20)
> disabled ZIL:   0m15s
> (local extraction0m0.269s)
> 
> I was not so much interested in the absolute numbers but rather in the
> relative
> performance differences between the standard ZIL, the SSD ZIL and the
> disabled
> ZIL cases.

Oh, one more comment.  If you don't mirror your ZIL, and your unmirrored SSD
goes bad, you lose your whole pool.  Or at least suffer data corruption.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VMware client solaris 10, RAW physical disk and zfs snapshots problem - all created snapshots are equal to zero.

2010-03-30 Thread Edward Ned Harvey
> The problem that I have now is that each created snapshot is always
> equal to zero... zfs just not storing changes that I have made to the
> file system before making a snapshot.
> 
>  r...@sl-node01:~# zfs list
> NAME    USED  AVAIL  REFER  MOUNTPOINT
> mypool01   91.9G   137G    23K  /mypool01
> mypool01/storage01 91.9G   137G  91.7G  /mypool01/storage01
> mypool01/storag...@30032010-1  0  -  91.9G  -
> mypool01/storag...@30032010-2  0  -  91.9G  -
> mypool01/storag...@30032010-3  0  -  91.7G  -
> mypool02   91.9G   137G    24K  /mypool02
> mypool02/copies  23K   137G    23K  /mypool02/copies
> mypool02/storage01 91.9G   137G  91.9G  /mypool02/storage01
> mypool02/storag...@30032010-1  0  -  91.9G  -
> mypool02/storag...@30032010-2  0  -  91.9G  -

Try this:
zfs snapshot mypool01/storag...@30032010-4
dd if=/dev/urandom of=/mypool01/storage01/randomfile bs=1024k count=1024
zfs snapshot mypool01/storag...@30032010-5
rm /mypool01/storage01/randomfile
zfs snapshot mypool01/storag...@30032010-6
zfs list

And see what happens.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> Again, we can't get a straight answer on this one..
> (or at least not 1 straight answer...)
> 
> Since the ZIL logs are committed atomically they are either committed
> in FULL, or NOT at all (by way of rollback of incomplete ZIL applies at
> zpool mount time / or transaction rollbacks if things go exceptionally
> bad), the only LOST data would be what hasn't been transferred from ZIL
> to the primary pool..
> 
> But the pool should be "sane".

If this is true ...  Suppose you shutdown a system, remove the ZIL device,
and power back on again.  What will happen?  I'm informed that with current
versions of solaris, you simply can't remove a zil device once it's added to
a pool.  (That's changed in recent versions of opensolaris) ... but in any
system where removing the zil isn't allowed, what happens if the zil is
removed?

I have to assume something which isn't quite sane happens.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs recreate questions

2010-03-30 Thread Edward Ned Harvey
> Anyway, my question is, [...]
> as expected I can't import it because the pool was created
> with a newer version of ZFS. What options are there to import?

I'm quite sure there is no option to import or receive or downgrade a zfs
filesystem from a later version.  I'm pretty sure your only option is
something like "tar"

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> If the ZIL device goes away then zfs might refuse to use the pool
> without user affirmation (due to potential loss of uncommitted
> transactions), but if the dedicated ZIL device is gone, zfs will use
> disks in the main pool for the ZIL.
> 
> This has been clarified before on the list by top zfs developers.

Here's a snippet from man zpool.  (Latest version available today in
solaris)

zpool remove pool device ...
Removes the specified device from the pool. This command
currently  only  supports  removing hot spares and cache
devices. Devices that are part of a mirrored  configura-
tion  can  be  removed  using  the zpool detach command.
Non-redundant and raidz devices cannot be removed from a
pool.

So you think it would be ok to shutdown, physically remove the log device,
and then power back on again, and force import the pool?  So although there
may be no "live" way to remove a log device from a pool, it might still be
possible if you offline the pool to ensure writes are all completed before
removing the device?

If it were really just that simple ... if zfs only needed to stop writing to
the log device and ensure the cache were flushed, and then you could safely
remove the log device ... doesn't it seem silly that there was ever a time
when that wasn't implemented?  Like ... Today.  (Still not implemented in
solaris, only opensolaris.)

I know I am not going to put the health of my pool on the line, assuming
this line of thought.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> So you think it would be ok to shutdown, physically remove the log
> device,
> and then power back on again, and force import the pool?  So although
> there
> may be no "live" way to remove a log device from a pool, it might still
> be
> possible if you offline the pool to ensure writes are all completed
> before
> removing the device?
> 
> If it were really just that simple ... if zfs only needed to stop
> writing to
> the log device and ensure the cache were flushed, and then you could
> safely
> remove the log device ... doesn't it seem silly that there was ever a
> time
> when that wasn't implemented?  Like ... Today.  (Still not implemented
> in
> solaris, only opensolaris.)

Allow me to clarify a little further, why I care about this so much.  I have
a solaris file server, with all the company jewels on it.  I had a pair of
intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
device came with a newer version of firmware on it.  Now, instead of
appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
New device is too small.

So apparently I'm the first guy this happened to.  Oracle is caught totally
off guard.  They're pulling their inventory of X25's from dispatch
warehouses, and inventorying all the firmware versions, and trying to figure
it all out.  Meanwhile, I'm still degraded.  Or at least, I think I am.

Nobody knows any way for me to remove my unmirrored log device.  Nobody
knows any way for me to add a mirror to it (until they can locate a drive
with the correct firmware.)  All the support people I have on the phone are
just as scared as I am.  "Well we could upgrade the firmware of your
existing drive, but that'll reduce it by 0.001 Gb, and that might just
create a time bomb to destroy your pool at a later date."  So we don't do
it.

Nobody has suggested that I simply shutdown and remove my unmirrored SSD,
and power back on.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-30 Thread Edward Ned Harvey
> > Just to make sure you know ... if you disable the ZIL altogether, and
> you
> > have a power interruption, failed cpu, or kernel halt, then you're
> likely to
> > have a corrupt unusable zpool, or at least data corruption.  If that
> is
> > indeed acceptable to you, go nuts.  ;-)
> 
> I believe that the above is wrong information as long as the devices
> involved do flush their caches when requested to.  Zfs still writes
> data in order (at the TXG level) and advances to the next transaction
> group when the devices written to affirm that they have flushed their
> cache.  Without the ZIL, data claimed to be synchronously written
> since the previous transaction group may be entirely lost.
> 
> If the devices don't flush their caches appropriately, the ZIL is
> irrelevant to pool corruption.

I stand corrected.  You don't lose your pool.  You don't have corrupted
filesystem.  But you lose whatever writes were not yet completed, so if
those writes happen to be things like database transactions, you could have
corrupted databases or files, or missing files if you were creating them at
the time, and stuff like that.  AKA, data corruption.

But not pool corruption, and not filesystem corruption.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> Use something other than Open/Solaris with ZFS as an NFS server?  :)
> 
> I don't think you'll find the performance you paid for with ZFS and
> Solaris at this time. I've been trying to more than a year, and
> watching dozens, if not hundreds of threads.
> Getting half-ways decent performance from NFS and ZFS is impossible
> unless you disable the ZIL.
> 
> You'd be better off getting NetApp

Hah hah.  I have a Sun X4275 server exporting NFS.  We have clients on all 4
of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or
filesystem.

I suggest you either enable the WriteBack cache on your HBA, or add SSD's
for ZIL.  Performance is 5-10x higher this way than using "naked" disks.
But of course, not as high as it is with a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> > Nobody knows any way for me to remove my unmirrored
> > log device.  Nobody knows any way for me to add a mirror to it (until
> 
> Since snv_125 you can remove log devices. See
> http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
> 
> I've used this all the time during my testing and was able to remove
> both
> mirrored and unmirrored log devices without any problems (and without
> reboot). I'm using snv_134.

Aware.  Opensolaris can remove log devices.  Solaris cannot.  Yet.  But if
you want your server in production, you can get a support contract for
solaris.  Opensolaris cannot.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> >Oh, one more comment. If you don't mirror your ZIL, and your
> unmirrored SSD
> >goes bad, you lose your whole pool. Or at least suffer data
> corruption.
> 
> Hmmm, I thought that in that case ZFS reverts to the "regular on disks"
> ZIL?

I see the source for some confusion.  On the ZFS Best Practices page:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

It says:
Failure of the log device may cause the storage pool to be inaccessible if
you are running the Solaris Nevada release prior to build 96 and a release
prior to the Solaris 10 10/09 release.

It also says:
If a separate log device is not mirrored and the device that contains the
log fails, storing log blocks reverts to the storage pool.

...  At the time when I built my system (Oct 2009) this is what it said:
At present, until [http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 CR
6707530] is integrated, failure of the log device may cause the storage pool
to be inaccessible. Protecting the log device by mirroring will allow you to
access the storage pool even if a log device has failed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> Would your users be concerned if there was a possibility that
> after extracting a 50 MB tarball that files are incomplete, whole
> subdirectories are missing, or file permissions are incorrect?

Correction:  "Would your users be concerned if there was a possibility that
after extracting a 50MB tarball *and having a server crash* then files could
be corrupted as described above."

If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you've described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Obviously, if you cannot accept 5-10 minutes of data loss, such as credit
card transactions, this would not be acceptable.  You'd need to keep your
ZIL enabled.  Also, if you have a svn server on the ZFS server, and you have
svn clients on other systems ... You should never allow your clients to
advance beyond the current rev of the server.  So again, you'd have to keep
the ZIL enabled on the server.

It all depends on your workload.  For some, the disabled ZIL is worth the
risk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VMware client solaris 10, RAW physical disk and zfs snapshots problem - all created snapshots are equal to zero.

2010-03-31 Thread Edward Ned Harvey
> I did those test and here are results:
> 
> r...@sl-node01:~# zfs list
> NAMEUSED  AVAIL  REFER  MOUNTPOINT
> mypool01   91.9G   136G23K  /mypool01
> mypool01/storage01 91.9G   136G  91.7G  /mypool01/storage01
> mypool01/storag...@30032010-1  0  -  91.9G  -
> mypool01/storag...@30032010-2  0  -  91.9G  -
> mypool01/storag...@30032010-3  2.15M  -  91.7G  -
> mypool01/storag...@30032010-441K  -  91.7G  -
> mypool01/storag...@30032010-5  1.17M  -  91.7G  -
> mypool01/storag...@30032010-6  0  -  91.7G  -
> mypool02   91.9G   137G24K  /mypool02
> mypool02/copies  23K   137G23K  /mypool02/copies
> mypool02/storage01 91.9G   137G  91.9G  /mypool02/storage01
> mypool02/storag...@30032010-1  0  -  91.9G  -
> mypool02/storag...@30032010-2  0  -  91.9G  -
> 
> As you can see I have differences for snapshot 4,5 and 6 as you
> suggested to make a test. But I can see also changes on snapshot no. 3
> - I complain about this snapshot because I could not see differences
> on it last night! Now it shows.

Well, the first thing you should know is this:  Suppose you take a snapshot,
and create some files.  Then the snapshot still occupies no disk space.
Everything is in the current filesystem.  The only time a snapshot occupies
disk space is when the snapshot contains data that is missing from the
current filesystem.  That is - If you "rm" or overwrite some files in the
current filesystem, then you will see the size of the snapshot growing.
Make sense?

That brings up a question though.  If you did the commands as I wrote them,
it would mean you created a 1G file, took a snapshot, and rm'd the file.
Therefore your snapshot should contain at least 1G.  I am confused by the
fact that you only have 1-2M in your snapshot.  Maybe I messed up the
command I told you, or you messed up entering it on the system, and you only
created a 1M file, instead of a 1G file?


> What is still strange: snapshots 1 and 2 are the oldest but they are
> still equal to zero! After changes and snapshots 3,4,5 and 6 I would
> expect that snapshots 1 and 2 are "recording" also changes on the
> storage01 file system, but not... could it be possible that snapshots
> 1 and 2 are somehow "broken?"

If some file existed during all of the old snapshots, and you destroy your
later snapshots, then the data occupied by the later snapshots will start to
fall onto the older snapshots.  Until you destroy the oldest snapshot that
contained that data.  At which time, the data is truly gone from all of the
snapshots.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> I see the source for some confusion.  On the ZFS Best Practices page:
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> 
> It says:
> Failure of the log device may cause the storage pool to be inaccessible
> if
> you are running the Solaris Nevada release prior to build 96 and a
> release
> prior to the Solaris 10 10/09 release.
> 
> It also says:
> If a separate log device is not mirrored and the device that contains
> the
> log fails, storing log blocks reverts to the storage pool.

I have some more concrete data on this now.  Running Solaris 10u8 (which is
10/09), fully updated last weekend.  We want to explore the consequences of
adding or failing a non-mirrored log device.  We created a pool with a
non-mirrored ZIL log device.  And experimented with it:

(a)  Simply yank out the non-mirrored log device while the system is live.
The result was:  Any zfs or zpool command would hang permanently.  Even "zfs
list" hangs permanently.  The system cannot shutdown, cannot reboot, cannot
"zfs send" or "zfs snapshot" or anything ... It's a bad state.  You're
basically hosed.  Power cycle is the only option.

(b)  After power cycling, the system won't boot.  It gets part way through
the boot process, and eventually just hangs there, infinitely cycling error
messages about services that couldn't start.  Random services, such as
inetd, which seem unrelated to some random data pool that failed.  So we
power cycle again, and go into failsafe mode, to clean up and destroy the
old messed up pool ... Boot up totally clean again, and create a new totally
clean pool with a non-mirrored log device.  Just to ensure we really are
clean, we simply "zpool export" and "zpool import" with no trouble, and
reboot once for good measure.  "zfs list" and everything are all working
great...

(c)  Do a "zpool export."  Obviously, the ZIL log device is clean and
flushed at this point, not being used.  We simply yank out the log device,
and do "zpool import."  Well ... Without that log device, I forget the
terminology, it said something like "missing disk."  Plain and simple, you
*can* *not* import the pool without the log device.  It does not say "to
force use -f" and even if you specify the -f, it still just throws the same
error message, missing disk or whatever.  Won't import.  Period.

...  So, to anybody who said the failed log device will simply fail over to
blocks within the main pool:  Sorry.  That may be true in some later
version, but it is not the slightest bit true in the absolute latest solaris
(proper) available today.

I'm going to venture a guess this is no longer a problem, after zpool
version 19.  This is when "ZFS log device removal" was introduced.

Unfortunately, the latest version of solaris only goes up to zpool version
15.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> A MegaRAID card with write-back cache? It should also be cheaper than
> the F20.

I haven't posted results yet, but I just finished a few weeks of extensive
benchmarking various configurations.  I can say this:

WriteBack cache is much faster than "naked" disks, but if you can buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much faster than
WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for example.  If
you're running solaris proper, you better mirror your ZIL log device.  If
you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the opensolaris
machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks alike)
using the dedicated ZIL SSD device, very noticeably faster than enabling the
WriteBack.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Edward Ned Harvey
> We ran into something similar with these drives in an X4170 that turned
> out to
> be  an issue of the preconfigured logical volumes on the drives. Once
> we made
> sure all of our Sun PCI HBAs where running the exact same version of
> firmware
> and recreated the volumes on new drives arriving from Sun we got back
> into sync
> on the X25-E devices sizes.

Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when we
plugged in that drive, and "create simple volume" in the storagetek raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm still
hosed.

Are you saying I might benefit by sticking the SSD into some laptop, and
zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the drive
available, instead of using the storagetek raid utility?

Thanks for the suggestions...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
> >If you disable the ZIL, the filesystem still stays correct in RAM, and
> the
> >only way you lose any data such as you've described, is to have an
> >ungraceful power down or reboot.
> 
> >The advice I would give is:  Do zfs autosnapshots frequently (say ...
> every
> >5 minutes, keeping the most recent 2 hours of snaps) and then run with
> no
> >ZIL.  If you have an ungraceful shutdown or reboot, rollback to the
> latest
> >snapshot ... and rollback once more for good measure.  As long as you
> can
> >afford to risk 5-10 minutes of the most recent work after a crash,
> then you
> >can get a 10x performance boost most of the time, and no risk of the
> >aforementioned data corruption.
> 
> Why do you need the rollback? The current filesystems have correct and
> consistent data; not different from the last two snapshots.
> (Snapshots can happen in the middle of untarring)
> 
> The difference between running with or without ZIL is whether the
> client has lost data when the server reboots; not different from using
> Linux as an NFS server.

If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won't continue having more
losses afterward caused by inconsistent data on disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
> > Can you elaborate?  Just today, we got the replacement drive that has
> > precisely the right version of firmware and everything.  Still, when
> we
> > plugged in that drive, and "create simple volume" in the storagetek
> raid
> > utility, the new drive is 0.001 Gb smaller than the old drive.  I'm
> still
> > hosed.
> >
> > Are you saying I might benefit by sticking the SSD into some laptop,
> and
> > zero'ing the disk?  And then attach to the sun server?
> >
> > Are you saying I might benefit by finding some other way to make the
> drive
> > available, instead of using the storagetek raid utility?
> 
> Assuming you are also using a PCI LSI HBA from Sun that is managed with
> a utility called /opt/StorMan/arcconf and reports itself as the
> amazingly
> informative model number "Sun STK RAID INT" what worked for me was to
> run,
> arcconf delete (to delete the pre-configured volume shipped on the
> drive)
> arcconf create (to create a new volume)
> 
> What I observed was that
> arcconf getconfig 1
> would show the same physical device size for our existing drives and
> new
> ones from Sun, but they reported a slightly different logical volume
> size.
> I am fairly sure that was due to the Sun factory creating the initial
> volume
> with a different version of the HBA controller firmware then we where
> using
> to create our own volumes.
> 
> If I remember the sign correctly, the newer firmware creates larger
> logical
> volumes, and you really want to upgrade the firmware if you are going
> to
> be running multiple X25-E drives from the same controller.
> 
> I hope that helps.

Uggh.  This is totally different than my system.  But thanks for writing.
I'll take this knowledge, and see if we can find some analogous situation
with the StorageTek controller.  It still may be helpful, so again, thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
> >If you have an ungraceful shutdown in the middle of writing stuff,
> while the
> >ZIL is disabled, then you have corrupt data.  Could be files that are
> >partially written.  Could be wrong permissions or attributes on files.
> >Could be missing files or directories.  Or some other problem.
> >
> >Some changes from the last 1 second of operation before crash might be
> >written, while some changes from the last 4 seconds might be still
> >unwritten.  This is data corruption, which could be worse than losing
> a few
> >minutes of changes.  At least, if you rollback, you know the data is
> >consistent, and you know what you lost.  You won't continue having
> more
> >losses afterward caused by inconsistent data on disk.
> 
> How exactly is this different from "rolling back to some other point of
> time?".
> 
> I think you don't quite understand how ZFS works; all operations are
> grouped in transaction groups; all the transactions in a particular
> group
> are commit in one operation.  I don't know what partial ordering ZFS

Dude, don't be so arrogant.  Acting like you know what I'm talking about
better than I do.  Face it that you have something to learn here.

Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they're not necessarily committed to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in "sync" mode intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my "sync" write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.

This is particularly likely to happen, if my application issues a very small
sync write, followed by a larger async write, followed by a very small sync
write, and so on.  Then the OS will buffer my small sync writes and attempt
to aggregate them into a larger sequential block for the sake of accelerated
performance.  The end result is:  My larger async writes are sometimes
committed to disk before my small sync writes.  But the only reason I would
ever know or care about that would be if the ZIL were disabled, and the OS
crashed.  Afterward, my file has internal inconsistency.

Perfect examples of applications behaving this way would be databases and
virtual machines.


> Why do you think that a "Snapshot" has a "better quality" than the last
> snapshot available?

If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you're losing the most recent few minutes of data,
you can rest assured you haven't got file corruption in any of the existing
files.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
> This approach does not solve the problem.  When you do a snapshot,
> the txg is committed.  If you wish to reduce the exposure to loss of
> sync data and run with ZIL disabled, then you can change the txg commit
> interval -- however changing the txg commit interval will not eliminate
> the
> possibility of data loss.

The default commit interval is what, 30 seconds?  Doesn't that guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I'm wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that's at least 30 seconds old, then all the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You're acknowledging the loss of some known time worth
of data.  But you're gaining a guarantee of internal file consistency.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Edward Ned Harvey
> Is that what "sync" means in Linux?  

A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] how can I remove files when the fiile system is full?

2010-04-02 Thread Edward Ned Harvey
> On opensolaris?  Did you try deleting any old BEs?

Don't forget to "zfs destroy rp...@snapshot"
In fact, you might start with destroying snapshots ... if there are any
occupying space.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> > Seriously, all disks configured WriteThrough (spindle and SSD disks
> > alike)
> > using the dedicated ZIL SSD device, very noticeably faster than
> > enabling the
> > WriteBack.
> 
> What do you get with both SSD ZIL and WriteBack disks enabled?
> 
> I mean if you have both why not use both? Then both async and sync IO
> benefits.

Interesting, but unfortunately false.  Soon I'll post the results here.  I
just need to package them in a way suitable to give the public, and stick it
on a website.  But I'm fighting IT fires for now and haven't had the time
yet.

Roughly speaking, the following are approximately representative.  Of course
it varies based on tweaks of the benchmark and stuff like that.
Stripe 3 mirrors write through:  450-780 IOPS
Stripe 3 mirrors write back:  1030-2130 IOPS
Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
ZIL is 3-4 times faster than naked disk.  And for some reason, having the
WriteBack enabled while you have SSD ZIL actually hurts performance by
approx 10%.  You're better off to use the SSD ZIL with disks in Write
Through mode.

That result is surprising to me.  But I have a theory to explain it.  When
you have WriteBack enabled, the OS issues a small write, and the HBA
immediately returns to the OS:  "Yes, it's on nonvolatile storage."  So the
OS quickly gives it another, and another, until the HBA write cache is full.
Now the HBA faces the task of writing all those tiny writes to disk, and the
HBA must simply follow orders, writing a tiny chunk to the sector it said it
would write, and so on.  The HBA cannot effectively consolidate the small
writes into a larger sequential block write.  But if you have the WriteBack
disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
SSD, and immediately return to the process:  "Yes, it's on nonvolatile
storage."  So the application can issue another, and another, and another.
ZFS is smart enough to aggregate all these tiny write operations into a
single larger sequential write before sending it to the spindle disks.  

Long story short, the evidence suggests if you have SSD ZIL, you're better
off without WriteBack on the HBA.  And I conjecture the reasoning behind it
is because ZFS can write buffer better than the HBA can.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> I know it is way after the fact, but I find it best to coerce each
> drive down to the whole GB boundary using format (create Solaris
> partition just up to the boundary). Then if you ever get a drive a
> little smaller it still should fit.

It seems like it should be unnecessary.  It seems like extra work.  But
based on my present experience, I reached the same conclusion.

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what's to prevent
the same thing from happening to one of my 1TB spindle disk mirrors?
Nothing.  That's what.

I take it back.  Me.  I am to prevent it from happening.  And the technique
to do so is precisely as you've said.  First slice every drive to be a
little smaller than actual.  Then later if I get a replacement device for
the mirror, that's slightly smaller than the others, I have no reason to
care.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> > http://nfs.sourceforge.net/
> 
> I think B4 is the answer to Casper's question:

We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way "sync" and "async" writes are handled by the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don't think anything relating to NFS is the answer to Casper's question,
or else, Casper was simply jumping context by asking it.  Don't get me
wrong, I have no objection to his question or anything, it's just that the
conversation has derailed and now people are talking about NFS sync/async
instead of what happens when a C/C++ application is doing sync/async writes
to a disabled ZIL.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> > I am envisioning a database, which issues a small sync write,
> followed by a
> > larger async write.  Since the sync write is small, the OS would
> prefer to
> > defer the write and aggregate into a larger block.  So the
> possibility of
> > the later async write being committed to disk before the older sync
> write is
> > a real risk.  The end result would be inconsistency in my database
> file.
> 
> Zfs writes data in transaction groups and each bunch of data which
> gets written is bounded by a transaction group.  The current state of
> the data at the time the TXG starts will be the state of the data once
> the TXG completes.  If the system spontaneously reboots then it will
> restart at the last completed TXG so any residual writes which might
> have occured while a TXG write was in progress will be discarded.
> Based on this, I think that your ordering concerns (sync writes
> getting to disk "faster" than async writes) are unfounded for normal
> file I/O.

So you're saying that while the OS is building txg's to write to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg's.  That is, an application performing a small sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

If that's true, if there's no increased risk of data corruption, then why
doesn't everybody just disable their ZIL all the time on every system?

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It's a blocking call, that doesn't
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> hello
> 
> i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
> because we had no spare drive in stock, we ignored it.
> 
> then we decided to update our nexenta 3 alpha to beta, exported the
> pool and made a fresh install to have a clean system and tried to
> import the pool. we only got a error message about a missing drive.
> 
> we googled about this and it seems there is no way to acces the pool
> !!!
> (hope this will be fixed in future)
> 
> we had a backup and the data are not so important, but that could be a
> real problem.
> you have  a valid zfs3 pool and you cannot access your data due to
> missing zil.

If you have zpool less than version 19 (when ability to remove log device
was introduced) and you have a non-mirrored log device that failed, you had
better treat the situation as an emergency.  Normally you can find your
current zpool version by doing "zpool upgrade," but you cannot now if you're
in this failure state.  Do not attempt "zfs send" or "zfs list" or any other
zpool or zfs command.  Instead, do "man zpool" and look for "zpool remove."
If it says "supports removing log devices" then you had better use it to
remove your log device.  If it says "only supports removing hotspares or
cache" then your zpool is lost permanently.

If you are running Solaris, take it as given, you do not have zpool version
19.  If you are running Opensolaris, I don't know at which point zpool 19
was introduced.  Your only hope is to "zpool remove" the log device.  Use
tar or cp or something, to try and salvage your data out of there.  Your
zpool is lost and if it's functional at all right now, it won't stay that
way for long.  Your system will soon hang, and then you will not be able to
import your pool.

Ask me how I know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> ZFS recovers to a crash-consistent state, even without the slog,
> meaning it recovers to some state through which the filesystem passed
> in the seconds leading up to the crash.  This isn't what UFS or XFS
> do.
> 
> The on-disk log (slog or otherwise), if I understand right, can
> actually make the filesystem recover to a crash-INconsistent state (a

You're speaking the opposite of common sense.  If disabling the ZIL makes
the system faster *and* less prone to data corruption, please explain why we
don't all disable the ZIL?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> If you have zpool less than version 19 (when ability to remove log
> device
> was introduced) and you have a non-mirrored log device that failed, you
> had
> better treat the situation as an emergency.  

> Instead, do "man zpool" and look for "zpool
> remove."
> If it says "supports removing log devices" then you had better use it
> to
> remove your log device.  If it says "only supports removing hotspares
> or
> cache" then your zpool is lost permanently.

I take it back.  If you lost your log device on a zpool which is less than
version 19, then you *might* have a possible hope if you migrate your disks
to a later system.  You *might* be able to "zpool import" on a later version
of OS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> >Dude, don't be so arrogant.  Acting like you know what I'm talking
> about
> >better than I do.  Face it that you have something to learn here.
> 
> You may say that, but then you post this:

Acknowledged.  I read something arrogant, and I replied even more arrogant.
That was dumb of me.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> Only a broken application uses sync writes
> sometimes, and async writes at other times.

Suppose there is a virtual machine, with virtual processes inside it.  Some
virtual process issues a sync write to the virtual OS, meanwhile another
virtual process issues an async write.  Then the virtual OS will sometimes
issue sync writes and sometimes async writes to the host OS.

Are you saying this makes qemu, and vbox, and vmware "broken applications?"

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Edward Ned Harvey
> The purpose of the ZIL is to act like a fast "log" for synchronous
> writes.  It allows the system to quickly confirm a synchronous write
> request with the minimum amount of work.  

Bob and Casper and some others clearly know a lot here.  But I'm hearing
conflicting information, and don't know what to believe.  Does anyone here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I can
answer this question, I wrote that code, or at least have read it?"

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

At boot time, or "zpool import" time, what is taken to be "the current
filesystem?"  The latest uberblock?  Something else?

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I'm starting to realize ... This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn't
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG's before async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] To slice, or not to slice

2010-04-02 Thread Edward Ned Harvey
Momentarily, I will begin scouring the omniscient interweb for information,
but I'd like to know a little bit of what people would say here.  The
question is to slice, or not to slice, disks before using them in a zpool.

 

One reason to slice comes from recent personal experience.  One disk of a
mirror dies.  Replaced under contract with an identical disk.  Same model
number, same firmware.  Yet when it's plugged into the system, for an
unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
unable to attach and un-degrade the mirror.  It seems logical this problem
could have been avoided if the device added to the pool originally had been
a slice somewhat smaller than the whole physical device.  Say, a slice of
28G out of the 29G physical disk.  Because later when I get the
infinitesimally smaller disk, I can always slice 28G out of it to use as the
mirror device.

 

There is some question about performance.  Is there any additional overhead
caused by using a slice instead of the whole physical device?

 

There is another question about performance.  One of my colleagues said he
saw some literature on the internet somewhere, saying ZFS behaves
differently for slices than it does on physical devices, because it doesn't
assume it has exclusive access to that physical device, and therefore caches
or buffers differently . or something like that.

 

Any other pros/cons people can think of?

 

And finally, if anyone has experience doing this, and process
recommendations?  That is . My next task is to go read documentation again,
to refresh my memory from years ago, about the difference between "format,"
"partition," "label," "fdisk," because those terms don't have the same
meaning that they do in other OSes.  And I don't know clearly right now,
which one(s) I want to do, in order to create the large slice of my disks.

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-02 Thread Edward Ned Harvey
This might be unrelated, but along similar lines .

 

I've also heard that the risk for unexpected failure of your pool is higher
if/when you reach 100% capacity.  I've heard that you should always create a
small ZFS filesystem within a pool, and give it some reserved space, along
with the filesystem that you actually plan to use in your pool.  Anyone care
to offer any comments on that?

 

 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
Sent: Friday, April 02, 2010 5:23 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] To slice, or not to slice

 

Momentarily, I will begin scouring the omniscient interweb for information,
but I'd like to know a little bit of what people would say here.  The
question is to slice, or not to slice, disks before using them in a zpool.

 

One reason to slice comes from recent personal experience.  One disk of a
mirror dies.  Replaced under contract with an identical disk.  Same model
number, same firmware.  Yet when it's plugged into the system, for an
unknown reason, it appears 0.001 Gb smaller than the old disk, and therefore
unable to attach and un-degrade the mirror.  It seems logical this problem
could have been avoided if the device added to the pool originally had been
a slice somewhat smaller than the whole physical device.  Say, a slice of
28G out of the 29G physical disk.  Because later when I get the
infinitesimally smaller disk, I can always slice 28G out of it to use as the
mirror device.

 

There is some question about performance.  Is there any additional overhead
caused by using a slice instead of the whole physical device?

 

There is another question about performance.  One of my colleagues said he
saw some literature on the internet somewhere, saying ZFS behaves
differently for slices than it does on physical devices, because it doesn't
assume it has exclusive access to that physical device, and therefore caches
or buffers differently . or something like that.

 

Any other pros/cons people can think of?

 

And finally, if anyone has experience doing this, and process
recommendations?  That is . My next task is to go read documentation again,
to refresh my memory from years ago, about the difference between "format,"
"partition," "label," "fdisk," because those terms don't have the same
meaning that they do in other OSes.  And I don't know clearly right now,
which one(s) I want to do, in order to create the large slice of my disks.

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
Momentarily, I will begin scouring the omniscient interweb for information, but 
I'd like to know a little bit of what people would say here.  The question is 
to slice, or not to slice, disks before using them in a zpool.

One reason to slice comes from recent personal experience.  One disk of a 
mirror dies.  Replaced under contract with an identical disk.  Same model 
number, same firmware.  Yet when it's plugged into the system, for an unknown 
reason, it appears 0.001 Gb smaller than the old disk, and therefore unable to 
attach and un-degrade the mirror.  It seems logical this problem could have 
been avoided if the device added to the pool originally had been a slice 
somewhat smaller than the whole physical device.  Say, a slice of 28G out of 
the 29G physical disk.  Because later when I get the infinitesimally smaller 
disk, I can always slice 28G out of it to use as the mirror device.

There is some question about performance.  Is there any additional overhead 
caused by using a slice instead of the whole physical device?

There is another question about performance.  One of my colleagues said he saw 
some literature on the internet somewhere, saying ZFS behaves differently for 
slices than it does on physical devices, because it doesn't assume it has 
exclusive access to that physical device, and therefore caches or buffers 
differently ... or something like that.

Any other pros/cons people can think of?

And finally, if anyone has experience doing this, and process recommendations?  
That is ... My next task is to go read documentation again, to refresh my 
memory from years ago, about the difference between "format," "partition," 
"label," "fdisk," because those terms don't have the same meaning that they do 
in other OSes...  And I don't know clearly right now, which one(s) I want to 
do, in order to create the large slice of my disks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
> > One reason to slice comes from recent personal experience. One disk
> of
> > a mirror dies. Replaced under contract with an identical disk. Same
> > model number, same firmware. Yet when it's plugged into the system,
> > for an unknown reason, it appears 0.001 Gb smaller than the old disk,
> > and therefore unable to attach and un-degrade the mirror. It seems
> > logical this problem could have been avoided if the device added to
> > the pool originally had been a slice somewhat smaller than the whole
> > physical device. Say, a slice of 28G out of the 29G physical disk.
> > Because later when I get the infinitesimally smaller disk, I can
> > always slice 28G out of it to use as the mirror device.
> >
> 
> What build were you running? The should have been addressed by
> CR6844090
> that went into build 117.

I'm running solaris, but that's irrelevant.  The storagetek array controller
itself reports the new disk as infinitesimally smaller than the one which I
want to mirror.  Even before the drive is given to the OS, that's the way it
is.  Sun X4275 server.

BTW, I'm still degraded.  Haven't found an answer yet, and am considering
breaking all my mirrors, to create a new pool on the freed disks, and using
partitions in those disks, for the sake of rebuilding my pool using
partitions on all disks.  The aforementioned performance problem is not as
scary to me as running in degraded redundancy.


> it's well documented. ZFS won't attempt to enable the drive's cache
> unless it has the physical device. See
> 
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> #Storage_Pools

Nice.  Thank you.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
>> And finally, if anyone has experience doing this, and process
>> recommendations?  That is … My next task is to go read documentation
>> again, to refresh my memory from years ago, about the difference
>> between “format,” “partition,” “label,” “fdisk,” because those terms
>> don’t have the same meaning that they do in other OSes…  And I don’t
>> know clearly right now, which one(s) I want to do, in order to create
>> the large slice of my disks.
> 
> The whole partition vs. slice thing is a bit fuzzy to me, so take this
> with a grain of salt. You can create partitions using fdisk, or slices
> using format. The BIOS and other operating systems (windows, linux,
> etc) will be able to recognize partitions, while they won't be able to
> make sense of slices. If you need to boot from the drive or share it
> with another OS, then partitions are the way to go. If it's exclusive
> to solaris, then you can use slices. You can (but shouldn't) use slices
> and partitions from the same device (eg: c5t0d0s0 and c5t0d0p0).

Oh, I managed to find a really good answer to this question.  Several
sources all say to do precisely the same procedure, and when I did it on a
test system, it worked perfectly.  Simple and easy to repeat.  So I think
this is the gospel method to create the slices, if you're going to create
slices:

http://docs.sun.com/app/docs/doc/806-4073/6jd67r9hu
and
http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Rep
lacing.2FRelabeling_the_Root_Pool_Disk


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
> On Apr 2, 2010, at 2:29 PM, Edward Ned Harvey wrote:
> > I've also heard that the risk for unexpected failure of your pool is
> higher if/when you reach 100% capacity.  I've heard that you should
> always create a small ZFS filesystem within a pool, and give it some
> reserved space, along with the filesystem that you actually plan to use
> in your pool.  Anyone care to offer any comments on that?
> 
> Define "failure" in this context?
> 
> I am not aware of a data loss failure when near full.  However, all
> file systems
> will experience performance degradation for write operations as they
> become
> full.

To tell the truth, I'm not exactly sure.  Because I've never lost any ZFS
pool or filesystem.  I only have it deployed on 3 servers, and only one of
those gets heavy use.  It only filled up once, and it didn't have any
problem.  So I'm only trying to understand "the great beyond," that which I
have never known myself.  Learn from other peoples' experience,
preventively.  Yes, I do embrace a lot of voodoo and superstition in doing
sysadmin, but that's just cuz stuff ain't perfect, and I've seen so many
things happen that were supposedly not possible.  (Not talking about ZFS in
that regard...  yet.)  Well, unless you count the issue I'm having right
now, with two identical disks appearing as different sizes...  But I don't
think that's a zfs problem.

I recall some discussion either here or on opensolaris-discuss or
opensolaris-help, where at least one or a few people said they had some sort
of problem or problems, and they were suspicious about the correlation
between it happening, and the disk being full.  I also recall talking to
some random guy at a conference who said something similar.  But it's all
vague.  I really don't know.

And I have nothing concrete.  Hence the post asking for peoples' comments.
Somebody might relate something they experienced less vague than what I
know.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] To slice, or not to slice

2010-04-03 Thread Edward Ned Harvey
> I would return the drive to get a bigger one before doing something as
> drastic as that. There might have been a hichup in the production line,
> and that's not your fault.

Yeah, but I already have 2 of the replacement disks, both doing the same
thing.  One has a firmware newer than my old disk (so originally I thought
that was the cause, and requested another replacement disk).  But then we
got a replacement disk which is identical in every way to the failed disk
... but it still appears smaller for some reason.

So this happened on my SSD.  What's to prevent it from happening on one of
the spindle disks in the future?  Nothing that I know of ...  

So far, the idea of slicing seems to be the only preventive or corrective
measure.  Hence, wondering what pros/cons people would describe, beyond what
I've already thought up myself.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   5   6   7   8   9   10   >