Re: [zfs-discuss] Server upgrade

2012-02-20 Thread David Dyer-Bennet

On Thu, February 16, 2012 11:18, Paul Kraus wrote:
> On Thu, Feb 16, 2012 at 11:42 AM, David Dyer-Bennet  wrote:
>
>> I'm seriously thinking of going Nexenta, as I think it would let me be a
>> little less of a sysadmin.  Solaris 11 express is tempting in its own
>> way
>> though, if I decide the price is tolerable.
>
> I looked at the Nexenta route, and while it is _very_ attractive,
> I need my home server to function as DHCP and DNS server as well (and
> a couple other services would be nice as well). Since Nexenta is a
> storage appliance, I could not go that route and get what I needed
> without hacking into it.

Ah, that might be a problem.  Not those specific services currently, but I
do now and then run things.  MRTG, maybe Nagios, are on the list to do
(though it's so much harder to get anything like that going on Solaris,
I'm tempted to run a linux virtual server; that would be on the same box
though, so still a problem).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server upgrade

2012-02-16 Thread David Dyer-Bennet

On Wed, February 15, 2012 18:06, Brandon High wrote:
> On Wed, Feb 15, 2012 at 9:16 AM, David Dyer-Bennet  wrote:
>> Is there an upgrade path from (I think I'm running Solaris Express) to
>> something modern?  (That could be an Oracle distribution, or the free
>
> There *was* an upgrade path from snv_134 to snv_151a (Solaris 11
> Express) but I don't know if Oracle still supports it. There was an
> intermediate step or two along the way (snv_134b I think?) to move
> from OpenSolaris to Oracle Solaris.
>
> As others mentioned, you could jump to OpenIndiana from your current
> version. You may not be able to move between OI and S11 in the future,
> so it's a somewhat important decision.

Thanks.  Given the pricing for commercial Solaris versions, I don't think
moving to them is likely to ever be important to me.  It looks like OI and
Nexenta are the viable choices I have to look at.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server upgrade

2012-02-16 Thread David Dyer-Bennet

On Thu, February 16, 2012 13:31, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of David Dyer-Bennet
>>
>> This is already getting useful; "which has never worked for me" for
>> example is the sort of observation I find informative, since I've been
>> seeing your name around here for some time and have the general
>> impression
>> that you're not stupid or incompetent.
>
> Just because I talk a lot doesn't mean I'm not stupid or incompetent.  ;-)

I resemble that remark!

But, slightly more seriously, I've read what you said, not just noticed
the volume :-).

> "Never worked for me," in this case, basically means I tried upgrading
> from
> one opensolaris to another... which went horribly wrong...  And even when
> applying system updates (paid commercial solaris 10 support, applying
> security patches etc) those often cause problems too.  But I wouldn't call
> them "horribly wrong."

I've gotten at least that to work a few times.  But for me, keeping up
with OS upgrades is one of the most important sysadmin tasks.  Otherwise,
you're leaving unpatched vulnerabilities sitting around.

>> I was going to say the commercial version wasn't an option -- but on
>> consideration, I haven't done the research to determine that.  So that's
>> a
>> task (how hard can it be to find out how much they want?).
>
> You mean, how much it costs?  http://oracle.com  click on "Store," and
> "Solaris."  Looks like $1,000 per socket per year for 1-4 sockets.

You beat me to it.  And if that's the order of magnitude, then I was right
the first time, the commercial versions are completely out of the
question.  I might, if I felt really friendly towards Oracle, consider a
one-shot payment of 1/10 or maybe a little more :-).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Server upgrade

2012-02-16 Thread David Dyer-Bennet

On Thu, February 16, 2012 08:54, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of David Dyer-Bennet
>>
>> While I'm not in need of upgrading my server at an emergency level, I'm
>> starting to think about it -- to be prepared (and an upgrade could be
>> triggered by a failure at this point; my server dates to 2006).
>
> There are only a few options for you to consider.  I don't know which ones
> support encryption, or which ones offer an upgrade path from your version
> of
> opensolaris, but I figure you can probably easily evaluate each of the
> options for your own purposes.
>
> No matter which you use, I assume you will be exporting the data pool, and
> later importing it.  But the OS will either need to be wiped and
> reinstalled
> from scratch, or obviously, follow your upgrade path (which has never
> worked
> for me; I invariably end up wiping the OS and reinstalling.  Good thing I
> keep documentation about how I configure my OS.)

This is already getting useful; "which has never worked for me" for
example is the sort of observation I find informative, since I've been
seeing your name around here for some time and have the general impression
that you're not stupid or incompetent.

Yeah, I'll try to export and import the pool.  AND I'll have three current
backups on external drives, at least one out of the house and at least one
in the house :-).  I'm kind of fond of this data, and wouldn't like
anything to happen to it (I could recover some of the last decade of
photography from optical disks, with a lot of work, and the online copies
would remain but those aren't high-res).

> Nexenta, OpenIndiana, Solaris 11 Express (free version only permitted for
> certain uses, no regular updates available), or commercial Solaris.
>
> If you consider paying for solaris - at Oracle, you just pay them for "An
> OS" and they don't care which one you use.  Could be oracle linux,
> solaris,
> or solaris express.  I would recommend solaris 11 express based on
> personal
> experience.  It gets bugfixes and new features sooner than commercial
> solaris.

I was going to say the commercial version wasn't an option -- but on
consideration, I haven't done the research to determine that.  So that's a
task (how hard can it be to find out how much they want?).

Listing the options is extremely useful, in fact.  Even though I've heard
of all of them, seeing how you group things helps me too.

I'm seriously thinking of going Nexenta, as I think it would let me be a
little less of a sysadmin.  Solaris 11 express is tempting in its own way
though, if I decide the price is tolerable.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Server upgrade

2012-02-15 Thread David Dyer-Bennet
While I'm not in need of upgrading my server at an emergency level, I'm
starting to think about it -- to be prepared (and an upgrade could be
triggered by a failure at this point; my server dates to 2006).

I'm actually more concerned with software than hardware.  My load is
small, the current hardware is handling it no problem.  I don't see myself
as a candidate for dedup, so I don't need to add huge quantities of RAM. 
I'm handling compression on backups just fine (the USB external disks are
the choke-point, so compression actually speeds up the backups).

I'd like to be on a current software stream that I can easily update with
bug-fixes and new features.  The way I used to do that got broke in the
Oracle takeover.

I'm interested in encryption for my backups, if that's functional (and
safe) in current software versions.  I take copies off-site, so that's a
useful precaution.

Whatever I do, I'll of course make sure my backups are ALL up-to-date and
at least one is back off-site before I do anything drastic.

Is there an upgrade path from (I think I'm running Solaris Express) to
something modern?  (That could be an Oracle distribution, or the free
software fork, or some Nexenta distribution; my current data pool is 1.8T,
and I don't expect it to grow terribly fast, so the fully-featured free
version fits my needs for example.)  Upgrading might perhaps save me from
changing all the user passwords (half a dozen, not a huge problem) and
software packages I've added.

(uname -a says "SunOS fsfs 5.11 snv_134 i86pc i386 i86pc").

Or should I just export my pool and do a from-scratch install of
something?  (Then recreate the users and install any missing software. 
I've got some cron jobs, too.)

AND, what "something" should I upgrade to or install?  I've tried a couple
of times to figure out the alternatives and it's never really clear to me
what my good options are.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-16 Thread David Dyer-Bennet

On Tue, November 15, 2011 17:05, Anatoly wrote:
> Good day,
>
> The speed of send/recv is around 30-60 MBytes/s for initial send and
> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk
> to 100+ disks in pool. But the speed doesn't vary in any degree. As I
> understand 'zfs send' is a limiting factor. I did tests by sending to
> /dev/null. It worked out too slow and absolutely not scalable.
> None of cpu/memory/disk activity were in peak load, so there is of room
> for improvement.

What you're probably seeing with incremental sends is that the disks being
read are hitting their IOPS limits.  Zfs send does random reads all over
the place -- every block that's changed since the last incremental send is
read, in TXG order.  So that's essentially random reads all of the disk.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Remove corrupt files from snapshot

2011-11-16 Thread David Dyer-Bennet

On Tue, November 15, 2011 10:07, sbre...@hotmail.com wrote:


> Would it make sense to do "zfs scrub" regularly and have a report sent,
> i.e. once a day, so discrepancy would be noticed beforehand? Is there
> anything readily available in the Freebsd ZFS package for this?

If you're not scrubbing regularly, you're losing out on one of the key
benefits of ZFS.  In nearly all fileserver situations, a good amount of
the content is essentially archival, infrequently accessed but important
now and then.  (In my case it's my collection of digital and digitized
photos.)

A weekly scrub combined with a decent backup plan will detect bit-rot
before the backups with the correct data cycle into the trash (and, with
redundant storage like mirroring or RAID, the scrub will probably be able
to fix the error without resorting to restoring files from backup).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slow zfs send/recv speed

2011-11-16 Thread David Dyer-Bennet

On Tue, November 15, 2011 20:08, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Anatoly
>>
>> The speed of send/recv is around 30-60 MBytes/s for initial send and
>> 17-25 MBytes/s for incremental. I have seen lots of setups with 1 disk
>
> I suggest watching zpool iostat before, during, and after the send to
> /dev/null.  Actually, I take that back - zpool iostat seems to measure
> virtual IOPS, as I just did this on my laptop a minute ago, I saw 1.2k
> ops,
> which is at least 5-6x higher than my hard drive can handle, which can
> only
> mean it's reading a lot of previously aggregated small blocks from disk,
> which are now sequentially organized on disk.  How do you measure physical
> iops?  Is it just regular iostat?  I have seriously put zero effort into
> answering this question (sorry.)
>
> I have certainly noticed a delay in the beginning, while the system thinks
> about stuff for a little while to kick off an incremental... And it's
> acknowledged and normal that incrementals are likely fragmented all over
> the
> place so you could be IOPS limited (hence watching the iostat).
>
> Also, whenever I sit and watch it for long times, I see that it varies
> enormously.  For 5 minutes it will be (some speed), and for 5 minutes it
> will be 5x higher...
>
> Whatever it is, it's something we likely are all seeing, but probably just
> ignoring.  If you can find it in your heart to just ignore it too, then
> great, no problem.  ;-)  Otherwise, it's a matter of digging in and
> characterizing to learn more about it.

I see rather variable io stats while sending incremental backups.  The
receiver is a USB disk, so fairly slow, but I get 30MB/s in a good
stretch.  I'm compressing the ZFS filesystem on the receiving end, but
much of my content is already-compressed photo files, so it doesn't make a
huge difference.   Helps some, though, and at 30MB/s there's no shortage
of CPU horsepower to handle the compression.

The raw files are around 12MB each, probably not fragmented much (they're
just copied over from memory cards).  For a small number of the files,
there's a photoshop file that's much bigger (sometimes more than 1GB, if
it's a stitched panorama with layers of changes).  And then there are
sidecar XMP files, mostly two per image, and for most of them
web-resolution images, 100kB.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Adding mirrors to an existing zfs-pool]

2011-07-28 Thread David Dyer-Bennet

On Tue, July 26, 2011 09:55, Cindy Swearingen wrote:
>
> Subject: Re: [zfs-discuss] Adding mirrors to an existing zfs-pool
> Date: Tue, 26 Jul 2011 08:54:38 -0600
> From: Cindy Swearingen 
> To: Bernd W. Hennig 
> References: <342994905.11311662049567.JavaMail.Twebapp@sf-app1>
>
> Hi Bernd,
>
> If you are talking about attaching 4 new disks to a non redundant pool
> with 4 disks, and then you want to detach the previous disks then yes,
> this is possible and a good way to migrate to new disks.
>
> The new disks must be the equivalent size or larger than the original
> disks.
>
> See the hypothetical example below.
>
> If you mean something else, then please provide your zpool status
> output.
>
> Thanks,
>
> Cindy
>
>
> # zpool status tank
>   pool: tank
>   state: ONLINE
>   scan: resilvered 1018K in 0h0m with 0 errors on Fri Jul 22 15:54:52 2011
> config:
>
>  NAMESTATE READ WRITE CKSUM
>  tankONLINE   0 0 0
>  c4t1d0  ONLINE   0 0 0
>  c4t2d0  ONLINE   0 0 0
>  c4t3d0  ONLINE   0 0 0
>  c4t4d0  ONLINE   0 0 0
>
>
> # zpool attach tank c4t1d0 c6t1d0
> # zpool attach tank c4t2d0 c6t2d0
> # zpool attach tank c4t3d0 c6t3d0
> # zpool attach tank c4t4d0 c6t4d0
>
> The above syntax will create 4 mirrored pairs of disks.

I was somewhat surprised when I first learned of this.  In my head, I now
remember it as "a single disk in ZFS seems to be treated as a one-disk
mirror".  Previously, in my head, single disks were very different objects
from mirrors!

I'm still impressed by the ability to attach and detach arbitrary numbers
of disks to mirrors.  It makes upgrading mirrored disks very very safe,
since I can perform the entire procedure without ever reducing redundancy
below my starting point (using the classic attach new, resilver, detach
old sequence, repeated for however many disks were in the original
mirror).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-26 Thread David Dyer-Bennet

On Mon, July 25, 2011 10:03, Orvar Korvar wrote:
> "There is at least a common perception (misperception?) that devices
> cannot process TRIM requests while they are 100% busy processing other
> tasks."
>
> Just to confirm; SSD disks can do TRIM while processing other tasks?

"Processing" the request just means flagging the blocks, though, right? 
And the actual benefits only acrue if the garbage collection / block
reshuffling background tasks get a chance to run?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-04 Thread David Dyer-Bennet

On Tue, May 3, 2011 19:39, Rich Teer wrote:

> I'm playing around with nearline backups using zfs send | zfs recv.
> A full backup made this way takes quite a lot of time, so I was
> wondering: after the initial copy, would using an incremental send
> (zfs send -i) make the process much quick because only the stuff that
> had changed between the previous snapshot and the current one be
> copied?  Is my understanding of incremental zfs send correct?

Yes, that works.  In my setup, a full backup takes 6 hours (about 800GB of
data to an external USB 2 drive), the incremental maybe 20 minutes even if
I've added several gigabytes of images.

> Also related to this is a performance question.  My initial test involved
> copying a 50 MB zfs file system to a new disk, which took 2.5 minutes
> to complete.  The strikes me as being a bit high for a mere 50 MB;
> are my expectation realistic or is it just because of my very budget
> concious set up?  If so, where's the bottleneck?

In addition to issues others have mentiond, the way incremental send
works, it follows the order the blocks were written in rather than disk
order, so that can sometimes be bad.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Network video streaming [Was: Re: X4540 no next-gen product?]

2011-04-11 Thread David Dyer-Bennet
On 04/08/2011 07:22 PM, J.P. King wrote:

> No, I haven't tried a S7000, but I've tried other kinds of network
> storage and from a design perspective, for my applications, it doesn't
> even make a single bit of sense. I'm talking about high-volume
> real-time
> video streaming, where you stream 500-1000 (x 8Mbit/s) live streams
> from
> a machine over UDP. Having to go over the network to fetch the data
> from
> a different machine is kind of like building a proxy which doesn't
> really do anything - if the data is available from a different machine
> over the network, then why the heck should I just put another machine
> in
> the processing path? For my applications, I need a machine with as few
> processing components between the disks and network as possible, to
> maximize throughput, maximize IOPS and minimize latency and jitter.

Amusing history here -- the "Thumper" was developed at Kealia specifically
for their streaming video server.  Sun then bought them, and continued the
video server project until Oracle ate them (the Sun Streaming Video
Server).  That product supported 80,000 (not a typo) 4 megabit/sec video
streams if fully configured.  (Not off a single thumper, though, I don't
believe.)

However, there was a custom hardware board handling streaming, into
multiple line cards with multiple 10G optical ethernet interfaces.  And a
LOT of buffer memory; the card could support 2TB of RAM, though I believe
real installations were using 512GB.

Data got from the Thumpers to the streaming board over Ethernet, though. 
In big chunks -- 10MB maybe?  (Been a while; I worked on the user
interface level, but had little to do with the streaming hardware.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?

2011-04-06 Thread David Dyer-Bennet

On Tue, April 5, 2011 14:38, Joe Auty wrote:

> Migrating to a new machine I understand is a simple matter of ZFS
> send/receive, but reformatting the existing drives to host my existing
> data is an area I'd like to learn a little more about. In the past I've
> asked about this and was told that it is possible to do a send/receive
> to accommodate this, and IIRC this doesn't have to be to a ZFS server
> with the same number of physical drives?

The internal structure of the pool (how many vdevs, and what kind) is
irrelevant to zfs send / receive.  So I routinely send from a pool of 3
mirrored pairs of disks to a pool of one large drive, for example (it's
how I do my backups).   I've also gone the other way once :-( (It's good
to have backups).

I'm not 100.00% sure I understand what you're asking; does that answer it?

Mind you, this can be slow.  On my little server (under 1TB filled) the
full backup takes about 7 hours (largely because the single large external
drive is a USB drive; the bottleneck is the USB).  Luckily an incremental
backup is rather faster.

> How about getting a little more crazy... What if this entire server
> temporarily hosting this data was a VM guest running ZFS? I don't
> foresee this being a problem either, but with so much at stake I thought
> I would double check :) When I say temporary I mean simply using this
> machine as a place to store the data long enough to wipe the original
> server, install the new OS to the original server, and restore the data
> using this VM as the data source.

I haven't run ZFS extensively in VMs (mostly just short-lived small test
setups).  From my limited experience, and what I've heard on the list,
it's solid and reliable, though, which is what you need for that
application.

> Also, more generally, is ZFS send/receive mature enough that when you do
> data migrations you don't stress about this? Piece of cake? The
> difficulty of this whole undertaking will influence my decision and the
> whole timing of all of this.

A full send / receive has been reliable for a long time.  With a real
(large) data set, it's often a long run.  It's often done over a network,
and any network outage can break the run, and at that point you start
over, which can be annoying.  If the servers themselves can't stay up for
10 or 20 hours you presumably aren't ready to put them into production
anyway :-).

> I'm also thinking that a ZFS VM guest might be a nice way to maintain a
> remote backup of this data, if I can install the VM image on a
> drive/partition large enough to house my data. This seems like it would
> be a little less taxing than rsync cronjobs?

I'm a big fan of rsync, in cronjobs or wherever.  What it won't do is
properly preserve ZFS ACLs, and ZFS snapshots, though.  I moved from using
rsync to using zfs send/receive for my backup scheme at home, and had
considerable trouble getting that all working (using incremental
send/receive when there are dozens of snapshots new since last time).  But
I did eventually get up to recent enough code that it's working reliably
now.

If you can provision big enough data stores for your VM to hold what you
need, that seems a reasonable approach to me, but I haven't tried anything
much like it, so my opinion is, if you're very lucky, maybe worth what you
paid for it.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [illumos-Developer] zfs incremental send?

2011-03-30 Thread David Dyer-Bennet

On Tue, March 29, 2011 07:39, Richard Elling wrote:
> On Mar 29, 2011, at 3:10 AM, Roy Sigurd Karlsbakk 
> wrote:
>
>> - Original Message -
>>> On 2011-Mar-29 02:19:30 +0800, Roy Sigurd Karlsbakk
>>>  wrote:
>>>> Is it (or will it) be possible to do a partial/resumable zfs
>>>> send/receive? If having 30TB of data and only a gigabit link, such
>>>> transfers takes a while, and if interrupted, will require a
>>>> re-transmit of all the data.
>>>
>>> zfs send/receive works on snapshots: The smallest chunk of data that
>>> can be sent/received is the delta between two snapshots. There's no
>>> way to do a partial delta - defining the endpoint of a partial
>>> transfer or the starting point for resumption is effectively a
>>> snapshot.
>>
>> I know that's how it works, I'm merely pointing out that changing this
>> to something resumable would be rather nice, since an initial transfer
>> or 30 or 300 terabytes may easily be interrupted.
>
> In the UNIX tradition, the output and input are pipes. This allows you to
> add whatever transport you'd like for moving the bits. There are many that
> offer protection against network interruptions. Look for more, interesting
> developments in this area soon...

Name three :-).  I don't happen to have run into any that I can remember.

And in any case, that doesn't actually help my situation, where I'm
running both processes on the same box (the receive is talking to an
external USB disk that I disconnect and take off-site after the receive is
complete).  A system crash (or power shutdown, or whatever) during this
process seems to make the receiving pool unimportable.  Possibly I could
use recovery tricks to step back a TXG or two until I get something valid,
and then manually remove the snapshots added to get back to the initial
state, and then I could start the incremental again; in practice, I
haven't made that work, and just do another full send to start over (7
hours, not too bad really).

Anyway, the incremental send/receive seems to be the fragile point in my
backup scheme as well.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good SLOG devices?

2011-03-02 Thread David Dyer-Bennet

On Tue, March 1, 2011 10:35, Garrett D'Amore wrote:

>   a) do you need an SLOG at all?  Some workloads (asynchronous ones) will
> never benefit from an SLOG.

I've been fighting the urge to maybe do something about ZIL (which is what
we're talking about here, right?).  My load is CIFS, not NFS (so not
synchronous, right?), but there are a couple of areas that are significant
to me where I do decent-size (100MB to 1GB) sequential writes (to
newly-created files).  On the other hand, when those writes seem to me to
be going slowly, the disk access lights aren't mostly on, suggesting that
the disk may not be what's holding me up.  I can test that by saving to
local disk and comparing times, also maybe running zpool iostat.

This is a home system, lightly used; the performance issue is me sitting
waiting while big Photoshop files save.  So of some interest to me
personally, and not at ALL like what performance issues on NAS usually
look like.  It's on a UPS, so I'm not terribly worried about losses on
power failure; and I'd just lose my work since the last save, generally,
at worst.

I might not believe the disk access lights on the box (Chenbro chassis,
with two 4-drive hot-swap bays for the data disks; driven off the
motherboard  SATA plus a Supermicro 8-port SAS controller with SAS-to-SATA
cables).  In doing a drive upgrade just recently, I got rather confusing
results with the lights, perhaps the controller or the drive model made a
difference in when the activity lights came on.

The VDEVs in the pool are mirror pairs.  It's been expanded twice by
adding VDEVs and once by replacing devices in one VDEV.  So the load is
probably fairly unevenly spread across them just now.  My desktop connects
to this server over gigabit ethernet (through one switch; the boxes sit
next to each other on a shelf over my desk).

I'll do more research before spending money.  But as a question of general
theory, should a decent separate intent log device help for a single-user
sequential write sequence in the 100MB to 1GB size range?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Good SLOG devices?

2011-03-02 Thread David Dyer-Bennet

On Tue, March 1, 2011 16:32, Rocky Shek wrote:
> David,
>
> STEC/DataON ZeusRAM(Z4RZF3D-8UC-DNS) SSD now available for users in
> channel.
>
> It is 8GB DDR3 RAM based SAS SSD protected by supercapacitor and NVRAM
> 16GB.
>
> It is designed for ZFS ZIL with low latency
>
> http://dataonstorage.com/zeusram

Says "call for price".  I know what that means, it means "If you have to
ask, you can't afford it."

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive i/o anomaly

2011-02-09 Thread David Dyer-Bennet

On Wed, February 9, 2011 04:51, Matt Connolly wrote:

> Nonetheless,  I still find it odd that the whole io system effectively
> hangs up when one drive's queue fills up. Since the purpose of a mirror is
> to continue operating in the case of one drive's failure, I find it
> frustrating that the system slows right down so much because one drive's
> i/o queue is full.

I see what you're saying.  But I don't think mirror systems really try to
handle asymmetric performance.  They either treat the drives equivalently,
or else they decide one of them is "broken" and don't use it at all.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-08 Thread David Dyer-Bennet

On 2011-02-08 21:39, Brandon High wrote:

On Tue, Feb 8, 2011 at 12:53 PM, David Dyer-Bennet  wrote:

Wait, are you saying that the handling of errors in RAIDZ and mirrors is
completely different?  That it dumps the mirror disk immediately, but
keeps trying to get what it can from the RAIDZ disk?  Because otherwise,
you assertion doesn't seem to hold up.


I think he meant that if one drive in a mirror dies completely, then
any single read error on the remaining drive is not recoverable.

With raidz2 (or a 3-way mirror for that matter), if one drive dies
completely, you still have redundancy.


Sure, a 2-way mirror has only 100% redundancy; if one dies, no more 
redundancy.  Same for a RAIDZ -- if one dies, no more redundancy.  But a 
4-drive RAIDZ has roughly twice the odds of a 2-drive mirror of having a 
drive die. And sure, a RAIDZ two has more redundancy -- as does a 3-way 
mirror.


Or a 48-way mirror (I read a report from somebody who mirrored all the 
drives in a Thumper box, just to see if he could).


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-08 Thread David Dyer-Bennet

On Tue, February 8, 2011 13:03, Roy Sigurd Karlsbakk wrote:
>> Or you could stick strictly to mirrors; 4 pools 2x2T, 2x2T, 2x750G,
>> 2x1.5T. Mirrors are more flexible, give you more redundancy, and are
>> much easier to work with.
>
> Easier to work with, yes, but a RAIDz2 will statistically be safer than a
> set of mirrors, since in many cases, you loose a drive and during
> resilver, you find bad sectors on another drive in the same VDEV,
> resulting in data corruption. With RAIDz2 (or 3), the chance of these
> errors to be on the same place on all drives is quite minimal. With a
> (striped?) mirror, a single bitflip on the 'healthy' drive will involve
> data corruption.

Wait, are you saying that the handling of errors in RAIDZ and mirrors is
completely different?  That it dumps the mirror disk immediately, but
keeps trying to get what it can from the RAIDZ disk?  Because otherwise,
you assertion doesn't seem to hold up.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-08 Thread David Dyer-Bennet

On Mon, February 7, 2011 14:59, David Dyer-Bennet wrote:
>
> On Sat, February 5, 2011 11:54, Gaikokujin Kyofusho wrote:
>> Thank you kebabber. I will try out indiana and virtual box to play
>> around
>> with it a bit.
>>
>> Just to make sure I understand your example, if I say had a 4x2tb
>> drives,
>> 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1
>> +
>> 1 mirrored + 1 mirrored), in terms of accessing them would they just be
>> mounted like 3 partitions or could it all be accessed like one big
>> partition?
>
> A ZFS pool can contain many vdevs; you could put the three groups you
> describe into one pool, and then assign one (or more) file-systems to that
> pool.  Putting them all in one pool seems to me the natural way to handle
> it; they're all similar levels of redundancy.  It's more flexible to have
> everything in one pool, generally.
>
> (You could also make separate pools; my experience, for what it's worth,
> argues for making pools based on redundancy and performance (and only
> worry about BIG differences), and assign file-systems to pools based on
> needs for redundancy and performance.  And for my home system I just have
> one big data pool, currently consisting of 1x1TB, 2x400GB, 2x400GB, plus
> 1TB hot spare.)

Typo; I don't in fact have a non-redundant vdev in my main data pool! 
It's *2*x1TB at the start of that list.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread David Dyer-Bennet

On Mon, February 7, 2011 14:49, Yi Zhang wrote:
> On Mon, Feb 7, 2011 at 3:14 PM, Bill Sommerfeld 
> wrote:
>> On 02/07/11 11:49, Yi Zhang wrote:
>>>
>>> The reason why I
>>> tried that is to get the side effect of no buffering, which is my
>>> ultimate goal.
>>
>> ultimate = "final".  you must have a goal beyond the elimination of
>> buffering in the filesystem.
>>
>> if the writes are made durable by zfs when you need them to be durable,
>> why
>> does it matter that it may buffer data while it is doing so?
>>
>>                                                -
>> Bill
>
> If buffering is on, the running time of my app doesn't reflect the
> actual I/O cost. My goal is to accurately measure the time of I/O.
> With buffering on, ZFS would batch up a bunch of writes and change
> both the original I/O activity and the time.

I'm not sure I understand what you're trying to measure (which seems to be
your top priority).  Achievable performance with ZFS would be better using
suitable caching; normally that's the benchmark statistic people would
care about.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS/Drobo (Newbie) Question

2011-02-07 Thread David Dyer-Bennet

On Sat, February 5, 2011 11:54, Gaikokujin Kyofusho wrote:
> Thank you kebabber. I will try out indiana and virtual box to play around
> with it a bit.
>
> Just to make sure I understand your example, if I say had a 4x2tb drives,
> 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 +
> 1 mirrored + 1 mirrored), in terms of accessing them would they just be
> mounted like 3 partitions or could it all be accessed like one big
> partition?

A ZFS pool can contain many vdevs; you could put the three groups you
describe into one pool, and then assign one (or more) file-systems to that
pool.  Putting them all in one pool seems to me the natural way to handle
it; they're all similar levels of redundancy.  It's more flexible to have
everything in one pool, generally.

(You could also make separate pools; my experience, for what it's worth,
argues for making pools based on redundancy and performance (and only
worry about BIG differences), and assign file-systems to pools based on
needs for redundancy and performance.  And for my home system I just have
one big data pool, currently consisting of 1x1TB, 2x400GB, 2x400GB, plus
1TB hot spare.)

Or you could stick strictly to mirrors; 4 pools 2x2T, 2x2T, 2x750G,
2x1.5T.  Mirrors are more flexible, give you more redundancy, and are much
easier to work with.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Newbie question

2011-02-07 Thread David Dyer-Bennet

On Sat, February 5, 2011 03:54, Gaikokujin Kyofusho wrote:

> From what I understand using ZFS one could setup something like RAID 6
> (RAID-Z2?) but with the ability to use drives of varying
> sizes/speeds/brands and able to add additional drives later. Am I about
> right? If so I will continue studying up on this if not then I guess I
> need to continue exploring different options. Thanks!!

IMHO, your best bet for this kind of configuration is to use mirror pairs,
not RAIDZ*.  Because...

Things you can't do with RAIDZ*:

You cannot remove a vdev from a pool.

You cannot make a RAIDZ* vdev smaller (fewer disks).

You cannot make a RAIDZ* vdev larger (more disks).

To increase the storage capacity of a RAIDZ* vdev you need to replace all
the drives, one at a time, waiting for resilver between replacements
(resilver times can be VERY long with big modern drives).  And during each
resilver, your redundancy will be reduced by 1 -- meaning a RAIDZ array
would have NO redundancy during the resilver.  (And activity in the pool
is high during the resilver -- meaning the chances of any marginal drive
crapping out are higher than normal during the resilver.)

With mirrors, you can add new space by adding simply two drives (add a new
mirror vdev).

You can upgrade an existing mirror by replacing only two drives.

You can upgrade an existing mirror without reducing redundancy below your
starting point ever -- you attach a new drive, wait for the resilver to
complete (at this point you have a three-way mirror), then detach one of
the original drives; repeat for another new drive and the other original
drive.

Obviously, using mirrors requires you to buy more drives for any given
amount of usable space.

I must admit that my 8-bay hot-swap ZFS server cost me a LOT more than a
Drobo (but then I bought in 2006, too).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs-discuss Digest, Vol 64, Issue 13

2011-02-07 Thread David Dyer-Bennet

On Sun, February 6, 2011 13:01, Michael Armstrong wrote:
> Additionally, the way I do it is to draw a diagram of the drives in the
> system, labelled with the drive serial numbers. Then when a drive fails, I
> can find out from smartctl which drive it is and remove/replace without
> trial and error.

Having managed to muddle through this weekend without loss (though with a
certain amount of angst and duplication of efforts), I'm in the mood to
label things a bit more clearly on my system :-).

smartctl doesn't seem to be on my system, though.  I'm running
snv_134.  I'm still pretty badly lost in the whole repository /
package thing with Solaris, most of my brain cells were already
occupied with Red Hat, Debian, and Perl package information :-( .
Where do I look?

Are the controller port IDs, the "C9T3D0" things that ZFS likes,
reasonably stable?  They won't change just because I add or remove
drives, right; only maybe if I change controller cards?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Replace block devices to increase pool size

2011-02-07 Thread David Dyer-Bennet

On Sun, February 6, 2011 08:41, Achim Wolpers wrote:

> I have a zpool biult up from two vdrives (one mirror and one raidz). The
> raidz is built up from 4x1TB HDs. When I successively replace each 1TB
> drive with a 2TB drive will the capacity of the raidz double after the
> last block device is replaced?

You may have to manually set property autoexpand=on; I found yesterday
that I had to (in my case on a mirror that I was upgrading).  Probably
depends on what version you created things at and/or what version you're
running now.

I replaced the drives in one of the three mirror vdevs in my main pool
over this last weekend, and it all went quite smoothly, but I did have to
turn on autoexpand at the end of the process to see the new space.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Identifying drives (SATA), question about hot spare allocation

2011-02-06 Thread David Dyer-Bennet

Following up to myself, I think I've got things sorted, mostly.

1.  The thing I was most sure of, I was wrong about.  Some years back, I 
must have split the mirrors so that they used different brand disks.  I 
probably did this, maybe even accidentally, when I had to restore from 
backups at one point.   I suppose I could have physically labeled the 
carriers...no, that's crazy talk!


2.  The dd trick doesn't produce reliable activity light activation in 
my system.  I think some of the drives and/or controllers only turn on 
the activity light for writes.


3.  However, in spite of all this, I have replaced the disks in mirror-0 
with the bigger disks (via attach-new-resilver-detach-old), and added 
the third drive I bought as a hot spare.  All without having to restore 
from backups.


4.  AND I know which physical drive the detached 400GB drive is.  It 
occurs to me I could make that a second hot spare -- there are 4 
remaining 400GB drives in the pool, so it's useful for 2/3 of the 
failures by drive count.


Leading to a new question -- is ZFS smart about hot spare sizes?  Will 
it skip over too-small drives?  Will it, even better, prefer smaller 
drives to larger so long as they are big enough (thus leaving the big 
drives for bigger failures)?


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Identifying drives (SATA)

2011-02-06 Thread David Dyer-Bennet

On 2011-02-06 05:58, Orvar Korvar wrote:

Will this not ruin the zpool? If you overwrite one of discs in the zpool won't 
the zpool go broke, so you need to repair it?


Without quoting I can't tell what you think you're responding to, but 
from my memory of this thread, I THINK you're forgetting how dd works. 
The dd commands being proposed to create drive traffic are all read-only 
accesses, so they shouldn't damage anything


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Drive id confusion

2011-02-05 Thread David Dyer-Bennet
Solaris and/or ZFS are badly confused about drive IDs.  The "c5t0d0" 
names are very far removed from the real world, and possibly they've 
gotten screwed up somehow.  Is devfsadm supposed to fix those, or does 
it only delete excess?


Reason I believe it's confused:

zpool status shows mirror-0 on c9t3d0, c9t2d0, and c9t5d0.  But format 
shows the one remaining Seagate 400GB drive at c5t0d0 (my initial pool 
was two of those; I replaced one with a Samsung 1TB earlier today).  Now 
the mirror with three drives in is my very first mirror, which has to 
have the one remaining Seagate drive in it (given that I removed one 
Seagate drive; otherwise I could be confused about order of creation vs. 
mirror numbering).


I'm thinking either Solaris' appalling mess of device files is somehow 
scrod, or else ZFS is confused in its reporting (perhaps because of 
cache file contents?).  Is there anything I can do about either of 
these?  Does devfsadm really create the apporpirate /dev/dsk and etc. 
files based on what's present?  Would deleting the cache file while the 
pool is exported, and then searching for and importing the pool, help?


How worried should I be?  (I've got current backups).
--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] /dev/dsk files missing

2011-02-05 Thread David Dyer-Bennet
And devfsadm doesn't create them.  Am I looking at the wrong program, or 
what?

--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Identifying drives (SATA)

2011-02-05 Thread David Dyer-Bennet
I've got a small home fileserver, Chenowith case with 8 hot-swap bays. 
Of course, at this level, I don't have cute little lights next to each 
drive that the OS knows about and can control to indicate things to me.


The configuration I think I have is three mirror pairs.  I've got 
motherboard SATA connections, and an add-in SAS card with SAS-to-SATA 
cabling (all drives are SATA), and I've tried to wire it so each mirror 
is split across the two controllers.  However -- the old disks were 
already a pool before.  So if I put them in the "wrong" physical slots, 
when I imported the pool it would have still found them.  So I could 
have the disks in slots that aren't what I expected, without knowing it.


I'm planning to upgrade the first mirror by attaching new, larger, 
drives, letting the resilver finish, and eventually detaching the old 
drives.  I just installed the first new drive, located what controller 
it was on, and typed an attach command that did what I wanted:


bash-4.0$ zpool status zp1
  pool: zp1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h4m, 3.13% done, 2h5m to go
config:

NAMESTATE READ WRITE CKSUM
zp1 ONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c9t3d0  ONLINE   0 0 0
c5t1d0  ONLINE   0 0 0
c9t5d0  ONLINE   0 0 0  14.0G resilvered
  mirror-1  ONLINE   0 0 0
c9t4d0  ONLINE   0 0 0
c6t1d0  ONLINE   0 0 0
  mirror-2  ONLINE   0 0 0
c9t2d0  ONLINE   0 0 0
c5t0d0  ONLINE   0 0 0

errors: No known data errors

As you can see, the new drive being resilvered is in fact associated 
with the first mirror, as I had intended.  (The old drives in the first 
mirror are older than in the second two, and all three are the same 
size, so that's definitely the one to replace first.)


HOWEVER...the activity lights on the drives aren't doing what I expect. 
 The activity light on the new drive is on pretty solidly (that I 
expected), but the OTHER activity puzzles me.  (User activity is so 
close to nil that I'm quite confident that's not confusing me; 95% + of 
the access right now is the resilver.  Besides, usage could light up 
other drives, but it couldn't turn off the lights on the ones being 
resilvered.)


At first, I saw the second drive in the rack light up.  I believe that 
to be c5t1d0, the second disk in mirror-0, and it's the drive I 
specified for the old drive in the attach command.


However, soon I started seeing the fourth drive in the rack light up.  I 
believe that to be c6t1d0; part of mirror-1, and thus having no place in 
this resilver.  It remained active.  And after a while, the second drive 
activity light went off.  For some minutes now, I've been seeing 
activity ONLY on the new drive, and on drive 4 (the one I don't think is 
part of mirror 0).


The activity lights aren't connected by separate cables, so I don't see 
how I could have them hooked up differently from the disks.


It's clear from zpool status that I have attached the new drive to the 
right mirror.  So things are fine for now, I can let the resilver run to 
completion.   I can detach one of the old drives fine, because that's 
done with logical names, and those are shown in zpool status, so I have 
no doubt which logical names are the old drives in mirror 0.


However, eventually it will be time to physically remove the old drives. 
 If I remove only one at a time, I "shouldn't" cause a disaster even if 
I pull the wrong one, and I can tell by checking spool status right away 
whether I pulled the right or wrong one.  But this gets me into what I 
regard as risky territory -- if I pull a live drive, I'm going to 
suddenly need to know the commands needed to reattach it.  Can somebody 
point me at clear examples of that (or post them)?


I just found zpool iostat -v; now that I'm seeing traffic on the 
individual drives in the pool, it's clearly reading from both the old 
drives, and writing to the new drive, exactly as expected.  But only one 
activity light is lit on any of the old drives.


Is there a clever way to figure out which drive is which?  And if I have 
to fall back on removing a drive I think is right, and seeing if that's 
true, what admin actions will I have to perform to get the pool back to 
safety?  (I've got backups, but it's a pain to restore of course.) 
(Hmmm; in single-user mode, use dd to read huge chunks of one disk, and 
see which lights come on?  Do I even need to be in single-user mode

Re: [zfs-discuss] BOOT, ZIL, L2ARC one one SSD?

2011-01-06 Thread David Dyer-Bennet

On Thu, December 23, 2010 22:45, Edward Ned Harvey wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Bill Werner
>>
>> on a single 60GB SSD drive, use FDISK to create 3 physical partitions, a
> 20GB
>> for boot, a 30GB for L2ARC and a 10GB for ZIL?   Or is 3 physical
>> Solaris
>> partitions on a disk not considered the entire disk as far as ZFS is
>> concerned?
>
> You can do that.  Other people have before.  But IMHO, it demonstrates a
> faulty way of thinking.
>
> "SSD's are big and cheap now, so I can buy one of these high performance
> things, and slice it up!"  In all honesty, GB availability is not your
> limiting factor.  Speed is your limiting factor.  That's the whole point
> of
> buying the thing in the first place.  If you have 3 SSD's, they're each
> able
> to talk 3Gbit/sec at the same time.  But if you buy one SSD which is 3x
> larger, you save money but you get 1/3 the speed.

Boot, at least, largely doesn't overlap with any significant traffic to
ZIL, for example.

And where I come from, even at work, money doesn't grow on trees.  Sure,
three separate SSDs will clearly perform better.  They will also cost 3x
as much.  (Or more, if you don't have three free bays and controller
ports.)

The question we often have to address is, "what's the biggest performance
increase we can get for $500".  I considered multiple rotating disks vs.
one SSD for that reason, for example.

Yeah, anybody quibbling about $500 isn't building top-performance
enterprise-grade storage.  We do know this.  It's still where a whole lot
of us live -- especially those running a home NAS.

> That's not to say there's never a situation where it makes sense.  Other
> people have done it, and maybe it makes sense for you.  But probably not.

Yeah, okay, maybe we're not completely disagreeing.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] file-inherit and dir-inherit at toplevel of ZFS CIFS share

2010-10-25 Thread David Dyer-Bennet
It looks like permissions don't descend properly from the top-level share
in CIFS; I had to set them on the next level down to get the intended
results (including on lower levels; they seem to inherit properly from the
second level, just not from the top).  Is this a known behavior, or am I
confused and setting myself up for trouble later?

More broadly, is there anything good about "best practices" for using ACLS
with ZFS and CIFS shares?  For example, there are so many defined
attributes, some of them with the same short-form letter (I think one is
for directories and one is for files in that case, but that's not
documented that I can find), that I find myself wondering what "standard
bundles" of permissions would be useful.   Is it generally better to have
separate permissions to inherit for files and directories, or can most
things you want be accomplished with just one?

Back to specifics again -- I was running into a problem where a user on
the Solaris box could rename a file or directory, but an XP box
authenticating as the same user could not.  This was the one that seemed
to be solved by setting the permissions again one level down (dunno what
happens with new top-level items yet).  Is this normal behavior of
something that makes sense?  It's terribly weird.  (In windows, I could
right-click and create the "new directory" or whatever, but when I then
filled in the name I wanted and hit enter, I got a permission error.  I
could just leave it named "new directory", though.  And I could rename it
on the Linux side as the same user that failed to rename it from the
Windows side.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Balancing LVOL fill?

2010-10-20 Thread David Dyer-Bennet

On Wed, October 20, 2010 04:24, Tuomas Leikola wrote:

> I wished for a more aggressive write balancer but that may be too much
> to ask for.

I don't think it can be too much to ask for.  Storage servers have long
enough lives that adding disks to them is a routine operation; to the
extent that that's a problem, that really needs to be fixed.

However, it's not the sort of thing one should hold one's breath waiting for!

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread David Dyer-Bennet

On Fri, October 8, 2010 04:47, Stephan Budach wrote:
> So, I decided to give tar a whirl, after zfs send encountered the next
> corrupted file, resulting in an I/O error, even though scrub ran
> successfully w/o any erors.

I must say that this concept of scrub running w/o error when corrupted
files, detectable to zfs send, apparently exist, is very disturbing. 
Background scrubbing, and the block checksums to make it more meaningful
than just reading the disk blocks, was the key thing that drew me into
ZFS, and this seems to suggest that it doesn't work.

Does your sequence of tests happen to provide evidence that the problem
isn't new errors appearing, sometimes after a scrub and before the send? 
For example, have you done 1) scrub finds no error, 2) send finds error,
3) scrub finds no error?  (with nothing in between that could have cleared
or fixed the error).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Increase size of 2-way mirror

2010-10-06 Thread David Dyer-Bennet

On Wed, October 6, 2010 14:14, Tony MacDoodle wrote:
> Is it possible to add 2 disks to increase the size of the pool below?
>
> NAME STATE READ WRITE CKSUM
>   testpool ONLINE 0 0 0
> mirror-0 ONLINE 0 0 0
> c1t2d0 ONLINE 0 0 0
> c1t3d0 ONLINE 0 0 0
> mirror-1 ONLINE 0 0 0
> c1t4d0 ONLINE 0 0 0
> c1t5d0 ONLINE 0 0 0

You have two ways to increase the size of this pool (sanely).

First, you can add a third mirror vdev.  I think that's what you're
specifically asking about.  You do this with the "zpool add ..." command,
see man page.

Second, you can add (zpool attach) two larger disks to one of the existing
mirror vdevs, wait until the resilvers have finished, and then detach the
two original (smaller) disks.  At that point (with recent versions; with
older versions you have to set a property) the vdev will expand to use the
full capacity of the new larger disks, and that space will become
available in the pool.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread David Dyer-Bennet

On Tue, October 5, 2010 16:47, casper@sun.com wrote:
>
>
>>My immediate reaction to this is "time to avoid WD drives for a while";
>>until things shake out and we know what's what reliably.
>>
>>But, um, what do we know about say the Seagate Barracuda 7200.12 ($70),
>>the SAMSUNG Spinpoint F3 1TB ($75), or the HITACHI Deskstar 1TB 3.5"
>>($70)?
>
>
> I've seen several important features when selecting a drive for
> a mirror:
>
>   TLER (the ability of the drive to timeout a command)

I went and got what detailed documentation I could on a couple of the
Seagate drives last night, and I couldn't find anything on how they
behaved in that sort of error cases.  (I believe TLER is a WD-specific
term, but I didn't just search, I read them through.)

So that's inconvenient.  How do we find out about that sort of thing?

>   sector size (native vs virtual)

Richard Elling said ZFS handles the 4k real 512byte fake drives okay now
in default setups; but somebody immediately asked for version info, so I'm
still watching this one.

>   power use (specifically at home)

Hadn't thought about that.  But when I'm upgrading drives, I figure I'm
always going to come out better on power than when I started.

>   performance (mostly for work)

I can't bring myself to buy below 7200RPM, but it's probably foolish
(except that other obnoxious features tend to come in the "green" drives).

>   price

Yeah, well.  I'm cheap.

> I've heard scary stories about a mismatch of the native sector size and
> unaligned Solaris partitions (4K sectors, unaligned cylinder).

So have I.  Sounds like you get read-modify-write actions for non-aligned
accesses.

I hope the next generation of drives admit to being 4k sectors, and that
ZFS will be prepared to use them sensibly.  But I'm not sure I'm willing
to wait for that; the oldest drives in my box are now 4 years old, and I'm
about ready for the next capacity upgrade.

> I was pretty happen with the WD drives (except for the one with a
> seriously
> broken cache) but I see the reasons to not to pick WD drives over the 1TB
> range.

And the big ones are what pretty much everybody is using at home. 
Capacity and price are vastly more important than performance for most of
us.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-06 Thread David Dyer-Bennet

On Tue, October 5, 2010 17:20, Richard Elling wrote:
> On Oct 5, 2010, at 2:06 PM, Michael DeMan wrote:
>>
>> On Oct 5, 2010, at 1:47 PM, Roy Sigurd Karlsbakk wrote:

>>> Well, here it's about 60% up and for 150 drives, that makes a wee
>>> difference...

>> Understood on 1.6  times cost, especially for quantity 150 drives.

> One service outage will consume far more in person-hours and downtime than
> this little bit of money.  Penny-wise == Pound-foolish?

That looks to be true, yes (going back to the actual prices, 150 drives
would cost $6000 extra for the enterprise versions).

It's still quite annoying to be jerked around by people charging 60% extra
for changing a timeout in the firmware, and carefully making it NOT
user-alterable.

Also, the non-TLER versions are a constant threat to anybody running home
systems, who might quite reasonably think they could put those in a home
server.

(Yeah, I know the enterprise versions have other differences.  I'm not
nearly so sure I CARE about the other differences, in the size servers I'm
working with.)
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] TLER and ZFS

2010-10-05 Thread David Dyer-Bennet

On Tue, October 5, 2010 15:30, Roy Sigurd Karlsbakk wrote:


> I just discovered WD Black drives are rumored not to be set to allow TLER.
> Does anyone know how much performance impact the lack of TLER might have
> on a large pool? Choosing Enterprise drives will cost about 60% more, and
> on a large install, that means a lot of money...

My immediate reaction to this is "time to avoid WD drives for a while";
until things shake out and we know what's what reliably.

But, um, what do we know about say the Seagate Barracuda 7200.12 ($70),
the SAMSUNG Spinpoint F3 1TB ($75), or the HITACHI Deskstar 1TB 3.5"
($70)?

This is not a completely theoretical question to me; it's getting on
towards time to at least consider replacing my oldest mirrored pair; those
are 400GB Seagate, I think, dating from 2006.  I'd want something at least
twice as big (to make the space upgrade worthwhile), and I'm expecting to
buy three of them rather than just two because I think it's time to add a
hot spare to the system (currently 3 pair of data disks, and I've got two
more bays; I think a hot spare is a better use for them than a fourth
pair; safety of the data is very important, performance is adequate, and I
need a modest capacity upgrade, but the whole pool is currently 1.2TB
usable, not large).

On the third hand, there's the Barracuda 7200.11 1.5TB for only $75, which
is a really small price increment for a big space increment.

The WD RE3 1TB is $130 (all these prices are from Newegg just now). 
That's very close to TWICE the price of the competing 1TB drives.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When Zpool has no space left and no snapshots

2010-09-29 Thread David Dyer-Bennet

On Wed, September 29, 2010 15:17, Matt Cowger wrote:
> You can truncate a file:
>
> Echo "" > bigfile
>
> That will free up space without the 'rm'

Copy-on-write; the new version gets written to the disk before the old
version is released, it doesn't just overwrite.  AND, if it's in any
snapshots, the old version doesn't get released.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] When Zpool has no space left and no snapshots

2010-09-29 Thread David Dyer-Bennet

On Wed, September 22, 2010 21:25, Aleksandr Levchuk wrote:

> I ran out of space, consequently could not rm or truncate files. (It
> make sense because it's a copy-on-write and any transaction needs to
> be written to disk. It worked out really well - all I had to do is
> destroy some snapshots.)
>
> If there are no snapshots to destroy, how to prepare for a situation
> when a ZFS pool looses it's last free byte?

Add some more space somewhere around 90%, or earlier :-).

If you do get stuck,  you can add another vdev when full, too. Just
remember that you're stuck with whatever you add "forever", since there's
no way to remove a vdev from a pool.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))

2010-09-23 Thread David Dyer-Bennet

On Thu, September 23, 2010 01:33, Alexander Skwar wrote:
> Hi.
>
> 2010/9/19 R.G. Keen 
>
>> and last-generation hardware is very, very cheap.
>
> Yes, of course, it is. But, actually, is that a true statement? I've read
> that it's *NOT* advisable to run ZFS on systems which do NOT have ECC
> RAM. And those cheapo last-gen hardware boxes quite often don't have
> ECC, do they?

Last-generation server hardware supports ECC, and was usually populated
with ECC.  Last-generation desktop hardware rarely supports ECC, and was
even more rarely populated with ECC.

The thing is, last-generation server hardware is, um, marvelously adequate
for most home setups (the problem *I* see with it, for many home setups,
is that it's *noisy*).  So, if you can get it cheap in a sound-level that
fits your needs, that's not at all a bad choice.

I'm running a box I bought new as a home server, but it's NOW at least
last-generation hardware (2006), and it's still running fine; in
particular the CPU load remains trivial compared to what the box supports
(not doing compression or dedup on the main data pool, though I do
compress the backup pools on external USB disks).  (It does have ECC; even
before some of the cases leading to that recommendation were explained on
that list, I just didn't see the percentage in not protecting the memory.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-17 Thread David Dyer-Bennet

On Thu, September 16, 2010 14:04, Miles Nordin wrote:
>>>>>> "dd" == David Dyer-Bennet  writes:
>
> dd> Sure, if only a single thread is ever writing to the disk
> dd> store at a time.
>
> video warehousing is a reasonable use case that will have small
> numbers of sequential readers and writers to large files.  virtual
> tape library is another obviously similar one.  basically, things
> which used to be stored on tape.  which are not uncommon.

Haven't encountered those kinds of things first-hand, so I didn't think of
them.  Yes, those sound like they'd have lower numbers of simultaneous
users by a lot for one reason or another.

> AIUI ZFS does not have a fragmentation problem for these cases unless
> you fill past 96%, though I've been trying to keep my pool below 80%
> because .

As various people have said recently, we have no way to measure it that we
know of.  I don't feel I have a problem in my own setup, but it's so
low-stress that if ZFS doesn't work there, it wouldn't work anywhere.

> dd> This situation doesn't exist with any kind of enterprise disk
> dd> appliance, though; there are always multiple users doing
> dd> stuff.
>
> the point's relevant, but I'm starting to tune out every time I hear
> the word ``enterprise.''  seems it often decodes to:

Picked the phrase out of an orifice; trying to distinguish between storage
for key corporate data assets, and other uses.

>  (1) ``fat sacks and no clue,'' or
>
>  (2) ``i can't hear you i can't hear you i have one big hammer in my
>  toolchest and one quick answer to all questions, and everything's
>  perfect! perfect, I say.  unless you're offering an even bigger
>  hammer I can swap for this one, I don't want to hear it,'' or
>
>  (3) ``However of course I agree that hammers come in different
>  colors, and a wise and experienced craftsman will always choose
>  the color of his hammer based on the color of the nail he's
>  hitting, because the interface between hammers and nails doesn't
>  work well otherwise.  We all know here how to match hammer and
>  nail colors, but I don't want to discuss that at all because it's
>  a private decision to make between you and your salesdroid.
>
>  ``However, in this forum here we talk about GREEN NAILS ONLY.  If
>  you are hitting green nails with red hammers and finding they go
>  into the wood anyway then you are being very unprofessional
>  because that nail might have been a bank transaction. --posted
>  from opensolaris.org''

#3 is particularly amusing!
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-16 Thread David Dyer-Bennet

On Wed, September 15, 2010 16:18, Edward Ned Harvey wrote:

> For example, if you start with an empty drive, and you write a large
> amount
> of data to it, you will have no fragmentation.  (At least, no significant
> fragmentation; you may get a little bit based on random factors.)  As life
> goes on, as long as you keep plenty of empty space on the drive, there's
> never any reason for anything to become significantly fragmented.

Sure, if only a single thread is ever writing to the disk store at a time.

This situation doesn't exist with any kind of enterprise disk appliance,
though; there are always multiple users doing stuff.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-14 Thread David Dyer-Bennet
The difference between multi-user thinking and single-user thinking is
really quite dramatic in this area.  I came up the time-sharing side
(PDP-8, PDP-11, DECSYSTEM-20); TOPS-20 didn't have any sort of disk
defragmenter, and nobody thought one was particularly desirable, because
the normal access pattern of a busy system was spread all across the disk
packs anyway.

On a desktop workstation, it makes some sense to think about loading big
executable files fast -- that's something the user is sitting there
waiting for, and there's often nothing else going on at that exact moment.
 (There *could* be significant things happening in the background, but
quite often there aren't.)  Similarly, loading a big "document"
(single-file book manuscript, bitmap image, or whatever) happens at a
point where the user has requested it and is waiting for it right then,
and there's mostly nothing else going on.

But on really shared disk space (either on a timesharing system, or a
network file server serving a good-sized user base), the user is competing
for disk activity (either bandwidth or IOPs, depending on the access
pattern of the users).  Generally you don't get to load your big DLL in
one read -- and to the extent that you don't, it doesn't matter much how
it's spread around the disk, because the head won't be in the same spot
when you get your turn again.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Configuration questions for Home File Server (CPU cores, dedup, checksum)?

2010-09-13 Thread David Dyer-Bennet

On Tue, September 7, 2010 15:58, Craig Stevenson wrote:

> 3.  Should I consider using dedup if my server has only 8Gb of RAM?  Or,
> will that not be enough to hold the DDT?  In which case, should I add
> L2ARC / ZIL or am I better to just skip using dedup on a home file server?

I would not consider using dedup in the current state of the code.  I hear
too many horror stories.

Also, why do you think you'd get much benefit?  It takes pretty big blocks
of exact bit-for-bit duplication to actually trigger the code, and you're
not going to find them in compressed image (including motion picture /
video) or audio files, for example (the main things that take up much
space on most home servers).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] resilver = defrag?

2010-09-13 Thread David Dyer-Bennet

On Mon, September 13, 2010 07:14, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:rich...@nexenta.com]
>>
>> This operational definition of "fragmentation" comes from the single-
>> user,
>> single-tasking world (PeeCees). In that world, only one thread writes
>> files
>> from one application at one time. In those cases, there is a reasonable
>> expectation that a single file's blocks might be contiguous on a single
>> disk.
>> That isn't the world we live in, where have RAID, multi-user, or multi-
>> threaded
>> environments.
>
> I don't know what you're saying, but I'm quite sure I disagree with it.
>
> Regardless of multithreading, multiprocessing, it's absolutely possible to
> have contiguous files, and/or file fragmentation.  That's not a
> characteristic which depends on the threading model.
>
> Also regardless of raid, it's possible to have contiguous or fragmented
> files.  The same concept applies to multiple disks.

The attitude that it *matters* seems to me to have developed, and be
relevant only to, single-user computers.

Regardless of whether a file is contiguous or not, by the time you read
the next chunk of it, in the multi-user world some other user is going to
have moved the access arm of that drive.  Hence, it doesn't matter if the
file is contiguous or not.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Storage server hardwae

2010-08-26 Thread David Dyer-Bennet

On Thu, August 26, 2010 13:58, Tom Buskey wrote:

> I usually see 17 MB/s max on an external USB 2.0 drive.

Interesting; I routinly see 27 MB/s peaking to 30 MB/s on the cheap WD 1TB
external drives I use for backups.  (Backup is probably best case, the
only user of that drive is a zfs receive process.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Mon, August 16, 2010 15:35, Joerg Schilling wrote:

> I know of ext* performance checks where people did run gtar to unpack a
> linux
> kernel archive and these people did nothing but metering the wall clock
> time
> for gtar.
>
> I repeated this test and it turned out, that Linux did not even start to
> write
> to the disk when gtar finished.

As a test of ext? performance, that does seem to be lacking something!

I guess it's a consequence of the low sound levels of modern disk drives;
you go back enough years, that error couldn't have passed unnoticed :-) .

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Mon, August 16, 2010 12:36, Bob Friesenhahn wrote:

> Can someone provide a link to the requisite source files so that we
> can see the copyright statements?  It may well be that Oracle assigned
> the copyright to some other party.


2  * Copyright (C) 2007 Oracle.  All rights reserved.
3  *
4  * This program is free software; you can redistribute it and/or
5  * modify it under the terms of the GNU General Public
6  * License v2 as published by the Free Software Foundation.

<http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=blob;f=fs/btrfs/root-tree.c;h=2d958be761c84556b39c60afa3b0f3fd75d6;hb=HEAD>

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Sat, August 14, 2010 16:26, Andrej Podzimek wrote:

> Well, a typical conversation about speed and stability usually boils down
> to this:
>
> A: I've heard that XYZ is unstable and slow.
> B: Are you sure? Have you tested XYZ? What are your benchmark results?
> Have you had any issues?
> A: No. I *have* *not* *tested* XYZ. I think XYZ is so unstable and slow
> that it's not worth testing.

Yes indeed!

I can't afford to test everything carefully.  Like most people, I read
published reports and listen to conversations places like this, and form
an impression of what performs how.

Then I do some testing to verify that something I'm seriously considering
produces satisfactory performance.  The key there is "satisfactory"; I'm
not looking for the "best", I'm looking for something that fits in and is
satisfactory.

The more unusual my requirements, and the better defined, the less I can
gain from studying outside test reports.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Sun, August 15, 2010 09:19, David Magda wrote:
> On Aug 14, 2010, at 14:54, Edward Ned Harvey wrote:
>
>> From:  Russ Price
>
>>
>>> For me, Solaris had zero mindshare since its beginning, on account of
>>> being prohibitively expensive.
>>
>> I hear that a lot, and I don't get it.  $400/yr does move it out of
>> peoples'
>> basements generally, and keeps sol10 out of enormous clustering
>> facilities
>> that don't have special purposes or free alternatives.  But I
>> wouldn't call
>> it prohibitively expensive, for a whole lot of purposes.
>
> But that US$ 400 was only if you wanted support. For the last little
> while you could run Solaris 10 legally without a support contract
> without issues.

Looks like there are prices for "service" for things that could
legitimately be called RedHat Enterprise Linux from $80/year up into at
least the mid thousands; this may account for the range of impressions
people have.

The 24/7 Premium subscription for a two-socket server is $1299/year.  The
business-hours plan is $799.

<https://www.redhat.com/wapps/store/catalog.html>

Your point that "free" has been important is very true.  I'm not sure that
what Oracle says they're doing with Solaris 11 Express won't cover that at
least for business customers, though.  (I do think that they'll lose out
on the extensive testing we've been providing.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Mon, August 16, 2010 11:01, Joerg Schilling wrote:
> "David Dyer-Bennet"  wrote:
>
>> >> As such, they'll need to continue to comply with GPLv2 requirements.
>> >
>> > No, there is definitely no need for Oracle to comply with the GPL as
>> they
>> > own the code.
>>
>> Ray's point is, how long would BTRFS remain in the Linux kernel in that
>> case?
>
> Such a license change can happen at any time. The Linux folks have no
> grant
> that it would not happen.

And they have every right to stop including BTRFS in the kernel whenever
they wish.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Mon, August 16, 2010 10:43, Joerg Schilling wrote:
> "David Dyer-Bennet"  wrote:
>
>>
>> On Sun, August 15, 2010 20:44, Peter Jeremy wrote:
>>
>> > Irrespective of the above, there is nothing requiring Oracle to
>> release
>> > any future btrfs or ZFS improvements (or even bugfixes).  They can't
>> > retrospectively change the license on already released code but they
>> > can put a different (non-OSS) license on any new code.
>>
>> That's true.
>>
>> However, if Oracle makes a binary release of BTRFS-derived code, they
>> must
>> release the source as well; BTRFS is under the GPL.
>
> This claim would only be true in case that Oracle does not own the
> copyright
> on its' code...

Oops, yeah, you're right there; the copyright holder can grant additional
licenses and do things itself.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Mon, August 16, 2010 10:48, Joerg Schilling wrote:
> Ray Van Dolson  wrote:
>
>> > I absolutely guarantee Oracle can and likely already has
>> > dual-licensed BTRFS.
>>
>> Well, Oracle obviously would want btrfs to stay as part of the Linux
>> kernel rather than die a death of anonymity outside of it...
>>
>> As such, they'll need to continue to comply with GPLv2 requirements.
>
> No, there is definitely no need for Oracle to comply with the GPL as they
> own the code.

Ray's point is, how long would BTRFS remain in the Linux kernel in that case?
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread David Dyer-Bennet

On Sun, August 15, 2010 20:44, Peter Jeremy wrote:

> Irrespective of the above, there is nothing requiring Oracle to release
> any future btrfs or ZFS improvements (or even bugfixes).  They can't
> retrospectively change the license on already released code but they
> can put a different (non-OSS) license on any new code.

That's true.

However, if Oracle makes a binary release of BTRFS-derived code, they must
release the source as well; BTRFS is under the GPL.

So, if they're going to use it in any way as a product, they have to
release the source.  If they want to use it just internally they can do
anything they want, of course.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New Supermicro SAS/SATA controller: AOC-USAS2-L8e in SOHO NAS and HD HT

2010-08-13 Thread David Dyer-Bennet

On Aug 12, 2010, at 7:03 PM, valrh...@gmail.com wrote:

> Has anyone bought one of these cards recently? It seems to list for
> around $170 at various places, which seems like quite a decent deal. But
> no well-known reputable vendor I know seems to sell these, and I want to
> be able to have someone backing the sale if something isn't perfect.
> Where do you all recommend buying this card from?

I put something very similar in -- same number with an 'i' suffix instead
of the 'e'.  I remember seeing both existed at the time, and that the i
was what I needed.  I'm using SATA cables, and no expanders (each cable
goes directly to a drive), maybe the 'e' has more advanced features (that
I knew I didn't need).

I can't imagine the retailer would be of any value for support on such a
card; perhaps, in the worst case, they  might possibly take it back. 
Selling it on Ebay is often more profitable, since the buyer pays shipping
:-).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-12 Thread David Dyer-Bennet

On Wed, August 11, 2010 15:11, Paul Kraus wrote:
> On Wed, Aug 11, 2010 at 10:36 AM, David Dyer-Bennet  wrote:

>>>> Am I looking for too much here?  I *thought* I was doing something
>>>> that
>>>> should be simple and basic and frequently used nearly everywhere, and
>>>> hence certain to work.  "What could go wrong?", I thought :-).  If I'm
>>>> doing something inherently dicey I can try to find a way to back off;
>>>> as
>>>> my primary backup process, this needs to be rock-solid.
>
> It looks like you are trying to do a full send every time, what about
> a first full then incremental (which should be much faster) ? The
> first full might run afoul of the 2 hour snapshots (and deletions),
> but I would not expect the incremental to. I am syncing about 20 TB of
> data between sites this way every 4 hours over a 100 Mb link. I put
> the snapshot management and the site to site replication in the same
> script to keep them from fighting :-)

What I'm working on is, in fact, the first backup.  I intended from the
start to use incrementals; they just didn't work in earlier versions, and
I was reduced to doing full backups only.  And I need a successful full
backup to start the series, and to initialize any new backup media, and so
forth.  So I think I have to solve this problem, even if most of the
backups will be incrementals.

Mostly the incrementals should be quite fast -- but I can come home from a
weekend away with 30 GB or so of photos, which would appear on the server
all at once.  Still, that's well under 2 hours.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-11 Thread David Dyer-Bennet

On Tue, August 10, 2010 16:41, Dave Pacheco wrote:
> David Dyer-Bennet wrote:

>> If that turns out to be the problem, that'll be annoying to work around
>> (I'm making snapshots every two hours and deleting them after a couple
>> of
>> weeks).  Locks between admin scripts rarely end well, in my experience.
>> But at least I'd know what I had to work around.
>>
>> Am I looking for too much here?  I *thought* I was doing something that
>> should be simple and basic and frequently used nearly everywhere, and
>> hence certain to work.  "What could go wrong?", I thought :-).  If I'm
>> doing something inherently dicey I can try to find a way to back off; as
>> my primary backup process, this needs to be rock-solid.
>
>
> It's certainly a reasonable thing to do and it should work.  There have
> been a few problems around deleting and renaming snapshots as they're
> being sent, but the delete issues were fixed in build 123 by having
> zfs_send hold snapshots being sent (as long as you've upgraded your pool
> past version 18), and it sounds like you're not doing renames, so your
> problem may be unrelated.

AHA!  You may have nailed the issue -- I've upgraded from 111b to 134, but
have not yet upgraded my pool.  Checking...yes, the pool I'm sending from
is V14.  (I don't instantly upgrade pools; I need to preserve the option
of falling back to older software for a while after an upgrade.)

So, I should try either turning off my snapshot creator/deleter during the
backup, or upgrade the pool.  Will do!  (I will eventually upgrade the
pool of course, but I think I'll try the more reversible option first.  I
can have the deleter check for the pid file the backup already creates to
avoid two backups running at once.)

Thank you very much!  This is extremely encouraging.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-11 Thread David Dyer-Bennet

On Tue, August 10, 2010 23:13, Ian Collins wrote:
> On 08/11/10 03:45 PM, David Dyer-Bennet wrote:

>> cannot receive incremental stream: most recent snapshot of
>> bup-wrack/fsfs/zp1/ddb does not
>> match incremental source

> That last error occurs if the snapshot exists, but has changed, it has
> been deleted and a new one with the same name created.

So for testing purposes at least, I need to shut down everything I have
that creates or deletes snapshots.  (I don't, though, have anything that
would delete one and create one with the same name.  I create snapshots
with various names (2hr, daily, weekly, monthly, yearly) and a current
timestamp, and I delete old ones (many days old at a minimum).)

And I think I'll abstract the commands from my backup script into a
simpler dedicated test script, so I'm sure I'm doing exactly the same
thing each time (that should cause me to hit on a combination that works
right away :-) ).

Is there anything stock in b134 that messes with snapshots that I should
shut down to keep things stable, or am I only worried about my own stuff?

Are other people out there not using send/receive for backups?  Or not
trying to preserve snapshots while doing it?  Or, are you doing what I'm
doing, and not having the problems I'm having?
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-10 Thread David Dyer-Bennet

On 10-Aug-10 13:46, David Dyer-Bennet wrote:


On Tue, August 10, 2010 13:23, Dave Pacheco wrote:

David Dyer-Bennet wrote:

My full backup still doesn't complete.  However, instead of hanging the
entire disk subsystem as it did on 111b, it now issues error messages.
Errors at the end.

[...]

cannot receive incremental stream: most recent snapshot of
bup-wrack/fsfs/zp1/ddb does not
match incremental source
bash-4.0$

The bup-wrack pool was newly-created, empty, before this backup started.

The backup commands were:

zfs send -Rv "$srcsnap" | zfs recv -Fudv "$BUPPOOL/$HOSTNAME/$FS"

I don't see how anything could be creating snapshots on bup-wrack while
this was running.  That pool is not normally mounted (it's on a single
external USB drive, I plug it in for backups).  My script for doing
regular snapshots of zp1 and rpool doesn't reference any of the bup-*
pools.

I don't see how this snapshot mismatch can be coming from anything but
the
send/receive process.

There are quite a lot of snapshots; dailys for some months, 2-hour ones
for a couple of weeks.  Most of them are empty or tiny.

Next time I will try WITHOUT -v on both ends, and arrange to capture the
expanded version of the command with all the variables filled in, but I
don't expect any different outcome.

Any other ideas?



Is it possible that snapshots were renamed on the sending pool during
the send operation?


I don't have any scripts that rename a snapshot (in fact I didn't know it
was possible until just now), and I don't have other users with permission
to make snapshots (either delegated or by root access).  I'm not using the
Sun auto-snapshot thing, I've got a much-simpler script of my own (hence I
know what it does).  So I don't at the moment see how one would be getting
renamed.

It's possible that a snapshot was *deleted* on the sending pool during the
send operation, however.  Also that snapshots were created (however, a
newly created one would be after the one specified in the zfs send -R, and
hence should be irrelevant).  (In fact it's certain that snapshots were
created and I'm nearly certain of deleted.)


More information.  The test I started this morning errored out somewhat 
similarly, and one set of errors is clearly deleted snapshots (they're 
2hr snapshots that some of get deleted every 2 hours).  There are also 
errors relating to "incremental streams" which is strange since I'm not 
using -I or -i at all.


Here are the commands again, and all the output.

+ zfs create -p bup-wrack/fsfs/zp1
+ zfs send -Rp z...@bup-20100810-154542gmt
+ zfs recv -Fud bup-wrack/fsfs/zp1
warning: cannot send 'zp1/d...@bup-2hr-20100731-12cdt': no such pool 
or dataset
warning: cannot send 'zp1/d...@bup-2hr-20100731-14cdt': no such pool 
or dataset
warning: cannot send 'zp1/d...@bup-2hr-20100731-16cdt': no such pool 
or dataset
warning: cannot send 'zp1/d...@bup-20100731-213303gmt': incremental 
source (@bup-2hr-20100731-16CDT) does not exist
warning: cannot send 'zp1/d...@bup-2hr-20100731-18cdt': no such pool 
or dataset
warning: cannot send 'zp1/d...@bup-2hr-20100731-20cdt': incremental 
source (@bup-2hr-20100731-18CDT) does not exist
cannot receive incremental stream: most recent snapshot of 
bup-wrack/fsfs/zp1/ddb does not

match incremental source

Afterward,

bash-4.0$ zpool list
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
bup-wrack   928G   687G   241G73%  1.00x  ONLINE  /backups/bup-wrack
rpool   149G  10.0G   139G 6%  1.00x  ONLINE  -
zp11.09T   743G   373G66%  1.00x  ONLINE  -

So quite a lot did get transferred; but not all.

So, it appears clear that snapshots being deleted during the zfs send -R 
causes a warning.  A warning is fine, since they're not there it can't 
send them, and they were there when the command was given so it makes 
sense for it to try.


That last message, which is not tagged as either warning or error, 
worries me though.  And wondering how complete the transfer is; I 
believe the backup copy is compressed whereas the zp1 copy isn't, so the 
ALLOC being that different isn't clear-cut evidence of anything.


I'll try to guess a few things that should be recent and see if they in 
fact got into the backup.


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-10 Thread David Dyer-Bennet

On Tue, August 10, 2010 13:23, Dave Pacheco wrote:
> David Dyer-Bennet wrote:
>> My full backup still doesn't complete.  However, instead of hanging the
>> entire disk subsystem as it did on 111b, it now issues error messages.
>> Errors at the end.
> [...]
>> cannot receive incremental stream: most recent snapshot of
>> bup-wrack/fsfs/zp1/ddb does not
>> match incremental source
>> bash-4.0$
>>
>> The bup-wrack pool was newly-created, empty, before this backup started.
>>
>> The backup commands were:
>>
>> zfs send -Rv "$srcsnap" | zfs recv -Fudv "$BUPPOOL/$HOSTNAME/$FS"
>>
>> I don't see how anything could be creating snapshots on bup-wrack while
>> this was running.  That pool is not normally mounted (it's on a single
>> external USB drive, I plug it in for backups).  My script for doing
>> regular snapshots of zp1 and rpool doesn't reference any of the bup-*
>> pools.
>>
>> I don't see how this snapshot mismatch can be coming from anything but
>> the
>> send/receive process.
>>
>> There are quite a lot of snapshots; dailys for some months, 2-hour ones
>> for a couple of weeks.  Most of them are empty or tiny.
>>
>> Next time I will try WITHOUT -v on both ends, and arrange to capture the
>> expanded version of the command with all the variables filled in, but I
>> don't expect any different outcome.
>>
>> Any other ideas?
>
>
> Is it possible that snapshots were renamed on the sending pool during
> the send operation?

I don't have any scripts that rename a snapshot (in fact I didn't know it
was possible until just now), and I don't have other users with permission
to make snapshots (either delegated or by root access).  I'm not using the
Sun auto-snapshot thing, I've got a much-simpler script of my own (hence I
know what it does).  So I don't at the moment see how one would be getting
renamed.

It's possible that a snapshot was *deleted* on the sending pool during the
send operation, however.  Also that snapshots were created (however, a
newly created one would be after the one specified in the zfs send -R, and
hence should be irrelevant).  (In fact it's certain that snapshots were
created and I'm nearly certain of deleted.)

If that turns out to be the problem, that'll be annoying to work around
(I'm making snapshots every two hours and deleting them after a couple of
weeks).  Locks between admin scripts rarely end well, in my experience. 
But at least I'd know what I had to work around.

Am I looking for too much here?  I *thought* I was doing something that
should be simple and basic and frequently used nearly everywhere, and
hence certain to work.  "What could go wrong?", I thought :-).  If I'm
doing something inherently dicey I can try to find a way to back off; as
my primary backup process, this needs to be rock-solid.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-10 Thread David Dyer-Bennet
Additional information.  I started another run, and captured the exact
expanded commands.  These SHOULD BE the exact commands used in the last
run except for the snapshot name (this script makes a recursive snapshot
just before it starts a backup).  In any case they ARE the exact commands
used in this new run, and we'll see what happens at the end of this run.

(These are from a bash trace as produced by "set -x")

+ zfs create -p bup-wrack/fsfs/zp1
+ zfs send -Rp z...@bup-20100810-154542gmt
+ zfs recv -Fud bup-wrack/fsfs/zp1

(The send and the receive are source and sink in a pipeline).  As you can
see, the destination filesystem is new in the bup-wrack pool.  The "-R" on
the send should, as I understand it, create a replication stream which
will "replicate  the specified filesystem, and all descendent file
systems, up to the  named  snapshot.  When received, all properties,
snapshots, descendent file systems, and clones are preserved."  This
should send the full state of zp1 up to the snapshot.  And the receive
should receive it into bup-wrack/fsfs/zp1.)

Isn't this how a "full backup" should be made using zfs send/receive? 
(Once this is working, I think intend to use -I to send incremental
streams to update it regularly.)

bash-4.0$ zpool list
NAMESIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
bup-wrack   928G  4.62G   923G 0%  1.00x  ONLINE  /backups/bup-wrack
rpool   149G  10.0G   139G 6%  1.00x  ONLINE  -
zp11.09T   743G   373G66%  1.00x  ONLINE  -

zp1 is my primary data pool.  It's not very big (physically it's 3 2-way
mirrors of 400GB drives).  It has 743G of data in it.  bup-wrack is the
backup pool, it's a single 1TB external USB drive.  This was taken shortly
after starting the second try at a full backup (since the b134 upgrade),
so bup-wrack is still mostly empty.

None of the pools have shown any errors of any sort in months.  zp1 and
rpool are scrubbed weekly.





-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Problems with big ZFS send/receive in b134

2010-08-10 Thread David Dyer-Bennet
My full backup still doesn't complete.  However, instead of hanging the
entire disk subsystem as it did on 111b, it now issues error messages. 
Errors at the end.

sending from @bup-daily-20100726-10CDT to
zp1/d...@bup-daily-20100727-10cdt
received 3.80GB stream in 136 seconds (28.6MB/sec)
receiving incremental stream of zp1/d...@bup-daily-20100727-10cdt into
bup-wrack/fsfs/z
p1/d...@bup-daily-20100727-10cdt
sending from @bup-daily-20100727-10CDT to
zp1/d...@bup-daily-20100728-11cdt
received 192MB stream in 10 seconds (19.2MB/sec)
receiving incremental stream of zp1/d...@bup-daily-20100728-11cdt into
bup-wrack/fsfs/z
p1/d...@bup-daily-20100728-11cdt
sending from @bup-daily-20100728-11CDT to
zp1/d...@bup-daily-20100729-10cdt
received 170MB stream in 9 seconds (18.9MB/sec)
receiving incremental stream of zp1/d...@bup-daily-20100729-10cdt into
bup-wrack/fsfs/z
p1/d...@bup-daily-20100729-10cdt
sending from @bup-daily-20100729-10CDT to
zp1/d...@bup-2hr-20100729-22cdt
warning: cannot send 'zp1/d...@bup-2hr-20100729-22cdt': no such pool or
dataset
sending from @bup-2hr-20100729-22CDT to
zp1/d...@bup-2hr-20100730-00cdt
warning: cannot send 'zp1/d...@bup-2hr-20100730-00cdt': no such pool or
dataset
sending from @bup-2hr-20100730-00CDT to
zp1/d...@bup-2hr-20100730-02cdt
warning: cannot send 'zp1/d...@bup-2hr-20100730-02cdt': no such pool or
dataset
sending from @bup-2hr-20100730-02CDT to
zp1/d...@bup-2hr-20100730-04cdt
warning: cannot send 'zp1/d...@bup-2hr-20100730-04cdt': incremental
source (@bup-2hr-20
100730-02CDT) does not exist
sending from @bup-2hr-20100730-04CDT to
zp1/d...@bup-2hr-20100730-06cdt
sending from @bup-2hr-20100730-06CDT to
zp1/d...@bup-2hr-20100730-08cdt
sending from @bup-2hr-20100730-08CDT to
zp1/d...@bup-daily-20100730-10cdt
sending from @bup-daily-20100730-10CDT to
zp1/d...@bup-2hr-20100730-10cdt
sending from @bup-2hr-20100730-10CDT to
zp1/d...@bup-2hr-20100730-12cdt
sending from @bup-2hr-20100730-12CDT to
zp1/d...@bup-2hr-20100730-14cdt
sending from @bup-2hr-20100730-14CDT to
zp1/d...@bup-2hr-20100730-16cdt
sending from @bup-2hr-20100730-16CDT to
zp1/d...@bup-2hr-20100730-18cdt
sending from @bup-2hr-20100730-18CDT to
zp1/d...@bup-2hr-20100730-20cdt
sending from @bup-2hr-20100730-20CDT to
zp1/d...@bup-2hr-20100730-22cdt
received 162MB stream in 9 seconds (18.0MB/sec)
receiving incremental stream of zp1/d...@bup-2hr-20100730-06cdt into
bup-wrack/fsfs/zp1
/d...@bup-2hr-20100730-06cdt
cannot receive incremental stream: most recent snapshot of
bup-wrack/fsfs/zp1/ddb does not
match incremental source
bash-4.0$

The bup-wrack pool was newly-created, empty, before this backup started.

The backup commands were:

zfs send -Rv "$srcsnap" | zfs recv -Fudv "$BUPPOOL/$HOSTNAME/$FS"

I don't see how anything could be creating snapshots on bup-wrack while
this was running.  That pool is not normally mounted (it's on a single
external USB drive, I plug it in for backups).  My script for doing
regular snapshots of zp1 and rpool doesn't reference any of the bup-*
pools.

I don't see how this snapshot mismatch can be coming from anything but the
send/receive process.

There are quite a lot of snapshots; dailys for some months, 2-hour ones
for a couple of weeks.  Most of them are empty or tiny.

Next time I will try WITHOUT -v on both ends, and arrange to capture the
expanded version of the command with all the variables filled in, but I
don't expect any different outcome.

Any other ideas?







-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Directory tree renaming -- disk usage

2010-08-09 Thread David Dyer-Bennet
If I have a directory with a bazillion files in it (or, let's say, a
directory subtree full of raw camera images, about 15MB each, totalling
say 50GB) on a ZFS filesystem, and take daily snapshots of it (without
altering it), the snapshots use almost no extra space, I know.

If I now rename that directory, and take another snapshot, what happens? 
Do I get two copies of the unchanged data now, or does everything still
reference the same original data (file content)?  Seems like the new
directory tree contains the "same old files", same inodes and so forth, so
it shouldn't be duplicating the data as I understand it; is that correct?

This would, obviously, be fairly easy to test; and, if I removed the
snapshots afterward, wouldn't take space permanently (have to make sure
that the scheduler doesn't do one of my permanent snapshots during the
test).  But I'm interested in the theoretical answer in any case.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Errors upgrading 2009.06 to dev build 134

2010-08-05 Thread David Dyer-Bennet

Last night I upgraded from 2009.6 to b134 from the dev branch. I
haven't tried to boot the resulting BE yet, because I got the
following errors:

PHASEACTIONS
Removal Phase16199/21806
Warning - directory etc/sma/snmp/mibs not empty - contents preserved in
/tmp/tmp3udGdj/var/pkg/lost+found/etc/sma/snmp/mibs-20100805T075635Z
Removal Phase21806/21806
Install Phase79042/80017 The 'pcieb' driver
shares the
alias 'pciexclass,060400' with the 'pcie_pci'
driver, but the system cannot determine how the latter was delivered.
Its entry on line 2 in /etc/driver_aliases has been commented
out.  If this driver is no longer needed, it may be removed by booting
into the 'opensolaris-3' boot environment and invoking 'rem_drv pcie_pci'
as well as removing line 2 from /etc/driver_aliases or, before
rebooting, mounting the 'opensolaris-3' boot environment and running
'rem_drv -b  pcie_pci' and removing line 2 from
/etc/driver_aliases.
The 'pcieb' driver shares the alias 'pciexclass,060401' with the 'pcie_pci'
driver, but the system cannot determine how the latter was delivered.
Its entry on line 3 in /etc/driver_aliases has been commented
out.  If this driver is no longer needed, it may be removed by booting
into the 'opensolaris-3' boot environment and invoking 'rem_drv pcie_pci'
as well as removing line 3 from /etc/driver_aliases or, before
rebooting, mounting the 'opensolaris-3' boot environment and running
'rem_drv -b  pcie_pci' and removing line 3 from
/etc/driver_aliases.
Install Phase80017/80017
Update Phase 27721/27760 driver (aggr) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
Update Phase 27725/27760 driver (softmac) upgrade
(removal of policy'read_priv_set=net_rawaccess
write_priv_set=net_rawaccess) failed:
minor node spec required.
Update Phase 27726/27760 driver (vnic) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
Update Phase 27736/27760 driver (ibd) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
Update Phase 27743/27760 driver (dnet) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
Update Phase 27744/27760 driver (elxl) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
Update Phase 27745/27760 driver (iprb) upgrade
(removal
of policy'read_priv_set=net_rawaccess write_priv_set=net_rawaccess)
failed: minor
node spec required.
U

Do these look familiar to anybody?  Can they, I hope, be ignored?  Or
does anybody have any ideas what needs to be fixed?

I didn't install any drivers beyond what the earlier installers
figured out for themselves I needed, and I didn't mess with driver
config that I recall.

I know I can probably fall back to what I'm running now if this new
install fails to run, and I'll eventually just try it.  I've got a
couple of bootable CDs with "recovery consoles" that at least get me
single user, and one with full LiveCD capability, so I should be able
to unwind the mess if necessary.

I guess technically this has no business on zfs-discuss; apologies for
that, but all the prior discussion of this upgrade, and the motivation
for it, is that I need a more current ZFS, and everybody I know is in
this list, not over in the install-discuss list.




-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Upgrading 2009.06 to something current

2010-08-01 Thread David Dyer-Bennet
What's a good choice for a decently stable upgrade?  I'm unable to run 
backups because ZFS send/receive won't do full-pool replication 
reliably, it hangs better than 2/3 of the time, and people here have 
told me later versions (later than 111b) fix this.  I was originally 
waiting for the "spring" release, but okay, I've kind of given up on 
that.  This is a home "production" server; it's got all my photos on it. 
 And the backup isn't as current as I'd like, and I'm having trouble 
getting a better backup.  (I'll do *something* before I risk the 
upgrade; maybe brute force, rsync to an external drive, to at least give 
me a clean copy of the current state; I can live without ACLs.)


I find various blogs with instructions for how to do such an upgrade, 
and they don't agree, and each one has posts from people for whom it 
didn't work, too.  Is there any kind of consensus on what the best way 
to do this is?


--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread David Dyer-Bennet

On Fri, July 16, 2010 14:07, Frank Cusack wrote:
> On 7/16/10 12:02 PM -0500 David Dyer-Bennet wrote:
>>> It would be nice to have applications request to be notified
>>> before a snapshot is taken, and when that have requested
>>> notification have acknowledged that they're ready, the snapshot
>>> would be taken; and then another notification sent that it was
>>> taken.  Prior to indicating they were ready, the apps could
>>> have achieved a logically consistent on disk state.  That
>>> would eliminate the need for (for example) separate database
>>> backups, if you could have a snapshot with the database on it
>>> in a consistent state.
>>
>> Any software dependent on cooperating with the filesystem to ensure that
>> the files are consistent in a snapshot fails the cord-yank test (which
>> is
>> equivalent to the "processor explodes" test and the "power supply bursts
>> into flames" test and the "disk drive shatters" test and so forth).  It
>> can't survive unavoidable physical-world events.
>
> It can, if said software can roll back to the last consistent state.
> That may or may not be "recent" wrt a snapshot.  If an application is
> very active, it's possible that many snapshots may be taken, none of
> which are actually in a state the application can use to recover from.
> Rendering snapshots much less effective.

Wait, if the application can in fact survive the "cord pull" test then by
definition of "survive", all the snapshots are useful.  They'll be
everything consistent that was committed to disk by the time of the yank
(or snapshot); which, it seems to me, is the very best that anybody could
hope for.

> Also, just administratively, and perhaps legally, it's highly desirable
> to know that the time of a snapshot is the actual time that application
> state can be recovered to or referenced to.

Maybe, but since that's not achievable for your core corporate asset (the
database), I think of it as a pipe dream rather than a goal.

> Also, if an application cannot survive a cord-yank test, it might be
> even more highly desirable that snapshots be a stable that from which
> the application can be restarted.

If it cannot survive a cord-yank test, it should not be run, ever, by
anybody, for any purpose more important than playing a game.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-16 Thread David Dyer-Bennet

On Fri, July 16, 2010 08:39, Richard L. Hamilton wrote:
>> > It'd be handy to have a mechanism where
>> applications could register for
>> > snapshot notifications. When one is about to
>> happen, they could be told
>> > about it and do what they need to do. Once all the
>> applications have
>> > acknowledged the snapshot alert--and/or after a
>> pre-set timeout--the file
>> > system would create the snapshot, and then notify
>> the applications that
>> > it's done.
>> >
>> Why would an application need to be notified? I think
>> you're under the
>> misconception that something happens when a ZFS
>> snapshot is taken.
>> NOTHING happens when a snapshot is taken (OK, well,
>> there is the
>> snapshot reference name created). Blocks aren't moved
>> around, we don't
>> copy anything, etc. Applications have no need to "do
>> anything" before a
>> snapshot it taken.
>
> It would be nice to have applications request to be notified
> before a snapshot is taken, and when that have requested
> notification have acknowledged that they're ready, the snapshot
> would be taken; and then another notification sent that it was
> taken.  Prior to indicating they were ready, the apps could
> have achieved a logically consistent on disk state.  That
> would eliminate the need for (for example) separate database
> backups, if you could have a snapshot with the database on it
> in a consistent state.

Any software dependent on cooperating with the filesystem to ensure that
the files are consistent in a snapshot fails the cord-yank test (which is
equivalent to the "processor explodes" test and the "power supply bursts
into flames" test and the "disk drive shatters" test and so forth).  It
can't survive unavoidable physical-world events.

Conversely, any scheme for a program writing to its files that PASSES
those tests will be fine with arbitrary snapshots, too.

For that matter, remember that the "snapshot" may be taken on a zfs server
on another continent which is making the storage available via iScsi;
there's currently no notification channel to tell the software the
snapshot is happening.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-15 Thread David Dyer-Bennet

On Thu, July 15, 2010 09:29, Tim Cook wrote:
> On Thu, Jul 15, 2010 at 9:09 AM, David Dyer-Bennet  wrote:
>
>>
>> On Wed, July 14, 2010 23:51, Tim Cook wrote:

>> > You're clearly talking about something completely different than
>> everyone
>> > else.  Whitebox works GREAT if you've got 20 servers.  Try scaling it
>> to
>> > 10,000.  "A couple extras" ends up being an entire climate controlled
>> > warehouse full of parts that may or may not be in the right city.  Not
>> to
>> > mention you've then got full-time staff on-hand to constantly be
>> replacing
>> > parts.  Your model doesn't scale for 99% of businesses out there.
>> Unless
>> > they're google, and they can leave a dead server in a rack for years,
>> it's
>> > an unsustainable plan.  Out of the fortune 500, I'd be willing to bet
>> > there's exactly zero companies that use whitebox systems, and for a
>> > reason.
>>
>> You might want to talk to Google about that; as I understand it they
>> decided that buying expensive servers was a waste of money precisely
>> because of the high numbers they needed.  Even with the good ones, some
>> will fail, so they had to plan to work very well through server
>> failures,
>> so they can save huge amounts of money on hardware by buying cheap
>> servers rather than expensive ones.

> Obviously someone was going to bring up google, whose business model is
> unique, and doesn't really apply to anyone else.  Google makes it work
> because they order so many thousands of servers at a time, they can demand
> custom made parts for the servers, that are built to their specifications.

Certainly they're one of the most unusual setups out there, in several
ways (size, plus details of what they do with their computers.

>  Furthermore, the clustering and filesystem they use wouldn't function at
> all for 99% of the workloads out there.  Their core application: search,
> is
> what makes the hardware they use possible.  If they were serving up a
> highly
> transactional database that required millisecond latency it would be a
> different story.

Again, I'm not at all convinced of that "99%" bit.

Obviously low-latency transactional database applications are about the
polar opposite of what Google does.  However, transactional database
applications are nearer 1% than 99% of the workloads out there, at every
shop I've worked at or seen detailed descriptions of.

Big email farms, for example, don't generally have that kind of database
at all.  Big web farms probably do have some databases used that way --
but not for that high a percentage of their traffic, and generally running
on one big server while the web is spread across hundreds of servers.
Akamai is more like Google in a bunch of ways than most places.  Wikipedia
and ebay and amazon have huge web front-ends, while also needing
transactional database support.

Um, maybe I'm getting really too far afield from ZFS.  I'll shut up now :-) .
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Legality and the future of zfs...

2010-07-15 Thread David Dyer-Bennet

On Wed, July 14, 2010 23:51, Tim Cook wrote:
> On Wed, Jul 14, 2010 at 9:27 PM, BM  wrote:
>
>> On Thu, Jul 15, 2010 at 12:49 AM, Edward Ned Harvey
>>  wrote:
>> > I'll second that.  And I think this is how you can tell the
>> difference:
>> > With supermicro, do you have a single support number to call and a
>> 4hour
>> > onsite service response time?
>>
>> Yes.
>>
>> BTW, just for the record, people potentially have a bunch of other
>> supermicros in a stock, that they've bought for the rest of the money
>> that left from a budget that was initially estimated to get shiny
>> Sun/Oracle hardware. :) So normally you put them online in a cluster
>> and don't really worry that one of them gone — just power that thing
>> down and disconnect from the whole grid.
>>
>> > When you pay for the higher prices for OEM hardware, you're paying for
>> the
>> > knowledge of parts availability and compatibility. And a single point
>> > vendor who supports the system as a whole, not just one component.
>>
>> What exactly kind of compatibility you're talking about? For example,
>> if I remove my broken mylar air shroud for X8 DP with a
>> MCP-310-18008-0N number because I step on it accidentally :-D, pretty
>> much I think I am gonna ask them to replace exactly THAT thing back.
>> Or you want to let me tell you real stories how OEM hardware is
>> supported and how many emails/phonecalls it involves? One of the very
>> latest (just a week ago): Apple Support reported me that their
>> engineers in US has no green idea why Darwin kernel panics on their
>> XServe, so they suggested me replace mother board TWICE and keep OLDER
>> firmware and never upgrade, since it will cause crash again (although
>> identical server works just fine with newest firmware)! I told them
>> NNN times that traceback of Darwin kernel was yelling about ACPI
>> problem and gave them logs/tracebacks/transcripts etc, but they still
>> have no idea where is the problem. Do I need such "support"? No. Not
>> at all.
>>
>> --
>> Kind regards, BM
>>
>> Things, that are stupid at the beginning, rarely ends up wisely.
>> ___
>>
>>
>
> You're clearly talking about something completely different than everyone
> else.  Whitebox works GREAT if you've got 20 servers.  Try scaling it to
> 10,000.  "A couple extras" ends up being an entire climate controlled
> warehouse full of parts that may or may not be in the right city.  Not to
> mention you've then got full-time staff on-hand to constantly be replacing
> parts.  Your model doesn't scale for 99% of businesses out there.  Unless
> they're google, and they can leave a dead server in a rack for years, it's
> an unsustainable plan.  Out of the fortune 500, I'd be willing to bet
> there's exactly zero companies that use whitebox systems, and for a
> reason.

You might want to talk to Google about that; as I understand it they
decided that buying expensive servers was a waste of money precisely
because of the high numbers they needed.  Even with the good ones, some
will fail, so they had to plan to work very well through server failures,
so they can save huge amounts of money on hardware by buying cheap servers
rather than expensive ones.

And your juxtaposition of "fortune 500" and "99% of businesses" is
significant; possibly the Fortune 500, other than Google, use expensive
proprietary hardware; but 99% of businesses out there are NOT in the
Fortune 500, and mostly use whitebox systems (and not rackmount at all;
they'll have one or at most two tower servers).
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] preparing for future drive additions

2010-07-14 Thread David Dyer-Bennet

On Wed, July 14, 2010 14:58, Daniel Taylor wrote:

> I'm about the build a opensolaris NAS system, currently we have two drives
> and are planning on adding two more at a later date (2TB enterprise level
> HDD are a bit expensive!).

Do you really need them?  Now?  Maybe 1TB drives are good now, and then
add a pair of 2TB in a year?

> Whats the best configuration for setting up these drives bearing in mind I
> want to expand in the future?

Mirror now (pool consisting of one two-way mirror vdev).  Add second
mirror vdev to the pool when you need to expand.

> I was thinking of mirroring the drives and then converting to raidz some
> how?

No way to convert to raidz.  (That is, no magic simple way; you can of
course put in new drives for the raidz and copy the data across.)

> It will only be a max of 4 drives, the second two of which will be bought
> later.

5 drives would be a lot better.  You could keep a hot spare -- and you
could expand mirror vdevs safely (never dropping below your normal
redundancy level), too.

You can add new vdevs to a pool.  This is very useful for a growing system
(until you run out of drive slots).

You can expand an existing vdev by replacing all the drives (one at a
time).  It's a lot cleaner and safer with mirror vdevs than with raidz[
23] vdevs.

In a raid vdev, you can replace drives individually and wait for them to
resilver.  When each drive is done, replace the next.  When you have
replaced all of the drives, the vdev will then make the new space
available.  HOWEVER, doing this takes away a level of redundancy -- you
take away a live drive.  For a RAIDZ, that means no redundancy during the
resilver (which takes a while on a 2TB drive, if you haven't noticed). 
And the resilver is stressing the drives, so if there's any incipient
failure, it's more likely to show up during the resilver.  Scary!  (RAIDZ2
is better in that you still have one layer of redundancy when you take one
drive out; but in a 4-drive chassis forget it!).

In a mirror vdev,  you can be much cleverer, IF you can connect the new
drive while the old drives are all still present.  Attach the new bigger
drive as a THIRD drive to the mirror vdev, and wait for the resilver.  You
now have a three-way mirror, and you never dropped below a two-way mirror
at any time during the process.  Detach one small drive and attach a new
big drive, and wait again.  And detach the last small drive, and you have
now expanded your mirror vdev without ever dropping below your normal
redundancy.  (There are variants on this; the key point is that a mirror
vdev can be an n-way mirror for any value of n your hardware can support.)

If your backups are good and your uptime requirements aren't really
strict, of course the risks can be tolerated better.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/recv hanging in 2009.06

2010-07-13 Thread David Dyer-Bennet

On Fri, July 9, 2010 18:42, Giovanni Tirloni wrote:
> On Fri, Jul 9, 2010 at 6:49 PM, BJ Quinn  wrote:
>> I have a couple of systems running 2009.06 that hang on relatively large
>> zfs send/recv jobs.  With the -v option, I see the snapshots coming
>> across, and at some point the process just pauses, IO and CPU usage go
>> to zero, and it takes a hard reboot to get back to normal.  The same
>> script running against the same data doesn't hang on 2008.05.
>
> There are issues running concurrent zfs receive in 2009.6. Try to run
> just one at a time.

He's doing the same thing I'm doing -- one send, one receive.  (But
incremental replication.)

> Switching to a development build (b134) is probably the answer until
> we've a new release.

Given that the "spring" stable release was my planned solution, I'm
starting to think about doing something else myself.

Does anybody have any idea what's up with the stable release, though?  Has
anything been said about the plans that I've maybe missed?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs send/recv hanging in 2009.06

2010-07-13 Thread David Dyer-Bennet

On Fri, July 9, 2010 16:49, BJ Quinn wrote:
> I have a couple of systems running 2009.06 that hang on relatively large
> zfs send/recv jobs.  With the -v option, I see the snapshots coming
> across, and at some point the process just pauses, IO and CPU usage go to
> zero, and it takes a hard reboot to get back to normal.  The same script
> running against the same data doesn't hang on 2008.05.
>
> There are maybe 100 snapshots, 200GB of data total.  Just trying to send
> to a blank external USB drive in one case, and in the other, I'm restoring
> from a USB drive to a local drive, but the behavior is the same.
>
> I see that others have had a similar problem, but there doesn't seem to be
> any answers -
>
> https://opensolaris.org/jive/thread.jspa?messageID=384540
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg34493.html
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg37158.html
>
> I'd like to stick with a "released" version of OpenSolaris, so I'm hoping
> that the answer isn't to switch to the dev repository and pull down b134.

I still have this problem (I was msg34493 there).

My original plan was to wait for the Spring release, to get me to a stable
release on more recent code.  I'm still following that plan, i.e. haven't
done anything else yet.  At the time the "March" release was expected to
actually appear by April.

Other than trying more recent code, I don't recall any useful ideas coming
through the list.

It seems like the thing people recommend as the backup scheme for ZFS
simply doesn't work yet.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?

2010-06-24 Thread David Dyer-Bennet

On Thu, June 24, 2010 08:58, Arne Jansen wrote:

> Cross check: we pulled also while writing with cache enabled, and it lost
> 8 writes.

I'm SO pleased to see somebody paranoid enough to do that kind of
cross-check doing this benchmarking!

"Benchmarking is hard!"

> So I'd say, yes, it flushes its cache on request.

Starting to sound pretty convincing,  yes.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Complete Linux Noob

2010-06-15 Thread David Dyer-Bennet

On Tue, June 15, 2010 14:13, CarlPalmer wrote:
> I have been researching different types of raids, and I happened across
> raidz, and I am blown away.  I have been trying to find resources to
> answer some of my questions, but many of them are either over my head in
> terms of details, or foreign to me as I am a linux noob, and I have to
> admit I have never even looked at Solaris.

Heh; caught another one :-) .

> Are the Parity drives just that, a drive assigned to parity, or is the
> parity shared over several drives?

No drives are formally designated for "parity"; all n drives in the RAIDZ
vdev are used together in such a way that you can lose one drive without
loss of data, but exactly which bits are "data" and which bits are
"parity" and where they are stored is not something the admin has to think
about or know (and in fact cannot know).

> I understand that you can build a raidz2 that will have 2 parity disks.
> So in theory I could lose 2 disks and still rebuild my array so long as
> they are not both the parity disks correct?

Any two disks out of a raidz2 vdev can be lost.  Lose a third before the
recover completes and your data is toast.

> I understand that you can have Spares assigned to the raid, so that if a
> drive fails, it will immediately grab the spare and rebuild the damaged
> drive.  Is this correct?

Yes, RAIDZ (including z2 and z3) and mirror vdevs will grab a "hot spare"
if one is assigned and needed, and start the resilvering operation
immediately.

> Now I can not find anything on how much space is taken up in the raidz1 or
> raidz2.  If all the drives are the same size, does a raidz2 take up the
> space of 2 of the drives for parity, or is the space calculation
> different?

That's the right calculation.

> I get that you can not expand a raidz as you would a normal raid, by
> simply slapping on a drive.  Instead it seems that the preferred method is
> to create a new raidz.  Now Lets say that I want to add another raidz1 to
> my system, can I get the OS to present this as one big drive with the
> space from both raid pools?

You can't expand a normal RAID, either, anywhere I've ever seen.

A "pool" can contain multiple "vdevs".  You can add additional vdevs to a
pool and the new space become immediately available to the pool, and hence
to anything (like a filesystem) drawing from that pool.

(The zpool command will attempt to stop you from mixing vdevs of different
redundancy in the same pool, but you can force it to let you.  Mixing a
RAIDZ vdev and a RAIDZ3 vdev in the same pool is a silly thing to do,
since you don't control where in the pool any new data goes, and it's
likely to be striped across the vdevs in the pool.)

You can also replace all the drives in a vdev, serially (and waiting for
the resilver to complete at each step before continuing to the next
drive), and if the new drives are larger than the old drives, when  you've
replaced all of them the new space will be usable in that vdev.  This is
particularly useful with mirrors, where there are only two drives to
replace.

(Well, actually, ZFS mirrors can have any number of drives.  To avoid the
risk of loss when upgrading the drives in a mirror, attach the new bigger
drive FIRST, wait for the resilver, and THEN detach one of the smaller
original drives, repeat for the second drive, and you will never go to a
redundancy lower than 2.  You can even attach BOTH new disks at once, if
you have the slots and controller space, and have a 4-way  mirror for a
while.  Somebody reported configuring ALL the drives in a 'Thumper' as a
mirror, a 48-way mirror, just to see if it worked.  It did.)

> How do I share these types of raid pools across the network.  Or more
> specifically, how do I access them from Windows based systems?  Is there
> any special trick?

Nothing special.  In-kernel CIFS is better than SAMBA, and supports full
NTFS ACLs.  I hear it also attaches to AD cleanly, but I haven't done
that, don't run AD at home.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please trim posts

2010-06-10 Thread David Dyer-Bennet

On Thu, June 10, 2010 12:26, patto...@yahoo.com wrote:
> It's getting downright ridiculous. The digest people will kiss you.

But those reading via individual message email quite possibly will not. 
Quoting at least what you're actually responding to is crucial to making
sense out here.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Depth of Scrub

2010-06-04 Thread David Dyer-Bennet

On Fri, June 4, 2010 03:29, sensille wrote:
> Hi,
>
> I have a small question about the depth of scrub in a raidz/2/3
> configuration.
> I'm quite sure scrub does not check spares or unused areas of the disks
> (it
> could check if the disks detects any errors there).
> But what about the parity? Obviously it has to be checked, but I can't
> find
> any indications for it in the literature. The man page only states that
> the
> data is being checksummed and only if that fails the redundancy is being
> used.
> Please tell me I'm wrong ;)

I believe you're wrong.  Scrub checks all the blocks used by ZFS,
regardless of what's in them.  (It doesn't check free blocks.)

> But what I'm really targeting with my question: How much coverage can be
> reached with a find | xargs wc in contrast to scrub? It misses the
> snapshots, but anything beyond that?

Your find script misses the redundant data; scrub checks it all.

It may well miss some of the metadata as well, and probably misses the
redundant copies of metadata.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Thu, June 3, 2010 12:03, Bob Friesenhahn wrote:
> On Thu, 3 Jun 2010, David Dyer-Bennet wrote:
>>
>> In an 8-bay chassis, there are other concerns, too.  Do I keep space
>> open
>> for a hot spare?  There's no real point in a hot spare if you have only
>> one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2
>> plus a hot spare.  And putting everything into one vdev means that for
>> any
>> upgrade I have to replace all 8 drives at once, a financial problem for
>> a
>> home server.
>
> It is not so clear to me that an 8-drive raidz3 is clearly better than
> 7-drive raidz2 plus a hot spare.  From a maintenance standpoint, I
> think that it is useful to have a spare drive or even an empty spare
> slot so that it is easy to replace a drive without needing to
> physically remove it from the system.  A true hot spare allows
> replacement to start automatically right away if a failure is
> detected.

But is having a RAIDZ2 drop to single redundancy, with replacement
starting instantly, actually as good or better than having a RAIDZ3 drop
to double redundancy, with actual replacement happening later?  The
"degraded" state of the RAIDZ3 has the same redundancy as the "healthy"
state of the RAIDZ2.

Certainly having a spare drive bay to play with is often helpful; though
the scenarios that most immediately spring to mind are all mirror-related
and hence don't apply here.

> With only 8-drives, the reliability improvement from raidz3 is
> unlikely to be borne out in practice.  Other potential failures modes
> will completely drown out the on-paper reliability improvement
> provided by raidz3.

I wouldn't give up much of anything to add Z3 on 8 drives, no.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Thu, June 3, 2010 13:04, Garrett D'Amore wrote:
> On Thu, 2010-06-03 at 11:49 -0500, David Dyer-Bennet wrote:
>> hot spares in place, but I have the bays reserved for that use.
>>
>> In the latest upgrade, I added 4 2.5" hot-swap bays (which got the
>> system
>> disks out of the 3.5" hot-swap bays).  I have two free, and that's the
>> form-factor SSDs come in these days, so if I thought it would help I
>> could
>> add an SSD there.  Have to do quite a bit of research to see which uses
>> would actually benefit me, and how much.  It's not obvious that either
>> l2arc or zil on SSD would help my program loading, image file loading,
>> or
>> image file saving cases that much.  There may be more other stuff than I
>> really think of though.
>
> It really depends on the working sets these programs deal with.
>
> zil is useful primarily when doing lots of writes, especially lots of
> writes to small files or to data scattered throughout a file.  I view it
> as a great solution for database acceleration, and for accelerating the
> filesystems I use for hosting compilation workspaces.  (In retrospect,
> since by definition the results of compilation are reproducible, maybe I
> should just turn off synchronous writes for build workspaces... provided
> that they do not contain any modifications to the sources themselves.
> I'm going to have to play with this.)

I suspect there are more cases here than I immediately think of.  For
example, sitting here thinking, I wonder if the web cache would benefit a
lot?  And all those email files?

RAW files from my camera are 12-15MB, and the resulting Photoshop files
are around 50MB (depending on compression, and they get bigger fast if I
add layers).  Those aren't small, and I don't read the same thing over and
over lots.

For build spaces, definitely should be reproducible from source.  A
classic production build starts with checking out a tagged version from
source control, and builds from there.

> l2arc is useful for data that is read back frequently but is too large
> to fit in buffer cache.  I can imagine that it would be useful for
> hosting storage associated with lots of  programs that are called
> frequently. You can think of it as a logical extension of the buffer
> cache in this regard... if your working set doesn't fit in RAM, then
> l2arc can prevent going back to rotating media.

I don't think I'm going to benefit much from this.

> All other things being equal, I'd increase RAM before I'd worry too much
> about l2arc.  The exception to that would be if I knew I had working
> sets that couldn't possibly fit in RAM... 160GB of SSD is a *lot*
> cheaper than 160GB of RAM. :-)

I just did increase RAM, same upgrade as the 2.5" bays and the additional
controller and the third mirrored vdev.  I increased it all the way to
4GB!  And I can't increase it further feasibly (4GB sticks of ECC RAM
being hard to find and extremely pricey; plus I'd have to displace some of
my existing memory).

Since this is a 2006 system, in another couple of years it'll be time to
replace MB and processor and memory, and I'm sure it'll have a lot more
memory next time.

I'm desperately waiting for Solaris 2006.$Q2 ("Q2" since it was pointed
out last time that "Spring" was wrong on half the Earth), since I hope it
will resolve my backup problems so I can get incremental backups happening
nightly (intention is to use zfs send/receive with incremental replication
streams, to keep external drives up-to-date with data and all snapshots). 
The oldness of the system and especially the drives makes this more
urgent, though of course it's important in general.  I do manage a full
backup that completes now and then, anyway, and they'll complete overnight
if they don't hang. Problem is, if they hang, have to reboot the Solaris
box and every Windows box using it.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Thu, June 3, 2010 10:50, Garrett D'Amore wrote:
> On Thu, 2010-06-03 at 10:35 -0500, David Dyer-Bennet wrote:
>> On Thu, June 3, 2010 10:15, Garrett D'Amore wrote:
>> > Using a stripe of mirrors (RAID0) you can get the benefits of multiple
>> > spindle performance, easy expansion support (just add new mirrors to
>> the
>> > end of the raid0 stripe), and 100% data redundancy.   If you can
>> afford
>> > to pay double for your storage (the cost of mirroring), this is IMO
>> the
>> > best solution.
>>
>> Referencing "RAID0" here in the context of ZFS is confusing, though.
>> Are
>> you suggesting using underlying RAID hardware to create virtual volumes
>> to
>> then present to ZFS, or what?
>
> RAID0 is basically the default configuration of a ZFS pool -- its a
> concatenation of the underlying vdevs.  In this case the vdevs should
> themselves be two-drive mirrors.
>
> This of course has to be done in the ZFS layer, and ZFS doesn't call it
> RAID0, any more than it calls a mirror RAID1, but effectively that's
> what they are.

Kinda mostly, anyway.  I thought we recently had this discussion, and
people were pointing out things like the striping wasn't physically the
same on each drive and such.

>> > Note that this solution is not quite as resilient against hardware
>> > failure as raidz2 or raidz3.  While the RAID1+0 solution can tolerate
>> > multiple drive failures, if both both drives in a mirror fail, you
>> lose
>> > data.
>>
>> In a RAIDZ solution, two or more drive failures lose your data.  In a
>> mirrored solution, losing the WRONG two drives will still lose your
>> data,
>> but you have some chance of surviving losing a random two drives.  So I
>> would describe the mirror solution as more resilient.
>>
>> So going to RAIDZ2 or even RAIDZ3 would be better, I agree.
>
>>From a data resiliency point, yes, raidz2 or raidz3 offers better
> protection.  At a significant performance cost.

The place I care about performance is almost entirely sequential
read/write -- loading programs, and loading and saving large image files. 
I don't know a lot of home users that actually need high IOPS.

> Given enough drives, one could probably imagine using raidz3 underlying
> vdevs, with RAID0 striping to spread I/O across multiple spindles.  I'm
> not sure how well this would perform, but I suspect it would perform
> better than straight raidz2/raidz3, but at a significant expense (you'd
> need a lot of drives).

Might well work that way; it does sound about right.

>> In an 8-bay chassis, there are other concerns, too.  Do I keep space
>> open
>> for a hot spare?  There's no real point in a hot spare if you have only
>> one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2
>> plus a hot spare.  And putting everything into one vdev means that for
>> any
>> upgrade I have to replace all 8 drives at once, a financial problem for
>> a
>> home server.
>
> This is one of the reasons I don't advocate using raidz (any version)
> for home use, unless you can't afford the cost in space represented by
> mirroring and a hot spare or two.  (The other reason ... for my use at
> least... is the performance cost.  I want to use my array to host
> compilation workspaces, and for that I would prefer to get the most
> performance out of my solution.  I suppose I could add some SSDs... but
> I still think multiple spindles are a good option when you can do it.)
>
> In an 8 drive chassis, without any SSDs involved,I'd configure 6 of the
> drives as a 3 vdev stripe consisting of mirrors of 2 drives, and I'd
> leave the remaining two bays as hot spares.  Btw, using the hot spares
> in this way potentially means you can use those bays later to upgrade to
> larger drives in the future, without offlining anything and without
> taking too much of a performance penalty when you do so.

And the three 2-way mirrors is exactly where I am right now.  I don't have
hot spares in place, but I have the bays reserved for that use.

In the latest upgrade, I added 4 2.5" hot-swap bays (which got the system
disks out of the 3.5" hot-swap bays).  I have two free, and that's the
form-factor SSDs come in these days, so if I thought it would help I could
add an SSD there.  Have to do quite a bit of research to see which uses
would actually benefit me, and how much.  It's not obvious that either
l2arc or zil on SSD would help my program loading, image file loading, or
image file saving cases that much.  There may be more other stuff than I
really think of though.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Thu, June 3, 2010 10:50, Marty Scholes wrote:
> David Dyer-Bennet wrote:
>> My choice of mirrors rather than RAIDZ is based on
>> the fact that I have
>> only 8 hot-swap bays (I still think of this as LARGE
>> for a home server;
>> the competition, things like the Drobo, tends to have
>> 4 or 5), that I
>> don't need really large amounts of storage (after my
>> latest upgrade I'm
>> running with 1.2TB of available data space), and that
>> I expected to need
>> to expand storage over the life of the system.  With
>> mirror vdevs, I can
>> expand them without compromising redundancy even
>> temporarily, by attaching
>> the new drives before I detach the old drives; I
>> couldn't do that with
>> RAIDZ.  Also, the fact that disk is now so cheap
>> means that 100%
>> redundancy is affordable, I don't have to compromise
>> on RAIDZ.
>
> Maybe I have been unlucky too many times doing storage admin in the 90s,
> but simple mirroring still scares me.  Even with a hot spare (you do have
> one, right?) the rebuild window leaves the entire pool exposed to a single
> failure.

No hot spare currently.  And now running on 4-year-old disks, too.

For me, mirroring is a big step UP from bare single drives.  That's my
"default state".

Of course, I'm a big fan of multiple levels of backup.

> One of the nice things about zfs is that allows, "to each his own."  My
> home server's main pool is 22x 73GB disks in a Sun A5000 configured as
> RAIDZ3.  Even without a hot spare, it takes several failures to get the
> pool into trouble.

Yes, it's very flexible, and while there are no doubt useless degenerate
cases here and there, lots of the cases are useful for some environment or
other.

That does seem like rather an extreme configuration.

> At the same time, there are several downsides to a wide stripe like that,
> including relatively poor iops and longer rebuild windows.  As noted
> above, until bp_rewrite arrives, I cannot change the geometry of a vdev,
> which kind of limits the flexibility.

There are a LOT of reasons to want bp_rewrite, certainly.

> As a side rant, I still find myself baffled that Oracle/Sun correctly
> touts the benefits of zfs in the enterprise, including tremendous
> flexibility and simplicity of filesystem provisioning and nondisruptive
> changes to filesystems via properties.
>
> These forums are filled with people stating that the enterprise demands
> simple, flexibile and nondisruptive filesystem changes, but no enterprise
> cares about simple, flexibile and nondisruptive pool/vdev changes, e.g.
> changing a vdev geometry or evacuating a vdev.  I can't accept that zfs
> flexibility is critical and zpool flexibility is unwanted.

We could certainly use that level of pool-equivalent flexibility at work;
we don't currently have it (not ZFS, not high-end enterprise storage
units).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Thu, June 3, 2010 10:15, Garrett D'Amore wrote:
> Using a stripe of mirrors (RAID0) you can get the benefits of multiple
> spindle performance, easy expansion support (just add new mirrors to the
> end of the raid0 stripe), and 100% data redundancy.   If you can afford
> to pay double for your storage (the cost of mirroring), this is IMO the
> best solution.

Referencing "RAID0" here in the context of ZFS is confusing, though.  Are
you suggesting using underlying RAID hardware to create virtual volumes to
then present to ZFS, or what?

> Note that this solution is not quite as resilient against hardware
> failure as raidz2 or raidz3.  While the RAID1+0 solution can tolerate
> multiple drive failures, if both both drives in a mirror fail, you lose
> data.

In a RAIDZ solution, two or more drive failures lose your data.  In a
mirrored solution, losing the WRONG two drives will still lose your data,
but you have some chance of surviving losing a random two drives.  So I
would describe the mirror solution as more resilient.

So going to RAIDZ2 or even RAIDZ3 would be better, I agree.

In an 8-bay chassis, there are other concerns, too.  Do I keep space open
for a hot spare?  There's no real point in a hot spare if you have only
one vdev; that is, 8-drive RAIDZ3 is clearly better than 7-drive RAIDZ2
plus a hot spare.  And putting everything into one vdev means that for any
upgrade I have to replace all 8 drives at once, a financial problem for a
home server.

> If you're clever, you'll also try to make sure each side of the mirror
> is on a different controller, and if you have enough controllers
> available, you'll also try to balance the controllers across stripes.

I did manage to split the mirrors accross controllers (I have 6 SATA on
the motherboard and I added an 8-port SAS card with SAS-SATA cabling).

> One way to help with that is to leave a drive or two available as a hot
> spare.
>
> Btw, the above recommendation mirrors what Jeff Bonwick himself (the
> creator of ZFS) has advised on his blog.

I believe that article directly influenced my choice, in fact.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] one more time: pool size changes

2010-06-03 Thread David Dyer-Bennet

On Wed, June 2, 2010 17:54, Roman Naumenko wrote:
> Recently I talked to a co-worker who manages NetApp storages. We discussed
> size changes for pools in zfs and aggregates in NetApp.
>
> And some time before I had suggested to a my buddy zfs for his new home
> storage server, but he turned it down since there is no expansion
> available for a pool.

I set up my home fileserver with ZFS (in 2006) BECAUSE zfs could expand
the pool for me, and nothing else I had access to could do that (home
fileserver, little budget).

My server is currently running with one data pool, three vdevs.  Each of
the data vdev is a two-way mirror.  I started with one, expanded to two,
then expanded to three.  Rather than expanding to four when this fills up,
I'm going to attach a larger drive to the first mirror vdev, and then a
second one, and then remove the two current drives, thus expanding the
vdev without ever compromising the redundancy.

My choice of mirrors rather than RAIDZ is based on the fact that I have
only 8 hot-swap bays (I still think of this as LARGE for a home server;
the competition, things like the Drobo, tends to have 4 or 5), that I
don't need really large amounts of storage (after my latest upgrade I'm
running with 1.2TB of available data space), and that I expected to need
to expand storage over the life of the system.  With mirror vdevs, I can
expand them without compromising redundancy even temporarily, by attaching
the new drives before I detach the old drives; I couldn't do that with
RAIDZ.  Also, the fact that disk is now so cheap means that 100%
redundancy is affordable, I don't have to compromise on RAIDZ.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can I change copies to 2 *after* I have copied a bunch of files?

2010-06-01 Thread David Dyer-Bennet

On Fri, May 28, 2010 11:04, Thanassis Tsiodras wrote:
> I've read on the web that copies=2 affects only the files copied *after* I
> have changed the setting

That is correct.

Rewriting datasets is a feature desired for future versions (it would make
a LOT of things, including shrinking pools and adding compression or extra
redundancy later work).  Nobody has promised a date for it that I recall.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send/recv over ssh

2010-05-21 Thread David Dyer-Bennet

On Fri, May 21, 2010 12:59, Brandon High wrote:
> On Fri, May 21, 2010 at 7:12 AM, David Dyer-Bennet  wrote:
>>
>> On Thu, May 20, 2010 19:44, Freddie Cash wrote:
>>> And you can always patch OpenSSH with HPN, thus enabling the NONE
>>> cipher,
>>> which disable encryption for the data transfer (authentication is
>>> always
>>> encrypted).  And twiddle the internal buffers that OpenSSH uses to
>>> improve
>>> transfer rates, especially on 100 Mbps or faster links.
>>
>> Ah!  I've been wanting that for YEARS.  Very glad to hear somebody has
>> done it.
>
> ssh-1 has had the 'none' cipher from day one, though it looks like
> openssh has removed it at some point. Fixing the buffers seems to be a
> nice tweak though.

I thought I remembered a "none" cipher, but couldn't find it the other
year and decided I must have been wrong.  I did use ssh-1, so maybe I
really WAS remembering after all.

>> With the common use of SSH for for moving bulk data (under rsync as
>> well),
>> this is a really useful idea.  Of course one should think about where
>> one
>
> I think there's a certain assumption that using ssh = safe, and by
> enabling a none cipher you break that assumption. All of us know
> better, but less experienced admins may not.

Seems a high price to pay to try to protect idiots from being idiots. 
Anybody who doesn't understand that "encryption = none" means it's not
encrypted and hence not safe isn't safe as an admin anyway.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting experience with Nexenta - anyone seen it?

2010-05-21 Thread David Dyer-Bennet

On Fri, May 21, 2010 10:19, Bob Friesenhahn wrote:
> On Fri, 21 May 2010, Miika Vesti wrote:
>
>> AFAIK OCZ Vertex 2 does not use volatile DRAM cache but non-volatile
>> NAND
>> grid. Whether it respects or ignores the cache flush seems irrelevant.
>>
>> There has been previous discussion about this:
>> http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/35702
>>
>> "I'm pretty sure that all SandForce-based SSDs don't use DRAM as their
>> cache, but take a hunk of flash to use as scratch space instead. Which
>> means that they'll be OK for ZIL use."
>>
>> So, OCZ Vertex 2 seems to be a good choice for ZIL.
>
> There seem to be quite a lot of blind assumptions in the above.  The
> only good choice for ZIL is when you know for a certainty and not
> assumptions based on 3rd party articles and blog postings.  Otherwise
> it is like assuming that if you jump through an open window that there
> will be firemen down below to catch you.

Just how DOES one know something for a certainty, anyway?  I've seen LOTS
of people mess up performance testing in ways that gave them very wrong
answers; relying solely on your own testing is as foolish as relying on a
couple of random blog posts.

To be comfortable (I don't ask for "know for a certainty"; I'm not sure
that exists outside of "faith"), I want a claim by the manufacturer and
multiple outside tests in "significant" journals -- which could be the
blog of somebody I trusted, as well as actual magazines and such. 
Ideally, certainly if it's important, I'd then verify the tests myself.

There aren't enough hours in the day, so I often get by with less.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] send/recv over ssh

2010-05-21 Thread David Dyer-Bennet

On Thu, May 20, 2010 19:44, Freddie Cash wrote:
> And you can always patch OpenSSH with HPN, thus enabling the NONE
> cipher,
> which disable encryption for the data transfer (authentication is always
> encrypted).  And twiddle the internal buffers that OpenSSH uses to improve
> transfer rates, especially on 100 Mbps or faster links.

Ah!  I've been wanting that for YEARS.  Very glad to hear somebody has
done it.

With the common use of SSH for for moving bulk data (under rsync as well),
this is a really useful idea.  Of course one should think about where one
is moving one's data unencrypted; but the precise cases where the
performance hit of encryption will show are the safe ones, such as between
my desktop and server which are plugged into the same switch; no data
would leave that small LAN segment.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-05-03 Thread David Dyer-Bennet

On Mon, May 3, 2010 17:02, Richard Elling wrote:
> On May 3, 2010, at 2:38 PM, David Dyer-Bennet wrote:
>> On Sun, May 2, 2010 14:12, Richard Elling wrote:
>>> On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote:
>>>> On Fri, 30 Apr 2010, Freddie Cash wrote:
>>>>> Without a periodic scrub that touches every single bit of data in the
>>>>> pool, how can you be sure
>>>>> that 10-year files that haven't been opened in 5 years are still
>>>>> intact?
>>>>
>>>> You don't.  But it seems that having two or three extra copies of the
>>>> data on different disks should instill considerable confidence.  With
>>>> sufficient redundancy, chances are that the computer will explode
>>>> before
>>>> it loses data due to media corruption.  The calculated time before
>>>> data
>>>> loss becomes longer than even the pyramids in Egypt could withstand.
>>>
>>> These calculations are based on fixed MTBF.  But disk MTBF decreases
>>> with
>>> age. Most disks are only rated at 3-5 years of expected lifetime.
>>> Hence,
>>> archivists
>>> use solutions with longer lifetimes (high quality tape = 30 years) and
>>> plans for
>>> migrating the data to newer media before the expected media lifetime is
>>> reached.
>>> In short, if you don't expect to read your 5-year lifetime rated disk
>>> for
>>> another 5 years,
>>> then your solution is uhmm... shall we say... in need of improvement.
>>
>> Are they giving tape that long an estimated life these days?  They
>> certainly weren't last time I looked.
>
> Yes.
> http://www.oracle.com/us/products/servers-storage/storage/tape-storage/036556.pdf
> http://www.sunstarco.com/PDF%20Files/Quantum%20LTO3.pdf

Yep, they say 30 years.  That's probably in the same "years" where the MAM
gold archival DVDs are good for 200, I imagine.  (i.e. based on
accelerated testing, with the lab knowing what answer the client wants). 
Although we may know more about tape aging, the accelerated tests may be
more valid for tapes?

But LTO-3 is a 400GB tape that costs, hmmm, maybe $40 each (maybe less
with better shopping, that's a quick Amazon price rounded down).  (I don't
factor in compression in my own analysis because my data is overwhelmingly
image filee and MP3 files, which don't compress further very well.)

Plus a $1000 drive, or $2000 for a 3-tape changer (and that's barely big
enough to back up my small server without manual intervention, might not
be by the end of the  year).

Tape is a LOT more expensive than my current hard-drive based backup
scheme, even if I use the backup drives only three years (and since they
spin less than 10% of the time, they should last pretty well).

Also, I lose my snapshots in a tape backup, whereas I keep them on my hard
drive backups.  (Or else I'm storing a ZFS send stream on tape and hoping
it will actually restore.)

>> And I basically don't trust tape; too many bad experiences (ever since I
>> moved off of DECTape, I've been having bad experiences with tape).  The
>> drives are terribly expensive and I can't afford redundancy, and in
>> thirty
>> years I very probably could not buy a new drive for my old tapes.
>>
>> I started out a big fan of tape, but the economics have been very much
>> against it in the range I'm working (small; 1.2 terabytes usable on my
>> server currently).
>>
>> I don't expect I'll keep my hard disks for 30 years; I expect I'll
>> upgrade
>> them periodically, probably even within their MTBF.  (Although note
>> that,
>> though tests haven't been run, the MTBF of a 5-year disk after 4 years
>> is
>> nearly certainly greater than 1 year.)
>
> Yes, but MTBF != expected lifetime.  MTBF is defined as Mean Time Between
> Failures (a rate), not Time Until Death (a lifetime).  If your MTBF was 1
> year,
> then the probability of failing within 1 year would be approximately 63%,
> assuming an exponential distribution.

Yeah, sorry, I stumbled into using the same wrong figures lots of people
were.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-05-03 Thread David Dyer-Bennet

On Sun, May 2, 2010 14:12, Richard Elling wrote:
> On May 1, 2010, at 1:56 PM, Bob Friesenhahn wrote:
>> On Fri, 30 Apr 2010, Freddie Cash wrote:
>>> Without a periodic scrub that touches every single bit of data in the
>>> pool, how can you be sure
>>> that 10-year files that haven't been opened in 5 years are still
>>> intact?
>>
>> You don't.  But it seems that having two or three extra copies of the
>> data on different disks should instill considerable confidence.  With
>> sufficient redundancy, chances are that the computer will explode before
>> it loses data due to media corruption.  The calculated time before data
>> loss becomes longer than even the pyramids in Egypt could withstand.
>
> These calculations are based on fixed MTBF.  But disk MTBF decreases with
> age. Most disks are only rated at 3-5 years of expected lifetime. Hence,
> archivists
> use solutions with longer lifetimes (high quality tape = 30 years) and
> plans for
> migrating the data to newer media before the expected media lifetime is
> reached.
> In short, if you don't expect to read your 5-year lifetime rated disk for
> another 5 years,
> then your solution is uhmm... shall we say... in need of improvement.

Are they giving tape that long an estimated life these days?  They
certainly weren't last time I looked.

And I basically don't trust tape; too many bad experiences (ever since I
moved off of DECTape, I've been having bad experiences with tape).  The
drives are terribly expensive and I can't afford redundancy, and in thirty
years I very probably could not buy a new drive for my old tapes.

I started out a big fan of tape, but the economics have been very much
against it in the range I'm working (small; 1.2 terabytes usable on my
server currently).

I don't expect I'll keep my hard disks for 30 years; I expect I'll upgrade
them periodically, probably even within their MTBF.  (Although note that,
though tests haven't been run, the MTBF of a 5-year disk after 4 years is
nearly certainly greater than 1 year.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-04-30 Thread David Dyer-Bennet

On Fri, April 30, 2010 13:44, Freddie Cash wrote:
> On Fri, Apr 30, 2010 at 11:35 AM, Bob Friesenhahn <
> bfrie...@simple.dallas.tx.us> wrote:
>
>> On Thu, 29 Apr 2010, Tonmaus wrote:
>>
>>  Recommending to not using scrub doesn't even qualify as a workaround,
>> in
>>> my regard.
>>>
>>
>> As a devoted believer in the power of scrub, I believe that after the
>> OS,
>> power supplies, and controller have been verified to function with a
>> good
>> scrubbing, if there is more than one level of redundancy, scrubs are not
>> really warranted.  With just one level of redundancy it becomes much
>> more
>> important to verify that both copies were written to disk correctly.
>>
> Without a periodic scrub that touches every single bit of data in the
> pool,
> how can you be sure that 10-year files that haven't been opened in 5 years
> are still intact?
>
> Self-healing only comes into play when the file is read.  If you don't
> read
> a file for years, how can you be sure that all copies of that file haven't
> succumbed to bit-rot?

Yes, that's precisely my point.  That's why it's especially relevant to
archival data -- it's important (to me), but not frequently accessed.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-04-30 Thread David Dyer-Bennet

On Thu, April 29, 2010 17:35, Bob Friesenhahn wrote:

> In my opinion periodic scrubs are most useful for pools based on
> mirrors, or raidz1, and much less useful for pools based on raidz2 or
> raidz3.  It is useful to run a scrub at least once on a well-populated
> new pool in order to validate the hardware and OS, but otherwise, the
> scrub is most useful for discovering bit-rot in singly-redundant
> pools.

I've got 10 years of photos on my disk now, and it's growing at faster
than one year per year (since I'm scanning backwards slowly through the
negatives).  Many of them don't get accessed very often; they're archival,
not current use.  Scrub was one of the primary reasons I chose ZFS for the
fileserver they live on -- I want some assurance, 20 years from now, that
they're still valid.  I needed something to check them periodically, and
something to check *against*, and block checksums and scrub seemed to fill
the bill.

So, yes, I want to catch bit rot -- on a pool of mirrored VDEVs.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance drop during scrub?

2010-04-28 Thread David Dyer-Bennet

On Wed, April 28, 2010 10:16, Eric D. Mudama wrote:
> On Wed, Apr 28 at  1:34, Tonmaus wrote:
>>> Zfs scrub needs to access all written data on all
>>> disks and is usually
>>> disk-seek or disk I/O bound so it is difficult to
>>> keep it from hogging
>>> the disk resources.  A pool based on mirror devices
>>> will behave much
>>> more nicely while being scrubbed than one based on
>>> RAIDz2.
>>
>> Experience seconded entirely. I'd like to repeat that I think we
>> need more efficient load balancing functions in order to keep
>> housekeeping payload manageable. Detrimental side effects of scrub
>> should not be a decision point for choosing certain hardware or
>> redundancy concepts in my opinion.
>
> While there may be some possible optimizations, i'm sure everyone
> would love the random performance of mirror vdevs, combined with the
> redundancy of raidz3 and the space of a raidz1.  However, as in all
> systems, there are tradeoffs.

The situations being mentioned are much worse than what seem reasonable
tradeoffs to me.  Maybe that's because my intuition is misleading me about
what's available.  But if the normal workload of a system uses 25% of its
sustained IOPS, and a scrub is run at "low priority", I'd like to think
that during a scrub I'd see a little degradation in performance, and that
the scrub would take 25% or so longer than it would on an idle system. 
There's presumably some inefficiency, so the two loads don't just add
perfectly; so maybe another 5% lost to that?  That's the big uncertainty. 
I have a hard time believing in 20% lost to that.

Do you think that's a reasonable outcome to hope for?  Do you think ZFS is
close to meeting it?

People with systems that live at 75% all day are obviously going to have
more problems than people who live at 25%!

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?

2010-04-27 Thread David Dyer-Bennet

On Tue, April 27, 2010 11:17, Bob Friesenhahn wrote:
> On Tue, 27 Apr 2010, David Dyer-Bennet wrote:
>>
>> I don't think I understand your scenario here.  The docs online at
>> <http://docs.sun.com/app/docs/doc/819-5461/gazgd?a=view> describe uses
>> of
>> zpool replace that DO run the array degraded for a while, and don't seem
>> to mention any other.
>>
>> Could you be more detailed?
>
> If a disk has failed, then it makes sense to physically remove the old
> disk, insert a new one, and do 'zpool replace tank c1t1d0'.  However
> if the disk has not failed, then you can install a new disk in another
> location and use the two argument form of replace like 'zpool replace
> tank c1t1d0 c1t1d7'.  If I understand things correctly, this allows
> you to replace one good disk with another without risking the data in
> your pool.

I don't see any reason to think the old device remains in use until the
new device is resilvered, and if it doesn't, then you're down one level of
redundancy the instant the old device goes out of service.

I don't have a RAIDZ group, but trying this while there's significant load
on the group, it should be easy to see if there's traffic on the old drive
after the resilver starts.  If there is, that would seem to be evidence
that it's continuing to use the old drive while resilvering to the new
one, which would be good.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?

2010-04-27 Thread David Dyer-Bennet

On Tue, April 27, 2010 10:38, Bob Friesenhahn wrote:
> On Tue, 27 Apr 2010, David Dyer-Bennet wrote:
>>
>> Hey, you know what might be helpful?  Being able to add redundancy to a
>> raid vdev.  Being able to go from RAIDZ2 to RAIDZ3 by adding another
>> drive
>> of suitable size.  Also being able to go the other way.  This lets you
>> do
>> the trick of temporarily adding redundancy to a vdev while swapping out
>> devices one at a time to eventually upgrade the size (since you're
>> deliberately creating a fault situation, increasing redundancy before
>> you
>> do it makes loads of sense!).
>
> You can already replace one drive with another (zpool replace) so as
> long as there is space for the new drive, it is not necessary to
> degrade the array and lose redundancy while replacing a device.  As
> long as you can physically add a drive to the system (even
> temporarily) it is not necessary to deliberately create a fault
> situation.

I don't think I understand your scenario here.  The docs online at
<http://docs.sun.com/app/docs/doc/819-5461/gazgd?a=view> describe uses of
zpool replace that DO run the array degraded for a while, and don't seem
to mention any other.

Could you be more detailed?
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?

2010-04-27 Thread David Dyer-Bennet

On Mon, April 26, 2010 17:21, Edward Ned Harvey wrote:

> Also, if you've got all those disks in an array, and they're MTBF is ...
> let's say 25,000 hours ... then 3 yrs later when they begin to fail, they
> have a tendency to all fail around the same time, which increases the
> probability of exceeding your designed level of redundancy.

It's useful to consider this when doing mid-life upgrades.  Unfortunately
there's not too much useful to be done right now with RAID setups.

With mirrors, when adding some disks mid-life (seems like a common though
by no means universal scenario to not fully populate the chassis at first,
and add more 1/3 to 1/2 way through the projected life), with some extra
trouble one can attach a new disk as a n+1st disk in an existing mirror,
wait for the resilver, and detach an old disk.  That mirror is now one new
disk and one old disk, rather than two disks of the same age.  Then build
a new mirror out of the freed disk plus another new disk.  Now you've got
both mirrors consisting of disks of different ages, less prone to failing
at the same time.  (Of course this doesn't work when you're using bigger
drives for the mid-life kicker, and most of the time it would make sense
to do so.)

Even buying different (mixed) brands initially doesn't help against aging;
only against batch or design problems.

Hey, you know what might be helpful?  Being able to add redundancy to a
raid vdev.  Being able to go from RAIDZ2 to RAIDZ3 by adding another drive
of suitable size.  Also being able to go the other way.  This lets you do
the trick of temporarily adding redundancy to a vdev while swapping out
devices one at a time to eventually upgrade the size (since you're
deliberately creating a fault situation, increasing redundancy before you
do it makes loads of sense!).

> I recently bought 2x 1Tb disks for my sun server, for $650 each.  This was
> enough to make me do the analysis, "why am I buying sun branded overpriced
> disks?"  Here is the abridged version:

No argument that, in the existing market, with various levels of need,
this is often the right choice.

I find it deeply frustrating and annoying that this dilemma exists
entirely due to bad behavior by the disk companies, though.  First they
sell deliberately-defective drives (lie about cache flush, for example)
and then they (in conspiracy with an accomplice company) charge us many
times the cost of the physical hardware for fixed versions.  This MUST be
stopped.  This is EXACTLY what standards exist for -- so we can buy
known-quantity products in a competitive market.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-14 Thread David Dyer-Bennet

On 14-Apr-10 22:44, Ian Collins wrote:

On 04/15/10 06:16 AM, David Dyer-Bennet wrote:

Because 132 was the most current last time I paid much attention :-). As
I say, I'm currently holding out for 2010.$Spring, but knowing how to get
to a particular build via package would be potentially interesting for
the
future still.


I hope it's 2010.$Autumn, I don't fancy waiting until October.

Hint: the southern hemisphere does exist!


I've even been there.

But the month/season relationship is too deeply built into too many 
things I follow (like the Christmas books come out of the publisher's 
fall list; for that matter, like that Christmas is in the winter) to go 
away at all easily.


California doesn't have seasons anyway.

--
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-14 Thread David Dyer-Bennet

On Wed, April 14, 2010 15:28, Miles Nordin wrote:
>>>>>> "dd" == David Dyer-Bennet  writes:
>
> dd> Is it possible to switch to b132 now, for example?
>
> yeah, this is not so bad.  I know of two approaches:

Thanks, I've filed and flagged this for reference.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggestions about current ZFS setup

2010-04-14 Thread David Dyer-Bennet

On Wed, April 14, 2010 12:29, Bob Friesenhahn wrote:
> On Wed, 14 Apr 2010, David Dyer-Bennet wrote:
>>>>
>>>> Not necessarily for a home server.  While mine so far is all mirrored
>>>> pairs of 400GB disks, I don't even think about "performance" issues, I
>>>> never come anywhere near the limits of the hardware.
>>>
>>> I don't see how the location of the server has any bearing on required
>>> performance.  If these 2TB drives are the new 4K sector variety, even
>>> you might notice.
>>
>> The location does not, directly, of course; but the amount and type of
>> work being supported does, and most home servers see request streams
>> very
>> different from commercial servers.
>
> If it was not clear, the performance concern is primarily for writes
> since zfs will load-share the writes across the available vdevs using
> an algorithm which also considers the write queue/backlog for each
> vdev.  If a vdev is slow, then it may be filled more slowly than the
> other vdevs.  This is also the reason why zfs encourages that all
> vdevs use the same organization.

As I said, I don't think of performance issues on mine.  So I wasn't
thinking of that particular detail, and it's good to call it out
explicitly.  If the performance of the new drives isn't adequate, then the
performance of the entire pool will become inadequate, it looks like.

I expect it's routine to have disks of different generations in the same
pool at this point (and if it isn't now, it will be in 5 years), just due
to what's available, replacing bad drives, and so forth.
-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Which build is the most stable, mainly for NAS (zfs)?

2010-04-14 Thread David Dyer-Bennet

On Wed, April 14, 2010 11:51, Tonmaus wrote:
>>
>> On Wed, April 14, 2010 08:52, Tonmaus wrote:
>> > safe to say: 2009.06 (b111) is unusable for the
>> purpose, ans CIFS is dead
>> > in this build.
>>
>> That's strange; I run it every day (my home Windows
>> "My Documents" folder
>> and all my photos are on 2009.06).
>>
>>
>> -bash-3.2$ cat /etc/release
>> OpenSolaris 2009.06 snv_111b
>>  X86
>> Copyright 2009 Sun Microsystems, Inc.  All
>> Rights Reserved.
>> Use is subject to license
>>  terms.
>>  Assembled 07 May 2009
>
>
> I would be really interested how you got past this
> http://defect.opensolaris.org/bz/show_bug.cgi?id=11371
> which I was so badly bitten by that I considered giving up on OpenSolaris.


I don't get random hangs in normal use; so I haven't done anything to "get
past" this.

I DO get hangs when funny stuff goes on, which may well be related to that
problem (at least they require a reboot).  Hmmm; I get hangs sometimes
when trying to send a full replication stream to an external backup drive,
and I have to reboot to recover from them.  I can live with this, in the
short term.  But now I'm feeling hopeful that they're fixed in what I'm
likely to be upgrading to next.

>>  not sure if this is best choice. I'd like to
>>  hear from others as well.
>> Well, it's technically not a stable build.
>>
>> I'm holding off to see what 2010.$Spring ends up
>> being; I'll convert to
>> that unless it turns into a disaster.
>>
>> Is it possible to switch to b132 now, for example?  I
>> don't think the old
>> builds are available after the next one comes out; I
>> haven't been able to
>> find them.
>
> There are methods to upgrade to any dev build by pkg. Can't tell you from
> the top of my head, but I have done it with success.
>
> I wouldn't know why to go to 132 instead of 133, though. 129 seems to be
> an option.

Because 132 was the most current last time I paid much attention :-).  As
I say, I'm currently holding out for 2010.$Spring, but knowing how to get
to a particular build via package would be potentially interesting for the
future still.  Having been told it's possible helps, makes it worth
looking harder.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Suggestions about current ZFS setup

2010-04-14 Thread David Dyer-Bennet

On Wed, April 14, 2010 12:06, Bob Friesenhahn wrote:
> On Wed, 14 Apr 2010, David Dyer-Bennet wrote:
>>> It should be "safe" but chances are that your new 2TB disks are
>>> considerably slower than the 1TB disks you already have.  This should
>>> be as much cause for concern (or more so) than the difference in raidz
>>> topology.
>>
>> Not necessarily for a home server.  While mine so far is all mirrored
>> pairs of 400GB disks, I don't even think about "performance" issues, I
>> never come anywhere near the limits of the hardware.
>
> I don't see how the location of the server has any bearing on required
> performance.  If these 2TB drives are the new 4K sector variety, even
> you might notice.

The location does not, directly, of course; but the amount and type of
work being supported does, and most home servers see request streams very
different from commercial servers.

The last server software I worked on was able to support 80,000
simultaneous HD video streams.  Coming off Thumpers, in fact (well, coming
out of a truly obscene amount of DRAM buffer on the streaming board, which
was in turn loaded from Thumpers); this was the thing that Thumper was
originally designed for, known when I worked there as the Sun Streaming
System I believe.  You don't see loads like that on home servers :-).  And
a big database server would have an equally extreme but totally different
access pattern.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   >