RE: [zfs-discuss] Automounting ? (idea ?)

2006-09-27 Thread Bennett, Steve
 
> So recently, i decided to test out some of the ideas i've been toying
> with, and decided to create 50 000 and 100 000 filesystems, the test
> machine was a nice V20Z with dual 1.8 opterons, 4gb ram, connecting a
> scsi 3310 raid array, via two scsi controllers.

I did a similar test a couple of months ago, albeit on a smaller system,
and 'only' 10,000 users. I saw a similar delay at boot time, but also
saw a large amount of memory utilisation.

> So ... how about an automounter? Is this even possible? Does 
> it exist ?

Around the same time, Casper Dik mentioned the possibility of
automounting zfs datasets, as well as the possibility of cool stuff like
*creating* zfs datasets with the automounter.

One thing that hasn't been touched on is how one would back up a system
when some (or most) filesystems are unmounted most of the time.

Is is possible to make a backup and/or take a snapshot of an unmounted
dataset (and if not, is that a future possibility)?

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Proposal: multiple copies of user data

2006-09-12 Thread Bennett, Steve
Darren said:
> Right, that is a very important issue.  Would a
> ZFS "scrub" framework do copy on write ?
> As you point out if it doesn't then we still need
> to do something about the old clear text blocks
> because strings(1) over the raw disk will show them.
> 
> I see the desire to have a knob that says "make this 
> encrypted now" but I personally believe that it is
> actually better if you can make this choice at the
> time you create the ZFS data set.

I'm not sure that that gets rid of the problem at all.

If I have an existing filesystem that I want to encrypt, but I need to
create a new dataset to do so, I'm going to create my new, encrypted
dataset, then copy my data onto it, then (maybe) delete the old one.

If both datasets are in the same pool (which is likely), I'll still not
be able to securely erase the blocks that have all my cleartext data on
them. The only way to do the job properly would to overwrite the entire
pool, which is likely to be pretty inconvenient in most cases.

So, how about some way to securely erase freed blocks?

It could be implemented as a one-off operation that acts on an entire
pool e.g.
zfs shred tank
which would walk the free block list and overwrite with random data some
number of times.
Or it might be more useful to have it as a per-dataset option:
zfs set shred=32 tank/secure
which could overwrite blocks with random data as they are freed.
I have no idea how expensive this might be (both in development time,
and in performance hit), but its use might be a bit wider than just
dealing with encryption and/or rekeying.

I guess that deletion of a snapshot might get a bit expensive, but maybe
there's some way that blocks awaiting shredding could be queued up and
dealt with at a lower priority...

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic

2006-09-08 Thread Bennett, Steve

> Dunno about eSATA jbods, but eSATA host ports have
> appeared on at least two HDTV-capable DVRs for storage
> expansion (looks like one model of the Scientific Atlanta
> cable box DVR's as well as on the shipping-any-day-now
> Tivo Series 3).  
> 
> It's strange that they didn't go with firewire since it's 
> already widely used for digital video.

Cost? If you use eSata it's pretty much just a physical connector onto
the board, whereas I guess firewire needs a 1394 interface (couple of
dollars?) plus a royalty to all the patent holders.

It's probably not much, but I can't see how there can be *any* margin in
consumer electronics these days...

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] ZFS Boot Disk

2006-08-18 Thread Bennett, Steve
Lori said:
> The limitation is mainly about the *number* of disks
> that can be accessed at one time.
> ...
> But with straight mirroring, there's no such problem
> because any disk in the mirror can supply all of the
> disk blocks needed to boot.

Does that mean that these restrictions will go away once replication can
be varied on a per dataset (or per file) basis? You could have all your
'essential to boot' files mirrored across all disks, then raidz2 the
rest...

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] How to best layout our filesystems

2006-07-27 Thread Bennett, Steve
Eric said:
> For U3, these are the performance fixes:
> 6424554 full block re-writes need not read data in
> 6440499 zil should avoid txg_wait_synced() and use dmu_sync() 
> to issue 
> parallelIOs when fsyncing
> 6447377 ZFS prefetch is inconsistant
> 6373978 want to take lots of snapshots quickly ('zfs snapshot -r')
> 
> you could perhaps include these two as well:
> 4034947 anon_swap_adjust() should call kmem_reap() if 
> availrmem is low.
> 6416482 filebench oltp workload hangs in zfs
> 
> There won't be anything in U3 that isn't already in nevada...
Hi Eric,

Do S10U2 users have to wait for U3 to get these fixes, or are they going
to be released as patches before then?
I'm presuming that U3 is scheduled for early 2007...

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Bennett, Steve
Jeff Bonwick said:

> RAID-Z takes a different approach.  We were designing a filesystem
> as well, so we could make the block pointers as semantically rich
> as we wanted.  To that end, the block pointers in ZFS contains data
> layout information.  One nice side effect of this is that we don't
> need fixed-width RAID stripes.  If you have 4+1 RAID-Z, we'll store
> 128k as 4x32k plus 32k of parity, just like any RAID system would.
> But if you only need to store 3 sectors, we won't do a partial-stripe
> update of an existing 5-wide stripe; instead, we'll just allocate
> four sectors, and store the data and its parity.  The stripe width
> is variable on a per-block basis.  And, although we don't support it
> yet, so is the replication model.  The rule for how to reconstruct
> a given block is described explicitly in the block pointer, not
> implicitly by the device configuration.

Thanks for the explanation - a great help in understanding how all this
stuff fits together.

Unfortunately I'm now less sure about why you cannot 'just' add another
disk to a RAID-Z pool. Is this just a policy decision for the sake of
keeping it simple, rather than a technical restriction?

> If your free disk space might be used for single-copy data,
> or might be used for mirrored data, then how much free space
> do you have?  Questions like that need to be answered, and
> answered in ways that make sense.

They need to be answered, but as the storage is scaled up we don't need
any extra accuracy - knowing that a filesystem is somewhere around 80%
full is just fine - I really don't need to care precisely how many
blocks are free, and it actually hinders me if I get given the exact
information (I have to scale it into the number of GB, or the percentage
of space used).

The fact that we then pretty much ignore exact block counts then leads
me to think that we don't actually need to care about exactly how many
blocks are free on a disk - so if I store N blocks of data it's
acceptable for the number of free blocks to change by domething
different to N. And once data starts to be compressed the direct
correlation between the size of a file and the amount of disk space it
uses goes away in any case.

All pretty exciting - how long are we going to have to wait for this?

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Expanding raidz2

2006-07-13 Thread Bennett, Steve
 
> > I guess that could be made to work, but then the data on 
> > the disk becomes much (much much) more difficult to
> > interpret because you have some rows which are effectively
> > one width and others which are another (ad infinitum).
> 
> How do rows come into it?  I was just assuming that each
> (existing) in-use disk block was pointed to by a FS block,
> which was tracked by other structures.  I was guessing that
> adding space (effectively extending the "rows") wasn't
> going to be noticed for accessing old data.

That's what my assumption was too. I had the impression from
the initial information (I nearly said hype) about ZFS, that
the distinctions between RAID levels were to become less
clear i.e. that you could have some files stored with higher
resilience than others.

Maybe this is a dumb question, but I've never written a
filesystem is there a fundamental reason why you cannot have
some files mirrored, with others as raidz, and others with no
resilience? This would allow a pool to initially exist on one
disk, then gracefully change between different resilience
strategies as you add disks and the requirements change.

Apologies if this is pie in the sky.

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] ZFS needs a viable backup mechanism

2006-07-07 Thread Bennett, Steve

Mike said:
> 3) ZFS ability to recognize duplicate blocks and store only one copy.
> I'm not sure the best way to do this, but my thought was to have ZFS
> remember what the checksums of every block are.  As new blocks are
> written, the checksum of the new block is compared to known checksums.
>  If there is a match, a full comparison of the block is performed.  If
> it really is a match, the data is not really stored a second time.  In
> this case, you are still backing up and restoring 50 TB.

I've done a limited version of this on a disk-to-disk backup system that
we use - I use rsync with --link-dest to preserve multiple copies in a
space-efficient way, but I found that glitches caused the links to be
lost
occasionally, so I have a job that I run occasionally that looks for
identical files and hard links them to each other.
The ability to get this done in ZFS would be pretty neat, and presumably
COW would ensure that there was no danger of a change on one copy
affecting
any others.

Even if there were severe restrictions on how it worked - e.g. only
files
with the same relative paths would be considered, or it was batch-only
instead of live and continuous - it would still be pretty powerful.

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] ZFS needs a viable backup mechanism

2006-07-07 Thread Bennett, Steve
 
> If you are going to use Veritas NetBackup why not use the 
> native Solaris client ?

I don't suppose anyone knows if Networker will become zfs-aware at any
point?
e.g.
  backing up properties
  backing up an entire pool as a single save set
  efficient incrementals (something similar to "zfs send -i")

The ability to back stuff up well would make widespread adoption easier,
especially if thumper lives up to expectations.

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] what to put on zfs

2006-06-30 Thread Bennett, Steve
A slightly different tack now...

what filesystems is it a good (or bad) idea to put on ZFS?
root - NO (not yet anyway)
home - YES (although the huge number of mounts still scares me a bit)
/usr - possible?
/var - possible?
swap - no?

Is there any advantage in having multiple zpools over just having one
and allocating all filesystems out of it?
Obviously if you wanted (for example) /export/home to be raidz and /usr
to be mirror you would have to, but are there other considerations
beyond that?

I'm thinking that zfs frees me up from getting the sizing 'right' at
install time i.e. big enough that I don't have to resize later, which
inevitably means at least one filesystem being far bigger than it needs
to be.

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


RE: [zfs-discuss] Re: Re: Supporting ~10K users on ZFS

2006-06-30 Thread Bennett, Steve
Casper said:
> You can have composite mounts (multiple nested mounts)
> but that is essentially a single automount entry so it
> can't be overly long, I believe.

I've seen that in the man page, but I've never managed to
find a use for it!

What I'd *like* to be able to do is have a map that amounts to:

00 -ro \
  / keck:/export/home/00
  /* -rw /export/home/00/&
01 -ro \
  / keck:/export/home/01
  /* -rw /export/home/01/&
...

This doesn't work - I think it's beyond the capabilities of automountd.
I don't even think an executable map would help.

I can see that I could do an executable map to preserve the
/export/home/NN/username on the server, but have /home/username on the
client - we were considering this on a different system here (where
we're encountering similar problems with a panasas fileserver).

Thanks

Steve.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss