Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-22 Thread Nico Williams
IIRC dump is special.

As for swap... really, you don't want to swap.  If you're swapping you
have problems.  Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM.  There
*are* exceptions to this, such as Varnish.  For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-20 Thread Nico Williams
Bloom filters are very small, that's the difference.  You might only need a
few bits per block for a Bloom filter.  Compare to the size of a DDT entry.
 A Bloom filter could be cached entirely in main memory.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RFE: Un-dedup for unique blocks

2013-01-19 Thread Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.  This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash.  It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)

2013-01-14 Thread Nico Williams
On Mon, Jan 14, 2013 at 1:48 PM, Tomas Forsman st...@acc.umu.se wrote:
 https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=15852599

 Host oraclecorp.com not found: 3(NXDOMAIN)

 Would oracle.internal be a better domain name?

Things like that cannot be changed easily.  They (Oracle) are stuck
with that domainname for the forseeable future.  Also, whoever thought
it up probably didn't consider leakage of internal URIs to the
outside.  *shrug*
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?

2012-07-30 Thread Nico Williams
The copies thing is a really only for laptops, where the likelihood of
redundancy is very low (there are some high-end laptops with multiple
drives, but those are relatively rare) and where this idea is better
than nothing.  It's also nice that copies can be set on a per-dataset
manner (whereas RAID-Zn and mirroring are for pool-wide redundancy,
not per-dataset), so you could set it  1 on home directories but not
/.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 9:48 AM,  casper@oracle.com wrote:
Huge space, but still finite=85

 Dan Brown seems to think so in Digital Fortress but it just means he
 has no grasp on big numbers.

I couldn't get past that.  I had to put the book down.  I'm guessing
it was as awful as it threatened to be.

IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov skiselkov...@gmail.com wrote:
 It's also possible to set dedup=verify with checksum=sha256,
 however, that makes little sense (as the chances of getting a random
 hash collision are essentially nil).

IMO dedup should always verify.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New fast hash algorithm - is it needed?

2012-07-11 Thread Nico Williams
You can treat whatever hash function as an idealized one, but actual
hash functions aren't.  There may well be as-yet-undiscovered input
bit pattern ranges where there's a large density of collisions in some
hash function, and indeed, since our hash functions aren't ideal,
there must be.  We just don't know where these potential collisions
are -- for cryptographically secure hash functions that's enough (plus
2nd pre-image and 1st pre-image resistance, but allow me to handwave),
but for dedup?  *shudder*.

Now, for some content types collisions may not be a problem at all.
Think of security camera recordings: collisions will show up as bad
frames in a video stream that no one is ever going to look at, and if
they should need it, well, too bad.

And for other content types collisions can be horrible.  Us ZFS lovers
love to talk about how silent bit rot means you may never know about
serious corruption in other filesystems until it's too late.  Now, if
you disable verification in dedup, what do you get?  The same
situation as other filesystems are in relative to bit rot, only with
different likelihoods.

Disabling verification is something to do after careful deliberation,
not something to do by default.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-04 Thread Nico Williams
On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Tue, 3 Jul 2012, James Litchfield wrote:
 Agreed - msync/munmap is the only guarantee.

 I don't see that the munmap definition assures that anything is written to
 disk.  The system is free to buffer the data in RAM as long as it likes
 without writing anything at all.

Oddly enough the manpages at the Open Group don't make this clear.  So
I think it may well be advisable to use msync(3C) before munmap() on
MAP_SHARED mappings.  However, I think all implementors should, and
probably all do (Linux even documents that it does) have an implied
msync(2) when doing a munmap(2).  I really makes no sense at all to
have munmap(2) not imply msync(3C).

(That's another thing, I don't see where the standard requires that
munmap(2) be synchronous.  I think it'd be nice to have an mmap(2)
option for requesting whether munmap(2) of the same mapping be
synchronous or asynchronous.  Async munmap(2) - no need to mount
cross-calls, instead allowing to mapping to be torn down over time.
Doing a synchronous msync(3C), then a munmap(2) is a recipe for going
real slow, but if munmap(2) does not portably guarantee an implied
msync(3C), then would it be safe to do an async msync(2) then
munmap(2)??)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-03 Thread Nico Williams
On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield
jim.litchfi...@oracle.com wrote:
 On 07/02/12 15:00, Nico Williams wrote:
 You can't count on any writes to mmap(2)ed files hitting disk until
 you msync(2) with MS_SYNC.  The system should want to wait as long as
 possible before committing any mmap(2)ed file writes to disk.
 Conversely you can't expect that no writes will hit disk until you
 msync(2) or munmap(2).

 Driven by fsflush which will scan memory (in chunks) looking for dirty,
 unlocked, non-kernel pages to flush to disk.

Right, but one just cannot count on that -- it's not part of the API
specification.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files

2012-07-02 Thread Nico Williams
On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn
bfrie...@simple.dallas.tx.us wrote:
 On Mon, 2 Jul 2012, Iwan Aucamp wrote:
 I'm interested in some more detail on how ZFS intent log behaves for
 updated done via a memory mapped file - i.e. will the ZIL log updates done
 to an mmap'd file or not ?


 I would to expect these writes to go into the intent log unless msync(2) is
 used on the mapping with the MS_SYNC option.

You can't count on any writes to mmap(2)ed files hitting disk until
you msync(2) with MS_SYNC.  The system should want to wait as long as
possible before committing any mmap(2)ed file writes to disk.
Conversely you can't expect that no writes will hit disk until you
msync(2) or munmap(2).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Nico Williams
On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith
alan.coopersm...@oracle.com wrote:
 On 06/26/12 05:46 AM, Lionel Cons wrote:
 On 25 June 2012 11:33,  casper@oracle.com wrote:
 To be honest, I think we should also remove this from all other
 filesystems and I think ZFS was created this way because all modern
 filesystems do it that way.

 This may be wrong way to go if it breaks existing applications which
 rely on this feature. It does break applications in our case.

 Existing applications rely on the ability to corrupt UFS filesystems?
 Sounds horrible.

My guess is that the OP just wants unlink() of an empty directory to
be the same as rmdir() of the same.  Or perhaps they want unlink() of
a non-empty directory to result in a recursive rm...  But if they
really want hardlinks to directories, then yeah, that's horrible.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?

2012-06-26 Thread Nico Williams
On Tue, Jun 26, 2012 at 8:12 AM, Lionel Cons
lionelcons1...@googlemail.com wrote:
 On 26 June 2012 14:51,  casper@oracle.com wrote:
 We've already asked our Netapp representative. She said it's not hard
 to add that.

Did NetApp tell you that they'll add support for using the NFSv4 LINK
operation on source objects that are directories?!  I'd be extremely
surprised!  Or did they only tell you that they'll add support for
using the NFSv4 REMOVE operation on non-empty directories?  The latter
is definitely feasible (although it could fail due to share deny OPENs
of files below, say, but hey).  The former is... not sane.

 I'd suggest whether you can restructure your code and work without this.

 It would require touching code for which we don't have sources anymore
 (people gone, too). It would also require to create hard links to the
 results files directly, which means linking 15000+ files per directory
 with a minimum of 3 directories. Each day (this is CERN after
 all).

Oh, I see.  But you still don't want hardlinks to directories!
Instead you might be able to use LD_PRELOAD to emulate the behavior
that the application wants.  The app is probably implementing
rename(), so just detect the sequence and map it to an actual
rename(2).

 The other way around would be to throw the SPARC machines away and go
 with Netapp.

So Solaris is just a fileserver here?

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is there an actual newsgroup for zfs-discuss?

2012-06-11 Thread Nico Williams
On Mon, Jun 11, 2012 at 5:05 PM, Tomas Forsman st...@acc.umu.se wrote:
 .. or use a mail reader that doesn't suck.

Or the mailman thread view.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Terminology question on ZFS COW

2012-06-05 Thread Nico Williams
COW goes back at least to the early days of virtual memory and fork().
 On fork() the kernel would arrange for writable pages in the parent
process to be made read-only so that writes to them could be caught
and then the page fault handler would copy the page (and restore write
access) so the parent and child each have their own private copies.
COW as used in ZFS is not the same, but the concept was introduced
very early also, IIRC in the mid-80s -- certainly no later than
BSD4.4's log structure filesystem (which ZFS resembles in many ways).

So, is COW a misnomer?  Yes and no, and anyways, it's irrelevant.  The
important thing is that when you say COW people understand that you're
not saving a copy of the old thing but rather writing the new thing to
a new location.  (The old version of whatever was copied-on-write is
stranded, unless -of course- you have references left to it from
things like snapshots.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] current status of SAM-QFS?

2012-05-03 Thread Nico Williams
On Wed, May 2, 2012 at 7:59 AM, Paul Kraus p...@kraus-haus.org wrote:
 On Wed, May 2, 2012 at 7:46 AM, Darren J Moffat darr...@opensolaris.org 
 wrote:
 If Oracle is only willing to share (public) information about the
 roadmap for products via official sales channels then there will be
 lots of FUD in the market. Now, as to sharing futures and NDA
 material, that _should_ only be available via direct Oracle channels
 (as it was under Sun as well).

Sun was tight lipped too, yes, but information leaked through the open
or semi-open software development practices in Solaris.  If you saw
some feature pushed to some gate you had no guarantee that it would
remain there or be supported, but you had a pretty good inkling as to
whether the engineers working on it intended it to remain there.

If you can't get something out of your rep, you might try reading the
tea leaves (sketchy business).  But ultimately you need to be prepared
for any product's EOL.  You can expect some amount of warning time
about EOLs, but legacy has a way of sticking around, so write plan for
how to migrate data and where to, then put the plan in a drawer
somewhere (and update it as necessary).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling
richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote:
 Reboot requirement is a lame client implementation.

And lame protocol design.  You could possibly migrate read-write NFSv3
on the fly by preserving FHs and somehow updating the clients to go to
the new server (with a hiccup in between, no doubt), but only entire
shares at a time -- you could not migrate only part of a volume with
NFSv3.

Of course, having migration support in the protocol does not equate to
getting it in the implementation, but it's certainly a good step in
that direction.

 You are correct, a ZFS send/receive will result in different file handles on
 the receiver, just like
 rsync, tar, ufsdump+ufsrestore, etc.

That's understandable for NFSv2 and v3, but for v4 there's no reason
that an NFSv4 server stack and ZFS could not arrange to preserve FHs
(if, perhaps, at the price of making the v4 FHs rather large).
Although even for v3 it should be possible for servers in a cluster to
arrange to preserve devids...

Bottom line: live migration needs to be built right into the protocol.

For me one of the exciting things about Lustre was/is the idea that
you could just have a single volume where all new data (and metadata)
is distributed evenly as you go.  Need more storage?  Plug it in,
either to an existing head or via a new head, then flip a switch and
there it is.  No need to manage allocation.  Migration may still be
needed, both within a cluster and between clusters, but that's much
more manageable when you have a protocol where data locations can be
all over the place in a completely transparent manner.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar car...@taltos.org wrote:
 On 4/26/12 2:17 PM, J.P. King wrote:
 I don't know SnapMirror, so I may be mistaken, but I don't see how you
 can have non-synchronous replication which can allow for seamless client
 failover (in the general case). Technically this doesn't have to be
 block based, but I've not seen anything which wasn't. Synchronous
 replication pretty much precludes DR (again, I can think of theoretical
 ways around this, but have never come across anything in practice).

 seamless is an over-statement, I agree. NetApp has synchronous SnapMirror
 (which is only mostly synchronous...). Worst case, clients may see a
 filesystem go backwards in time, but to a point-in-time consistent state.

Sure, if we assume apps make proper use of O_EXECL, O_APPEND,
link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and
can roll their own state back on their own.  Databases typically know
how to do that (e.g., SQLite3).  Most apps?  Doubtful.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-26 Thread Nico Williams
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling
richard.ell...@gmail.com wrote:
 [...]

NFSv4 had migration in the protocol (excluding protocols between
servers) from the get-go, but it was missing a lot (FedFS) and was not
implemented until recently.  I've no idea what clients and servers
support it adequately besides Solaris 11, though that's just my fault
(not being informed).  It's taken over a decade to get to where we
have any implementations of NFSv4 migration.

 For me one of the exciting things about Lustre was/is the idea that
 you could just have a single volume where all new data (and metadata)
 is distributed evenly as you go.  Need more storage?  Plug it in,
 either to an existing head or via a new head, then flip a switch and
 there it is.  No need to manage allocation.  Migration may still be
 needed, both within a cluster and between clusters, but that's much
 more manageable when you have a protocol where data locations can be
 all over the place in a completely transparent manner.


 Many distributed file systems do this, at the cost of being not quite
 POSIX-ish.

Well, Lustre does POSIX semantics just fine, including cache coherency
(as opposed to NFS' close-to-open coherency, which is decidedly
non-POSIX).

 In the brave new world of storage vmotion, nosql, and distributed object
 stores,
 it is not clear to me that coding to a POSIX file system is a strong
 requirement.

Well, I don't quite agree.  I'm very suspicious of
eventually-consistent.  I'm not saying that the enormous DBs that eBay
and such run should sport SQL and ACID semantics -- I'm saying that I
think we can do much better than eventually-consistent (and
no-language) while not paying the steep price that ACID requires.  I'm
not alone in this either.

The trick is to find the right compromise.  Close-to-open semantics
works out fine for NFS, but O_APPEND is too wonderful not to have
(ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not
O_APPEND).

Whoever first delivers the right compromise in distributed DB
semantics stands to make a fortune.

 Perhaps people are so tainted by experiences with v2 and v3 that we can
 explain
 the non-migration to v4 as being due to poor marketing? As a leader of NFS,
 Sun
 had unimpressive marketing.

Sun did not do too much to improve NFS in the 90s, not compared to the
v4 work that only really started paying off only too recently.  And
then since Sun had lost the client space by then it doesn't mean all
that much to have the best server if the clients aren't able to take
advantage of the server's best features for lack of client
implementation.  Basically, Sun's ZFS, DTrace, SMF, NFSv4, Zones, and
other amazing innovations came a few years too late to make up for the
awful management that Sun was saddled with.  But for all the decidedly
awful things Sun management did (or didn't do), the worst was
terminating Sun PS (yes, worse that all the non-marketing, poor
marketing, poor acquisitions, poor strategy, and all the rest
including truly epic mistakes like icing Solaris on x86 a decade ago).
 One of the worst outcomes of the Sun debacle is that now there's a
bevy of senior execs who think the worst thing Sun did was to open
source Solaris and Java -- which isn't to say that Sun should have
open sourced as much as it did, or that open source is an end in
itself, but that open sourcing these things was legitimate a business
tool with very specific goals in mind in each case, and which had
nothing to do with the sinking of the company.  Or maybe that's one of
the best outcomes, because the good news about it is that those who
learn the right lessons (in that case: that open source is a
legitimate business tool that is sometimes, often even, a great
mind-share building tool) will be in the minority, and thus will have
a huge advantage over their competition.  That's another thing Sun did
not learn until it was too late: mind-share matters enormously to a
software company.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on Linux vs FreeBSD

2012-04-25 Thread Nico Williams
As I understand it LLNL has very large datasets on ZFS on Linux.  You
could inquire with them, as well as
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/topics?pli=1
.  My guess is that it's quite stable for at least some use cases
(most likely: LLNL's!), but that may not be yours.  You could
always... test it, but if you do then please tell us how it went :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
I agree, you need something like AFS, Lustre, or pNFS.  And/or an NFS
proxy to those.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer p...@paularcher.org wrote:
 2:20pm, Richard Elling wrote:
 Ignoring lame NFS clients, how is that architecture different than what
 you would have
 with any other distributed file system? If all nodes share data to all
 other nodes, then...?

 Simple. With a distributed FS, all nodes mount from a single DFS. With NFS,
 each node would have to mount from each other node. With 16 nodes, that's
 what, 240 mounts? Not to mention your data is in 16 different
 mounts/directory structures, instead of being in a unified filespace.

To be fair NFSv4 now has a distributed namespace scheme so you could
still have a single mount on the client.  That said, some DFSes have
better properties, such as striping of data across sets of servers,
aggressive caching, and various choices of semantics (e.g., Lustre
tries hard to give you POSIX cache coherency semantics).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling
richard.ell...@gmail.com wrote:
 Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents).
 FWIW,
 automounters were invented 20+ years ago to handle this in a nearly seamless
 manner.
 Today, we have DFS from Microsoft and NFS referrals that almost eliminate
 the need
 for automounter-like solutions.

I disagree vehemently.  automount is a disaster because you need to
synchronize changes with all those clients.  That's not realistic.
I've built a large automount-based namespace, replete with a
distributed configuration system for setting the environment variables
available to the automounter.  I can tell you this: the automounter
does not scale, and it certainly does not avoid the need for outages
when storage migrates.

With server-side, referral-based namespace construction that problem
goes away, and the whole thing can be transparent w.r.t. migrations.

For my money the key features a DFS must have are:

 - server-driven namespace construction
 - data migration without having to restart clients,
   reconfigure them, or do anything at all to them
 - aggressive caching

 - striping of file data for HPC and media environments

 - semantics that ultimately allow multiple processes
   on disparate clients to cooperate (i.e., byte range
   locking), but I don't think full POSIX semantics are
   needed

   (that said, I think O_EXCL is necessary, and it'd be
   very nice to have O_APPEND, though the latter is
   particularly difficult to implement and painful when
   there's contention if you stripe file data across
   multiple servers)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins i...@ianshome.com wrote:
 Aren't those general considerations when specifying a file server?

There are Lustre clusters with thousands of nodes, hundreds of them
being servers, and high utilization rates.  Whatever specs you might
have for one server head will not meet the demand that hundreds of the
same can.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Apr 25, 2012, at 3:36 PM, Nico Williams wrote:
  I disagree vehemently.  automount is a disaster because you need to
  synchronize changes with all those clients.  That's not realistic.

 Really?  I did it with NIS automount maps and 600+ clients back in 1991.
 Other than the obvious problems with open files, has it gotten worse since
 then?

Nothing's changed.  Automounter + data migration - rebooting clients
(or close enough to rebooting).  I.e., outage.

 Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc.

But not with AFS.  And spec-wise not with NFSv4 (though I don't know
if/when all NFSv4 clients will properly support migration, just that
the protocol and some servers do).

 With server-side, referral-based namespace construction that problem
 goes away, and the whole thing can be transparent w.r.t. migrations.

Yes.

 Agree, but we didn't have NFSv4 back in 1991 :-)  Today, of course, this
 is how one would design it if you had to design a new DFS today.

Indeed, that's why I built an automounter solution in 1996 (that's
still in use, I'm told).  Although to be fair AFS existed back then
and had global namespace and data migration back then, and was mature.
 It's taken NFS that long to catch up...

 [...]

 Almost any of the popular nosql databases offer this and more.
 The movement away from POSIX-ish DFS and storing data in
 traditional files is inevitable. Even ZFS is a object store at its core.

I agree.  Except that there are applications where large octet streams
are needed.  HPC, media come to mind.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)

2012-04-25 Thread Nico Williams
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus pk1...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams n...@cryptonector.com wrote:
 Nothing's changed.  Automounter + data migration - rebooting clients
 (or close enough to rebooting).  I.e., outage.

    Uhhh, not if you design your automounter architecture correctly
 and (as Richard said) have NFS clients that are not lame to which I'll
 add, automunters that actually work as advertised. I was designing
 automount architectures that permitted dynamic changes with minimal to
 no outages in the late 1990's. I only had a little over 100 clients
 (most of which were also servers) and NIS+ (NIS ver. 3) to distribute
 the indirect automount maps.

Further below you admit that you're talking about read-only data,
effectively.  But the world is not static.  Sure, *code* is by and
large static, and indeed, we segregated data by whether it was
read-only (code, historical data) or not (application data, home
directories).  We were able to migrated *read-only* data with no
outages.  But for the rest?  Yeah, there were always outages.  Of
course, we had a periodic maintenance window, with all systems
rebooting within a short period, and this meant that some data
migration outages were not noticeable, but they were real.

    I also had to _redesign_ a number of automount strategies that
 were built by people who thought that using direct maps for everything
 was a good idea. That _was_ a pain in the a** due to the changes
 needed at the applications to point at a different hierarchy.

We used indirect maps almost exclusively.  Moreover, we used
hierarchical automount entries, and even -autofs mounts.  We also used
environment variables to control various things, such as which servers
to mount what from (this was particularly useful for spreading the
load on read-only static data).  We used practically every feature of
the automounter except for executable maps (and direct maps, when we
eventually stopped using those).

    It all depends on _what_ the application is doing. Something that
 opens and locks a file and never releases the lock or closes the file
 until the application exits will require a restart of the application
 with an automounter / NFS approach.

No kidding!  In the real world such applications exist and get used.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Data loss by memory corruption?

2012-01-18 Thread Nico Williams
On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov jimkli...@cos.ru wrote:
 2012-01-18 1:20, Stefan Ring wrote:
 I don’t care too much if a single document gets corrupted – there’ll
 always be a good copy in a snapshot. I do care however if a whole
 directory branch or old snapshots were to disappear.

 Well, as far as this problem relies on random memory corruptions,
 you don't get to choose whether your document gets broken or some
 low-level part of metadata tree ;)

Other filesystems tend to be much more tolerant of bit rot of all
types precisely because they have no block checksums.

But I'd rather have ZFS -- *with* redundancy, of course, and with ECC.

It might be useful to have a way to recover from checksum mismatches
by involving a human.  I'm imagining a tool that tests whether
accepting a block's actual contents results in making data available
that the human thinks checks out, and if so, then rewriting that
block.  Some bit errors might simply result in meaningless metadata,
but in some cases this can be corrected (e.g., ridiculous block
addresses).  But if ECC takes care of the problem then why waste the
effort?  (Partial answer: because it'd be a very neat GSoC type
project!)

 Besides, what if that document you don't care about is your account's
 entry in a banking system (as if they had no other redundancy and
 double-checks)? And suddenly you don't exist because of some EIOIO,
 or your balance is zeroed (or worse, highly negative)? ;)

This is why we have paper trails, logs, backups, redundancy at various
levels, ...

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks

2012-01-11 Thread Nico Williams
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov jimkli...@cos.ru wrote:
 I've recently had a sort of an opposite thought: yes,
 ZFS redundancy is good - but also expensive in terms
 of raw disk space. This is especially bad for hardware
 space-constrained systems like laptops and home-NASes,
 where doubling the number of HDDs (for mirrors) or
 adding tens of percent of storage for raidZ is often
 not practical for whatever reason.

Redundancy through RAID-Z and mirroring is expensive for home systems
and laptops, but mostly due to the cost of SATA/SAS ports, not the
cost of the drives.  The drives are cheap, but getting an extra disk
in a laptop is either impossible or expensive.  But that doesn't mean
you can't mirror slices or use ditto blocks.  For laptops just use
ditto blocks and either zfs send or external mirror that you
attach/detach.

 Current ZFS checksums allow us to detect errors, but
 in order for recovery to actually work, there should be
 a redundant copy and/or parity block available and valid.

 Hence the question: why not put ECC info into ZFS blocks?

RAID-Zn *is* an error correction system.  But what you are asking for
is a same-device error correction method that costs less than ditto
blocks, with error correction data baked into the blkptr_t.  Are there
enough free bits left in the block pointer for error correction codes
for large blocks?  (128KB blocks, but eventually ZFS needs to support
even larger blocks, so keep that in mind.)  My guess is: no.  Error
correction data might have to get stored elsewhere.

I don't find this terribly attractive, but maybe I'm just not looking
at it the right way.  Perhaps there is a killer enterprise feature for
ECC here: stretching MTTDL in the face of a device failure in a mirror
or raid-z configuration (but if failures are typically of whole drives
rather than individual blocks, then this wouldn't help).  But without
a good answer for where to store the ECC for the largest blocks, I
don't see this happening.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2012-01-05 Thread Nico Williams
On Thu, Jan 5, 2012 at 8:53 AM, sol a...@yahoo.com wrote:
 if a bug fixed in Illumos is never reported to Oracle by a customer,
 it would likely never get fixed in Solaris either

 :-(

 I would have liked to think that there was some good-will between the ex- and 
 current-members of the zfs team, in the sense that the people who created zfs 
 but then left Oracle still care about it enough to want the Oracle version to 
 be as bug-free as possible.

My intention was to encourage users to report bugs to both, Oracle and
Illumos.  It's possible that Oracle engineers pay attention to the
Illumos bug database, but I expect that for legal reasons the will not
look at Illumos code that has any new copyright notices relative to
Oracle code.  The simplest way for Oracle engineers to avoid all
possible legal problems is to simply ignore at least the Illumos
source repositories, possibly more.  I'm speculating, sure; I might be
wrong.

As for good will, I'm certain that there is, at least at the engineer
level, and probably beyond.  But that doesn't mean that there will be
bug parity, much less feature parity.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs brad.di...@oracle.com wrote:
 Jim,

 You are spot on.  I was hoping that the writes would be close enough to 
 identical that
 there would be a high ratio of duplicate data since I use the same record 
 size, page size,
 compression algorithm, … etc.  However, that was not the case.  The main 
 thing that I
 wanted to prove though was that if the data was the same the L1 ARC only 
 caches the
 data that was actually written to storage.  That is a really cool thing!  I 
 am sure there will
 be future study on this topic as it applies to other scenarios.

 With regards to directory engineering investing any energy into optimizing 
 ODSEE DS
 to more effectively leverage this caching potential, that won't happen.  OUD 
 far out
 performs ODSEE.  That said OUD may get some focus in this area.  However, 
 time will
 tell on that one.

Databases are not as likely to benefit from dedup as virtual machines,
indeed, DBs are likely to not benefit at all from dedup.  The VM use
case benefits from dedup for the obvious reason that many VMs will
have the same exact software installed most of the time, using the
same filesystems, and the same patch/update installation order, so if
you keep data out of their root filesystems then you can expect
enormous deduplicatiousness.  But databases, not so much.  The unit of
deduplicable data in a VM use case is the guest's preferred block
size, while in a DB the unit of deduplicable data might be a
variable-sized table row, or even smaller: a single row/column value
-- and you have no way to ensure alignment of individual deduplicable
units nor ordering of sets of deduplicable units into larger ones.

When it comes to databases your best bets will be: a) database-level
compression or dedup features (e.g., Oracle's column-level compression
feature) or b) ZFS compression.

(Dedup makes VM management easier, because the alternative is to patch
one master guest VM [per-guest type] then re-clone and re-configure
all instances of that guest type, in the process possibly losing any
customizations in those guests.  But even before dedup, the ability to
snapshot and clone datasets was an impressive dedup-like tool for the
VM use-case, just not as convenient as dedup.)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote:
 Richard Elling wrote:
 many of the former Sun ZFS team
 regularly contribute to ZFS through the illumos developer community.

 Does this mean that if they provide a bug fix via illumos then the fix won't
 make it into the Oracle code?

If you're an Oracle customer you should report any ZFS bugs you find
to Oracle if you want fixes in Solaris.  You may want to (and I
encourage you to) report such bugs to Illumos if at all possible
(i.e., unless your agreement with Oracle or your employer's policies
somehow prevent you from doing so).

The following is complete speculation.  Take it with salt.

With reference to your question, it may mean that Oracle's ZFS team
would have to come up with their own fixes to the same bugs.  Oracle's
legal department would almost certainly have to clear the copying of
any non-trivial/obvious fix from Illumos into Oracle's ON tree.  And
if taking a fix from Illumos were to require opening the affected
files (because they are under CDDL in Illumos) then executive
management approval would also be required.  But the most likely case
is that the issue simply wouldn't come up in the first place because
Oracle's ZFS team would almost certainly ignore the Illumos repository
(perhaps not the Illumos bug tracker, but probably that too) as that's
simply the easiest way for them to avoid legal messes.  Think about
it.  Besides, I suspect that from Oracle's point of view what matters
are bug reports by Oracle customers to Oracle, so if a bug fixed in
Illumos is never reported to Oracle by a customer, it would likely
never get fixed in Solaris either except by accident, as a result of
another change.

Also, the Oracle ZFS team is not exactly devoid of clue, even with the
departures from it to date.  I suspect they will be able to fix bugs
in Oracle's ZFS and completely independently of the open ZFS
community, even if it means duplicating effort.

That said, Illumos is a fork of OpenSolaris, and as such it and
Solaris will necessarily diverge as at least one of the two (and
probably both, for a while) gets plenty of bug fixes and enhancements.
 This is a good thing, not a bad thing, at least for now.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-29 Thread Nico Williams
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens mahr...@delphix.com wrote:
 On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote:
 (1) when constructing the stream, every time a block is read from a fileset
 (or volume), its checksum is sent to the receiving machine. The receiving
 machine then looks up that checksum in its DDT, and sends back a needed or
 not-needed reply to the sender. While this lookup is being done, the
 sender must hold the original block in RAM, and cannot write it out to the
 to-be-sent-stream.
 ...
 you produce a huge amount of small network packet
 traffic, which trashes network throughput

 This seems like a valid approach to me.  When constructing the stream,
 the sender need not read the actual data, just the checksum in the
 indirect block.  So there is nothing that the sender must hold in
 RAM.  There is no need to create small (or synchronous) network
 packets, because sender need not wait for the receiver to determine if
 it needs the block or not.  There can be multiple asynchronous
 communication streams:  one where the sender sends all the checksums
 to the receiver; another where the receiver requests blocks that it
 does not have from the sender; and another where the sender sends
 requested blocks back to the receiver.  Implementing this may not be
 trivial, and in some cases it will not improve on the current
 implementation.  But in others it would be a considerable improvement.

Right, you'd want to let the socket/transport buffer/flow control
writes of I have this new block checksum messages from the zfs
sender and I need the block with this checksum messages from the zfs
receiver.

I like this.

A separate channel for bulk data definitely comes recommended for flow
control reasons, but if you do that then securing the transport gets
complicated: you couldn't just zfs send .. | ssh ... zfs receive.  You
could use SSH channel multiplexing, but that will net you lousy
performance (well, no lousier than one already gets with SSH
anyways)[*].  (And SunSSH lacks this feature anyways)  It'd then begin
to pay to have have a bonafide zfs send network protocol, and now
we're talking about significantly more work.  Another option would be
to have send/receive options to create the two separate channels, so
one would do something like:

% zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs
receive --dedup-control-channel ... 
% zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive
--dedup-bulk-channel
% wait

The second zfs receive would rendezvous with the first and go from there.

[*] The problem with SSHv2 is that it has flow controlled channels
layered over a flow controlled congestion channel (TCP), and there's
not enough information flowing from TCP to SSHv2 to make this work
well, but also, the SSHv2 channels cannot have their window shrink
except by the sender consuming it, which makes it impossible to mix
high-bandwidth bulk and small control data over a congested link.
This means that in practice SSHv2 channels have to have relatively
small windows, which then forces the protocol to work very
synchronously (i.e., with effectively synchronous ACKs of bulk data).
I now believe the idea of mixing bulk and non-bulk data over a single
TCP connection in SSHv2 is a failure.  SSHv2 over SCTP, or over
multiple TCP connections, would be much better.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-27 Thread Nico Williams
On Tue, Dec 27, 2011 at 2:20 PM, Frank Cusack fr...@linetwo.net wrote:
 http://sparcv9.blogspot.com/2011/12/solaris-11-illumos-and-source.html

 If I upgrade ZFS to use the new features in Solaris 11 I will be unable
 to import my pool using the free ZFS implementation that is available in
 illumos based distributions


 Is that accurate?  I understand if the S11 version is ahead of illumos, of
 course I can't use the same pools in both places, but that is the same
 problem as using an S11 pool on S10.  The author is implying a much worse
 situation, that there are zfs tracks in addition to versions and that S11
 is now on a different track and an S11 pool will not be usable elsewhere,
 ever.  I hope it's just a misrepresentation.

Hard to say.  Suppose Oracle releases no details on any additions to
the on-disk ZFS format since build 147...  then either the rest of the
ZFS developer community forks for good, or they have to reverse
engineer Oracle's additions.  Even if Oracle does release details on
their additions, what if the external ZFS developer community
disagrees vehemently with any of those?  And what if the open source
community adds extensions that Oracle never adopts?  A fork is not yet
a reality, but IMO it sure looks likely.

Of course, you can still manage to have pools that will work on all
implementations -- until the day that implementations start removing
older formats anyways, which not only could happen, but I think will
happen, though probably not until S10 is EOLed, and in any case
probably not for a few years yet, likely not even within the next half
decade.  It's hard to predict such things though, so take the above
with some (or lots!) of salt.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] S11 vs illumos zfs compatiblity

2011-12-27 Thread Nico Williams
On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack fr...@linetwo.net wrote:
 So with a de facto fork (illumos) now in place, is it possible that two
 zpools will report the same version yet be incompatible across
 implementations?

Not likely: the Illumos community has developed a method for managing
ZFS extensions in a way other than linear chronology.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

2011-12-13 Thread Nico Williams
On Dec 11, 2011 5:12 AM, Nathan Kroenert nat...@tuneunix.com wrote:

  On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote:

 On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote:

 Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware.

 The only vendor i know that can do this is Netapp

 And you really work at Oracle?:)

 The answer is definiately yes. ARC caches on-disk blocks and dedup just
 reference those blocks. When you read dedup code is not involved at all.
 Let me show it to you with simple test:

 Create a file (dedup is on):

# dd if=/dev/random of=/foo/a bs=1m count=1024

 Copy this file so that it is deduped:

# dd if=/foo/a of=/foo/b bs=1m

 Export the pool so all cache is removed and reimport it:

# zpool export foo
# zpool import foo

 Now let's read one file:

# dd if=/foo/a of=/dev/null bs=1m
1073741824 bytes transferred in 10.855750 secs (98909962
bytes/sec)

 We read file 'a' and all its blocks are in cache now. The 'b' file
 shares all the same blocks, so if ARC caches blocks only once, reading
 'b' should be much faster:

# dd if=/foo/b of=/dev/null bs=1m
1073741824 bytes transferred in 0.870501 secs (1233475634
bytes/sec)

 Now look at it, 'b' was read 12.5 times faster than 'a' with no disk
 activity. Magic?:)


 Hey all,

 That reminds me of something I have been wondering about... Why only 12x
faster? If we are effectively reading from memory - as compared to a disk
reading at approximately 100MB/s (which is about an average PC HDD reading
sequentially), I'd have thought it should be a lot faster than 12x.

 Can we really only pull stuff from cache at only a little over one
gigabyte per second if it's dedup data?

The second file may gave the same data, but not the same metadata -the
inode number at least must be different- so the znode for it must get read
in, and that will slow reading the copy down a bit.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)

2011-11-29 Thread Nico Williams
On Tue, Nov 29, 2011 at 12:17 PM, Cindy Swearingen
cindy.swearin...@oracle.com wrote:
 I think the too many open files is a generic error message about running
 out of file descriptors. You should check your shell ulimit
 information.

Also, see how many open files you have: echo /proc/self/fd/*

It'd be quite weird though to have a very low fd limit or a very large
number of file descriptors open in the shell.

That said, as Casper says, utilities like mv(1) should be able to cope
with reasonably small fd limits (i.e., not as small as 3, but perhaps
as small as 10).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'

2011-11-28 Thread Nico Williams
On Mon, Nov 28, 2011 at 11:28 AM, Smith, David W. smith...@llnl.gov wrote:
 You could list by inode, then use find with rm.

 # ls -i
 7223 -O

 # find . -inum 7223 -exec rm {} \;

This is the one solution I'd recommend against, since it would remove
hardlinks that you might care about.

Also, this thread is getting long, repetitive, tiring.  Please stop.
This is a standard issue Unix beginner question, just like my test
program does nothing.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] virtualbox rawdisk discrepancy

2011-11-21 Thread Nico Williams
Moving boot disks from one machine to another used to work as long as
the machines were of the same architecture.  I don't recall if it was
*supported* (and wouldn't want to pretend to speak for Oracle now),
but it was meant to work (unless you minimized the install and removed
drivers not needed on the first system that are needed on the other
system).  You did have to do a reconfigure boot though!

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-11-14 Thread Nico Williams
On Mon, Nov 14, 2011 at 8:33 AM, Edward Ned Harvey
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Paul Kraus

 Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls
 apart (in terms of performance) when you get too many objects in one
 FS, which is specifically what drove us to ZFS. We had 4.5 TB of data

 According to wikipedia, btrfs is a b-tree.
 I know in ZFS, the DDT is an AVL tree, but what about the rest of the
 filesystem?

ZFS directories are hashed.  Aside from this, the filesystem (and
volume) have a tree structure, but that's not what's interesting here
-- what's interesting is how directories are indexed.

 B-trees should be logarithmic time, which is the best O() you can possibly
 achieve.  So if HFS+ is dog slow, it's an implementation detail and not a
 general fault of b-trees.

Hash tables can do much better than O(log N) for searching: O(1) for
best case, and O(n) for the worst case.

Also, b-trees are O(log_b N), where b is the number of entries
per-node.  6e7 entries/directory probably works out to 2-5 reads
(assuming 0% cache hit rate) depending on the size of each directory
entry and the size of the b-tree blocks.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] aclmode=mask

2011-11-14 Thread Nico Williams
I see, with great pleasure, that ZFS in Solaris 11 has a new
aclmode=mask property.

http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbscy.html#gkkkp
http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbchf.html#gljyz
http://download.oracle.com/docs/cd/E23824_01/html/821-1462/zfs-1m.html#scrolltoc
(search for aclmode)

May this be the last word in ACL/chmod interactions (knocks on wood,
crosses fingers, ...).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] aclmode=mask

2011-11-14 Thread Nico Williams
On Mon, Nov 14, 2011 at 6:20 PM, Nico Williams n...@cryptonector.com wrote:
 I see, with great pleasure, that ZFS in Solaris 11 has a new
 aclmode=mask property.

Also, congratulations on shipping.  And thank you for implementing aclmode=mask.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-11-11 Thread Nico Williams
On Fri, Nov 11, 2011 at 4:27 PM, Paul Kraus p...@kraus-haus.org wrote:
 The command syntax paradigm of zfs (command sub-command object
 parameters) is not unique to zfs, but seems to have been the way of
 doing things in Solaris 10. The _new_ functions of Solaris 10 were
 all this way (to the best of my knowledge)...

 zonecfg
 zoneadm
 svcadm
 svccfg
 ... and many others are written this way. To boot the zone named foo
 you use the command zoneadm -z foo boot, to disable the service
 named sendmail, svcadm disable sendmail, etc. Someone at Sun was
 thinking :-)

I'd have preferred zoneadm boot foo.  The -z zone command thing is a
bit of a sore point, IMO.

But yes, all these new *adm(1M( and *cfg(1M) commands in S10 are
wonderful, especially when compared to past and present alternatives
in the Unix/Linux world.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-11-08 Thread Nico Williams
To some people active-active means all cluster members serve the
same filesystems.

To others active-active means all cluster members serve some
filesystems and can serve all filesystems ultimately by taking over
failed cluster members.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-19 Thread Nico Williams
On Wed, Oct 19, 2011 at 7:24 AM, Garrett D'Amore
garrett.dam...@nexenta.com wrote:
 I'd argue that from a *developer* point of view, an fsck tool for ZFS might 
 well be useful.  Isn't that what zdb is for? :-)

 But ordinary administrative users should never need something like this, 
 unless they have encountered a bug in ZFS itself.  (And bugs are as likely to 
 exist in the checker tool as in the filesystem. ;-)

zdb can be useful for admins -- say, to gather stats not reported by
the system, to explore the fs/vol layout, for educational purposes,
and so on.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] about btrfs and zfs

2011-10-18 Thread Nico Williams
On Tue, Oct 18, 2011 at 9:35 AM, Brian Wilson bfwil...@doit.wisc.edu wrote:
 I just wanted to add something on fsck on ZFS - because for me that used to
 make ZFS 'not ready for prime-time' in 24x7 5+ 9s uptime environments.
 Where ZFS doesn't have an fsck command - and that really used to bug me - it
 does now have a -F option on zpool import.  To me it's the same
 functionality for my environment - the ability to try to roll back to a
 'hopefully' good state and get the filesystem mounted up, leaving the
 corrupted data objects corrupted.  [...]

Yes, that's exactly what it is.  There's no point calling it fsck
because fsck fixes individual filesystems, while ZFS fixups need to
happen at the volume level (at volume import time).

It's true that this should have been in ZFS from the word go.  But
it's there now, and that's what matters, IMO.

It's also true that this was never necessary with hardware that
doesn't lie, but it's good to have it anyways, and is critical for
personal systems such as laptops.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov jimkli...@cos.ru wrote:
 Thanks to Nico for concerns about POSIX locking. However,
 hopefully, in the usecase I described - serving images of
 VMs in a manner where storage, access and migration are
 efficient - whole datasets (be it volumes or FS datasets)
 can be dedicated to one VM host server at a time, just like
 whole pools are dedicated to one host nowadays. In this
 case POSIX compliance can be disregarded - access
 is locked by one host, not avaialble to others, period.
 Of course, there is a problem of capturing storage from
 hosts which died, and avoiding corruptions - but this is
 hopefully solved in the past decades of clustering tech's.

It sounds to me like you need horizontal scaling more than anything
else.  In that case, why not use pNFS or Lustre?  Even if you want
snapshots, a VM should be able to handle that on its own, and though
probably not as nicely as ZFS in some respects, having the application
be in control of the exact snapshot boundaries does mean that you
don't have to quiesce your VMs just to snapshot safely.

 Nico also confirmed that one node has to be a master of
 all TXGs - which is conveyed in both ideas of my original
 post.

Well, at any one time one node would have to be the master of the next
TXG, but it doesn't mean that you couldn't have some cooperation.
There are lots of other much more interesting questions.  I think the
biggest problem lies in requiring full connectivity from every server
to every LUN.  I'd much rather take the Lustre / pNFS model (which,
incidentally, don't preclude having snapshots).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-14 Thread Nico Williams
Also, it's not worth doing a clustered ZFS thing that is too
application-specific.  You really want to nail down your choices of
semantics, explore what design options those yield (or approach from
the other direction, or both), and so on.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote:
 ZFS developers have for a long time stated that ZFS is not intended,
 at least not in near term, for clustered environments (that is, having
 a pool safely imported by several nodes simultaneously). However,
 many people on forums have wished having ZFS features in clusters.

 ...and UFS before ZFS… I'd wager that every file system has this RFE in its
 wish list :-)

Except the ones that already have it!  :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea

2011-10-11 Thread Nico Williams
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov jimkli...@cos.ru wrote:
 So, one version of the solution would be to have a single host
 which imports the pool in read-write mode (i.e. the first one
 which boots), and other hosts would write thru it (like iSCSI
 or whatever; maybe using SAS or FC to connect between
 reader and writer hosts). However they would read directly
 from the ZFS pool using the full SAN bandwidth.

You need to do more than simply assign a node for writes.  You need to
send write and lock requests to one node.  And then you need to figure
out what to do about POSIX write visibility rules (i.e., when a write
should be visible to other readers).  I think you'd basically end up
not meeting POSIX in this regard, just like NFS, though perhaps not
with close-to-open semantics.

I don't think ZFS is the beast you're looking for.  You want something
more like Lustre, GPFS, and so on.  I suppose someone might surprise
us one day with properly clustered ZFS, but I think it'd be more
likely that the filesystem would be ZFS-like, not ZFS proper.

 Second version of the solution is more or less the same, except
 that all nodes can write to the pool hardware directly using some
 dedicated block ranges owned by one node at a time. This
 would work like much a ZIL containing both data and metadata.
 Perhaps these ranges would be whole metaslabs or some other
 ranges as agreed between the master node and other nodes.

This is much hairier.  You need consistency.  If two processes on
different nodes are writing to the same file, then you need to
*internally* lock around all those writes so that the on-disk
structure ends up being sane.  There's a number of things you could do
here, such as, for example, having a per-node log and one node
coalescing them (possibly one node per-file, but even then one node
has to be the master of every txg).

And still you need to be careful about POSIX semantics.  That does not
come for free in any design -- you will need something like the Lustre
DLM (distributed lock manager).  Or else you'll have to give up on
POSIX.

There's a hefty price to be paid for POSIX semantics in a clustered
environment.  You'd do well to read up on Lustre's experience in
detail.  And not just Lustre -- that would be just to start.  I
caution you that this is not a simple project.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs diff performance disappointing

2011-09-26 Thread Nico Williams
On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea j...@jcea.es wrote:
 I just upgraded to Solaris 10 Update 10, and one of the improvements
 is zfs diff.

 Using the birthtime of the sectors, I would expect very high
 performance. The actual performance doesn't seems better that an
 standard rdiff, though. Quite disappointing...

 Should I disable atime to improve zfs diff performance? (most data
 doesn't change, but atime of most files would change).

atime has nothing to do with it.

How much work zfs diff has to do depends on how much has changed
between snapshots.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs diff performance disappointing

2011-09-26 Thread Nico Williams
Ah yes, of course.  I'd misread your original post.  Yes, disabling
atime updates will reduce the number of superfluous transactions.
It's *all* transactions that count, not just the ones the app
explicitly caused, and atime implies lots of transactions.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs scripts

2011-09-09 Thread Nico Williams
On Fri, Sep 9, 2011 at 5:33 AM, Sriram Narayanan sri...@belenix.org wrote:
 Plus, you'll need an  character at the end of each command.

And a wait command, if you want the script to wait for the sends to
finish (which you should).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-28 Thread Nico Williams
On Wed, Jul 27, 2011 at 9:22 PM, Daniel Carosone d...@geek.com.au wrote:
 Absent TRIM support, there's another way to do this, too.  It's pretty
 easy to dd /dev/zero to a file now and then.  Just make sure zfs
 doesn't prevent these being written to the SSD (compress and dedup are
 off).  I have a separate fill dataset for this purpose, to avoid
 keeping these zeros in auto-snapshots too.

Nice.

Seems to me that it'd be nicer to have an interface to raw flash (no
wear leveling, direct access to erasure, read, write,
read-modify-write [as an optimization]).  Then the filesystem could do
a much better job of using flash efficiently.  But a raw interface
wouldn't be a disk-compatible interface.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-24 Thread Nico Williams
On Jul 9, 2011 1:56 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 Given the abysmal performance, I have to assume there is a significant
 number of overhead reads or writes in order to maintain the DDT for each
 actual block write operation.  Something I didn't mention in the other
 email is that I also tracked iostat throughout the whole operation.  It's
 all writes (or at least 99.9% writes.)  So I am forced to conclude it's a
 bunch of small DDT maintenance writes taking place and incurring access
time
 penalties in addition to each intended single block access time penalty.

 The nature of the DDT is that it's a bunch of small blocks, that tend to
be
 scattered randomly, and require maintenance in order to do anything else.
 This sounds like precisely the usage pattern that benefits from low
latency
 devices such as SSD's.

The DDT should be written to in COW fashion, and asynchronously, so there
should be no access time penalty.  Or so ISTM it should be.

Dedup is necessarily slower for writing because of the deduplication table
lookups.  Those are synchronous lookups, but for async writes you'd think
that total write throughput would only be affected by a) the additional read
load (which is zero in your case) and b) any inability to put together large
transactions due to the high latency of each logical write, but (b)
shouldn't happen, particularly if the DDT fits in RAM or L2ARC, as it does
in your case.

So, at first glance my guess is ZFS is leaving dedup write performance on
the table most likely due to implementation reasons, not design reasons.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
IMO a faster processor with built-in AES and other crypto support is
most likely to give you the most bang for your buck, particularly if
you're using closed Solaris 11, as Solaris engineering is likely to
add support for new crypto instructions faster than Illumos (but I
don't really know enough about Illumos to say for sure).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
On Jun 27, 2011 9:24 PM, David Magda dma...@ee.ryerson.ca wrote:
 AESNI is certain better than nothing, but RSA, SHA, and the RNG would be
nice as well. It'd also be handy for ZFS crypto in addition to all the
network IO stuff.

The most important reason for AES-NI might be not performance but defeating
side-channel attacks.

Also, really fast AES HW makes AES-based hash functions quite tempting.

No, AES-NI is nothing to sneeze at.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption accelerator card recommendations.

2011-06-27 Thread Nico Williams
On Jun 27, 2011 4:15 PM, David Magda dma...@ee.ryerson.ca wrote:
 The (Ultra)SPARC T-series processors do, but to a certain extent it goes
 against a CPU manufacturers best (financial) interest to provide this:
 crypto is very CPU intensive using 'regular' instructions, so if you need
 to do a lot of it, it would force you to purchase a manufacturer's
 top-of-the-line CPUs, and to have as many sockets as you can to handle a
 load (and presumably you need to do useful work besides just
 en/decrypting traffic).

I hope no CPU vendor thinks about the economics of chip making that way.  I
actually doubt any do.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Versioning FS was: question about COW and snapshots

2011-06-16 Thread Nico Williams
As Casper pointed out, the right thing to do is to build applications
such that they can detect mid-transaction state and roll it back (or
forward, if there's enough data).  Then mid-transaction snapshots are
fine, and the lack of APIs by which to inform the filesystem of
application transaction boundaries becomes much less of an issue
(adding such APIs is not a good solution, since it'd take many years
for apps to take advantage of them and more years still for legacy
apps to be upgraded or decomissioned).

The existing FS interfaces provide enough that one can build
applications this way.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Nico Williams
On Thu, Jun 16, 2011 at 8:51 AM,  casper@oracle.com wrote:
 If a database engine or another application keeps both the data and the
 log in the same filesystem, a snapshot wouldn't create inconsistent data
 (I think this would be true with vim and a large number of database
 engines; vim will detect the swap file and datbase should be able to
 detect the inconsistency and rollback and re-apply the log file.)

Correct.  SQLite3 will be able to recover automatically from restores
of mid-transaction snapshots.

VIM does not recover automatically, but it does notice the swap file
and warns the user and gives them a way to handle the problem.

(When you save a file, VIM renames the old one out of the way, creates
a new file with the original name, writes the new contents to it,
closes it, then unlinks the swap file.  On recovery VIM notices the
swap file and gives the user a menu of choices.)

I believe this is the best solution: write applications so they can
recover from being restarted with data restored from a mid-transaction
snapshot.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-16 Thread Nico Williams
That said, losing committed transactions when you needed and thought
you had ACID semantics... is bad.  But that's implied in any
restore-from-backups situation.  So you replicate/distribute
transactions so that restore from backups (or snapshots) is an
absolutely last resort matter, and if you ever have to restore from
backups you also spend time manually tracking down (from
counterparties, paper trails kept elsewhere, ...) any missing
transactions.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net 
wrote:
 If anyone has any ideas be it ZFS based or any useful scripts that
 could help here, I am all ears.

 Something like this one-liner will show what would be allocated by everything 
 if hardlinks weren't used:

 # size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do 
 size=$(( $size + $i )); done; echo $size

Oh, you don't want to do that: you'll run into max argument list size issues.

Try this instead:

(echo 0; find . -type f \! -links 1 | xargs stat -c  %b %B *+ $p; echo p) | dc

;)

xargs is your friend (and so is dc... RPN FTW!).  Note that I'm not
printing the number of links because find will print a name for every
link (well, if you do the find from the root of the relevant
filesystem), so we'd be counting too much space.

You'll need the GNU stat(1).  Or you could do something like this
using the ksh stat builtin:

(
echo 0
find . -type f \! -links 1 | while read p; do
xargs stat -c  %b %B *+ $p
done
echo p
) | dc

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams n...@cryptonector.com wrote:
 Try this instead:

 (echo 0; find . -type f \! -links 1 | xargs stat -c  %b %B *+ $p; echo p) | 
 dc

s/\$p//
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-13 Thread Nico Williams
And, without a sub-shell:

find . -type f \! -links 1 | xargs stat -c  %b %B *+p /dev/null | dc
2/dev/null | tail -1

(The stderr redirection is because otherwise dc whines once that the
stack is empty, and the tail is because we print interim totals as we
go.)

Also, this doesn't quit work, since it counts every link, when we want
to count all but one links.  This, then, is what will tell you how
much space you saved due to hardlinks:

find . -type f \! -links 1 | xargs stat -c  8k %b %B * %h 1 - * %h
/+p /dev/null 2/dev/null | dc

Excuse my earlier brainfarts :)

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Hard link space savings

2011-06-12 Thread Nico Williams
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson
scott.law...@manukau.ac.nz wrote:
 I have an interesting question that may or may not be answerable from some
 internal
 ZFS semantics.

This is really standard Unix filesystem semantics.

 [...]

 So total storage used is around ~7.5MB due to the hard linking taking place
 on each store.

 If hard linking capability had been turned off, this same message would have
 used 1500 x 2MB =3GB
 worth of storage.

 My question is there any simple ways of determining the space savings on
 each of the stores from the usage of hard links?  [...]

But... you just did!  :)  It's: number of hard links * (file size +
sum(size of link names and/or directory slot size)).  For sufficiently
large files (say, larger than one disk block) you could approximate
that as: number of hard links * file size.  The key is the number of
hard links, which will typically vary, but for e-mails that go to all
users, well, you know the number of links then is the number of users.

You could write a script to do this -- just look at the size and
hard-link count of every file in the store, apply the above formula,
add up the inflated sizes, and you're done.

Nico

PS: Is it really the case that Exchange still doesn't deduplicate
e-mails?  Really?  It's much simpler to implement dedup in a mail
store than in a filesystem...
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, Oracle and Nexenta

2011-05-26 Thread Nico Williams
On May 25, 2011 7:15 AM, Garrett Dapos;Amore garr...@nexenta.com wrote:

 You are welcome to your beliefs.   There are many groups that do standards
that do not meet in public.  [...]

True.

 [...] In fact, I can't think of any standards bodies that *do* hold open
meetings.

I can: the IETF, for example.  All business of the IETF is transacted or
confirmed on open participation mailing lists, and IETF meetings are known
as NOTE WELL meetings because of the notice given at their opening regarding
the fact that meeting is public and resulting considerations regarding,
e.g., trade secrets.

Mind you, there are many more standards setting organizations that don't
have open participation, such as OASIS, ISO, and so on.  I don't begrudge
you starting closed, our even staying closed, though I would prefer that at
least the output of any ZFS standards org be open.  Also, I would recommend
that you eventually consider creating a new open participation list for
non-members (separate from any members-only list).

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-22 Thread Nico Williams
On Sun, May 22, 2011 at 10:20 AM, Richard Elling
richard.ell...@gmail.com wrote:
 ZFS already tracks the blocks that have been written, and the time that
 they were written. So we already know when something was writtem, though
 that does not answer the question of whether the data was changed. I think
 it is a pretty good bet that newly written data is different :-)

Not really.  There's bp rewrite (assuming that ever ships, or gets
implemented elsewhere), for example.

 Then, the filesystem should make this Merkle Tree available to
 applications through a simple query.

 Something like zfs diff ?

That works within a filesystem.  And zfs send/recv works when you have
one filesystem faithfully tracking another.

When you have two filesystems with similar contents, and the history
of each is useless in deciding how to do a bi-directional
synchronization, then you need a way to diff files that is not based
on intra-filesystem history.  The rsync algorithm is the best
high-performance algorithm that we have for determining differences
between files separated by a network.  My proposal (back then, and
Zooko's now) is to leverage work that the filesystem does anyways to
implement a high-performance remote diff that is faster than rsync for
the simple reason that some of the rsync algorithm essentially gets
pre-computed.

 This would enable applications—without needing any further
 in-filesystem code—to perform a Merkle Tree sync, which would range
 from noticeably more efficient to dramatically more efficient than
 rsync or zfs send. :-)

 Since ZFS send already has an option to only send the changed blocks,
 I disagree with your assertion that your solution will be dramatically
 more efficient than zsf send. We already know zfs send is much more
 efficient than rsync for large file systems.

You missed Zooko's point completely.  It might help to know that Zooko
works on a project called Tahoe Least-Authority Filesystem, which is
by nature distributed.  Once you lose the constraints of not having a
network or having uni-directional replication only, I think you'll get
it.  Or perhaps you'll argue that no one should ever need bi-di
replication, that if one finds oneself wanting that then one has taken
a wrong turn somewhere.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

2011-05-22 Thread Nico Williams
On Sun, May 22, 2011 at 1:52 PM, Nico Williams n...@cryptonector.com wrote:
 [...] Or perhaps you'll argue that no one should ever need bi-di
 replication, that if one finds oneself wanting that then one has taken
 a wrong turn somewhere.

You could also grant the premise and argue instead that nothing the
filesystem can do to speed up remote bi-di sync is worth the cost --
an argument that requires a lot more analysis.  For example, if the
filesystem were to compute and store rsync rolling CRC signatures,
well, that would require significant compute and storage resources,
and it might not speed up synchronization enough to ever be
worthwhile.  Similarly, a Merkle hash tree based on rolling hash
functions (and excluding physical block pointer details) might require
each hash output to grow linearly with block size in order to retain
the rolling hash property (I'm not sure about this; I know very little
about rolling hash functions), in which case the added complexity
would be a complete non-starter.  Whereas a Merkle hash tree built
with regular hash functions would not be able to resolve
insertions/deletions of data chunks of size that is not a whole
multiple of block size.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
Also, sparseness need not be apparent to applications.  Until recent
improvements to lseek(2) to expose hole/non-hole offsets, the only way
to know about sparseness was to notice that a file's reported size is
more than the file's reported filesystem blocks times the block size.
Sparse files in Unix go back at least to the early 80s.

If a filesystem protocol, such as CIFS (I've no idea if it supports
sparse files), were to not support sparse files, all that would mean
is that the server must report a number of blocks that matches a
file's size (assuming the protocol in question even supports any
notion of reporting a file's size in blocks).

There's really two ways in which a filesystem protocol could support
sparse files: a) by reporting file size in bytes and blocks, b) by
reporting lists of file offsets demarcating holes from non-holes.  (b)
is a very new idea; Lustre may be the only filesystem that I know that
supports this (see the Linux FIEMAP APIs)., though work is in progress
to add this to NFSv4.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
On Mon, May 2, 2011 at 3:56 PM, Eric D. Mudama
edmud...@bounceswoosh.org wrote:
 Yea, kept googling and it makes sense.  I guess I am simply surprised
 that the application would have done the seek+write combination, since
 on NTFS (which doesn't support sparse) these would have been real
 1.5GB files, and there would be hundreds or thousands of them in
 normal usage.

It could have been smbd compressing long runs of zeros.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls reports incorrect file size

2011-05-02 Thread Nico Williams
Then again, Windows apps may be doing seek+write to pre-allocate storage.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] disable zfs/zpool destroy for root user

2011-02-17 Thread Nico Williams
On Thu, Feb 17, 2011 at 3:07 PM, Richard Elling
richard.ell...@gmail.com wrote:
 On Feb 17, 2011, at 12:44 PM, Stefan Dormayer wrote:

 Hi all,

 is there a way to disable the subcommand destroy of zpool/zfs for the root 
 user?

 Which OS?

Heheh.  Great answer.  The real answer depends also on what the OP
meant by root.

root in Solaris isn't the all-powerful thing it used to be, or, rather, it is,
but its power can be limited.  And not just on Solaris either.

The OP's question is difficult to answer because the question isn't the one
the OP really wants to ask -- we must tease out that real question, or guess.
I'd start with: just what is it that you want to accomplish?

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)

2011-02-14 Thread Nico Williams
On Feb 14, 2011 6:56 AM, Paul Kraus p...@kraus-haus.org wrote:
 P.S. I am measuring number of objects via `zdb -d` as that is faster
 than trying to count files and directories and I expect is a much
 better measure of what the underlying zfs code is dealing with (a
 particular dataset may have lots of snapshot data that does not
 (easily) show up).

It's faster because; a) no atime updates, b) no ZPL overhead.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS

2011-02-07 Thread Nico Williams
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote:
 On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote:
 Maybe I didn't make my intention clear. UFS with directio is
 reasonably close to a raw disk from my application's perspective: when
 the app writes to a file location, no buffering happens. My goal is to
 find a way to duplicate this on ZFS.

You're still mixing directio and O_DSYNC.

O_DSYNC is like calling fsync(2) after every write(2).  fsync(2) is
useful to obtain
some limited transactional semantics, as well as for durability
semantics.  In ZFS
you don't need to call fsync(2) to get those transactional semantics, but you do
need to call fsync(2) get those durability semantics.

Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly
more than just the data blocks you wrote to.  Which means that O_DSYNC on ZFS
is significantly slower than on UFS.  You can address this in one of two ways:
a) you might realize that you don't need every write(2) to be durable, then stop
using O_DSYNC, b) you might get a fast ZIL device.

I'm betting that if you look carefully at your application's requirements you'll
probably conclude that you don't need O_DSYNC at all.  Perhaps you can tell us
more about your application.

 Setting primarycache didn't eliminate the buffering, using O_DSYNC
 (whose side effects include elimination of buffering) made it
 ridiculously slow: none of the things I tried eliminated buffering,
 and just buffering, on ZFS.

 From the discussion so far my feeling is that ZFS is too different
 from UFS that there's simply no way to achieve this goal...

You've not really stated your application's requirements.  You may be convinced
that you need O_DSYNC, but chances are that you don't.  And yes, it's possible
that you'd need O_DSYNC on UFS but not on ZFS.

Nico
--
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss