Re: [zfs-discuss] RFE: Un-dedup for unique blocks
IIRC dump is special. As for swap... really, you don't want to swap. If you're swapping you have problems. Any swap space you have is to help you detect those problems and correct them before apps start getting ENOMEM. There *are* exceptions to this, such as Varnish. For Varnish and any other apps like it I'd dedicate an entire flash drive to it, no ZFS, no nothing. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
Bloom filters are very small, that's the difference. You might only need a few bits per block for a Bloom filter. Compare to the size of a DDT entry. A Bloom filter could be cached entirely in main memory. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
I've wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It's very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It's also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11 System Reboots Continuously Because of a ZFS-Related Panic (7191375)
On Mon, Jan 14, 2013 at 1:48 PM, Tomas Forsman st...@acc.umu.se wrote: https://bug.oraclecorp.com/pls/bug/webbug_print.show?c_rptno=15852599 Host oraclecorp.com not found: 3(NXDOMAIN) Would oracle.internal be a better domain name? Things like that cannot be changed easily. They (Oracle) are stuck with that domainname for the forseeable future. Also, whoever thought it up probably didn't consider leakage of internal URIs to the outside. *shrug* ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
The copies thing is a really only for laptops, where the likelihood of redundancy is very low (there are some high-end laptops with multiple drives, but those are relatively rare) and where this idea is better than nothing. It's also nice that copies can be set on a per-dataset manner (whereas RAID-Zn and mirroring are for pool-wide redundancy, not per-dataset), so you could set it 1 on home directories but not /. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 9:48 AM, casper@oracle.com wrote: Huge space, but still finite=85 Dan Brown seems to think so in Digital Fortress but it just means he has no grasp on big numbers. I couldn't get past that. I had to put the book down. I'm guessing it was as awful as it threatened to be. IMO, FWIW, yes, do add SHA-512 (truncated to 256 bits, of course). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 3:45 AM, Sašo Kiselkov skiselkov...@gmail.com wrote: It's also possible to set dedup=verify with checksum=sha256, however, that makes little sense (as the chances of getting a random hash collision are essentially nil). IMO dedup should always verify. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
You can treat whatever hash function as an idealized one, but actual hash functions aren't. There may well be as-yet-undiscovered input bit pattern ranges where there's a large density of collisions in some hash function, and indeed, since our hash functions aren't ideal, there must be. We just don't know where these potential collisions are -- for cryptographically secure hash functions that's enough (plus 2nd pre-image and 1st pre-image resistance, but allow me to handwave), but for dedup? *shudder*. Now, for some content types collisions may not be a problem at all. Think of security camera recordings: collisions will show up as bad frames in a video stream that no one is ever going to look at, and if they should need it, well, too bad. And for other content types collisions can be horrible. Us ZFS lovers love to talk about how silent bit rot means you may never know about serious corruption in other filesystems until it's too late. Now, if you disable verification in dedup, what do you get? The same situation as other filesystems are in relative to bit rot, only with different likelihoods. Disabling verification is something to do after careful deliberation, not something to do by default. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 3 Jul 2012, James Litchfield wrote: Agreed - msync/munmap is the only guarantee. I don't see that the munmap definition assures that anything is written to disk. The system is free to buffer the data in RAM as long as it likes without writing anything at all. Oddly enough the manpages at the Open Group don't make this clear. So I think it may well be advisable to use msync(3C) before munmap() on MAP_SHARED mappings. However, I think all implementors should, and probably all do (Linux even documents that it does) have an implied msync(2) when doing a munmap(2). I really makes no sense at all to have munmap(2) not imply msync(3C). (That's another thing, I don't see where the standard requires that munmap(2) be synchronous. I think it'd be nice to have an mmap(2) option for requesting whether munmap(2) of the same mapping be synchronous or asynchronous. Async munmap(2) - no need to mount cross-calls, instead allowing to mapping to be torn down over time. Doing a synchronous msync(3C), then a munmap(2) is a recipe for going real slow, but if munmap(2) does not portably guarantee an implied msync(3C), then would it be safe to do an async msync(2) then munmap(2)??) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Tue, Jul 3, 2012 at 9:48 AM, James Litchfield jim.litchfi...@oracle.com wrote: On 07/02/12 15:00, Nico Williams wrote: You can't count on any writes to mmap(2)ed files hitting disk until you msync(2) with MS_SYNC. The system should want to wait as long as possible before committing any mmap(2)ed file writes to disk. Conversely you can't expect that no writes will hit disk until you msync(2) or munmap(2). Driven by fsflush which will scan memory (in chunks) looking for dirty, unlocked, non-kernel pages to flush to disk. Right, but one just cannot count on that -- it's not part of the API specification. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On Mon, Jul 2, 2012 at 3:32 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Mon, 2 Jul 2012, Iwan Aucamp wrote: I'm interested in some more detail on how ZFS intent log behaves for updated done via a memory mapped file - i.e. will the ZIL log updates done to an mmap'd file or not ? I would to expect these writes to go into the intent log unless msync(2) is used on the mapping with the MS_SYNC option. You can't count on any writes to mmap(2)ed files hitting disk until you msync(2) with MS_SYNC. The system should want to wait as long as possible before committing any mmap(2)ed file writes to disk. Conversely you can't expect that no writes will hit disk until you msync(2) or munmap(2). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith alan.coopersm...@oracle.com wrote: On 06/26/12 05:46 AM, Lionel Cons wrote: On 25 June 2012 11:33, casper@oracle.com wrote: To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. This may be wrong way to go if it breaks existing applications which rely on this feature. It does break applications in our case. Existing applications rely on the ability to corrupt UFS filesystems? Sounds horrible. My guess is that the OP just wants unlink() of an empty directory to be the same as rmdir() of the same. Or perhaps they want unlink() of a non-empty directory to result in a recursive rm... But if they really want hardlinks to directories, then yeah, that's horrible. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 8:12 AM, Lionel Cons lionelcons1...@googlemail.com wrote: On 26 June 2012 14:51, casper@oracle.com wrote: We've already asked our Netapp representative. She said it's not hard to add that. Did NetApp tell you that they'll add support for using the NFSv4 LINK operation on source objects that are directories?! I'd be extremely surprised! Or did they only tell you that they'll add support for using the NFSv4 REMOVE operation on non-empty directories? The latter is definitely feasible (although it could fail due to share deny OPENs of files below, say, but hey). The former is... not sane. I'd suggest whether you can restructure your code and work without this. It would require touching code for which we don't have sources anymore (people gone, too). It would also require to create hard links to the results files directly, which means linking 15000+ files per directory with a minimum of 3 directories. Each day (this is CERN after all). Oh, I see. But you still don't want hardlinks to directories! Instead you might be able to use LD_PRELOAD to emulate the behavior that the application wants. The app is probably implementing rename(), so just detect the sequence and map it to an actual rename(2). The other way around would be to throw the SPARC machines away and go with Netapp. So Solaris is just a fileserver here? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there an actual newsgroup for zfs-discuss?
On Mon, Jun 11, 2012 at 5:05 PM, Tomas Forsman st...@acc.umu.se wrote: .. or use a mail reader that doesn't suck. Or the mailman thread view. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Terminology question on ZFS COW
COW goes back at least to the early days of virtual memory and fork(). On fork() the kernel would arrange for writable pages in the parent process to be made read-only so that writes to them could be caught and then the page fault handler would copy the page (and restore write access) so the parent and child each have their own private copies. COW as used in ZFS is not the same, but the concept was introduced very early also, IIRC in the mid-80s -- certainly no later than BSD4.4's log structure filesystem (which ZFS resembles in many ways). So, is COW a misnomer? Yes and no, and anyways, it's irrelevant. The important thing is that when you say COW people understand that you're not saving a copy of the old thing but rather writing the new thing to a new location. (The old version of whatever was copied-on-write is stranded, unless -of course- you have references left to it from things like snapshots.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] current status of SAM-QFS?
On Wed, May 2, 2012 at 7:59 AM, Paul Kraus p...@kraus-haus.org wrote: On Wed, May 2, 2012 at 7:46 AM, Darren J Moffat darr...@opensolaris.org wrote: If Oracle is only willing to share (public) information about the roadmap for products via official sales channels then there will be lots of FUD in the market. Now, as to sharing futures and NDA material, that _should_ only be available via direct Oracle channels (as it was under Sun as well). Sun was tight lipped too, yes, but information leaked through the open or semi-open software development practices in Solaris. If you saw some feature pushed to some gate you had no guarantee that it would remain there or be supported, but you had a pretty good inkling as to whether the engineers working on it intended it to remain there. If you can't get something out of your rep, you might try reading the tea leaves (sketchy business). But ultimately you need to be prepared for any product's EOL. You can expect some amount of warning time about EOLs, but legacy has a way of sticking around, so write plan for how to migrate data and where to, then put the plan in a drawer somewhere (and update it as necessary). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 12:10 AM, Richard Elling richard.ell...@gmail.com wrote: On Apr 25, 2012, at 8:30 PM, Carson Gaspar wrote: Reboot requirement is a lame client implementation. And lame protocol design. You could possibly migrate read-write NFSv3 on the fly by preserving FHs and somehow updating the clients to go to the new server (with a hiccup in between, no doubt), but only entire shares at a time -- you could not migrate only part of a volume with NFSv3. Of course, having migration support in the protocol does not equate to getting it in the implementation, but it's certainly a good step in that direction. You are correct, a ZFS send/receive will result in different file handles on the receiver, just like rsync, tar, ufsdump+ufsrestore, etc. That's understandable for NFSv2 and v3, but for v4 there's no reason that an NFSv4 server stack and ZFS could not arrange to preserve FHs (if, perhaps, at the price of making the v4 FHs rather large). Although even for v3 it should be possible for servers in a cluster to arrange to preserve devids... Bottom line: live migration needs to be built right into the protocol. For me one of the exciting things about Lustre was/is the idea that you could just have a single volume where all new data (and metadata) is distributed evenly as you go. Need more storage? Plug it in, either to an existing head or via a new head, then flip a switch and there it is. No need to manage allocation. Migration may still be needed, both within a cluster and between clusters, but that's much more manageable when you have a protocol where data locations can be all over the place in a completely transparent manner. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 5:45 PM, Carson Gaspar car...@taltos.org wrote: On 4/26/12 2:17 PM, J.P. King wrote: I don't know SnapMirror, so I may be mistaken, but I don't see how you can have non-synchronous replication which can allow for seamless client failover (in the general case). Technically this doesn't have to be block based, but I've not seen anything which wasn't. Synchronous replication pretty much precludes DR (again, I can think of theoretical ways around this, but have never come across anything in practice). seamless is an over-statement, I agree. NetApp has synchronous SnapMirror (which is only mostly synchronous...). Worst case, clients may see a filesystem go backwards in time, but to a point-in-time consistent state. Sure, if we assume apps make proper use of O_EXECL, O_APPEND, link(2)/unlink(2)/rename(2), sync(2), fsync(2), and fdatasync(3C) and can roll their own state back on their own. Databases typically know how to do that (e.g., SQLite3). Most apps? Doubtful. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Thu, Apr 26, 2012 at 12:37 PM, Richard Elling richard.ell...@gmail.com wrote: [...] NFSv4 had migration in the protocol (excluding protocols between servers) from the get-go, but it was missing a lot (FedFS) and was not implemented until recently. I've no idea what clients and servers support it adequately besides Solaris 11, though that's just my fault (not being informed). It's taken over a decade to get to where we have any implementations of NFSv4 migration. For me one of the exciting things about Lustre was/is the idea that you could just have a single volume where all new data (and metadata) is distributed evenly as you go. Need more storage? Plug it in, either to an existing head or via a new head, then flip a switch and there it is. No need to manage allocation. Migration may still be needed, both within a cluster and between clusters, but that's much more manageable when you have a protocol where data locations can be all over the place in a completely transparent manner. Many distributed file systems do this, at the cost of being not quite POSIX-ish. Well, Lustre does POSIX semantics just fine, including cache coherency (as opposed to NFS' close-to-open coherency, which is decidedly non-POSIX). In the brave new world of storage vmotion, nosql, and distributed object stores, it is not clear to me that coding to a POSIX file system is a strong requirement. Well, I don't quite agree. I'm very suspicious of eventually-consistent. I'm not saying that the enormous DBs that eBay and such run should sport SQL and ACID semantics -- I'm saying that I think we can do much better than eventually-consistent (and no-language) while not paying the steep price that ACID requires. I'm not alone in this either. The trick is to find the right compromise. Close-to-open semantics works out fine for NFS, but O_APPEND is too wonderful not to have (ditto O_EXCL, which NFSv2 did not have; v4 has O_EXCL, but not O_APPEND). Whoever first delivers the right compromise in distributed DB semantics stands to make a fortune. Perhaps people are so tainted by experiences with v2 and v3 that we can explain the non-migration to v4 as being due to poor marketing? As a leader of NFS, Sun had unimpressive marketing. Sun did not do too much to improve NFS in the 90s, not compared to the v4 work that only really started paying off only too recently. And then since Sun had lost the client space by then it doesn't mean all that much to have the best server if the clients aren't able to take advantage of the server's best features for lack of client implementation. Basically, Sun's ZFS, DTrace, SMF, NFSv4, Zones, and other amazing innovations came a few years too late to make up for the awful management that Sun was saddled with. But for all the decidedly awful things Sun management did (or didn't do), the worst was terminating Sun PS (yes, worse that all the non-marketing, poor marketing, poor acquisitions, poor strategy, and all the rest including truly epic mistakes like icing Solaris on x86 a decade ago). One of the worst outcomes of the Sun debacle is that now there's a bevy of senior execs who think the worst thing Sun did was to open source Solaris and Java -- which isn't to say that Sun should have open sourced as much as it did, or that open source is an end in itself, but that open sourcing these things was legitimate a business tool with very specific goals in mind in each case, and which had nothing to do with the sinking of the company. Or maybe that's one of the best outcomes, because the good news about it is that those who learn the right lessons (in that case: that open source is a legitimate business tool that is sometimes, often even, a great mind-share building tool) will be in the minority, and thus will have a huge advantage over their competition. That's another thing Sun did not learn until it was too late: mind-share matters enormously to a software company. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on Linux vs FreeBSD
As I understand it LLNL has very large datasets on ZFS on Linux. You could inquire with them, as well as http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/topics?pli=1 . My guess is that it's quite stable for at least some use cases (most likely: LLNL's!), but that may not be yours. You could always... test it, but if you do then please tell us how it went :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
I agree, you need something like AFS, Lustre, or pNFS. And/or an NFS proxy to those. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 4:26 PM, Paul Archer p...@paularcher.org wrote: 2:20pm, Richard Elling wrote: Ignoring lame NFS clients, how is that architecture different than what you would have with any other distributed file system? If all nodes share data to all other nodes, then...? Simple. With a distributed FS, all nodes mount from a single DFS. With NFS, each node would have to mount from each other node. With 16 nodes, that's what, 240 mounts? Not to mention your data is in 16 different mounts/directory structures, instead of being in a unified filespace. To be fair NFSv4 now has a distributed namespace scheme so you could still have a single mount on the client. That said, some DFSes have better properties, such as striping of data across sets of servers, aggressive caching, and various choices of semantics (e.g., Lustre tries hard to give you POSIX cache coherency semantics). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 5:22 PM, Richard Elling richard.ell...@gmail.com wrote: Unified namespace doesn't relieve you of 240 cross-mounts (or equivalents). FWIW, automounters were invented 20+ years ago to handle this in a nearly seamless manner. Today, we have DFS from Microsoft and NFS referrals that almost eliminate the need for automounter-like solutions. I disagree vehemently. automount is a disaster because you need to synchronize changes with all those clients. That's not realistic. I've built a large automount-based namespace, replete with a distributed configuration system for setting the environment variables available to the automounter. I can tell you this: the automounter does not scale, and it certainly does not avoid the need for outages when storage migrates. With server-side, referral-based namespace construction that problem goes away, and the whole thing can be transparent w.r.t. migrations. For my money the key features a DFS must have are: - server-driven namespace construction - data migration without having to restart clients, reconfigure them, or do anything at all to them - aggressive caching - striping of file data for HPC and media environments - semantics that ultimately allow multiple processes on disparate clients to cooperate (i.e., byte range locking), but I don't think full POSIX semantics are needed (that said, I think O_EXCL is necessary, and it'd be very nice to have O_APPEND, though the latter is particularly difficult to implement and painful when there's contention if you stripe file data across multiple servers) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs
On Wed, Apr 25, 2012 at 5:42 PM, Ian Collins i...@ianshome.com wrote: Aren't those general considerations when specifying a file server? There are Lustre clusters with thousands of nodes, hundreds of them being servers, and high utilization rates. Whatever specs you might have for one server head will not meet the demand that hundreds of the same can. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 7:37 PM, Richard Elling richard.ell...@gmail.com wrote: On Apr 25, 2012, at 3:36 PM, Nico Williams wrote: I disagree vehemently. automount is a disaster because you need to synchronize changes with all those clients. That's not realistic. Really? I did it with NIS automount maps and 600+ clients back in 1991. Other than the obvious problems with open files, has it gotten worse since then? Nothing's changed. Automounter + data migration - rebooting clients (or close enough to rebooting). I.e., outage. Storage migration is much more difficult with NFSv2, NFSv3, NetWare, etc. But not with AFS. And spec-wise not with NFSv4 (though I don't know if/when all NFSv4 clients will properly support migration, just that the protocol and some servers do). With server-side, referral-based namespace construction that problem goes away, and the whole thing can be transparent w.r.t. migrations. Yes. Agree, but we didn't have NFSv4 back in 1991 :-) Today, of course, this is how one would design it if you had to design a new DFS today. Indeed, that's why I built an automounter solution in 1996 (that's still in use, I'm told). Although to be fair AFS existed back then and had global namespace and data migration back then, and was mature. It's taken NFS that long to catch up... [...] Almost any of the popular nosql databases offer this and more. The movement away from POSIX-ish DFS and storing data in traditional files is inevitable. Even ZFS is a object store at its core. I agree. Except that there are applications where large octet streams are needed. HPC, media come to mind. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cluster vs nfs (was: Re: ZFS on Linux vs FreeBSD)
On Wed, Apr 25, 2012 at 8:57 PM, Paul Kraus pk1...@gmail.com wrote: On Wed, Apr 25, 2012 at 9:07 PM, Nico Williams n...@cryptonector.com wrote: Nothing's changed. Automounter + data migration - rebooting clients (or close enough to rebooting). I.e., outage. Uhhh, not if you design your automounter architecture correctly and (as Richard said) have NFS clients that are not lame to which I'll add, automunters that actually work as advertised. I was designing automount architectures that permitted dynamic changes with minimal to no outages in the late 1990's. I only had a little over 100 clients (most of which were also servers) and NIS+ (NIS ver. 3) to distribute the indirect automount maps. Further below you admit that you're talking about read-only data, effectively. But the world is not static. Sure, *code* is by and large static, and indeed, we segregated data by whether it was read-only (code, historical data) or not (application data, home directories). We were able to migrated *read-only* data with no outages. But for the rest? Yeah, there were always outages. Of course, we had a periodic maintenance window, with all systems rebooting within a short period, and this meant that some data migration outages were not noticeable, but they were real. I also had to _redesign_ a number of automount strategies that were built by people who thought that using direct maps for everything was a good idea. That _was_ a pain in the a** due to the changes needed at the applications to point at a different hierarchy. We used indirect maps almost exclusively. Moreover, we used hierarchical automount entries, and even -autofs mounts. We also used environment variables to control various things, such as which servers to mount what from (this was particularly useful for spreading the load on read-only static data). We used practically every feature of the automounter except for executable maps (and direct maps, when we eventually stopped using those). It all depends on _what_ the application is doing. Something that opens and locks a file and never releases the lock or closes the file until the application exits will require a restart of the application with an automounter / NFS approach. No kidding! In the real world such applications exist and get used. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data loss by memory corruption?
On Wed, Jan 18, 2012 at 4:53 AM, Jim Klimov jimkli...@cos.ru wrote: 2012-01-18 1:20, Stefan Ring wrote: I don’t care too much if a single document gets corrupted – there’ll always be a good copy in a snapshot. I do care however if a whole directory branch or old snapshots were to disappear. Well, as far as this problem relies on random memory corruptions, you don't get to choose whether your document gets broken or some low-level part of metadata tree ;) Other filesystems tend to be much more tolerant of bit rot of all types precisely because they have no block checksums. But I'd rather have ZFS -- *with* redundancy, of course, and with ECC. It might be useful to have a way to recover from checksum mismatches by involving a human. I'm imagining a tool that tests whether accepting a block's actual contents results in making data available that the human thinks checks out, and if so, then rewriting that block. Some bit errors might simply result in meaningless metadata, but in some cases this can be corrected (e.g., ridiculous block addresses). But if ECC takes care of the problem then why waste the effort? (Partial answer: because it'd be a very neat GSoC type project!) Besides, what if that document you don't care about is your account's entry in a banking system (as if they had no other redundancy and double-checks)? And suddenly you don't exist because of some EIOIO, or your balance is zeroed (or worse, highly negative)? ;) This is why we have paper trails, logs, backups, redundancy at various levels, ... Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Idea: ZFS and on-disk ECC for blocks
On Wed, Jan 11, 2012 at 9:16 AM, Jim Klimov jimkli...@cos.ru wrote: I've recently had a sort of an opposite thought: yes, ZFS redundancy is good - but also expensive in terms of raw disk space. This is especially bad for hardware space-constrained systems like laptops and home-NASes, where doubling the number of HDDs (for mirrors) or adding tens of percent of storage for raidZ is often not practical for whatever reason. Redundancy through RAID-Z and mirroring is expensive for home systems and laptops, but mostly due to the cost of SATA/SAS ports, not the cost of the drives. The drives are cheap, but getting an extra disk in a laptop is either impossible or expensive. But that doesn't mean you can't mirror slices or use ditto blocks. For laptops just use ditto blocks and either zfs send or external mirror that you attach/detach. Current ZFS checksums allow us to detect errors, but in order for recovery to actually work, there should be a redundant copy and/or parity block available and valid. Hence the question: why not put ECC info into ZFS blocks? RAID-Zn *is* an error correction system. But what you are asking for is a same-device error correction method that costs less than ditto blocks, with error correction data baked into the blkptr_t. Are there enough free bits left in the block pointer for error correction codes for large blocks? (128KB blocks, but eventually ZFS needs to support even larger blocks, so keep that in mind.) My guess is: no. Error correction data might have to get stored elsewhere. I don't find this terribly attractive, but maybe I'm just not looking at it the right way. Perhaps there is a killer enterprise feature for ECC here: stretching MTTDL in the face of a device failure in a mirror or raid-z configuration (but if failures are typically of whole drives rather than individual blocks, then this wouldn't help). But without a good answer for where to store the ECC for the largest blocks, I don't see this happening. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Thu, Jan 5, 2012 at 8:53 AM, sol a...@yahoo.com wrote: if a bug fixed in Illumos is never reported to Oracle by a customer, it would likely never get fixed in Solaris either :-( I would have liked to think that there was some good-will between the ex- and current-members of the zfs team, in the sense that the people who created zfs but then left Oracle still care about it enough to want the Oracle version to be as bug-free as possible. My intention was to encourage users to report bugs to both, Oracle and Illumos. It's possible that Oracle engineers pay attention to the Illumos bug database, but I expect that for legal reasons the will not look at Illumos code that has any new copyright notices relative to Oracle code. The simplest way for Oracle engineers to avoid all possible legal problems is to simply ignore at least the Illumos source repositories, possibly more. I'm speculating, sure; I might be wrong. As for good will, I'm certain that there is, at least at the engineer level, and probably beyond. But that doesn't mean that there will be bug parity, much less feature parity. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 9:53 AM, Brad Diggs brad.di...@oracle.com wrote: Jim, You are spot on. I was hoping that the writes would be close enough to identical that there would be a high ratio of duplicate data since I use the same record size, page size, compression algorithm, … etc. However, that was not the case. The main thing that I wanted to prove though was that if the data was the same the L1 ARC only caches the data that was actually written to storage. That is a really cool thing! I am sure there will be future study on this topic as it applies to other scenarios. With regards to directory engineering investing any energy into optimizing ODSEE DS to more effectively leverage this caching potential, that won't happen. OUD far out performs ODSEE. That said OUD may get some focus in this area. However, time will tell on that one. Databases are not as likely to benefit from dedup as virtual machines, indeed, DBs are likely to not benefit at all from dedup. The VM use case benefits from dedup for the obvious reason that many VMs will have the same exact software installed most of the time, using the same filesystems, and the same patch/update installation order, so if you keep data out of their root filesystems then you can expect enormous deduplicatiousness. But databases, not so much. The unit of deduplicable data in a VM use case is the guest's preferred block size, while in a DB the unit of deduplicable data might be a variable-sized table row, or even smaller: a single row/column value -- and you have no way to ensure alignment of individual deduplicable units nor ordering of sets of deduplicable units into larger ones. When it comes to databases your best bets will be: a) database-level compression or dedup features (e.g., Oracle's column-level compression feature) or b) ZFS compression. (Dedup makes VM management easier, because the alternative is to patch one master guest VM [per-guest type] then re-clone and re-configure all instances of that guest type, in the process possibly losing any customizations in those guests. But even before dedup, the ability to snapshot and clone datasets was an impressive dedup-like tool for the VM use-case, just not as convenient as dedup.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Thu, Dec 29, 2011 at 2:06 PM, sol a...@yahoo.com wrote: Richard Elling wrote: many of the former Sun ZFS team regularly contribute to ZFS through the illumos developer community. Does this mean that if they provide a bug fix via illumos then the fix won't make it into the Oracle code? If you're an Oracle customer you should report any ZFS bugs you find to Oracle if you want fixes in Solaris. You may want to (and I encourage you to) report such bugs to Illumos if at all possible (i.e., unless your agreement with Oracle or your employer's policies somehow prevent you from doing so). The following is complete speculation. Take it with salt. With reference to your question, it may mean that Oracle's ZFS team would have to come up with their own fixes to the same bugs. Oracle's legal department would almost certainly have to clear the copying of any non-trivial/obvious fix from Illumos into Oracle's ON tree. And if taking a fix from Illumos were to require opening the affected files (because they are under CDDL in Illumos) then executive management approval would also be required. But the most likely case is that the issue simply wouldn't come up in the first place because Oracle's ZFS team would almost certainly ignore the Illumos repository (perhaps not the Illumos bug tracker, but probably that too) as that's simply the easiest way for them to avoid legal messes. Think about it. Besides, I suspect that from Oracle's point of view what matters are bug reports by Oracle customers to Oracle, so if a bug fixed in Illumos is never reported to Oracle by a customer, it would likely never get fixed in Solaris either except by accident, as a result of another change. Also, the Oracle ZFS team is not exactly devoid of clue, even with the departures from it to date. I suspect they will be able to fix bugs in Oracle's ZFS and completely independently of the open ZFS community, even if it means duplicating effort. That said, Illumos is a fork of OpenSolaris, and as such it and Solaris will necessarily diverge as at least one of the two (and probably both, for a while) gets plenty of bug fixes and enhancements. This is a good thing, not a bad thing, at least for now. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens mahr...@delphix.com wrote: On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble tr...@netdemons.com wrote: (1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a needed or not-needed reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream. ... you produce a huge amount of small network packet traffic, which trashes network throughput This seems like a valid approach to me. When constructing the stream, the sender need not read the actual data, just the checksum in the indirect block. So there is nothing that the sender must hold in RAM. There is no need to create small (or synchronous) network packets, because sender need not wait for the receiver to determine if it needs the block or not. There can be multiple asynchronous communication streams: one where the sender sends all the checksums to the receiver; another where the receiver requests blocks that it does not have from the sender; and another where the sender sends requested blocks back to the receiver. Implementing this may not be trivial, and in some cases it will not improve on the current implementation. But in others it would be a considerable improvement. Right, you'd want to let the socket/transport buffer/flow control writes of I have this new block checksum messages from the zfs sender and I need the block with this checksum messages from the zfs receiver. I like this. A separate channel for bulk data definitely comes recommended for flow control reasons, but if you do that then securing the transport gets complicated: you couldn't just zfs send .. | ssh ... zfs receive. You could use SSH channel multiplexing, but that will net you lousy performance (well, no lousier than one already gets with SSH anyways)[*]. (And SunSSH lacks this feature anyways) It'd then begin to pay to have have a bonafide zfs send network protocol, and now we're talking about significantly more work. Another option would be to have send/receive options to create the two separate channels, so one would do something like: % zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs receive --dedup-control-channel ... % zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive --dedup-bulk-channel % wait The second zfs receive would rendezvous with the first and go from there. [*] The problem with SSHv2 is that it has flow controlled channels layered over a flow controlled congestion channel (TCP), and there's not enough information flowing from TCP to SSHv2 to make this work well, but also, the SSHv2 channels cannot have their window shrink except by the sender consuming it, which makes it impossible to mix high-bandwidth bulk and small control data over a congested link. This means that in practice SSHv2 channels have to have relatively small windows, which then forces the protocol to work very synchronously (i.e., with effectively synchronous ACKs of bulk data). I now believe the idea of mixing bulk and non-bulk data over a single TCP connection in SSHv2 is a failure. SSHv2 over SCTP, or over multiple TCP connections, would be much better. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Tue, Dec 27, 2011 at 2:20 PM, Frank Cusack fr...@linetwo.net wrote: http://sparcv9.blogspot.com/2011/12/solaris-11-illumos-and-source.html If I upgrade ZFS to use the new features in Solaris 11 I will be unable to import my pool using the free ZFS implementation that is available in illumos based distributions Is that accurate? I understand if the S11 version is ahead of illumos, of course I can't use the same pools in both places, but that is the same problem as using an S11 pool on S10. The author is implying a much worse situation, that there are zfs tracks in addition to versions and that S11 is now on a different track and an S11 pool will not be usable elsewhere, ever. I hope it's just a misrepresentation. Hard to say. Suppose Oracle releases no details on any additions to the on-disk ZFS format since build 147... then either the rest of the ZFS developer community forks for good, or they have to reverse engineer Oracle's additions. Even if Oracle does release details on their additions, what if the external ZFS developer community disagrees vehemently with any of those? And what if the open source community adds extensions that Oracle never adopts? A fork is not yet a reality, but IMO it sure looks likely. Of course, you can still manage to have pools that will work on all implementations -- until the day that implementations start removing older formats anyways, which not only could happen, but I think will happen, though probably not until S10 is EOLed, and in any case probably not for a few years yet, likely not even within the next half decade. It's hard to predict such things though, so take the above with some (or lots!) of salt. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] S11 vs illumos zfs compatiblity
On Tue, Dec 27, 2011 at 8:44 PM, Frank Cusack fr...@linetwo.net wrote: So with a de facto fork (illumos) now in place, is it possible that two zpools will report the same version yet be incompatible across implementations? Not likely: the Illumos community has developed a method for managing ZFS extensions in a way other than linear chronology. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
On Dec 11, 2011 5:12 AM, Nathan Kroenert nat...@tuneunix.com wrote: On 12/11/11 01:05 AM, Pawel Jakub Dawidek wrote: On Wed, Dec 07, 2011 at 10:48:43PM +0200, Mertol Ozyoney wrote: Unfortunetly the answer is no. Neither l1 nor l2 cache is dedup aware. The only vendor i know that can do this is Netapp And you really work at Oracle?:) The answer is definiately yes. ARC caches on-disk blocks and dedup just reference those blocks. When you read dedup code is not involved at all. Let me show it to you with simple test: Create a file (dedup is on): # dd if=/dev/random of=/foo/a bs=1m count=1024 Copy this file so that it is deduped: # dd if=/foo/a of=/foo/b bs=1m Export the pool so all cache is removed and reimport it: # zpool export foo # zpool import foo Now let's read one file: # dd if=/foo/a of=/dev/null bs=1m 1073741824 bytes transferred in 10.855750 secs (98909962 bytes/sec) We read file 'a' and all its blocks are in cache now. The 'b' file shares all the same blocks, so if ARC caches blocks only once, reading 'b' should be much faster: # dd if=/foo/b of=/dev/null bs=1m 1073741824 bytes transferred in 0.870501 secs (1233475634 bytes/sec) Now look at it, 'b' was read 12.5 times faster than 'a' with no disk activity. Magic?:) Hey all, That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? The second file may gave the same data, but not the same metadata -the inode number at least must be different- so the znode for it must get read in, and that will slow reading the copy down a bit. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)
On Tue, Nov 29, 2011 at 12:17 PM, Cindy Swearingen cindy.swearin...@oracle.com wrote: I think the too many open files is a generic error message about running out of file descriptors. You should check your shell ulimit information. Also, see how many open files you have: echo /proc/self/fd/* It'd be quite weird though to have a very low fd limit or a very large number of file descriptors open in the shell. That said, as Casper says, utilities like mv(1) should be able to cope with reasonably small fd limits (i.e., not as small as 3, but perhaps as small as 10). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'
On Mon, Nov 28, 2011 at 11:28 AM, Smith, David W. smith...@llnl.gov wrote: You could list by inode, then use find with rm. # ls -i 7223 -O # find . -inum 7223 -exec rm {} \; This is the one solution I'd recommend against, since it would remove hardlinks that you might care about. Also, this thread is getting long, repetitive, tiring. Please stop. This is a standard issue Unix beginner question, just like my test program does nothing. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualbox rawdisk discrepancy
Moving boot disks from one machine to another used to work as long as the machines were of the same architecture. I don't recall if it was *supported* (and wouldn't want to pretend to speak for Oracle now), but it was meant to work (unless you minimized the install and removed drivers not needed on the first system that are needed on the other system). You did have to do a reconfigure boot though! Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Mon, Nov 14, 2011 at 8:33 AM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Paul Kraus Is it really B-Tree based? Apple's HFS+ is B-Tree based and falls apart (in terms of performance) when you get too many objects in one FS, which is specifically what drove us to ZFS. We had 4.5 TB of data According to wikipedia, btrfs is a b-tree. I know in ZFS, the DDT is an AVL tree, but what about the rest of the filesystem? ZFS directories are hashed. Aside from this, the filesystem (and volume) have a tree structure, but that's not what's interesting here -- what's interesting is how directories are indexed. B-trees should be logarithmic time, which is the best O() you can possibly achieve. So if HFS+ is dog slow, it's an implementation detail and not a general fault of b-trees. Hash tables can do much better than O(log N) for searching: O(1) for best case, and O(n) for the worst case. Also, b-trees are O(log_b N), where b is the number of entries per-node. 6e7 entries/directory probably works out to 2-5 reads (assuming 0% cache hit rate) depending on the size of each directory entry and the size of the b-tree blocks. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] aclmode=mask
I see, with great pleasure, that ZFS in Solaris 11 has a new aclmode=mask property. http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbscy.html#gkkkp http://download.oracle.com/docs/cd/E23824_01/html/821-1448/gbchf.html#gljyz http://download.oracle.com/docs/cd/E23824_01/html/821-1462/zfs-1m.html#scrolltoc (search for aclmode) May this be the last word in ACL/chmod interactions (knocks on wood, crosses fingers, ...). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] aclmode=mask
On Mon, Nov 14, 2011 at 6:20 PM, Nico Williams n...@cryptonector.com wrote: I see, with great pleasure, that ZFS in Solaris 11 has a new aclmode=mask property. Also, congratulations on shipping. And thank you for implementing aclmode=mask. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Fri, Nov 11, 2011 at 4:27 PM, Paul Kraus p...@kraus-haus.org wrote: The command syntax paradigm of zfs (command sub-command object parameters) is not unique to zfs, but seems to have been the way of doing things in Solaris 10. The _new_ functions of Solaris 10 were all this way (to the best of my knowledge)... zonecfg zoneadm svcadm svccfg ... and many others are written this way. To boot the zone named foo you use the command zoneadm -z foo boot, to disable the service named sendmail, svcadm disable sendmail, etc. Someone at Sun was thinking :-) I'd have preferred zoneadm boot foo. The -z zone command thing is a bit of a sore point, IMO. But yes, all these new *adm(1M( and *cfg(1M) commands in S10 are wonderful, especially when compared to past and present alternatives in the Unix/Linux world. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
To some people active-active means all cluster members serve the same filesystems. To others active-active means all cluster members serve some filesystems and can serve all filesystems ultimately by taking over failed cluster members. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Wed, Oct 19, 2011 at 7:24 AM, Garrett D'Amore garrett.dam...@nexenta.com wrote: I'd argue that from a *developer* point of view, an fsck tool for ZFS might well be useful. Isn't that what zdb is for? :-) But ordinary administrative users should never need something like this, unless they have encountered a bug in ZFS itself. (And bugs are as likely to exist in the checker tool as in the filesystem. ;-) zdb can be useful for admins -- say, to gather stats not reported by the system, to explore the fs/vol layout, for educational purposes, and so on. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On Tue, Oct 18, 2011 at 9:35 AM, Brian Wilson bfwil...@doit.wisc.edu wrote: I just wanted to add something on fsck on ZFS - because for me that used to make ZFS 'not ready for prime-time' in 24x7 5+ 9s uptime environments. Where ZFS doesn't have an fsck command - and that really used to bug me - it does now have a -F option on zpool import. To me it's the same functionality for my environment - the ability to try to roll back to a 'hopefully' good state and get the filesystem mounted up, leaving the corrupted data objects corrupted. [...] Yes, that's exactly what it is. There's no point calling it fsck because fsck fixes individual filesystems, while ZFS fixups need to happen at the volume level (at volume import time). It's true that this should have been in ZFS from the word go. But it's there now, and that's what matters, IMO. It's also true that this was never necessary with hardware that doesn't lie, but it's good to have it anyways, and is critical for personal systems such as laptops. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Thu, Oct 13, 2011 at 9:13 PM, Jim Klimov jimkli...@cos.ru wrote: Thanks to Nico for concerns about POSIX locking. However, hopefully, in the usecase I described - serving images of VMs in a manner where storage, access and migration are efficient - whole datasets (be it volumes or FS datasets) can be dedicated to one VM host server at a time, just like whole pools are dedicated to one host nowadays. In this case POSIX compliance can be disregarded - access is locked by one host, not avaialble to others, period. Of course, there is a problem of capturing storage from hosts which died, and avoiding corruptions - but this is hopefully solved in the past decades of clustering tech's. It sounds to me like you need horizontal scaling more than anything else. In that case, why not use pNFS or Lustre? Even if you want snapshots, a VM should be able to handle that on its own, and though probably not as nicely as ZFS in some respects, having the application be in control of the exact snapshot boundaries does mean that you don't have to quiesce your VMs just to snapshot safely. Nico also confirmed that one node has to be a master of all TXGs - which is conveyed in both ideas of my original post. Well, at any one time one node would have to be the master of the next TXG, but it doesn't mean that you couldn't have some cooperation. There are lots of other much more interesting questions. I think the biggest problem lies in requiring full connectivity from every server to every LUN. I'd much rather take the Lustre / pNFS model (which, incidentally, don't preclude having snapshots). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
Also, it's not worth doing a clustered ZFS thing that is too application-specific. You really want to nail down your choices of semantics, explore what design options those yield (or approach from the other direction, or both), and so on. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling richard.ell...@gmail.com wrote: On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: ZFS developers have for a long time stated that ZFS is not intended, at least not in near term, for clustered environments (that is, having a pool safely imported by several nodes simultaneously). However, many people on forums have wished having ZFS features in clusters. ...and UFS before ZFS… I'd wager that every file system has this RFE in its wish list :-) Except the ones that already have it! :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov jimkli...@cos.ru wrote: So, one version of the solution would be to have a single host which imports the pool in read-write mode (i.e. the first one which boots), and other hosts would write thru it (like iSCSI or whatever; maybe using SAS or FC to connect between reader and writer hosts). However they would read directly from the ZFS pool using the full SAN bandwidth. You need to do more than simply assign a node for writes. You need to send write and lock requests to one node. And then you need to figure out what to do about POSIX write visibility rules (i.e., when a write should be visible to other readers). I think you'd basically end up not meeting POSIX in this regard, just like NFS, though perhaps not with close-to-open semantics. I don't think ZFS is the beast you're looking for. You want something more like Lustre, GPFS, and so on. I suppose someone might surprise us one day with properly clustered ZFS, but I think it'd be more likely that the filesystem would be ZFS-like, not ZFS proper. Second version of the solution is more or less the same, except that all nodes can write to the pool hardware directly using some dedicated block ranges owned by one node at a time. This would work like much a ZIL containing both data and metadata. Perhaps these ranges would be whole metaslabs or some other ranges as agreed between the master node and other nodes. This is much hairier. You need consistency. If two processes on different nodes are writing to the same file, then you need to *internally* lock around all those writes so that the on-disk structure ends up being sane. There's a number of things you could do here, such as, for example, having a per-node log and one node coalescing them (possibly one node per-file, but even then one node has to be the master of every txg). And still you need to be careful about POSIX semantics. That does not come for free in any design -- you will need something like the Lustre DLM (distributed lock manager). Or else you'll have to give up on POSIX. There's a hefty price to be paid for POSIX semantics in a clustered environment. You'd do well to read up on Lustre's experience in detail. And not just Lustre -- that would be just to start. I caution you that this is not a simple project. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff performance disappointing
On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea j...@jcea.es wrote: I just upgraded to Solaris 10 Update 10, and one of the improvements is zfs diff. Using the birthtime of the sectors, I would expect very high performance. The actual performance doesn't seems better that an standard rdiff, though. Quite disappointing... Should I disable atime to improve zfs diff performance? (most data doesn't change, but atime of most files would change). atime has nothing to do with it. How much work zfs diff has to do depends on how much has changed between snapshots. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff performance disappointing
Ah yes, of course. I'd misread your original post. Yes, disabling atime updates will reduce the number of superfluous transactions. It's *all* transactions that count, not just the ones the app explicitly caused, and atime implies lots of transactions. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs scripts
On Fri, Sep 9, 2011 at 5:33 AM, Sriram Narayanan sri...@belenix.org wrote: Plus, you'll need an character at the end of each command. And a wait command, if you want the script to wait for the sends to finish (which you should). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
On Wed, Jul 27, 2011 at 9:22 PM, Daniel Carosone d...@geek.com.au wrote: Absent TRIM support, there's another way to do this, too. It's pretty easy to dd /dev/zero to a file now and then. Just make sure zfs doesn't prevent these being written to the SSD (compress and dedup are off). I have a separate fill dataset for this purpose, to avoid keeping these zeros in auto-snapshots too. Nice. Seems to me that it'd be nicer to have an interface to raw flash (no wear leveling, direct access to erasure, read, write, read-modify-write [as an optimization]). Then the filesystem could do a much better job of using flash efficiently. But a raw interface wouldn't be a disk-compatible interface. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Jul 9, 2011 1:56 PM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Given the abysmal performance, I have to assume there is a significant number of overhead reads or writes in order to maintain the DDT for each actual block write operation. Something I didn't mention in the other email is that I also tracked iostat throughout the whole operation. It's all writes (or at least 99.9% writes.) So I am forced to conclude it's a bunch of small DDT maintenance writes taking place and incurring access time penalties in addition to each intended single block access time penalty. The nature of the DDT is that it's a bunch of small blocks, that tend to be scattered randomly, and require maintenance in order to do anything else. This sounds like precisely the usage pattern that benefits from low latency devices such as SSD's. The DDT should be written to in COW fashion, and asynchronously, so there should be no access time penalty. Or so ISTM it should be. Dedup is necessarily slower for writing because of the deduplication table lookups. Those are synchronous lookups, but for async writes you'd think that total write throughput would only be affected by a) the additional read load (which is zero in your case) and b) any inability to put together large transactions due to the high latency of each logical write, but (b) shouldn't happen, particularly if the DDT fits in RAM or L2ARC, as it does in your case. So, at first glance my guess is ZFS is leaving dedup write performance on the table most likely due to implementation reasons, not design reasons. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
IMO a faster processor with built-in AES and other crypto support is most likely to give you the most bang for your buck, particularly if you're using closed Solaris 11, as Solaris engineering is likely to add support for new crypto instructions faster than Illumos (but I don't really know enough about Illumos to say for sure). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On Jun 27, 2011 9:24 PM, David Magda dma...@ee.ryerson.ca wrote: AESNI is certain better than nothing, but RSA, SHA, and the RNG would be nice as well. It'd also be handy for ZFS crypto in addition to all the network IO stuff. The most important reason for AES-NI might be not performance but defeating side-channel attacks. Also, really fast AES HW makes AES-based hash functions quite tempting. No, AES-NI is nothing to sneeze at. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On Jun 27, 2011 4:15 PM, David Magda dma...@ee.ryerson.ca wrote: The (Ultra)SPARC T-series processors do, but to a certain extent it goes against a CPU manufacturers best (financial) interest to provide this: crypto is very CPU intensive using 'regular' instructions, so if you need to do a lot of it, it would force you to purchase a manufacturer's top-of-the-line CPUs, and to have as many sockets as you can to handle a load (and presumably you need to do useful work besides just en/decrypting traffic). I hope no CPU vendor thinks about the economics of chip making that way. I actually doubt any do. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Versioning FS was: question about COW and snapshots
As Casper pointed out, the right thing to do is to build applications such that they can detect mid-transaction state and roll it back (or forward, if there's enough data). Then mid-transaction snapshots are fine, and the lack of APIs by which to inform the filesystem of application transaction boundaries becomes much less of an issue (adding such APIs is not a good solution, since it'd take many years for apps to take advantage of them and more years still for legacy apps to be upgraded or decomissioned). The existing FS interfaces provide enough that one can build applications this way. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
On Thu, Jun 16, 2011 at 8:51 AM, casper@oracle.com wrote: If a database engine or another application keeps both the data and the log in the same filesystem, a snapshot wouldn't create inconsistent data (I think this would be true with vim and a large number of database engines; vim will detect the swap file and datbase should be able to detect the inconsistency and rollback and re-apply the log file.) Correct. SQLite3 will be able to recover automatically from restores of mid-transaction snapshots. VIM does not recover automatically, but it does notice the swap file and warns the user and gives them a way to handle the problem. (When you save a file, VIM renames the old one out of the way, creates a new file with the original name, writes the new contents to it, closes it, then unlinks the swap file. On recovery VIM notices the swap file and gives the user a menu of choices.) I believe this is the best solution: write applications so they can recover from being restarted with data restored from a mid-transaction snapshot. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
That said, losing committed transactions when you needed and thought you had ACID semantics... is bad. But that's implied in any restore-from-backups situation. So you replicate/distribute transactions so that restore from backups (or snapshots) is an absolutely last resort matter, and if you ever have to restore from backups you also spend time manually tracking down (from counterparties, paper trails kept elsewhere, ...) any missing transactions. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Mon, Jun 13, 2011 at 5:50 AM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: If anyone has any ideas be it ZFS based or any useful scripts that could help here, I am all ears. Something like this one-liner will show what would be allocated by everything if hardlinks weren't used: # size=0; for i in `find . -type f -exec du {} \; | awk '{ print $1 }'`; do size=$(( $size + $i )); done; echo $size Oh, you don't want to do that: you'll run into max argument list size issues. Try this instead: (echo 0; find . -type f \! -links 1 | xargs stat -c %b %B *+ $p; echo p) | dc ;) xargs is your friend (and so is dc... RPN FTW!). Note that I'm not printing the number of links because find will print a name for every link (well, if you do the find from the root of the relevant filesystem), so we'd be counting too much space. You'll need the GNU stat(1). Or you could do something like this using the ksh stat builtin: ( echo 0 find . -type f \! -links 1 | while read p; do xargs stat -c %b %B *+ $p done echo p ) | dc Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Mon, Jun 13, 2011 at 12:59 PM, Nico Williams n...@cryptonector.com wrote: Try this instead: (echo 0; find . -type f \! -links 1 | xargs stat -c %b %B *+ $p; echo p) | dc s/\$p// ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
And, without a sub-shell: find . -type f \! -links 1 | xargs stat -c %b %B *+p /dev/null | dc 2/dev/null | tail -1 (The stderr redirection is because otherwise dc whines once that the stack is empty, and the tail is because we print interim totals as we go.) Also, this doesn't quit work, since it counts every link, when we want to count all but one links. This, then, is what will tell you how much space you saved due to hardlinks: find . -type f \! -links 1 | xargs stat -c 8k %b %B * %h 1 - * %h /+p /dev/null 2/dev/null | dc Excuse my earlier brainfarts :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hard link space savings
On Sun, Jun 12, 2011 at 4:14 PM, Scott Lawson scott.law...@manukau.ac.nz wrote: I have an interesting question that may or may not be answerable from some internal ZFS semantics. This is really standard Unix filesystem semantics. [...] So total storage used is around ~7.5MB due to the hard linking taking place on each store. If hard linking capability had been turned off, this same message would have used 1500 x 2MB =3GB worth of storage. My question is there any simple ways of determining the space savings on each of the stores from the usage of hard links? [...] But... you just did! :) It's: number of hard links * (file size + sum(size of link names and/or directory slot size)). For sufficiently large files (say, larger than one disk block) you could approximate that as: number of hard links * file size. The key is the number of hard links, which will typically vary, but for e-mails that go to all users, well, you know the number of links then is the number of users. You could write a script to do this -- just look at the size and hard-link count of every file in the store, apply the above formula, add up the inflated sizes, and you're done. Nico PS: Is it really the case that Exchange still doesn't deduplicate e-mails? Really? It's much simpler to implement dedup in a mail store than in a filesystem... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On May 25, 2011 7:15 AM, Garrett Dapos;Amore garr...@nexenta.com wrote: You are welcome to your beliefs. There are many groups that do standards that do not meet in public. [...] True. [...] In fact, I can't think of any standards bodies that *do* hold open meetings. I can: the IETF, for example. All business of the IETF is transacted or confirmed on open participation mailing lists, and IETF meetings are known as NOTE WELL meetings because of the notice given at their opening regarding the fact that meeting is public and resulting considerations regarding, e.g., trade secrets. Mind you, there are many more standards setting organizations that don't have open participation, such as OASIS, ISO, and so on. I don't begrudge you starting closed, our even staying closed, though I would prefer that at least the output of any ZFS standards org be open. Also, I would recommend that you eventually consider creating a new open participation list for non-members (separate from any members-only list). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 10:20 AM, Richard Elling richard.ell...@gmail.com wrote: ZFS already tracks the blocks that have been written, and the time that they were written. So we already know when something was writtem, though that does not answer the question of whether the data was changed. I think it is a pretty good bet that newly written data is different :-) Not really. There's bp rewrite (assuming that ever ships, or gets implemented elsewhere), for example. Then, the filesystem should make this Merkle Tree available to applications through a simple query. Something like zfs diff ? That works within a filesystem. And zfs send/recv works when you have one filesystem faithfully tracking another. When you have two filesystems with similar contents, and the history of each is useless in deciding how to do a bi-directional synchronization, then you need a way to diff files that is not based on intra-filesystem history. The rsync algorithm is the best high-performance algorithm that we have for determining differences between files separated by a network. My proposal (back then, and Zooko's now) is to leverage work that the filesystem does anyways to implement a high-performance remote diff that is faster than rsync for the simple reason that some of the rsync algorithm essentially gets pre-computed. This would enable applications—without needing any further in-filesystem code—to perform a Merkle Tree sync, which would range from noticeably more efficient to dramatically more efficient than rsync or zfs send. :-) Since ZFS send already has an option to only send the changed blocks, I disagree with your assertion that your solution will be dramatically more efficient than zsf send. We already know zfs send is much more efficient than rsync for large file systems. You missed Zooko's point completely. It might help to know that Zooko works on a project called Tahoe Least-Authority Filesystem, which is by nature distributed. Once you lose the constraints of not having a network or having uni-directional replication only, I think you'll get it. Or perhaps you'll argue that no one should ever need bi-di replication, that if one finds oneself wanting that then one has taken a wrong turn somewhere. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 1:52 PM, Nico Williams n...@cryptonector.com wrote: [...] Or perhaps you'll argue that no one should ever need bi-di replication, that if one finds oneself wanting that then one has taken a wrong turn somewhere. You could also grant the premise and argue instead that nothing the filesystem can do to speed up remote bi-di sync is worth the cost -- an argument that requires a lot more analysis. For example, if the filesystem were to compute and store rsync rolling CRC signatures, well, that would require significant compute and storage resources, and it might not speed up synchronization enough to ever be worthwhile. Similarly, a Merkle hash tree based on rolling hash functions (and excluding physical block pointer details) might require each hash output to grow linearly with block size in order to retain the rolling hash property (I'm not sure about this; I know very little about rolling hash functions), in which case the added complexity would be a complete non-starter. Whereas a Merkle hash tree built with regular hash functions would not be able to resolve insertions/deletions of data chunks of size that is not a whole multiple of block size. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
Also, sparseness need not be apparent to applications. Until recent improvements to lseek(2) to expose hole/non-hole offsets, the only way to know about sparseness was to notice that a file's reported size is more than the file's reported filesystem blocks times the block size. Sparse files in Unix go back at least to the early 80s. If a filesystem protocol, such as CIFS (I've no idea if it supports sparse files), were to not support sparse files, all that would mean is that the server must report a number of blocks that matches a file's size (assuming the protocol in question even supports any notion of reporting a file's size in blocks). There's really two ways in which a filesystem protocol could support sparse files: a) by reporting file size in bytes and blocks, b) by reporting lists of file offsets demarcating holes from non-holes. (b) is a very new idea; Lustre may be the only filesystem that I know that supports this (see the Linux FIEMAP APIs)., though work is in progress to add this to NFSv4. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
On Mon, May 2, 2011 at 3:56 PM, Eric D. Mudama edmud...@bounceswoosh.org wrote: Yea, kept googling and it makes sense. I guess I am simply surprised that the application would have done the seek+write combination, since on NTFS (which doesn't support sparse) these would have been real 1.5GB files, and there would be hundreds or thousands of them in normal usage. It could have been smbd compressing long runs of zeros. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
Then again, Windows apps may be doing seek+write to pre-allocate storage. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] disable zfs/zpool destroy for root user
On Thu, Feb 17, 2011 at 3:07 PM, Richard Elling richard.ell...@gmail.com wrote: On Feb 17, 2011, at 12:44 PM, Stefan Dormayer wrote: Hi all, is there a way to disable the subcommand destroy of zpool/zfs for the root user? Which OS? Heheh. Great answer. The real answer depends also on what the OP meant by root. root in Solaris isn't the all-powerful thing it used to be, or, rather, it is, but its power can be limited. And not just on Solaris either. The OP's question is difficult to answer because the question isn't the one the OP really wants to ask -- we must tease out that real question, or guess. I'd start with: just what is it that you want to accomplish? Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On Feb 14, 2011 6:56 AM, Paul Kraus p...@kraus-haus.org wrote: P.S. I am measuring number of objects via `zdb -d` as that is faster than trying to count files and directories and I expect is a much better measure of what the underlying zfs code is dealing with (a particular dataset may have lots of snapshot data that does not (easily) show up). It's faster because; a) no atime updates, b) no ZPL overhead. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 1:17 PM, Yi Zhang yizhan...@gmail.com wrote: On Mon, Feb 7, 2011 at 1:51 PM, Brandon High bh...@freaks.com wrote: Maybe I didn't make my intention clear. UFS with directio is reasonably close to a raw disk from my application's perspective: when the app writes to a file location, no buffering happens. My goal is to find a way to duplicate this on ZFS. You're still mixing directio and O_DSYNC. O_DSYNC is like calling fsync(2) after every write(2). fsync(2) is useful to obtain some limited transactional semantics, as well as for durability semantics. In ZFS you don't need to call fsync(2) to get those transactional semantics, but you do need to call fsync(2) get those durability semantics. Now, in ZFS fsync(2) implies a synchronous I/O operation involving significantly more than just the data blocks you wrote to. Which means that O_DSYNC on ZFS is significantly slower than on UFS. You can address this in one of two ways: a) you might realize that you don't need every write(2) to be durable, then stop using O_DSYNC, b) you might get a fast ZIL device. I'm betting that if you look carefully at your application's requirements you'll probably conclude that you don't need O_DSYNC at all. Perhaps you can tell us more about your application. Setting primarycache didn't eliminate the buffering, using O_DSYNC (whose side effects include elimination of buffering) made it ridiculously slow: none of the things I tried eliminated buffering, and just buffering, on ZFS. From the discussion so far my feeling is that ZFS is too different from UFS that there's simply no way to achieve this goal... You've not really stated your application's requirements. You may be convinced that you need O_DSYNC, but chances are that you don't. And yes, it's possible that you'd need O_DSYNC on UFS but not on ZFS. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss