Sorry for resurrecting this interesting discussion so late: I'm skinning backwards through the forum.
One comment about segregating database logs is that people who take their data seriously often want a 'belt plus suspenders' approach to recovery. Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't sufficient (though RAID-6 might be): they want at least the redo logs separate so that in the extremely unlikely event that they lose something in the (already replicated) database the failure is guaranteed not to have affected the redo logs as well, from which they can reconstruct the current database state from a backup. True, this will mean that you can't aggregate redo log activity with other transaction bulk-writes, but that's at least partly good as well: databases are often extremely sensitive to redo log write latency and would not want such writes delayed by combination with other updates, let alone by up to a 5-second delay. ZFS's synchronous write intent log could help here (if you replicate it: serious database people would consider even the very temporary exposure to a single failure inherent in an unmirrored log completely unacceptable), but that could also be slowed by other synch small write activity; conversely, databases often couldn't care less about the latency of many of their other writes, because their own (replicated) redo log has already established the persistence that they need. As for direct I/O, it's not clear why ZFS couldn't support it: it could verify each read in user memory against its internal checksum and perform its self-healing magic if necessary before returning completion status (which would be the same status it would return if the same situation occurred during its normal mode of operation: either unconditional success or success-after-recovery if the application might care to know that); it could handle each synchronous write analogously, and if direct I/O mechanisms support lazy writes then presumably they tie up the user buffer until the write completes such that you could use your normal mechanisms there as well (just operating on the user buffer instead of your cache). In this I'm assuming that 'direct I/O' refers not to raw device access but to file-oriented access that simply avoids any internal cache use, such that you could still use your no-overwrite approach. Of course, this also assumes that the direct I/O is always being performed in aligned integral multiples of checksum units by the application; if not, you'd either have to bag the checksum facility (this would not be an entirely unreasonable option to offer, given that some sophisticated applictions might want to use their own even higher-level integrity mechanisms, e.g., across geographically-separated sites, and would not need yours) or run everything through cache as you normally do. In suitably-aligned cases where you do validate the data you could avoid half the copy overhead (an issue of memory bandwidth as well as simply operation latency: TPC-C submissions can be affected by this, though it may be rare in real-world use) by integrating the checksum calculation with the copy, but would still have multiple copies of the data taking up memory in a situation (direct I/O) where the application *by definition* does not expect you to be caching the data (quite likely because it is doing any desirable caching itself). Tablespace contiguity may, however, be a deal-breaker for some users: it is common for tablespaces to be scanned sequentially (when selection criteria don't mesh with existing indexes, perhaps especially in joins where the smaller tablespace (still too large to be retained in cache, though) is scanned repeatedly in an inner loop, and a DBMS often goes to some effort to keep them defragmented. Until ZFS provides some effective continuous defragmenting mechanisms of its own, its no-overwrite policy may do more harm than good in such cases (since the database's own logs keep persistence latency low, while the backing tablespaces can then be updated at leisure). I do want to comment on the observation that "enough concurrent 128K I/O can saturate a disk" - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive's streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive's streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increase still falls far short of the 90% utilization one could achieve using 4 MB chunking). Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this says little about effective utilization. Others have touched on several of these points as well - apologies for any repetition arising from writing while I read. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss