Sorry for resurrecting this interesting discussion so late:  I'm skinning 
backwards through the forum.

One comment about segregating database logs is that people who take their data 
seriously often want a 'belt plus suspenders' approach to recovery.  
Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't 
sufficient (though RAID-6 might be):  they want at least the redo logs separate 
so that in the extremely unlikely event that they lose something in the 
(already replicated) database the failure is guaranteed not to have affected 
the redo logs as well, from which they can reconstruct the current database 
state from a backup.

True, this will mean that you can't aggregate redo log activity with other 
transaction bulk-writes, but that's at least partly good as well:  databases 
are often extremely sensitive to redo log write latency and would not want such 
writes delayed by combination with other updates, let alone by up to a 5-second 
delay.

ZFS's synchronous write intent log could help here (if you replicate it:  
serious database people would consider even the very temporary exposure to a 
single failure inherent in an unmirrored log completely unacceptable), but that 
could also be slowed by other synch small write activity; conversely, databases 
often couldn't care less about the latency of many of their other writes, 
because their own (replicated) redo log has already established the persistence 
that they need.

As for direct I/O, it's not clear why ZFS couldn't support it:  it could verify 
each read in user memory against its internal checksum and perform its 
self-healing magic if necessary before returning completion status (which would 
be the same status it would return if the same situation occurred during its 
normal mode of operation:  either unconditional success or 
success-after-recovery if the application might care to know that); it could 
handle each synchronous write analogously, and if direct I/O mechanisms support 
lazy writes then presumably they tie up the user buffer until the write 
completes such that you could use your normal mechanisms there as well (just 
operating on the user buffer instead of your cache).  In this I'm assuming that 
'direct I/O' refers not to raw device access but to file-oriented access that 
simply avoids any internal cache use, such that you could still use your 
no-overwrite approach.

Of course, this also assumes that the direct I/O is always being performed in 
aligned integral multiples of checksum units by the application; if not, you'd 
either have to bag the checksum facility (this would not be an entirely 
unreasonable option to offer, given that some sophisticated applictions might 
want to use their own even higher-level integrity mechanisms, e.g., across 
geographically-separated sites, and would not need yours) or run everything 
through cache as you normally do.  In suitably-aligned cases where you do 
validate the data you could avoid half the copy overhead (an issue of memory 
bandwidth as well as simply operation latency:  TPC-C submissions can be 
affected by this, though it may be rare in real-world use) by integrating the 
checksum calculation with the copy, but would still have multiple copies of the 
data taking up memory in a situation (direct I/O) where the application *by 
definition* does not expect you to be caching the data (quite likely because it 
is doing any desirable caching itself).

Tablespace contiguity may, however, be a deal-breaker for some users:  it is 
common for tablespaces to be scanned sequentially (when selection criteria 
don't mesh with existing indexes, perhaps especially in joins where the smaller 
tablespace (still too large to be retained in cache, though) is scanned 
repeatedly in an inner loop, and a DBMS often goes to some effort to keep them 
defragmented.  Until ZFS provides some effective continuous defragmenting 
mechanisms of its own, its no-overwrite policy may do more harm than good in 
such cases (since the database's own logs keep persistence latency low, while 
the backing tablespaces can then be updated at leisure).

I do want to comment on the observation that "enough concurrent 128K I/O can 
saturate a disk" - the apparent implication being that one could therefore do 
no better with larger accesses, an incorrect conclusion.  Current disks can 
stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the 
average-seek-plus-partial-rotation required to get to that 128 KB in the first 
place.  Thus on a full drive serial random accesses to 128 KB chunks will yield 
only about 20% of the drive's streaming capability (by contrast, accessing data 
using serial random accesses in 4 MB contiguous chunks achieves around 90% of a 
drive's streaming capability):  one can do better on disks that support queuing 
if one allows queues to form, but this trades significantly increased average 
operation latency for the increase in throughput (and said increase still falls 
far short of the 90% utilization one could achieve using 4 MB chunking).

Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this 
says little about effective utilization.

Others have touched on several of these points as well - apologies for any 
repetition arising from writing while I read.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to