Re: [zfs-discuss] Re: ZFS and databases
billtodd wrote: I do want to comment on the observation that enough concurrent 128K I/O can saturate a disk - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive's streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive's streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increase still falls far short of the 90% utilization one could achieve using 4 MB chunking). Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this says little about effective utilization. I think I can summarize where we are at on this. This is the classic big-{packet|block|$-line|bikini} versus small-{packet|block|$-line|bikini} argument. One size won't fit all. The jury is still out on what all of this means for any given application. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
For Output ops, ZFS could setup a 10MB I/O transfer to disk starting at sector X, or chunk that up in 128K while still assigning the samerangeof disk blocks forthe operations. Yes there will be more control information going around, a little more CPU consumed, but the disk will be streaming all right, I would guess. Most heavy output load will behave this way with ZFS, random or not. The throughput will depend more on the availability of contiguous chunk of disk blocks that the actual record size in use. As for random input, the issue is that ZFS does not get a say as to what the application is requesting in terms of size and location. Yes doing a 4M Input of contiguous disk block will be faster than random reading 128K chunks spread out. But if the application is manipulating 4M objects, those will stream out and land on contiguous disk blocks (if available) and those should stream in as well (if our read codepath is clever enough). The fundamental question is really, is there something that ZFS does that causes data that represents an application logical unit of information (likely to be read as one chunk) and so that we would like to have contiguous on disk, to actually spread out everywhere on the platter. -r Richard Elling writes: billtodd wrote: I do want to comment on the observation that enough concurrent 128K I/O can saturate a disk - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive's streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive's streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increase still falls far short of the 90% utilization one could achieve using 4 MB chunking). Enough concurrent 0.5 KB I/O can also saturate a disk, after all - but this says little about effective utilization. I think I can summarize where we are at on this. This is the classic big-{packet|block|$-line|bikini} versus small-{packet|block|$-line|bikini} argument. One size won't fit all. The jury is still out on what all of this means for any given application. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS and databases
Sorry for resurrecting this interesting discussion so late: I'm skinning backwards through the forum. One comment about segregating database logs is that people who take their data seriously often want a 'belt plus suspenders' approach to recovery. Conventional RAID, even supplemented with ZFS's self-healing scrubbing, isn't sufficient (though RAID-6 might be): they want at least the redo logs separate so that in the extremely unlikely event that they lose something in the (already replicated) database the failure is guaranteed not to have affected the redo logs as well, from which they can reconstruct the current database state from a backup. True, this will mean that you can't aggregate redo log activity with other transaction bulk-writes, but that's at least partly good as well: databases are often extremely sensitive to redo log write latency and would not want such writes delayed by combination with other updates, let alone by up to a 5-second delay. ZFS's synchronous write intent log could help here (if you replicate it: serious database people would consider even the very temporary exposure to a single failure inherent in an unmirrored log completely unacceptable), but that could also be slowed by other synch small write activity; conversely, databases often couldn't care less about the latency of many of their other writes, because their own (replicated) redo log has already established the persistence that they need. As for direct I/O, it's not clear why ZFS couldn't support it: it could verify each read in user memory against its internal checksum and perform its self-healing magic if necessary before returning completion status (which would be the same status it would return if the same situation occurred during its normal mode of operation: either unconditional success or success-after-recovery if the application might care to know that); it could handle each synchronous write analogously, and if direct I/O mechanisms support lazy writes then presumably they tie up the user buffer until the write completes such that you could use your normal mechanisms there as well (just operating on the user buffer instead of your cache). In this I'm assuming that 'direct I/O' refers not to raw device access but to file-oriented access that simply avoids any internal cache use, such that you could still use your no-overwrite approach. Of course, this also assumes that the direct I/O is always being performed in aligned integral multiples of checksum units by the application; if not, you'd either have to bag the checksum facility (this would not be an entirely unreasonable option to offer, given that some sophisticated applictions might want to use their own even higher-level integrity mechanisms, e.g., across geographically-separated sites, and would not need yours) or run everything through cache as you normally do. In suitably-aligned cases where you do validate the data you could avoid half the copy overhead (an issue of memory bandwidth as well as simply operation latency: TPC-C submissions can be affected by this, though it may be rare in real-world use) by integrating the checksum calculation with the copy, but would still have multiple copies of the data taking up memory in a situation (direct I/O) where the application *by definition* does not expect you to be caching the data (quite likely because it is doing any desirable caching itself). Tablespace contiguity may, however, be a deal-breaker for some users: it is common for tablespaces to be scanned sequentially (when selection criteria don't mesh with existing indexes, perhaps especially in joins where the smaller tablespace (still too large to be retained in cache, though) is scanned repeatedly in an inner loop, and a DBMS often goes to some effort to keep them defragmented. Until ZFS provides some effective continuous defragmenting mechanisms of its own, its no-overwrite policy may do more harm than good in such cases (since the database's own logs keep persistence latency low, while the backing tablespaces can then be updated at leisure). I do want to comment on the observation that enough concurrent 128K I/O can saturate a disk - the apparent implication being that one could therefore do no better with larger accesses, an incorrect conclusion. Current disks can stream out 128 KB in 1.5 - 3 ms., while taking 5.5 - 12.5 ms. for the average-seek-plus-partial-rotation required to get to that 128 KB in the first place. Thus on a full drive serial random accesses to 128 KB chunks will yield only about 20% of the drive's streaming capability (by contrast, accessing data using serial random accesses in 4 MB contiguous chunks achieves around 90% of a drive's streaming capability): one can do better on disks that support queuing if one allows queues to form, but this trades significantly increased average operation latency for the increase in throughput (and said increase
Re: [zfs-discuss] Re: ZFS and databases
Nicolas Williams wrote: On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote: Given that ISV apps can be only changed by the ISV who may or may not be willing to use such a new interface, having a no cache property for the file - or given that filesystems are now really cheap with ZFS - for the filesystem would be important as well, like the forcedirectio mount option for UFS. No caching at the filesystem level is always appropriate if the application itself maintains a buffer of application data and does their own application specific buffer management like DBMSes or large matrix solvers. Double caching these typicaly huge amounts data in the filesystem is always a waste of RAM. Yes, but remember, DB vendors have adopted new features before -- they want to have the fastest DB. Same with open source web servers. So I'm a bit optimistic. Yes, but they usually adopt it only with their latest releases, but it takes time until these are adopted by customers. And it's not just DB vendors, there are other apps around which could benefit, and there are always some who may not adopt a new feature in Solaris at all. Remember when UFS Directio was introduced - forcedirectio was in much wider use than apps which used the API directly. Also, an LD_PRELOAD library could be provided to enable direct I/O as necessary. This would work technically, but wether ISVs are willing to support such usage is a different topic (there may be startup scripts involved making it a little tricky to pass an library path to the app). So while having the app request no caching may be the architecturally cleaner approach, having it as a property on a file or filesystem is a pragmatic approach, with faster time-to-market and a potential for a much broader use. - Franz Nico ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
Franz Haberhauer wrote: This would work technically, but wether ISVs are willing to support such usage is a different topic (there may be startup scripts involved making it a little tricky to pass an library path to the app). Yet another reason to start the applications from SMF. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Mon, May 15, 2006 at 07:16:38PM +0200, Franz Haberhauer wrote: Nicolas Williams wrote: Yes, but remember, DB vendors have adopted new features before -- they want to have the fastest DB. Same with open source web servers. So I'm a bit optimistic. Yes, but they usually adopt it only with their latest releases, but it takes time until these are adopted by customers. And it's not just DB vendors, there are other apps around which could benefit, and there are always some who may not adopt a new feature in Solaris at all. Remember when UFS Directio was introduced - forcedirectio was in much wider use than apps which used the API directly. I (but I'm not in the ZFS team) don't oppose a file attribute of some sort to provide hints to the FS about the utility of direct I/O to processes that open such files. Ideally the OS could just figure it out every time with enough accuracy that no interface should be necessary at all, but I'm not sure that this is possible. But really, the right interface is for the application to tell the OS. I don't know what others (marketing particularly -- you may well be right about time to market) here think of it but if we could just stick to proper interfaces that would be best. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Mon, May 15, 2006 at 11:17:17AM -0700, Bart Smaalders wrote: Perhaps an fadvise call is in order? We already have directio(3C). (That was a surprise for me also.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Sat, May 13, 2006 at 08:23:55AM +0200, Franz Haberhauer wrote: Given that ISV apps can be only changed by the ISV who may or may not be willing to use such a new interface, having a no cache property for the file - or given that filesystems are now really cheap with ZFS - for the filesystem would be important as well, like the forcedirectio mount option for UFS. No caching at the filesystem level is always appropriate if the application itself maintains a buffer of application data and does their own application specific buffer management like DBMSes or large matrix solvers. Double caching these typicaly huge amounts data in the filesystem is always a waste of RAM. Yes, but remember, DB vendors have adopted new features before -- they want to have the fastest DB. Same with open source web servers. So I'm a bit optimistic. Also, an LD_PRELOAD library could be provided to enable direct I/O as necessary. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance Engineering wrote: For read it is an interesting concept. Since Reading into cache Then copy into user space then keep data around but never use it is not optimal. So 2 issues, there is the cost of copy and there is the memory. Now could we detect the pattern that cause holding to the cached block not optimal and do a quick freebehind after the copyout ? Something like Random access + very large file + poor cache hit ratio ? An interface to request no caching on a per-file basis would be good (madvise(2) should do for mmap'ed files, an fcntl(2) or open(2) flag would be better). Now about avoiding the copy; That would mean dma straight into user space ? But if the checksum does not validate the data, what do we do ? Who cares? You DMA into user-space, check the checksum and if there's a problem return an error; so there's [corrupted] data in the user space buffer... but the app knows it, so what's the problem (see below)? If storage is not raid-protected and we have to return EIO, I don't think we can do this _and_ corrupt the user buffer also, not sure what POSIX says for this situation. If POSIX compliance is an issue just add new interfaces (possibly as simple as an open(2) flag). Now latency wise, the cost of copy is small compared to the I/O; right ? So it now turns into an issue of saving some CPU cycles. Can you build a system where the cost of the copy adds significantly to the latency numbers? (Think RAM disks.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
Nicolas Williams writes: On Fri, May 12, 2006 at 05:23:53PM +0200, Roch Bourbonnais - Performance Engineering wrote: For read it is an interesting concept. Since Reading into cache Then copy into user space then keep data around but never use it is not optimal. So 2 issues, there is the cost of copy and there is the memory. Now could we detect the pattern that cause holding to the cached block not optimal and do a quick freebehind after the copyout ? Something like Random access + very large file + poor cache hit ratio ? An interface to request no caching on a per-file basis would be good (madvise(2) should do for mmap'ed files, an fcntl(2) or open(2) flag would be better). Now about avoiding the copy; That would mean dma straight into user space ? But if the checksum does not validate the data, what do we do ? Who cares? You DMA into user-space, check the checksum and if there's a problem return an error; so there's [corrupted] data in the user space buffer... but the app knows it, so what's the problem (see below)? If storage is not raid-protected and we have to return EIO, I don't think we can do this _and_ corrupt the user buffer also, not sure what POSIX says for this situation. If POSIX compliance is an issue just add new interfaces (possibly as simple as an open(2) flag). Now latency wise, the cost of copy is small compared to the I/O; right ? So it now turns into an issue of saving some CPU cycles. Can you build a system where the cost of the copy adds significantly to the latency numbers? (Think RAM disks.) Nico -- Finally I can agree with somebody today. Directio is non-posix anyway and given that people have been train to inform the system that the cache won't be useful, that it's a hard problem to detect automatically, let's avoid the copy and save memory all at once for the read path. We could use the directio() call for that ... -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Fri, May 12, 2006 at 06:33:00PM +0200, Roch Bourbonnais - Performance Engineering wrote: Directio is non-posix anyway and given that people have been train to inform the system that the cache won't be useful, that it's a hard problem to detect automatically, let's avoid the copy and save memory all at once for the read path. We could use the directio() call for that ... I had no idea about directio(3C)! We might want an interface for the app to know what the natural block size of the file is, so it can read at proper file offsets. Of course, if that block size is smaller than the ZFS filesystem record size then ZFS may yet grow it. How to deal with this? (One option: don't grow it as long as an app has turned direct I/O on for a fildes and the fildes remains open.) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote: Now latency wise, the cost of copy is small compared to the I/O; right ? So it now turns into an issue of saving some CPU cycles. CPU cycles and memory bandwidth (which both can be in short supply on a database server). We can throw hardware at that :-) Imagine a machine with lots of extra CPU cycles and lots of parallel access to multiple memory banks. This is the strategy behind CMT. In the future, you will have many more CPU cycles and even better memory bandwidth than you do now, perhaps by an order of magnitude in the next few years. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On Fri, May 12, 2006 at 09:59:56AM -0700, Richard Elling wrote: On Fri, 2006-05-12 at 10:42 -0500, Anton Rang wrote: Now latency wise, the cost of copy is small compared to the I/O; right ? So it now turns into an issue of saving some CPU cycles. CPU cycles and memory bandwidth (which both can be in short supply on a database server). We can throw hardware at that :-) Imagine a machine with lots of extra CPU cycles and lots of parallel access to multiple memory banks. This is the strategy behind CMT. In the future, you will have many more CPU cycles and even better memory bandwidth than you do now, perhaps by an order of magnitude in the next few years. Well, yes, of course, but I think the arguments for direct I/O are excellent. Another thing that I see an argument for is limiting the size of various caches, to avoid paging (even having no swap isn't enough as you don't want memory pressure evicting hot text pages). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS and databases
On May 12, 2006, at 11:59 AM, Richard Elling wrote: CPU cycles and memory bandwidth (which both can be in short supply on a database server). We can throw hardware at that :-) Imagine a machine with lots of extra CPU cycles [ ... ] Yes, I've heard this story before, and I won't believe it this time. ;-) Seriously, I believe a database can perform very well on a CMT system, but there won't be any extra CPU cycles or memory bandwidth, because the demand for transaction rates will always exceed what we can supply. Anton ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss