Re: O_DIRECT question
Hi! > > > Which shouldn't be true. There is no fundamental reason why > > > ordinary writes should be slower than O_DIRECT. > > > > Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the > > kernel-user copy, > > You assume that ordinary read()/write() is *required* to do the copying. > It doesn't. Kernel is allowed to do direct DMAing in this case too. Kernel is allowed, but it is practically impossible to code. It would require slow MMU magic. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Phillip Susi wrote: [] > You seem to have missed the point of this thread. Denis Vlasenko's > message that you replied to simply pointed out that they are > semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC > + madvise could be fixed to perform as well. Several people including > Linus seem to like this idea and think it is quite possible. By the way, IF O_SYNC+madvise could be "fixed", can't O_DIRECT be implemented internally using them? I mean, during open(O_DIRECT), do open(O_SYNC) instead and call madvise() appropriately /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote: > It most certainly matters where the error happened because "you are > screwd" is not an acceptable outcome in a mission critical application. An I/O error is not an acceptable outcome in a mission critical app, all mission critical setups should be fault tolerant, so if raid cannot recover at the first sign of error the whole system should instantly go down and let the secondary takeover from it. See slony etc... Trying to recover the recoverable by mucking up with data making even _more_ writes on a failing disk before doing physical mirror image of the disk (the readable part) isn't a good idea IMHO. At best you could retry writing on the same sector hoping somebody disconnected the scsi cable by mistake. > A well engineered solution will deal with errors as best as possible, > not simply give up and tell the user they are screwed because the > designer was lazy. There is a reason that read and write return the > number of bytes _actually_ transfered, and the application is supposed > to check that result to verify proper operation. You can track the range where it happened with fsync too like said in previous email, and you can take the big database lock and then read-write read-write every single block in that range until you find the failing place if you really want to. read-write in place should be safe. > No, there is a slight difference. An fsync() flushes all dirty buffers > in an undefined order. Using O_DIRECT or O_SYNC, you can control the > flush order because you can simply wait for one set of writes to > complete before starting another set that must not be written until > after the first are on the disk. You can emulate that by placing an > fsync between both sets of writes, but that will flush any other > dirty Doing fsync after every write will provide the same ordering guarantee as O_SYNC, thought it was obvious what I meant here. The whole point is that most of the time you don't need it, you need an fsync after a couple of writes. All smtp servers uses fsync for the same reason, they also have to journal their writes to avoid losing email when there is a power loss. If you use writev or aio pwrite you can do well with O_SYNC too though. > buffers whose ordering you do not care about. Also there is no aio > version of fsync. please have a second look at aio_abi.h: IOCB_CMD_FSYNC = 2, IOCB_CMD_FDSYNC = 3, there must be a reason why they exist, right? > sync has no effect on reading, so that test is pointless. direct saves > the cpu overhead of the buffer copy, but isn't good if the cache isn't > entirely cold. The large buffer size really has little to do with it, direct bypasses the cache so the cache is freezing not just cold. > rather it is the fact that the writes to null do not block dd from > making the next read for any length of time. If dd were blocking on an > actual output device, that would leave the input device idle for the > portion of the time that dd were blocked. The objective was to measure the pipeline stall, if you stall it for other reason anyway what's the point? > In any case, this is a totally different example than your previous one > which had dd _writing_ to a disk, where it would block for long periods > of time due to O_SYNC, thereby preventing it from reading from the input > buffer in a timely manner. By not reading the input pipe frequently, it > becomes full and thus, tar blocks. In that case the large buffer size > is actually a detriment because with a smaller buffer size, dd would not > be blocked as long and so it could empty the pipe more frequently > allowing tar to block less. It would run slower with smaller buffer size because it would block too and it would read and write slower too. For my backup usage keeping tar blocked is actually a feature, so the load of the backup decreases. To me it's important the MB/sec of the writes and the MB/sec of the reads (to lower the load), I don't care too much about how long it takes as far as things runs as efficiently as possible when they run. The rate limiting effect of the blocking isn't a problem to me. > You seem to have missed the point of this thread. Denis Vlasenko's > message that you replied to simply pointed out that they are > semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC > + madvise could be fixed to perform as well. Several people including > Linus seem to like this idea and think it is quite possible. I answered to that email to point out the fundamental differences between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry but that's how things are running today and I don't see quite possible to change (unless of course we remove performance from the equation, then indeed they'll be much the same). Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages loaded with a new syscall that reads a piece at ti
Re: O_DIRECT question
Andrea Arcangeli wrote: When you have I/O errors during _writes_ (not Read!!) the raid must kick the disk out of the array before the OS ever notices. And if it's software raid that you're using, the OS should kick out the disk before your app ever notices any I/O error. when the write I/O error happens, it's not a problem for the application to solve. I thought it obvious that we were talking about non recoverable errors that then DO make it to the application. And any kind of mission critical app most definitely does care about write errors. You don't need your db completing the transaction when it was only half recorded. It needs to know it failed so it can back out and/or recover the data and record it elsewhere. You certainly don't want the users to think everything is fine, walk away, and have the system continue to limp on making things worse by the second. when the I/O error reaches the filesystem if you're lucky if the OS won't crash (ext3 claims to handle it), if your app receives the I/O error all you should be doing is to shutdown things gracefully sending all errors you can to the admin. If the OS crashes due to an IO error reading user data, then there is something seriously wrong and beyond the scope of this discussion. It suffices to say that due to the semantics of write() and sound engineering practice, the application expects to be notified of errors so it can try to recover, or fail gracefully. Whether it chooses to fail gracefully as you say it should, or recovers from the error, it needs to know that an error happened, and where it was. It doesn't matter much where the error happend, all it matters is that you didn't have a fault tolerant raid setup (your fault) and your primary disk just died and you're now screwed(tm). If you could trust that part of the disk is still sane you could perhaps attempt to avoid a restore from the last backup, otherwise all you can do is the equivalent of a e2fsck -f on the db metadata after copying what you can still read to the new device. It most certainly matters where the error happened because "you are screwd" is not an acceptable outcome in a mission critical application. A well engineered solution will deal with errors as best as possible, not simply give up and tell the user they are screwed because the designer was lazy. There is a reason that read and write return the number of bytes _actually_ transfered, and the application is supposed to check that result to verify proper operation. Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC offers exactly the same guarantees. Feel free to check the real life db code. Even bdb uses fsync. No, there is a slight difference. An fsync() flushes all dirty buffers in an undefined order. Using O_DIRECT or O_SYNC, you can control the flush order because you can simply wait for one set of writes to complete before starting another set that must not be written until after the first are on the disk. You can emulate that by placing an fsync between both sets of writes, but that will flush any other dirty buffers whose ordering you do not care about. Also there is no aio version of fsync. Please try yourself, it's simple enough: time dd if=/dev/hda of=/dev/null bs=16M count=100 time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct if you can measure any slowdown in the sync/direct you're welcome (it runs faster here... as it should). The pipeline stall is not measurable when it's so infrequent, and actually the pipeline stall is not a big issue when the I/O is contigous and the dma commands are always large. > aio is mandatory only while dealing with small buffers, especially while seeking to take advantage of the elevator. sync has no effect on reading, so that test is pointless. direct saves the cpu overhead of the buffer copy, but isn't good if the cache isn't entirely cold. The large buffer size really has little to do with it, rather it is the fact that the writes to null do not block dd from making the next read for any length of time. If dd were blocking on an actual output device, that would leave the input device idle for the portion of the time that dd were blocked. In any case, this is a totally different example than your previous one which had dd _writing_ to a disk, where it would block for long periods of time due to O_SYNC, thereby preventing it from reading from the input buffer in a timely manner. By not reading the input pipe frequently, it becomes full and thus, tar blocks. In that case the large buffer size is actually a detriment because with a smaller buffer size, dd would not be blocked as long and so it could empty the pipe more frequently allowing tar to block less. This whole thing is about performance, if you remove performance factors from the equation, you can stick to your O_SYNC 512
Re: O_DIRECT question
On Tue, Jan 30, 2007 at 08:57:20PM +0100, Andrea Arcangeli wrote: > Please try yourself, it's simple enough: > >time dd if=/dev/hda of=/dev/null bs=16M count=100 >time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync sorry, reading won't help much to exercise sync ;). But the direct line is enough to show the effect of I/O pipeline stall. To effectively test sync of course you want to write to a file instead (unless you want to wipe out /dev/hda ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Tue, Jan 30, 2007 at 01:50:41PM -0500, Phillip Susi wrote: > It should return the number of bytes successfully written before the > error, giving you the location of the first error. Also using smaller > individual writes ( preferably issued in parallel ) also allows the > problem spot to be isolated. When you have I/O errors during _writes_ (not Read!!) the raid must kick the disk out of the array before the OS ever notices. And if it's software raid that you're using, the OS should kick out the disk before your app ever notices any I/O error. when the write I/O error happens, it's not a problem for the application to solve. when the I/O error reaches the filesystem if you're lucky if the OS won't crash (ext3 claims to handle it), if your app receives the I/O error all you should be doing is to shutdown things gracefully sending all errors you can to the admin. It doesn't matter much where the error happend, all it matters is that you didn't have a fault tolerant raid setup (your fault) and your primary disk just died and you're now screwed(tm). If you could trust that part of the disk is still sane you could perhaps attempt to avoid a restore from the last backup, otherwise all you can do is the equivalent of a e2fsck -f on the db metadata after copying what you can still read to the new device. The only time I got a I/O error on writes, about 1G of the disk was gone, not very useful to know the first 512byte region that failed... unreadable and unwriteable. Every other time, writing to the disk actually solved the read I/O error (they weren't write I/O errors of course). Now if you're careful enough you can track down which data generated the I/O error by queuing the blocks that you write in between every fsync. So you can still know if perhaps only the journal has generated write I/O errors, in such a case you could tell the user that he can copy the data files and let the journal be regenerated on the new disk. I doubt it will help much in practice though (in such a case, I would always restore the last backup just in case). > >>Typically you only want one sector of data to be written before you > >>continue. In the cases where you don't, this might be nice, but as I > >>said above, you can't handle errors properly. > > > >Sorry but you're dreaming if you're thinking anything in real life > >writes at 512bytes at time with O_SYNC. Try that with any modern > >harddisk. > > When you are writing a transaction log, you do; you don't need much > data, but you do need to be sure it has hit the disk before continuing. > You certainly aren't writing many mb across a dozen write() calls and > only then care to make sure it is all flushed in an unknown order. When > order matters, you can not use fsync, which is one of the reasons why > databases use O_DIRECT; they care about the ordering. Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC offers exactly the same guarantees. Feel free to check the real life db code. Even bdb uses fsync. > >>>Just grep for fsync in the db code of your choice (try postgresql) and > >>>then explain me why they ever call fsync in their code, if you know > >>>how to do better with O_SYNC ;). > >>Doesn't sound like a very good idea to me. > > > >Why not a good idea to check any real life app? > > I meant it is not a good idea to use fsync as you can't properly handle > errors. See above. > >>The stalling is caused by cache pollution. Since you did not specify a > >>block size dd uses the base block size of the output disk. When > >>combined with sync, only one block is written at a time, and no more > >>until the first block has been flushed. Only then does dd send down > >>another block to write. Without dd the kernel is likely allowing many > >>mb to be queued in the buffer cache. Limiting output to one block at a > >>time is not good for throughput, but allowing half of ram to be used by > >>dirty pages is not good either. > > > >Throughput is perfect. I forgot to tell I combine it with ibs=4k > >obs=16M. Like it would be perfect with odirect too for the same > >reason. Stalling the I/O pipeline once every 16M isn't measurable in > > Throughput is nowhere near perfect, as the pipeline is stalled for quite > some time. The pipe fills up quickly while dd is blocked on the sync > write, which then blocks tar until all 16 MB have hit the disk. Only > then does dd go back to reading from the tar pipe, allowing it to > continue. During the time it takes tar to archive another 16 MB of > data, the write queue is empty. The only time that the tar process gets > to continue running while data is written to disk is in the small time > it takes for the pipe ( 4 KB isn't it? ) to fill up. Please try yourself, it's simple enough: time dd if=/dev/hda of=/dev/null bs=16M count=100 time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=dire
Re: O_DIRECT question
Andrea Arcangeli wrote: On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote: Did you intentionally drop this reply off list? No. Then I'll restore the lkml to the cc list. No, it doesn't... or at least can't report WHERE the error is. O_SYNC doesn't report where the error is either, try a write(fd, buf, 10*1024*1024). It should return the number of bytes successfully written before the error, giving you the location of the first error. Also using smaller individual writes ( preferably issued in parallel ) also allows the problem spot to be isolated. Typically you only want one sector of data to be written before you continue. In the cases where you don't, this might be nice, but as I said above, you can't handle errors properly. Sorry but you're dreaming if you're thinking anything in real life writes at 512bytes at time with O_SYNC. Try that with any modern harddisk. When you are writing a transaction log, you do; you don't need much data, but you do need to be sure it has hit the disk before continuing. You certainly aren't writing many mb across a dozen write() calls and only then care to make sure it is all flushed in an unknown order. When order matters, you can not use fsync, which is one of the reasons why databases use O_DIRECT; they care about the ordering. Just grep for fsync in the db code of your choice (try postgresql) and then explain me why they ever call fsync in their code, if you know how to do better with O_SYNC ;). Doesn't sound like a very good idea to me. Why not a good idea to check any real life app? I meant it is not a good idea to use fsync as you can't properly handle errors. The stalling is caused by cache pollution. Since you did not specify a block size dd uses the base block size of the output disk. When combined with sync, only one block is written at a time, and no more until the first block has been flushed. Only then does dd send down another block to write. Without dd the kernel is likely allowing many mb to be queued in the buffer cache. Limiting output to one block at a time is not good for throughput, but allowing half of ram to be used by dirty pages is not good either. Throughput is perfect. I forgot to tell I combine it with ibs=4k obs=16M. Like it would be perfect with odirect too for the same reason. Stalling the I/O pipeline once every 16M isn't measurable in Throughput is nowhere near perfect, as the pipeline is stalled for quite some time. The pipe fills up quickly while dd is blocked on the sync write, which then blocks tar until all 16 MB have hit the disk. Only then does dd go back to reading from the tar pipe, allowing it to continue. During the time it takes tar to archive another 16 MB of data, the write queue is empty. The only time that the tar process gets to continue running while data is written to disk is in the small time it takes for the pipe ( 4 KB isn't it? ) to fill up. The semantics of the two are very much the same; they only differ in the internal implementation. As far as the caller is concerned, in both cases, he is sure that writes are safe on the disk when they return, and reads semantically are no different with either flag. The internal implementations lead to different performance characteristics, and the other post was simply commenting that the performance characteristics of O_SYNC + madvise() is almost the same as O_DIRECT, or even better in some cases ( since the data read may already be in cache ). The semantics mandates the implementation because the semantics make up for the performance expectations. For the same reason you shouldn't write 512bytes at time with O_SYNC you also shouldn't use O_SYNC if your device risks to create a bottleneck in the CPU and memory. No, semantics have nothing to do with performance. Semantics deals with the state of the machine after the call, not how quickly it got there. Semantics is a question of correct operation, not optimal. With both O_DIRECT and O_SYNC, the machine state is essentially the same after the call: the data has hit the disk. Aside from the performance difference, the application can not tell the difference between O_DIRECT and O_SYNC, so if that performance difference can be resolved by changing the implementation, Linus can be happy and get rid of O_DIRECT. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Monday 29 January 2007 18:00, Andrea Arcangeli wrote: > On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote: > > I still don't see much difference between O_SYNC and O_DIRECT write > > semantic. > > O_DIRECT is about avoiding the copy_user between cache and userland, > when working with devices that runs faster than ram (think >=100M/sec, > quite standard hardware unless you've only a desktop or you cannot > afford raid). Yes, I know that, but O_DIRECT is also "overloaded" with O_SYNC-like semantic too ("write doesnt return until data hits physical media"). To have two ortogonal things "mixed together" in one flag feels "not Unixy" to me. So I am trying to formulate saner semantic. So far I think that this looks good: O_SYNC - usual meaning O_STREAM - do not try hard to cache me. This includes "if you can (buffer is sufficiently aligned, yadda, yadda), do not copy_user into pagecache but just DMA from userspace pages" - exactly because user told us that he is not interested in caching! Then O_DIRECT is approximately = O_SYNC + O_STREAM, and I think maybe Linus will not hate this "new" O_DIRECT - it doesn't bypass pagecache. > O_SYNC is about working around buggy or underperforming VM growing the > dirty levels beyond optimal levels, or to open logfiles that you want > to save to disk ASAP (most other journaling usages are better done > with fsync instead). I've got a feeling that db people use O_DIRECT (its O_SYNCy behaviour) as a poor man's write barrier when they must be sure that their redo logs have hit storage before they start to modify datafiles. Another reason why they want sync writes is write error detection. They cannot afford delaying it. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote: > I still don't see much difference between O_SYNC and O_DIRECT write > semantic. O_DIRECT is about avoiding the copy_user between cache and userland, when working with devices that runs faster than ram (think >=100M/sec, quite standard hardware unless you've only a desktop or you cannot afford raid). O_SYNC is about working around buggy or underperforming VM growing the dirty levels beyond optimal levels, or to open logfiles that you want to save to disk ASAP (most other journaling usages are better done with fsync instead). Or you can mount the fs in sync mode when you deal with users not capable of unmounting devices before unplugging them. Ideally you should never need O_SYNC, when you need O_SYNC it's usually a very bad sign. If you need O_DIRECT it's not a bad sign (needing O_DIRECT is mostly a sign you've a very fast storage). The only case where I ever used O_SYNC myself is during backups (when run on standard or mainline kernels that dirty half ram during backup). For the logfiles I don't find it very useful, if something I log them remotely (when system crashes usually the logs won't hit the disk anyway, so it's just slower). I use "tar | dd oflag=sync" and that generates a huge speedup to the rest of the system (not necessairly to the backup itself). Yes I could use even oflag=direct, but I'm fine to pass through the cache (the backup device runs at 10M/sec through USB, so the copy_user is _sure_ worth it, if something it will help, it will never be a measurable slowdown), what is not fine is to see half of the ram dirty the whole time... (hence the need of o_sync). O_SYNC and O_DIRECT are useful for different scenarios. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: I still don't see much difference between O_SYNC and O_DIRECT write semantic. Yes, if you change the normal io paths to properly support playing vmsplice games ( which have a number of corner cases ) to get the zero copy, and support madvise() and O_SYNC to control caching behavior, and fix all the error handling corner cases, then you may be able to do away with O_DIRECT. I believe that doing all that will be much more complex than O_DIRECT however. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sunday 28 January 2007 16:30, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Saturday 27 January 2007 15:01, Bodo Eggert wrote: > >> Denis Vlasenko <[EMAIL PROTECTED]> wrote: > >>> On Friday 26 January 2007 19:23, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > >> But even single-threaded I/O but in large quantities benefits from > >> O_DIRECT significantly, and I pointed this out before. > > Which shouldn't be true. There is no fundamental reason why > > ordinary writes should be slower than O_DIRECT. > > > Other than the copy to buffer taking CPU and memory resources. > >>> It is not required by any standard that I know. Kernel can be smarter > >>> and avoid that if it can. > >> The kernel can also solve the halting problem if it can. > >> > >> Do you really think an entropy estamination code on all access patterns in > >> the > >> system will be free as in beer, > > > > Actually I think we need this heuristic: > > > > if (opened_with_O_STREAM && buffer_is_aligned > > && io_size_is_a_multiple_of_sectorsize) > > do_IO_directly_to_user_buffer_without_memcpy > > > > is not *that* compilcated. > > > > I think that we can get rid of O_DIRECT peculiar requirements > > "you *must* not cache me" + "you *must* write me directly to bare metal" > > by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC > > ("write() should return only when data is written to storage, not sooner"). > > > > Why? > > > > Because these O_DIRECT "musts" are rather unusual and overkill. Apps > > should not have that much control over what kernel does internally; > > and also O_DIRECT was mixing shampoo and conditioner on one bottle > > (no-cache and sync writes) - bad API. > > What a shame that other operating systems can manage to really support > O_DIRECT, and that major application software can use this api to write > portable code that works even on Windows. > > You overlooked the problem that applications using this api assume that > reads are on bare metal as well, how do you address the case where > thread A does a write, thread B does a read? If you give thread B data > from a buffer and it then does a write to another file (which completes > before the write from thread A), and then the system crashes, you have > just put the files out of sync. Applications which syncronize their data integrity by keeping data on hard drive and relying on "read goes to bare metal, so it can't see written data before it gets written to bare metal". Wow, this is slow. Are you talking about this scenario: Bad: fd = open(..., O_SYNC); fork() write(fd, buf); [1] read(fd, buf2); [starts after write 1 started] write(somewhere_else, buf2); (write returns) < crash point (write returns) This will be *very* slow - if you use O_DIRECT and do what is depicted above, you write data, then you read it back, whic is slow. Why do you want that? Isn't it much faster to just wait for write to complete, and allow read to fetch (potentially) cached data? Better: fd = open(..., O_SYNC); fork() write(fd, buf); [1] (wait for write to finish) < crash point (write returns) read(fd, buf2); [starts after write 1 started] write(somewhere_else, buf2); (write returns) > So you may have to block all i/o for all > threads of the application to be sure that doesn't happen. Not all, only related i/o. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sunday 28 January 2007 16:18, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Friday 26 January 2007 19:23, Bill Davidsen wrote: > >> Denis Vlasenko wrote: > >>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > Phillip Susi wrote: > > [...] > > But even single-threaded I/O but in large quantities benefits from > O_DIRECT > significantly, and I pointed this out before. > >>> Which shouldn't be true. There is no fundamental reason why > >>> ordinary writes should be slower than O_DIRECT. > >>> > >> Other than the copy to buffer taking CPU and memory resources. > > > > It is not required by any standard that I know. Kernel can be smarter > > and avoid that if it can. > > Actually, no, the whole idea of page cache is that overall system i/o > can be faster if data sit in the page cache for a while. But the real > problem is that the application write is now disconnected from the > physical write, both in time and order. Not in O_SYNC case. > No standard says the kernel couldn't do direct DMA, but since having > that required is needed to guarantee write order and error status linked > to the actual application i/o, what a kernel "might do" is irrelevant. > > It's much easier to do O_DIRECT by actually doing the direct i/o than to > try to catch all the corner cases which arise in faking it. I still don't see much difference between O_SYNC and O_DIRECT write semantic. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: On Saturday 27 January 2007 15:01, Bodo Eggert wrote: Denis Vlasenko <[EMAIL PROTECTED]> wrote: On Friday 26 January 2007 19:23, Bill Davidsen wrote: Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: But even single-threaded I/O but in large quantities benefits from O_DIRECT significantly, and I pointed this out before. Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. Other than the copy to buffer taking CPU and memory resources. It is not required by any standard that I know. Kernel can be smarter and avoid that if it can. The kernel can also solve the halting problem if it can. Do you really think an entropy estamination code on all access patterns in the system will be free as in beer, Actually I think we need this heuristic: if (opened_with_O_STREAM && buffer_is_aligned && io_size_is_a_multiple_of_sectorsize) do_IO_directly_to_user_buffer_without_memcpy is not *that* compilcated. I think that we can get rid of O_DIRECT peculiar requirements "you *must* not cache me" + "you *must* write me directly to bare metal" by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC ("write() should return only when data is written to storage, not sooner"). Why? Because these O_DIRECT "musts" are rather unusual and overkill. Apps should not have that much control over what kernel does internally; and also O_DIRECT was mixing shampoo and conditioner on one bottle (no-cache and sync writes) - bad API. What a shame that other operating systems can manage to really support O_DIRECT, and that major application software can use this api to write portable code that works even on Windows. You overlooked the problem that applications using this api assume that reads are on bare metal as well, how do you address the case where thread A does a write, thread B does a read? If you give thread B data from a buffer and it then does a write to another file (which completes before the write from thread A), and then the system crashes, you have just put the files out of sync. So you may have to block all i/o for all threads of the application to be sure that doesn't happen. Or introduce some complex way to assure that all writes are physically done in order... that sounds like a lock infested mess to me, assuming that you could ever do it right. Oracle has their own version of Linux now, do you think that they would fork the application or the kernel? -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: On Friday 26 January 2007 19:23, Bill Davidsen wrote: Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: Phillip Susi wrote: [...] But even single-threaded I/O but in large quantities benefits from O_DIRECT significantly, and I pointed this out before. Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. Other than the copy to buffer taking CPU and memory resources. It is not required by any standard that I know. Kernel can be smarter and avoid that if it can. Actually, no, the whole idea of page cache is that overall system i/o can be faster if data sit in the page cache for a while. But the real problem is that the application write is now disconnected from the physical write, both in time and order. No standard says the kernel couldn't do direct DMA, but since having that required is needed to guarantee write order and error status linked to the actual application i/o, what a kernel "might do" is irrelevant. It's much easier to do O_DIRECT by actually doing the direct i/o than to try to catch all the corner cases which arise in faking it. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Saturday 27 January 2007 15:01, Bodo Eggert wrote: > Denis Vlasenko <[EMAIL PROTECTED]> wrote: > > On Friday 26 January 2007 19:23, Bill Davidsen wrote: > >> Denis Vlasenko wrote: > >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > > >> >> But even single-threaded I/O but in large quantities benefits from > >> >> O_DIRECT significantly, and I pointed this out before. > >> > > >> > Which shouldn't be true. There is no fundamental reason why > >> > ordinary writes should be slower than O_DIRECT. > >> > > >> Other than the copy to buffer taking CPU and memory resources. > > > > It is not required by any standard that I know. Kernel can be smarter > > and avoid that if it can. > > The kernel can also solve the halting problem if it can. > > Do you really think an entropy estamination code on all access patterns in the > system will be free as in beer, Actually I think we need this heuristic: if (opened_with_O_STREAM && buffer_is_aligned && io_size_is_a_multiple_of_sectorsize) do_IO_directly_to_user_buffer_without_memcpy is not *that* compilcated. I think that we can get rid of O_DIRECT peculiar requirements "you *must* not cache me" + "you *must* write me directly to bare metal" by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC ("write() should return only when data is written to storage, not sooner"). Why? Because these O_DIRECT "musts" are rather unusual and overkill. Apps should not have that much control over what kernel does internally; and also O_DIRECT was mixing shampoo and conditioner on one bottle (no-cache and sync writes) - bad API. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko <[EMAIL PROTECTED]> wrote: > On Friday 26 January 2007 19:23, Bill Davidsen wrote: >> Denis Vlasenko wrote: >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: >> >> But even single-threaded I/O but in large quantities benefits from >> >> O_DIRECT significantly, and I pointed this out before. >> > >> > Which shouldn't be true. There is no fundamental reason why >> > ordinary writes should be slower than O_DIRECT. >> > >> Other than the copy to buffer taking CPU and memory resources. > > It is not required by any standard that I know. Kernel can be smarter > and avoid that if it can. The kernel can also solve the halting problem if it can. Do you really think an entropy estamination code on all access patterns in the system will be free as in beer, or be able to predict the access pattern of random applications? -- Top 100 things you don't want the sysadmin to say: 86. What do you mean that wasn't a copy? Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Friday 26 January 2007 19:23, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > >> Phillip Susi wrote: > >>> Denis Vlasenko wrote: > You mean "You can use aio_write" ? > >>> Exactly. You generally don't use O_DIRECT without aio. Combining the > >>> two is what gives the big win. > >> Well, it's not only aio. Multithreaded I/O also helps alot -- all this, > >> say, to utilize a raid array with many spindles. > >> > >> But even single-threaded I/O but in large quantities benefits from O_DIRECT > >> significantly, and I pointed this out before. > > > > Which shouldn't be true. There is no fundamental reason why > > ordinary writes should be slower than O_DIRECT. > > > Other than the copy to buffer taking CPU and memory resources. It is not required by any standard that I know. Kernel can be smarter and avoid that if it can. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Friday 26 January 2007 18:05, Phillip Susi wrote: > Denis Vlasenko wrote: > > Which shouldn't be true. There is no fundamental reason why > > ordinary writes should be slower than O_DIRECT. > > Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the > kernel-user copy, You assume that ordinary read()/write() is *required* to do the copying. It doesn't. Kernel is allowed to do direct DMAing in this case too. > and when coupled with multithreading or aio, allows > the IO queues to be kept full with useful transfers at all times. Again, ordinary I/O is no different. Especially on fds opened with O_SYNC, write() will behave very similarly to O_DIRECT one - data is guaranteed to hit the disk before write() returns. > Normal read/write requires the kernel to buffer and guess access No it doesn't *require* that. > patterns correctly to perform read ahead and write behind perfectly to > keep the queues full. In practice, this does not happen perfectly all > of the time, or even most of the time, so it slows things down. So lets fix the kernel for everyone's benefit intead of "give us an API specifically for our needs". -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: Phillip Susi wrote: Denis Vlasenko wrote: You mean "You can use aio_write" ? Exactly. You generally don't use O_DIRECT without aio. Combining the two is what gives the big win. Well, it's not only aio. Multithreaded I/O also helps alot -- all this, say, to utilize a raid array with many spindles. But even single-threaded I/O but in large quantities benefits from O_DIRECT significantly, and I pointed this out before. Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. Other than the copy to buffer taking CPU and memory resources. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the kernel-user copy, and when coupled with multithreading or aio, allows the IO queues to be kept full with useful transfers at all times. Normal read/write requires the kernel to buffer and guess access patterns correctly to perform read ahead and write behind perfectly to keep the queues full. In practice, this does not happen perfectly all of the time, or even most of the time, so it slows things down. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Mark Lord wrote: You guys need to backup in this thread. Every example of O_DIRECT here could be replaced with calls to mmap(), msync(), and madvise() (or posix_fadvise). In addition to being at least as fast as O_DIRECT, these have the added benefit of using the page cache (avoiding reads for data already present, handling multiple users of the same data, etc..). Please actually _read_ the thread. In every one of my posts I have shown why this is not the case. To briefly rehash the core of the argument, there is no way to asynchronously manage IO with mmap, msync, madvise -- instead you take page faults or otherwise block, thus stalling the pipeline. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Mark Lord wrote: > You guys need to backup in this thread. > > Every example of O_DIRECT here could be replaced with > calls to mmap(), msync(), and madvise() (or posix_fadvise). No. How about handling IO errors? There is no practical way for it with mmap(). > In addition to being at least as fast as O_DIRECT, > these have the added benefit of using the page cache > (avoiding reads for data already present, handling multiple > users of the same data, etc..). > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
You guys need to backup in this thread. Every example of O_DIRECT here could be replaced with calls to mmap(), msync(), and madvise() (or posix_fadvise). In addition to being at least as fast as O_DIRECT, these have the added benefit of using the page cache (avoiding reads for data already present, handling multiple users of the same data, etc..). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: Well, I too currently work with Oracle. Apparently people who wrote damn thing have very, eh, Oracle-centric world-view. "We want direct writes to the disk. Period." Why? Does it makes sense? Are there better ways? - nothing. They think they know better. I fear you are taking the Windows approach, that the computer is there to serve the o/s and applications have to do things the way the o/s wants. As opposed to the UNIX way, where you can either be clever or stupid, the o/s is there to allow you to use the hardware, not be your mother. Currently applications have the option of letting the o/s make decisions via open/read/write, or let the o/s make decisions and tell the application via aio, or using O_DIRECT and having full control over the process. And that's exactly as it should be. It's not up to the o/s to be mother. (And let's not even start on why oracle ignores SIGTERM. Apparently Unix rules aren't for them. They're too big to play by rules.) Any process can ignore SIGTERM, or do a significant amount of cleanup before exit()ing. Complex operations need to be completed or unwound. Why select Oracle? Other applications may also do that, with more or less valid reasons. -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > Phillip Susi wrote: > > Denis Vlasenko wrote: > >> You mean "You can use aio_write" ? > > > > Exactly. You generally don't use O_DIRECT without aio. Combining the > > two is what gives the big win. > > Well, it's not only aio. Multithreaded I/O also helps alot -- all this, > say, to utilize a raid array with many spindles. > > But even single-threaded I/O but in large quantities benefits from O_DIRECT > significantly, and I pointed this out before. Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Phillip Susi wrote: > Denis Vlasenko wrote: >> You mean "You can use aio_write" ? > > Exactly. You generally don't use O_DIRECT without aio. Combining the > two is what gives the big win. Well, it's not only aio. Multithreaded I/O also helps alot -- all this, say, to utilize a raid array with many spindles. But even single-threaded I/O but in large quantities benefits from O_DIRECT significantly, and I pointed this out before. It's like enabling a write cache on disk AND doing intensive random writes - the cache - surprizingly - slows whole thing down by 5..10%. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: You mean "You can use aio_write" ? Exactly. You generally don't use O_DIRECT without aio. Combining the two is what gives the big win. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thursday 25 January 2007 20:28, Phillip Susi wrote: > > Ahhh shit, are you saying that fdatasync will wait until writes > > *by all other processes* to thios file will hit the disk? > > Is that thue? > > I think all processes yes, but certainly all writes to this file by this > process. That means you have to sync for every write, which means you > block. Blocking stalls the pipeline. I dont understand you here. Suppose fdatasync() is "do not return until all cached writes to this file *done by current process* hit the disk (i.e. cached write data from other concurrent processes is not waited for), report succes or error code". Then write(fd_O_DIRECT, buf, sz) - will wait until buf's data hit the disk write(fd, buf, sz) - potentially will return sooner, but fdatasync(fd) - will wait until buf's data hit the disk Looks same to me. > > If you opened a file and are doing only O_DIRECT writes, you > > *always* have your written data flushed, by each write(). > > How is it different from writes done using > > "normal" write() + fdatasync() pairs? > > Because you can do writes async, but not fdatasync ( unless there is an > async version I don't know about ). You mean "You can use aio_write" ? -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: If you opened a file and are doing only O_DIRECT writes, you *always* have your written data flushed, by each write(). How is it different from writes done using "normal" write() + fdatasync() pairs? Because you can do writes async, but not fdatasync ( unless there is an async version I don't know about ). Ahhh shit, are you saying that fdatasync will wait until writes *by all other processes* to thios file will hit the disk? Is that thue? I think all processes yes, but certainly all writes to this file by this process. That means you have to sync for every write, which means you block. Blocking stalls the pipeline. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thursday 25 January 2007 16:44, Phillip Susi wrote: > Denis Vlasenko wrote: > > I will still disagree on this point (on point "use O_DIRECT, it's faster"). > > There is no reason why O_DIRECT should be faster than "normal" read/write > > to large, aligned buffer. If O_DIRECT is faster on today's kernel, > > then Linux' read()/write() can be optimized more. > > Ahh but there IS a reason for it to be faster: the application knows > what data it will require, so it should tell the kernel rather than ask > it to guess. Even if you had the kernel playing vmsplice games to get > avoid the copy to user space ( which still has a fair amount of overhead > ), then you still have the problem of the kernel having to guess what > data the application will require next, and try to fetch it early. Then > when the application requests the data, if it is not already in memory, > the application blocks until it is, and blocking stalls the pipeline. > > > (I hoped that they can be made even *faster* than O_DIRECT, but as I said, > > you convinced me with your "error reporting" argument that reads must still > > block until entire buffer is read. Writes can avoid that - apps can do > > fdatasync/whatever to make sync writes & error checks if they want). > > > fdatasync() is not acceptable either because it flushes the entire file. If you opened a file and are doing only O_DIRECT writes, you *always* have your written data flushed, by each write(). How is it different from writes done using "normal" write() + fdatasync() pairs? > This does not allow the application to control the ordering of various > writes unless it limits itself to a single write/fdatasync pair at a > time. Further, fdatasync again blocks the application. Ahhh shit, are you saying that fdatasync will wait until writes *by all other processes* to thios file will hit the disk? Is that thue? -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: I will still disagree on this point (on point "use O_DIRECT, it's faster"). There is no reason why O_DIRECT should be faster than "normal" read/write to large, aligned buffer. If O_DIRECT is faster on today's kernel, then Linux' read()/write() can be optimized more. Ahh but there IS a reason for it to be faster: the application knows what data it will require, so it should tell the kernel rather than ask it to guess. Even if you had the kernel playing vmsplice games to get avoid the copy to user space ( which still has a fair amount of overhead ), then you still have the problem of the kernel having to guess what data the application will require next, and try to fetch it early. Then when the application requests the data, if it is not already in memory, the application blocks until it is, and blocking stalls the pipeline. (I hoped that they can be made even *faster* than O_DIRECT, but as I said, you convinced me with your "error reporting" argument that reads must still block until entire buffer is read. Writes can avoid that - apps can do fdatasync/whatever to make sync writes & error checks if they want). fdatasync() is not acceptable either because it flushes the entire file. This does not allow the application to control the ordering of various writes unless it limits itself to a single write/fdatasync pair at a time. Further, fdatasync again blocks the application. With aio, the application can keep several read/writes going in parallel, thus keeping the pipeline full. Even if the io were not O_DIRECT, and the kernel played vmsplice games to avoid the copy, it would still have more overhead, complexity and I think, very little gain in most cases. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Monday 22 January 2007 17:17, Phillip Susi wrote: > > You do not need to know which read() exactly failed due to bad disk. > > Filename and offset from the start is enough. Right? > > > > So, SIGIO/SIGBUS can provide that, and if your handler is of > > void (*sa_sigaction)(int, siginfo_t *, void *); > > style, you can get fd, memory address of the fault, etc. > > Probably kernel can even pass file offset somewhere in siginfo_t... > > Sure... now what does your signal handler have to do in order to handle > this error in such a way as to allow the one request to be failed and > the task to continue handling other requests? I don't think this is > even possible, yet alone clean. Actually, you have convinced me on this. While it's is possible to report error to userspace, it will be highly nontrivial (read: bug-prone) for userspace to catch and act on the errors. > > You think "Oracle". But this application may very well be > > not Oracle, but diff, or dd, or KMail. I don't want to care. > > I want all big writes to be efficient, not just those done by Oracle. > > *Including* single threaded ones. > > Then redesign those applications to use aio and O_DIRECT. Incidentally > I have hacked up dd to do just that and have some very nice performance > numbers as a result. I will still disagree on this point (on point "use O_DIRECT, it's faster"). There is no reason why O_DIRECT should be faster than "normal" read/write to large, aligned buffer. If O_DIRECT is faster on today's kernel, then Linux' read()/write() can be optimized more. (I hoped that they can be made even *faster* than O_DIRECT, but as I said, you convinced me with your "error reporting" argument that reads must still block until entire buffer is read. Writes can avoid that - apps can do fdatasync/whatever to make sync writes & error checks if they want). -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: The difference is that you block exactly when you try to access data which is not there yet, not sooner (potentially much sooner). If application (e.g. database) needs to know whether data is _really_ there, it should use aio_read (or something better, something which doesn't use signals. Do we have this 'something'? I honestly don't know). The application _IS_ using aio, which is why it can go and perform other work while it waits to be told that the read has completed. This is not possible with mmap because the task is blocked while faulting in pages, and unless it tries to access the pages, they won't be faulted in. In some cases, evne this is not needed because you don't have any other things to do, so you just do read() (which returns early), and chew on data. If your CPU is fast enough and processing of data is light enough so that it outruns disk - big deal, you block in page fault handler whenever a page is not read for you in time. If CPU isn't fast enough, your CPU and disk subsystem are nicely working in parallel. Being blocked in the page fault handler means the cpu is now idle because you can't go chew on data that _IS_ in core. The aio + O_DIRECT allows you to control when IO is started rather than rely on the kernel to decide when is a good time for readahead, and to KNOW when that IO is done so you can chew on the data. With O_DIRECT, you alternate: "CPU is idle, disk is working" / "CPU is working, disk is idle". You have this completely backwards. With mmap this is what you get because you chew data, page fault... chew data... page fault... What do you want to do on I/O error? I guess you cannot do much - any sensible db will shutdown itself. When your data storage starts to fail, it's pointless to continue running. Ever hear of error recovery? A good db will be able to cope with one or two bad blocks, or at the very least continue operating the other tables or databases it is hosting, or flush transactions and switch to read only mode, or any number of things other than abort(). You do not need to know which read() exactly failed due to bad disk. Filename and offset from the start is enough. Right? So, SIGIO/SIGBUS can provide that, and if your handler is of void (*sa_sigaction)(int, siginfo_t *, void *); style, you can get fd, memory address of the fault, etc. Probably kernel can even pass file offset somewhere in siginfo_t... Sure... now what does your signal handler have to do in order to handle this error in such a way as to allow the one request to be failed and the task to continue handling other requests? I don't think this is even possible, yet alone clean. You can still be multithreaded. The point is, with O_DIRECT you _are forced_ to_ be_ multithreaded, or else perfomance will suck. Or use aio. Simple read/write with the kernel trying to outsmart the application is nice for very simple applications, but it does not provide very good performance. This is why we have aio and O_DIRECT; because the application can manage the IO better than the kernel because it actually knows what it needs and when. Yes, the application ends up being more complex, but that is the price you pay. You simply can't get it perfect in a general purpose kernel that has to guess what the application is really trying to do. You think "Oracle". But this application may very well be not Oracle, but diff, or dd, or KMail. I don't want to care. I want all big writes to be efficient, not just those done by Oracle. *Including* single threaded ones. Then redesign those applications to use aio and O_DIRECT. Incidentally I have hacked up dd to do just that and have some very nice performance numbers as a result. Well, I too currently work with Oracle. Apparently people who wrote damn thing have very, eh, Oracle-centric world-view. "We want direct writes to the disk. Period." Why? Does it makes sense? Are there better ways? - nothing. They think they know better. Nobody has shown otherwise to date. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Andrea Arcangeli wrote: > Linus may be right that perhaps one day the CPU will be so much faster > than disk that such a copy will not be measurable and then O_DIRECT > could be downgraded to O_STREAMING or an fadvise. If such a day will > come by, probably that same day Dr. Tanenbaum will be finally right > about his OS design too. Dr. T. is probably right with his OS design, it's just people aren't ready for it, yet. Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: What will happen if we just make open ignore O_DIRECT? ;) And then anyone who feels sad about is advised to do it like described here: http://lkml.org/lkml/2002/5/11/58 Then database and other high performance IO users will be broken. Most of Linus's rant there is being rehashed now in this thread, and it has been pointed out that using mmap instead is unacceptable because it is inherently _synchronous_ and the app can not tolerate the page faults on read, and handling IO errors during the page fault is impossible/highly problematic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Hello everyone, This is a long thread about O_DIRECT surprisingly without a single bugreport in it, that's a good sign that O_DIRECT is starting to work well in 2.6 too ;) On Fri, Jan 12, 2007 at 02:47:48PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 15:35:09 -0700 > Erik Andersen <[EMAIL PROTECTED]> wrote: > > > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > > I suspect a lot of people actually have other reasons to avoid caches. > > > > > > For example, the reason to do O_DIRECT may well not be that you want to > > > avoid caching per se, but simply because you want to limit page cache > > > activity. In which case O_DIRECT "works", but it's really the wrong thing > > > to do. We could export other ways to do what people ACTUALLY want, that > > > doesn't have the downsides. > > > > I was rather fond of the old O_STREAMING patch by Robert Love, > > That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it > up to dehackify it and get it into mainline, but we ended up deciding that > posix_fadvise() was the way to go because it's standards-based. > > It's a bit more work in the app to use posix_fadvise() well. But the > results will be better. The app should also use sync_file_range() > intelligently to control its pagecache use. > > The problem with all of these things is that the application needs to be > changed, and people often cannot do that. If we want a general way of And if the application needs to be changed then IMHO it sounds better to go the last mile and to use O_DIRECT instead of O_STREAMING to run in zerocopy. Benchmarks have been posted here as well to show what a kind of difference O_DIRECT can make. O_STREAMING really shouldn't exist and all O_STREAMING users should be converted to O_DIRECT. The only reason O_DIRECT exists is to bypass the pagecache and to run in zerocopy, to avoid all pagecache lookups and locking, to preserve cpu caches, to avoid losing smp scalability in the memory bus in not-numa systems, and to avoid the general cpu overhead of copying the data with the cpu for no good reason. The cache polluting avoidance that O_STREAMING and fadvise can also provide, is an almost not interesting feature. I'm afraid databases aren't totally stupid here using O_DIRECT, the caches they keep in ram isn't necessarily always a 1:1 mapping of the on-disk data, so replacing O_DIRECT with a MAP_SHARED of the source file, wouldn't be the best even if they could be convinced to trust the OS instead of insisting to bypass it (and if they could combine MAP_SHARED with asyncio somehow). They don't have problems to trust the OS when they map tmpfs as MAP_SHARED after all... Why to waste time copying the data through pagecache if the pagecache itself won't be useful when the db is properly tuned? Linus may be right that perhaps one day the CPU will be so much faster than disk that such a copy will not be measurable and then O_DIRECT could be downgraded to O_STREAMING or an fadvise. If such a day will come by, probably that same day Dr. Tanenbaum will be finally right about his OS design too. Storage speed is growing along cpu speeds, especially with contiguous I/O and by using fast raid storage, so I don't see it very likely that we can ignore those memory copies any time soon. Perhaps an average amd64 desktop system with a single sata disk may never get a real benefit from O_DIRECT compared to O_STREAMING, but that's not the point as linux doesn't only run on desktops with a single SATA disk running at only 50M/sec (and abysmal performance while seeking). With regard to the locking mess, O_DIRECT already fallback to buffered mode while creating new blocks and uses proper locking to serialize against i_size changes (by sct). filling holes and i_size changes are the forbidden sins of O_DIRECT. The rest is just a matter of cache invalidates or cache flushes run at the right time. With more recent 2.6 changes, even further complexity has been introduced to allow mapped cache to see O_DIRECT writes, I've never been convinced that this was really useful. There was nothing wrong in having a not uptodate page mapped in userland (except to workaround an artifical BUG_ON that tried to enforce that artificial invariant for no apparent required reason), but it should work ok and it can be seen as a new feature. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sunday 21 January 2007 13:09, Michael Tokarev wrote: > Denis Vlasenko wrote: > > On Saturday 20 January 2007 21:55, Michael Tokarev wrote: > >> Denis Vlasenko wrote: > >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > example, which isn't quite possible now from userspace. But as long as > O_DIRECT actually writes data before returning from write() call (as it > seems to be the case at least with a normal filesystem on a real block > device - I don't touch corner cases like nfs here), it's pretty much > THE ideal solution, at least from the application (developer) standpoint. > >>> Why do you want to wait while 100 megs of data are being written? > >>> You _have to_ have threaded db code in order to not waste > >>> gobs of CPU time on UP + even with that you eat context switch > >>> penalty anyway. > >> Usually it's done using aio ;) > >> > >> It's not that simple really. > >> > >> For reads, you have to wait for the data anyway before doing something > >> with it. Omiting reads for now. > > > > Really? All 100 megs _at once_? Linus described fairly simple (conceptually) > > idea here: http://lkml.org/lkml/2002/5/11/58 > > In short, page-aligned read buffer can be just unmapped, > > with page fault handler catching accesses to yet-unread data. > > As data comes from disk, it gets mapped back in process' > > address space. > > > This way read() returns almost immediately and CPU is free to do > > something useful. > > And what the application does during that page fault? Waits for the read > to actually complete? How it's different from a regular (direct or not) > read? The difference is that you block exactly when you try to access data which is not there yet, not sooner (potentially much sooner). If application (e.g. database) needs to know whether data is _really_ there, it should use aio_read (or something better, something which doesn't use signals. Do we have this 'something'? I honestly don't know). In some cases, evne this is not needed because you don't have any other things to do, so you just do read() (which returns early), and chew on data. If your CPU is fast enough and processing of data is light enough so that it outruns disk - big deal, you block in page fault handler whenever a page is not read for you in time. If CPU isn't fast enough, your CPU and disk subsystem are nicely working in parallel. With O_DIRECT, you alternate: "CPU is idle, disk is working" / "CPU is working, disk is idle". > Well, it IS different: now we can't predict *when* exactly we'll sleep waiting > for the read to complete. And also, now we're in an unknown-corner-case when > an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and > this looks more like mmap than like actual read). > > Yes, this way we'll fix the problems in current O_DIRECT way of doing things - > all those rases and design stupidity etc. Yes it may work, provided those > "corner cases" like I/O errors problems will be fixed. What do you want to do on I/O error? I guess you cannot do much - any sensible db will shutdown itself. When your data storage starts to fail, it's pointless to continue running. You do not need to know which read() exactly failed due to bad disk. Filename and offset from the start is enough. Right? So, SIGIO/SIGBUS can provide that, and if your handler is of void (*sa_sigaction)(int, siginfo_t *, void *); style, you can get fd, memory address of the fault, etc. Probably kernel can even pass file offset somewhere in siginfo_t... > And yes, sometimes > it's not really that interesting to know when exactly we'll sleep actually > waiting for the I/O - during read or during some memory access... It differs from performance perspective, as dicussed above. > There may be other reasons to "want" those extra context switches. > I mentioned above that oracle doesn't use threads, but processes. You can still be multithreaded. The point is, with O_DIRECT you _are forced_ to_ be_ multithreaded, or else perfomance will suck. > > Assume that we have "clever writes" like Linus described. > > > > /* something like "caching i/o over this fd is mostly useless" */ > > /* (looks like this API is easier to transition to > > * than fadvise etc. - it's "looks like" O_DIRECT) */ > > fd = open(..., flags|O_STREAM); > > ... > > /* Starts writeout immediately due to O_STREAM, > > * marks buf100meg's pages R/O to catch modifications, > > * but doesn't block! */ > > write(fd, buf100meg, 100*1024*1024); > > And how do we know when the write completes? > > > /* We are free to do something useful in parallel */ > > sort(); > > .. which is done in another process, already started. You think "Oracle". But this application may very well be not Oracle, but diff, or dd, or KMail. I don't want to care. I want all big writes to be efficient, not just those done by Oracle. *Including* single threaded ones. > > Why we bothered to write Linux at
Re: O_DIRECT question
Denis Vlasenko wrote: > On Saturday 20 January 2007 21:55, Michael Tokarev wrote: >> Denis Vlasenko wrote: >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote: example, which isn't quite possible now from userspace. But as long as O_DIRECT actually writes data before returning from write() call (as it seems to be the case at least with a normal filesystem on a real block device - I don't touch corner cases like nfs here), it's pretty much THE ideal solution, at least from the application (developer) standpoint. >>> Why do you want to wait while 100 megs of data are being written? >>> You _have to_ have threaded db code in order to not waste >>> gobs of CPU time on UP + even with that you eat context switch >>> penalty anyway. >> Usually it's done using aio ;) >> >> It's not that simple really. >> >> For reads, you have to wait for the data anyway before doing something >> with it. Omiting reads for now. > > Really? All 100 megs _at once_? Linus described fairly simple (conceptually) > idea here: http://lkml.org/lkml/2002/5/11/58 > In short, page-aligned read buffer can be just unmapped, > with page fault handler catching accesses to yet-unread data. > As data comes from disk, it gets mapped back in process' > address space. > This way read() returns almost immediately and CPU is free to do > something useful. And what the application does during that page fault? Waits for the read to actually complete? How it's different from a regular (direct or not) read? Well, it IS different: now we can't predict *when* exactly we'll sleep waiting for the read to complete. And also, now we're in an unknown-corner-case when an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and this looks more like mmap than like actual read). Yes, this way we'll fix the problems in current O_DIRECT way of doing things - all those rases and design stupidity etc. Yes it may work, provided those "corner cases" like I/O errors problems will be fixed. And yes, sometimes it's not really that interesting to know when exactly we'll sleep actually waiting for the I/O - during read or during some memory access... Now I wonder how it should look like from an applications standpoint. It has its "smart" cache. A worker thread (process in case of oracle - there's a very good reason why they don't use threads, and this architecture saved our data several times already - but that's entirely different topic and not really relevant here) -- so, a worker process which executes requests coming from a user application wants to have (read) access a db block (usually 8Kb in size, but can be 4..32Kb - definitely not 100megs), where the requested data is located. It checks whenever this block is in cache, and if it's not, it is being read from the disk and added to the cache. The cache resides in a shared memory (so that other processes will be able to access it too). With the proposed solution, it looks even better - that `read()' operation which returns immediately, so all other processes which wants the same page at the same time will start "using" it immediately. Provided they all can access the memory. This is how a (large) index access or table-access-by-rowid (after index lookup for example) is done - requesting usually just a single block in some random place of a file. There's another access pattern - like, full table scans, where alot of data is being read sequentially. It's done in chunks, say, 64 blocks (8Kb each) at a time. We read a chunk of data, do some thing on it, and discard it (caching it isn't a very good idea). For this access pattern, the proposal should work fairy well. Except of the I/O errors handling maybe. By the way - the *whole* cache thing may be implemented in application *using in-kernel page cache*, with clever usage of mmap() and friends. Provided the whole database fits into an address space, or something like that ;) >> For writes, it's not that problematic - even 10-15 threads is nothing >> compared with the I/O (O in this case) itself -- that context switch >> penalty. > > Well, if you have some CPU intensive thing to do (e.g. sort), > why not benefit from lack of extra context switch? There may be other reasons to "want" those extra context switches. I mentioned above that oracle doesn't use threads, but processes. I don't know why exactly it's done this way, but I know how it saved our data. The short answer is this: bugs ;) A process doing somethin with the data and generates write requests to the db goes crazy - some memory corruption, doing some bad things... But that process does not do any writes directly - instead, it generates those write requests in shared memory, and ANOTHER process actually does the writing. AND verifies that the requests actually look sanely. And detects the "bad" writes, and immediately prevents data corruption. That other (dbwr) process does much simpler things, and has its own address space which isn
Re: O_DIRECT question
On Saturday 20 January 2007 21:55, Michael Tokarev wrote: > Denis Vlasenko wrote: > > On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > >> example, which isn't quite possible now from userspace. But as long as > >> O_DIRECT actually writes data before returning from write() call (as it > >> seems to be the case at least with a normal filesystem on a real block > >> device - I don't touch corner cases like nfs here), it's pretty much > >> THE ideal solution, at least from the application (developer) standpoint. > > > > Why do you want to wait while 100 megs of data are being written? > > You _have to_ have threaded db code in order to not waste > > gobs of CPU time on UP + even with that you eat context switch > > penalty anyway. > > Usually it's done using aio ;) > > It's not that simple really. > > For reads, you have to wait for the data anyway before doing something > with it. Omiting reads for now. Really? All 100 megs _at once_? Linus described fairly simple (conceptually) idea here: http://lkml.org/lkml/2002/5/11/58 In short, page-aligned read buffer can be just unmapped, with page fault handler catching accesses to yet-unread data. As data comes from disk, it gets mapped back in process' address space. This way read() returns almost immediately and CPU is free to do something useful. > For writes, it's not that problematic - even 10-15 threads is nothing > compared with the I/O (O in this case) itself -- that context switch > penalty. Well, if you have some CPU intensive thing to do (e.g. sort), why not benefit from lack of extra context switch? Assume that we have "clever writes" like Linus described. /* something like "caching i/o over this fd is mostly useless" */ /* (looks like this API is easier to transition to * than fadvise etc. - it's "looks like" O_DIRECT) */ fd = open(..., flags|O_STREAM); ... /* Starts writeout immediately due to O_STREAM, * marks buf100meg's pages R/O to catch modifications, * but doesn't block! */ write(fd, buf100meg, 100*1024*1024); /* We are free to do something useful in parallel */ sort(); > > I hope you agree that threaded code is not ideal performance-wise > > - async IO is better. O_DIRECT is strictly sync IO. > > Hmm.. Now I'm confused. > > For example, oracle uses aio + O_DIRECT. It seems to be working... ;) > As an alternative, there are multiple single-threaded db_writer processes. > Why do you say O_DIRECT is strictly sync? I mean that O_DIRECT write() blocks until I/O really is done. Normal write can block for much less, or not at all. > In either case - I provided some real numbers in this thread before. > Yes, O_DIRECT has its problems, even security problems. But the thing > is - it is working, and working WAY better - from the performance point > of view - than "indirect" I/O, and currently there's no alternative that > works as good as O_DIRECT. Why we bothered to write Linux at all? There were other Unixes which worked ok. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Denis Vlasenko wrote: > On Thursday 11 January 2007 18:13, Michael Tokarev wrote: >> example, which isn't quite possible now from userspace. But as long as >> O_DIRECT actually writes data before returning from write() call (as it >> seems to be the case at least with a normal filesystem on a real block >> device - I don't touch corner cases like nfs here), it's pretty much >> THE ideal solution, at least from the application (developer) standpoint. > > Why do you want to wait while 100 megs of data are being written? > You _have to_ have threaded db code in order to not waste > gobs of CPU time on UP + even with that you eat context switch > penalty anyway. Usually it's done using aio ;) It's not that simple really. For reads, you have to wait for the data anyway before doing something with it. Omiting reads for now. For writes, it's not that problematic - even 10-15 threads is nothing compared with the I/O (O in this case) itself -- that context switch penalty. > I hope you agree that threaded code is not ideal performance-wise > - async IO is better. O_DIRECT is strictly sync IO. Hmm.. Now I'm confused. For example, oracle uses aio + O_DIRECT. It seems to be working... ;) As an alternative, there are multiple single-threaded db_writer processes. Why do you say O_DIRECT is strictly sync? In either case - I provided some real numbers in this thread before. Yes, O_DIRECT has its problems, even security problems. But the thing is - it is working, and working WAY better - from the performance point of view - than "indirect" I/O, and currently there's no alternative that works as good as O_DIRECT. Thanks. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sunday 14 January 2007 10:11, Nate Diller wrote: > On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > Most applications don't get the kind of performance analysis that > Digeo was doing, and even then, it's rather lucky that we caught that. > So I personally think it'd be best for libc or something to simulate > the O_STREAM behavior if you ask for it. That would simplify things > for the most common case, and have the side benefit of reducing the > amount of extra code an application would need in order to take > advantage of that feature. Sounds like you are saying that making O_DIRECT really mean O_STREAM will work for everybody (including db people, except that they will moan a lot about "it isn't _real_ O_DIRECT!!! Linux suxxx"). I don't care about that. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > example, which isn't quite possible now from userspace. But as long as > O_DIRECT actually writes data before returning from write() call (as it > seems to be the case at least with a normal filesystem on a real block > device - I don't touch corner cases like nfs here), it's pretty much > THE ideal solution, at least from the application (developer) standpoint. Why do you want to wait while 100 megs of data are being written? You _have to_ have threaded db code in order to not waste gobs of CPU time on UP + even with that you eat context switch penalty anyway. I hope you agree that threaded code is not ideal performance-wise - async IO is better. O_DIRECT is strictly sync IO. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thursday 11 January 2007 16:50, Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Nick Piggin wrote: > > > > Speaking of which, why did we obsolete raw devices? And/or why not just > > go with a minimal O_DIRECT on block device support? Not a rhetorical > > question -- I wasn't involved in the discussions when they happened, so > > I would be interested. > > Lots of people want to put their databases in a file. Partitions really > weren't nearly flexible enough. So the whole raw device or O_DIRECT just > to the block device thing isn't really helping any. > > > O_DIRECT is still crazily racy versus pagecache operations. > > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix > it sanely. Except by teaching people not to use it, and making the normal > paths fast enough (and that _includes_ doing things like dropping caches > more aggressively, but it probably would include more work on the device > queue merging stuff etc etc). What will happen if we just make open ignore O_DIRECT? ;) And then anyone who feels sad about is advised to do it like described here: http://lkml.org/lkml/2002/5/11/58 -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Tue, 16 Jan 2007, Arjan van de Ven wrote: > On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote: > > Helge Hafting <[EMAIL PROTECTED]> wrote: > > > Michael Tokarev wrote: > > >> But seriously - what about just disallowing non-O_DIRECT opens together > > >> with O_DIRECT ones ? > > >> > > > Please do not create a new local DOS attack. > > > I open some important file, say /etc/resolv.conf > > > with O_DIRECT and just sit on the open handle. > > > Now nobody else can open that file because > > > it is "busy" with O_DIRECT ? > > > > Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close? > > .. then any user can impact the operation, performance and reliability > of the database application of another user... sounds like plugging one > hole by making a bigger hole ;) Don't allow other users to access your raw database files then, and if backup kicks in, pausing the database would DTRT for integrety of the backup. For other applications, paused O_DIRECT may very well be a problem, but I can't think of one right now. -- Logic: The art of being wrong with confidence... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
I think one problem with mmap/msync is that they can't maintain i_size atomically like regular write does. so, one needs to implement own i_size management in userspace. thanks, Alex > Side note: the only reason O_DIRECT exists is because database people are > too used to it, because other OS's haven't had enough taste to tell them > to do it right, so they've historically hacked their OS to get out of the > way. > As a result, our madvise and/or posix_fadvise interfaces may not be all > that strong, because people sadly don't use them that much. It's a sad > example of a totally broken interface (O_DIRECT) resulting in better > interfaces not getting used, and then not getting as much development > effort put into them. > So O_DIRECT not only is a total disaster from a design standpoint (just > look at all the crap it results in), it also indirectly has hurt better > interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and > clean interface to make sure we don't pollute memory unnecessarily with > cached pages after they are all done) ends up being a no-op ;/ > Sad. And it's one of those self-fulfilling prophecies. Still, I hope some > day we can just rip the damn disaster out. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote: > Helge Hafting <[EMAIL PROTECTED]> wrote: > > Michael Tokarev wrote: > > >> But seriously - what about just disallowing non-O_DIRECT opens together > >> with O_DIRECT ones ? > >> > > Please do not create a new local DOS attack. > > I open some important file, say /etc/resolv.conf > > with O_DIRECT and just sit on the open handle. > > Now nobody else can open that file because > > it is "busy" with O_DIRECT ? > > Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close? .. then any user can impact the operation, performance and reliability of the database application of another user... sounds like plugging one hole by making a bigger hole ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On 1/12/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: On Thu, 11 Jan 2007, Roy Huang wrote: > > On a embedded systerm, limiting page cache can relieve memory > fragmentation. There is a patch against 2.6.19, which limit every > opened file page cache and total pagecache. When the limit reach, it > will release the page cache overrun the limit. I do think that something like this is probably a good idea, even on non-embedded setups. We historically couldn't do this, because mapped pages were too damn hard to remove, but that's obviously not much of a problem any more. However, the page-cache limit should NOT be some compile-time constant. It should work the same way the "dirty page" limit works, and probably just default to "feel free to use 90% of memory for page cache". Linus The attached patch limit the page cache by a simple way: 1) If request memory from page cache, Set a flag to mark this kind of allocation: static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE); } 2) Have zone_watermark_ok done this limit: + if (alloc_flags & ALLOC_PAGECACHE){ + min = min + VFS_CACHE_LIMIT; + } + if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; 3) So, when __alloc_pages is called by page cache, pass the ALLOC_PAGECACHE into get_page_from_freelist to trigger the pagecache limit branch in zone_watermark_ok. This approach works on my side, I'll make a new patch to make the limit tunable in the proc fs soon. The following is the patch: = Index: mm/page_alloc.c === --- mm/page_alloc.c (revision 2645) +++ mm/page_alloc.c (working copy) @@ -892,6 +892,9 @@ failed: #define ALLOC_HARDER0x10 /* try to alloc harder */ #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ #define ALLOC_CPUSET0x40 /* check for correct cpuset */ +#define ALLOC_PAGECACHE0x80 /* __GFP_PAGECACHE set */ + +#define VFS_CACHE_LIMIT0x400 /* limit VFS cache page */ /* * Return 1 if free pages are above 'mark'. This takes into account the order @@ -910,6 +913,10 @@ int zone_watermark_ok(struct zone *z, in if (alloc_flags & ALLOC_HARDER) min -= min / 4; + if (alloc_flags & ALLOC_PAGECACHE){ + min = min + VFS_CACHE_LIMIT; + } + if (free_pages <= min + z->lowmem_reserve[classzone_idx]) return 0; for (o = 0; o < order; o++) { @@ -1000,8 +1007,12 @@ restart: return NULL; } - page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, - zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET); + if (gfp_mask & __GFP_PAGECACHE) + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, + zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_PAGECACHE); + else + page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order, + zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET); if (page) goto got_pg; @@ -1027,6 +1038,9 @@ restart: if (wait) alloc_flags |= ALLOC_CPUSET; + if (gfp_mask & __GFP_PAGECACHE) + alloc_flags |= ALLOC_PAGECACHE; + /* * Go through the zonelist again. Let __GFP_HIGH and allocations * coming from realtime tasks go deeper into reserves. Index: include/linux/gfp.h === --- include/linux/gfp.h (revision 2645) +++ include/linux/gfp.h (working copy) @@ -46,6 +46,7 @@ struct vm_area_struct; #define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use emergency reserves */ #define __GFP_HARDWALL ((__force gfp_t)0x2u) /* Enforce hardwall cpuset memory allocs */ #define __GFP_THISNODE ((__force gfp_t)0x4u)/* No fallback, no policies */ +#define __GFP_PAGECACHE((__force gfp_t)0x8u) /* Is page cache allocation ? */ #define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) Index: include/linux/pagemap.h === --- include/linux/pagemap.h (revision 2645) +++ include/linux/pagemap.h (working copy) @@ -62,7 +62,7 @@ static inline struct page *__page_cache_ static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE); } static inline struct page *page_cache_alloc_cold(struct address_space *x) = Welcome any comm
Re: O_DIRECT question
Helge Hafting <[EMAIL PROTECTED]> wrote: > Michael Tokarev wrote: >> But seriously - what about just disallowing non-O_DIRECT opens together >> with O_DIRECT ones ? >> > Please do not create a new local DOS attack. > I open some important file, say /etc/resolv.conf > with O_DIRECT and just sit on the open handle. > Now nobody else can open that file because > it is "busy" with O_DIRECT ? Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close? -- "Unix policy is to not stop root from doing stupid things because that would also stop him from doing clever things." - Andi Kleen "It's such a fine line between stupid and clever" - Derek Smalls - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote: > > Yes for desktop, server, but maybe not for embedded system, specially > for no-mmu linux. In many embedded system cases, the whole system is > running in the ram, including file system. So it's not necessary using > page cache anymore. Page cache can't improve performance on these > cases, but only fragment memory. You were not very specific, so I have to guess that you're referring to the problem of having two copies of the same file in RAM - one in the page cache and one in the "backing store", which is just RAM. There are two solutions to this problem. One is tmpfs, which doesn't use a backing store and keeps all data in the page cache. The other is xip, which doesn't use the page cache and goes directly to backing store. Unlike O_DIRECT, xip only works with a RAM or de-facto RAM backing store (NOR flash works read-only). So if you really care about memory waste in embedded systems, you should have a look at mm/filemap_xip.c and continue Carsten Otte's work. Jörn -- Fantasy is more important than knowledge. Knowledge is limited, while fantasy embraces the whole world. -- Albert Einstein - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Michael Tokarev wrote: Chris Mason wrote: [] I recently spent some time trying to integrate O_DIRECT locking with page cache locking. The basic theory is that instead of using semaphores for solving O_DIRECT vs buffered races, you put something into the radix tree (I call it a placeholder) to keep the page cache users out, and lock any existing pages that are present. But seriously - what about just disallowing non-O_DIRECT opens together with O_DIRECT ones ? Please do not create a new local DOS attack. I open some important file, say /etc/resolv.conf with O_DIRECT and just sit on the open handle. Now nobody else can open that file because it is "busy" with O_DIRECT ? Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bill Davidsen <[EMAIL PROTECTED]> wrote: > My point is, that there is code to handle sparse data now, without > O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem > with the idea of O_DIRECT, the kernel has a security problem. The idea of O_DIRECT is to bypass the pagecache, and the pagecache is what provides the security against reading someone else's data using sparse files or partial-block-IO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sat, 13 Jan 2007, Bill Davidsen wrote: > Bodo Eggert wrote: > > > (*) This would allow fadvise_size(), too, which could reduce fragmentation > > (and give an early warning on full disks) without forcing e.g. fat to > > zero all blocks. OTOH, fadvise_size() would allow users to reserve the > > complete disk space without his filesizes reflecting this. > > Please clarify how this would interact with quota, and why it wouldn't > allow someone to run me out of disk. I fell into the "write-will-never-fail"-pit. Therefore I have to talk about the original purpose, write with O_DIRECT, too. - Reserved blocks should be taken out of the quota, since they are about to be written right now. If you emptied your quota doing this, it's your fault. It it was the group's quota, just run fast enough.-) - If one write failed that extended the reserved range, the reserved area should be shrunk again. Obviously you'll need something clever here. * You can't shrink carelessly while there are O_DIRECT writes. * You can't just try to grab the semaphore[0] for writing, this will deadlock with other write()s. * If you drop the read lock, it will work out, because you aren't writing anymore, and if you get the write lock, there won't be anybody else writing. Therefore you can clear the reservation for the not- written blocks. You may unreserve blocks that should stay reserved, but that won't harm much. At worst, you'll get fragmentation, loss of speed and an aborted (because of no free space) write command. Document this, it's a feature.-) - If you fadvise_size on a non-quota-disk, you can possibly reserve it completely, without being the easy-to-spot offender. You can do the same by actually writing these files, keeping them open and unlinking them. The new quality is: You can't just look at the file sizes in /proc in order to spot the offender. However, if you reflect the reserved blocks in the used-blocks-field of struct stat, du will work as expected and the BOFH will know whom to LART. BTW: If the fs supports holes, using du would be the right thing to do anyway. BTW2: I don't know if reserving without actually assigning blocks is supported or easy to support at all. These reservations are the result of "These blocks are not yet written, therefore they contain possibly secret data that would leak on failed writes, therefore they may not be actually assigned to the file before write finishes. They may not be on the free list either. And hey, if we support pre-reserving blocks to the file, we may additionally use it for fadvise_size. I'll mention that briefly." [0] r/w semaphore, read={r,w}_odirect, write=ftruncate -- Fun things to slip into your budget Paradigm pro-activator (a whole pack) (you mean beer?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Michael Tokarev wrote: Bill Davidsen wrote: If I got it right (and please someone tell me if I *really* got it right!), the problem is elsewhere. Suppose you have a filesystem, not at all related to databases and stuff. Your usual root filesystem, with your /etc/ /var and so on directories. Some time ago you edited /etc/shadow, updating it by writing new file and renaming it to proper place. So you have that old content of your shadow file (now deleted) somewhere on the disk, but not accessible from the filesystem. Now, a bad guy deliberately tries to open some file on this filesystem, using O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and at the same time tries to use O_DIRECT read of the data. Which should be identified and zeros returned. Consider: I open a file for database use, and legitimately seek to a location out at, say, 250MB, and then write at the location my hash says I should. That's all legitimate. Now when some backup program accesses the file sequentially, it gets a boatload of zeros, because Linux "knows" that is sparse data. Yes, the backup program should detect this as well, so what? My point is, that there is code to handle sparse data now, without O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem with the idea of O_DIRECT, the kernel has a security problem. Due to all the races etc, it is possible for him to read that old content of /etc/shadow file you've deleted before. I do have one thought, WRT reading uninitialized disk data. I would hope that sparse files are handled right, and that when doing a write with O_DIRECT the metadata is not updated until the write is done. "hope that sparse files are handled right" is a high hope. Exactly because this very place IS racy. Other than assuring that a program can't read where no program has written, I don't see a problem. Anyone accessing the same file with multiple processes had better be doing user space coordination, and gets no sympathy from me if they don't. In this case, "works right" does not mean "works as expected," because the program has no right to assume the kernel will sort out poor implementations. Without O_DIRECT the problem of doing ordered i/o in user space becomes very difficult, if not impossible, so "get rid of O_DIRECT" is the wrong direction. When the program can be sure the i/o is done, then cleverness in user space can see that it's done RIGHT. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Fri, 12 Jan 2007 15:35:09 -0700 Erik Andersen <[EMAIL PROTECTED]> wrote: > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > I suspect a lot of people actually have other reasons to avoid caches. > > > > For example, the reason to do O_DIRECT may well not be that you want to > > avoid caching per se, but simply because you want to limit page cache > > activity. In which case O_DIRECT "works", but it's really the wrong thing > > to do. We could export other ways to do what people ACTUALLY want, that > > doesn't have the downsides. > > I was rather fond of the old O_STREAMING patch by Robert Love, That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it up to dehackify it and get it into mainline, but we ended up deciding that posix_fadvise() was the way to go because it's standards-based. It's a bit more work in the app to use posix_fadvise() well. But the results will be better. The app should also use sync_file_range() intelligently to control its pagecache use. and there's an interesting note that i should add here, cause there's a downside to using fadvise() instead of O_STREAM when the programmer is not careful. I spent at least a month doing some complex blktrace analysis to try to figure out why Digeo's new platform (which used the fadvise() call) didn't have the kind of streaming performance that it should have. One symptom I found was that even on the media partition where I/O should have always been happening in nice 512K chunks (ra_pages == 128), it seemed to be happening in random values between 32K and 512K. It turns out that the code pulls in some size chunk, maybe 32K, then does an fadvise DONTNEED on the fd, *with zero offset and zero length*, meaning that it wipes out *all* the pagecache for the file. That means that the rest of the 512K from the readahead would get discarded before it got used, and later the remaining pages in the ra window would get faulted in again. Most applications don't get the kind of performance analysis that Digeo was doing, and even then, it's rather lucky that we caught that. So I personally think it'd be best for libc or something to simulate the O_STREAM behavior if you ask for it. That would simplify things for the most common case, and have the side benefit of reducing the amount of extra code an application would need in order to take advantage of that feature. NATE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bill Davidsen wrote: > Linus Torvalds wrote: >> [] >> But what O_DIRECT does right now is _not_ really sensible, and the >> O_DIRECT propeller-heads seem to have some problem even admitting that >> there _is_ a problem, because they don't care. > > You say that as if it were a failing. Currently if you mix access via > O_DIRECT and non-DIRECT you can get unexpected results. You can screw > yourself, mangle your data, or have no problems at all if you avoid > trying to access the same bytes in multiple ways. There are lots of ways > to get or write stale data, not all involve O_DIRECT in any way, and the > people actually using O_DIRECT now are managing very well. > > I don't regard it as a system failing that I am allowed to shoot myself > in the foot, it's one of the benefits of Linux over Windows. Using > O_DIRECT now is like being your own lawyer, room for both creativity and > serious error. But what's there appears portable, which is important as > well. If I got it right (and please someone tell me if I *really* got it right!), the problem is elsewhere. Suppose you have a filesystem, not at all related to databases and stuff. Your usual root filesystem, with your /etc/ /var and so on directories. Some time ago you edited /etc/shadow, updating it by writing new file and renaming it to proper place. So you have that old content of your shadow file (now deleted) somewhere on the disk, but not accessible from the filesystem. Now, a bad guy deliberately tries to open some file on this filesystem, using O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and at the same time tries to use O_DIRECT read of the data. Due to all the races etc, it is possible for him to read that old content of /etc/shadow file you've deleted before. > I do have one thought, WRT reading uninitialized disk data. I would hope > that sparse files are handled right, and that when doing a write with > O_DIRECT the metadata is not updated until the write is done. "hope that sparse files are handled right" is a high hope. Exactly because this very place IS racy. Again, *IF* I got it correctly. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: On Sat, 13 Jan 2007, Michael Tokarev wrote: (No, really - this load isn't entirely synthetic. It's a typical database workload - random I/O all over, on a large file. If it can, it combines several I/Os into one, by requesting more than a single block at a time, but overall it is random.) My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without having all the BAD behaviour that O_DIRECT adds. For example, just the requirement that O_DIRECT can never create a file mapping, and can never interact with ftruncate would actually make O_DIRECT a lot more palatable to me. Together with just the requirement that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, and flush the page cache entirely, would make all the aliases go away. At that point, O_DIRECT would be a way of saying "we're going to do uncached accesses to this pre-allocated file". Which is a half-way sensible thing to do. But it's not necessary, it would break existing programs, would be incompatible with other o/s like AIX, BSD, Solaris. And it doesn't provide the legitimate use for O_DIRECT in avoiding cache pollution when writing a LARGE file. But what O_DIRECT does right now is _not_ really sensible, and the O_DIRECT propeller-heads seem to have some problem even admitting that there _is_ a problem, because they don't care. You say that as if it were a failing. Currently if you mix access via O_DIRECT and non-DIRECT you can get unexpected results. You can screw yourself, mangle your data, or have no problems at all if you avoid trying to access the same bytes in multiple ways. There are lots of ways to get or write stale data, not all involve O_DIRECT in any way, and the people actually using O_DIRECT now are managing very well. I don't regard it as a system failing that I am allowed to shoot myself in the foot, it's one of the benefits of Linux over Windows. Using O_DIRECT now is like being your own lawyer, room for both creativity and serious error. But what's there appears portable, which is important as well. I do have one thought, WRT reading uninitialized disk data. I would hope that sparse files are handled right, and that when doing a write with O_DIRECT the metadata is not updated until the write is done. A lot of DB people seem to simply not care about security or anything else.anything else. I'm trying to tell you that quoting numbers is pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt. The guiding POSIX standard appears dead, and major DB programs which work on Linux run on AIX, Solaris, and BSD. That sounds like a good level of compatibility. I'm not sure what more correctness you would want beyond a proposed standard and common practice. It's tricky to use, like many other neat features. I xonfess I have abused O_DIRECT by opening a file with O_DIRECT, fdopen()ing it for C, supplying my own large aligned buffer, and using that with an otherwise unmodified large program which uses fprintf(). That worked on all of the major UNIX variants as well. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bodo Eggert wrote: (*) This would allow fadvise_size(), too, which could reduce fragmentation (and give an early warning on full disks) without forcing e.g. fat to zero all blocks. OTOH, fadvise_size() would allow users to reserve the complete disk space without his filesizes reflecting this. Please clarify how this would interact with quota, and why it wouldn't allow someone to run me out of disk. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds <[EMAIL PROTECTED]> wrote: > On Sat, 13 Jan 2007, Michael Tokarev wrote: >> (No, really - this load isn't entirely synthetic. It's a typical database >> workload - random I/O all over, on a large file. If it can, it combines >> several I/Os into one, by requesting more than a single block at a time, >> but overall it is random.) > > My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without > having all the BAD behaviour that O_DIRECT adds. > > For example, just the requirement that O_DIRECT can never create a file > mapping, That sounds sane, but the video streaming folks will be unhappy. Maybe you could do: reserve_space(); (*) do_write_odirect(); update_filesize(); and only allow reads up to the current filesize? Off cause if you do ftruncate first and then write o_direct, the holes will need to be filled before the corresponding blocks are assigned to the file. Either you'll zero them or you can insert them into the file after the write. Races: against other reads: May happen in any order, to-be-written pages are beyond filesize (inaccessible), zeroed or not yet assigned to the file. against other writes: No bad effect, since you don't unreserve mappings, and update_filesize won't shrink the file. You must, however, not reserve two chunks for the same location in the file unless you can handle replacing blocks of files. open(O_WRITE) without O_DIRECT is not allowed, therefore that can't race. against truncate: Yes, see below (*) This would allow fadvise_size(), too, which could reduce fragmentation (and give an early warning on full disks) without forcing e.g. fat to zero all blocks. OTOH, fadvise_size() would allow users to reserve the complete disk space without his filesizes reflecting this. > and can never interact with ftruncate ACK, r/w semaphore, read={r,w}_odirect, write=ftruncate? > would actually make > O_DIRECT a lot more palatable to me. Together with just the requirement > that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, > and flush the page cache entirely, would make all the aliases go away. That's probably the best semantics. Maybe you should allow O_READ for the backup people, maybe forcing O_DIRECT|O_ALLOWDOUBLEBUFFER (doing the extra copy in the kernel). > At that point, O_DIRECT would be a way of saying "we're going to do > uncached accesses to this pre-allocated file". Which is a half-way > sensible thing to do. And I'd bet nobody would notice these changes unless they try inherently stupid things. > But what O_DIRECT does right now is _not_ really sensible, and the > O_DIRECT propeller-heads seem to have some problem even admitting that > there _is_ a problem, because they don't care. It's a hammer - having it will make anything look like a nail, and there is nothing wrong with hammering a nail!!! .-) > A lot of DB people seem to simply not care about security or anything > else.anything else. I'm trying to tell you that quoting numbers is > pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt. The only thing you'll need for a correct database behaviour is: If one process has completed it's write and the next process opens that file, it must read the current contents. Races with normal reads and writes, races with truncate - don't do that then. You wouldn't expect "cat somefile > database.dat" on a running db to be a good thing, too, no matter if o_direct is used or not. -- Funny quotes: 3. On the other hand, you have different fingers. Friß, Spammer: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bill Davidsen wrote: The point is that if you want to be able to allocate at all, sometimes you will have to write dirty pages, garbage collect, and move or swap programs. The hardware is just too limited to do something less painful, and the user can't see memory to do things better. Linus is right, 'Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is.' I think any solution is going to be ugly, unfortunately. It seems quite robust and clean to me, actually. Any userspace memory that absolutely must be large contiguous regions have to be allocated at boot or from a pool reserved at boot. All other allocations can be broken into smaller ones. Write dirty pages, garbage collect, move or swap programs isn't going to be robust because there is lots of vital kernel memory that cannot be moved and will cause fragmentation. The reclaimable zone work that went on a while ago for hugepages is exactly how you would also fix this problem and still have a reasonable degree of flexibility at runtime. It isn't really ugly or hard, compared with some of the non-working "solutions" that have been proposed. The other good thing is that the core mm already has practically everything required, so the functionality is unintrusive. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 Jan 2007 15:35:09 -0700 Erik Andersen <[EMAIL PROTECTED]> wrote: > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > I suspect a lot of people actually have other reasons to avoid caches. > > > > For example, the reason to do O_DIRECT may well not be that you want to > > avoid caching per se, but simply because you want to limit page cache > > activity. In which case O_DIRECT "works", but it's really the wrong thing > > to do. We could export other ways to do what people ACTUALLY want, that > > doesn't have the downsides. > > I was rather fond of the old O_STREAMING patch by Robert Love, That was an akpmpatch whcih I did for the Digeo kernel. Robert picked it up to dehackify it and get it into mainline, but we ended up deciding that posix_fadvise() was the way to go because it's standards-based. It's a bit more work in the app to use posix_fadvise() well. But the results will be better. The app should also use sync_file_range() intelligently to control its pagecache use. The problem with all of these things is that the application needs to be changed, and people often cannot do that. If we want a general way of stopping particular apps from swamping pagecache then it'd really need to be an externally-imposed thing - probably via additional accounting and a new rlimit. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > I suspect a lot of people actually have other reasons to avoid caches. > > For example, the reason to do O_DIRECT may well not be that you want to > avoid caching per se, but simply because you want to limit page cache > activity. In which case O_DIRECT "works", but it's really the wrong thing > to do. We could export other ways to do what people ACTUALLY want, that > doesn't have the downsides. I was rather fond of the old O_STREAMING patch by Robert Love, which added an open() flag telling the kernel to not keep data from the current file in cache by dropping pages from the pagecache before the current index. O_STREAMING was very nice for when you know you want to read a large file sequentially without polluting the rest of the cache with GB of data that you plan on only read once and discard. It worked nicely at doing what many people want to use O_DIRECT for. Using O_STREAMING you would get normal read/write semantics since you still had the pagecache caching your data, but only the not-yet-written write-behind data and the not-yet-read read-ahead data. With the additional hint the kernel should drop free-able pages from the pagecache behind the current position, because we know we will never want them again. I thought that was a very nice way of handling things. -Erik -- Erik B. Andersen http://codepoet-consulting.com/ --This message was written using 73% post-consumer electrons-- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: > > On Sat, 13 Jan 2007, Michael Tokarev wrote: >>> At that point, O_DIRECT would be a way of saying "we're going to do >>> uncached accesses to this pre-allocated file". Which is a half-way >>> sensible thing to do. >> Half-way? > > I suspect a lot of people actually have other reasons to avoid caches. > > For example, the reason to do O_DIRECT may well not be that you want to > avoid caching per se, but simply because you want to limit page cache > activity. In which case O_DIRECT "works", but it's really the wrong thing > to do. We could export other ways to do what people ACTUALLY want, that > doesn't have the downsides. > > For example, the page cache is absolutely required if you want to mmap. > There's no way you can do O_DIRECT and mmap at the same time and expect > any kind of sane behaviour. It may not be what a DB wants to use, but it's > an example of where O_DIRECT really falls down. Provided when the two are about the same part of a file. If not, and if the file is "divided" on a proper boundary (sector/page/whatever-aligned), there's no issues, at least not if all the blocks of a file has been allocated (no gaps, that is). What I was referring to in my last email - and said it's a corner case - is: mmap() start of a file, say, first megabyte of it, where some index/bitmap is located, and use direct-io on the rest. So the two aren't overlap. Still problematic? >>> But what O_DIRECT does right now is _not_ really sensible, and the >>> O_DIRECT propeller-heads seem to have some problem even admitting that >>> there _is_ a problem, because they don't care. >> Well. In fact, there's NO problems to admit. >> >> Yes, yes, yes yes - when you think about it from a general point of >> view, and think how non-O_DIRECT and O_DIRECT access fits together, >> it's a complete mess, and you're 100% right it's a mess. > > You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually > fails in many ways right now. > > I've already mentioned ftruncate and block allocation. You don't seem to > understand that those are ALSO a problem. I do understand this. And this is, too, solved right now in userspace. For example, when oracle allocates a file for its data, or when it extends the file, it writes something to every block of new space (using O_DIRECT while at it, but that's a different story). The thing is: while it is doing that, no process tries to do anything with that (part of a) file (not counting some external processes run by evil hackers ;) So there's still no races or fundamental brokeness *in usage*. It uses ftruncate() to create or extend a file, *and* does O_DIRECT writes to force block allocations. That's probably not right, and that alone is probably difficult to implement in kernel (I just don't know; what I know for sure is that this way is very slow on ext3). Maybe because there's no way to tell kernel something like "set the file size to this and actually *allocate* space for it" (if it doesn't write some structure to the file). What I dislike very much is - half-solutions. And current O_DIRECT indeed looks like half-a-solution, because sometimes it works, and sometimes, in *wrong* usage scenario, it doesn't, or racy, etc, and kernel *allows* such a wrong scenario. A software should either work correctly, or disallow a usage where it can't guarantee correctness. Currently, kernel allows incorrect usage, and that, plus all the ugly things in code done in attempt to fix that, suxx. But the whole thing is not (fundamentally) broken. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sat, 13 Jan 2007, Michael Tokarev wrote: > > > > At that point, O_DIRECT would be a way of saying "we're going to do > > uncached accesses to this pre-allocated file". Which is a half-way > > sensible thing to do. > > Half-way? I suspect a lot of people actually have other reasons to avoid caches. For example, the reason to do O_DIRECT may well not be that you want to avoid caching per se, but simply because you want to limit page cache activity. In which case O_DIRECT "works", but it's really the wrong thing to do. We could export other ways to do what people ACTUALLY want, that doesn't have the downsides. For example, the page cache is absolutely required if you want to mmap. There's no way you can do O_DIRECT and mmap at the same time and expect any kind of sane behaviour. It may not be what a DB wants to use, but it's an example of where O_DIRECT really falls down. > > But what O_DIRECT does right now is _not_ really sensible, and the > > O_DIRECT propeller-heads seem to have some problem even admitting that > > there _is_ a problem, because they don't care. > > Well. In fact, there's NO problems to admit. > > Yes, yes, yes yes - when you think about it from a general point of > view, and think how non-O_DIRECT and O_DIRECT access fits together, > it's a complete mess, and you're 100% right it's a mess. You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually fails in many ways right now. I've already mentioned ftruncate and block allocation. You don't seem to understand that those are ALSO a problem. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disk Cache, Was: O_DIRECT question
Zan Lynx wrote: > On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote: > [snip] >> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this >> load - because everything is thrown away from the cache, even caches of /bin >> /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot). > > One thing that I've been using, and seems to work well, is a customized > version of the readahead program several distros use during boot up. [idea to lock some (commonly-used) cache pages in memory] > Something like that could keep your system responsive no matter what the > disk cache is doing otherwise. Unfortunately it's not. Sure, things like libc.so etc will be force-cached and will start fast. But not my data files and other stuff (what an unfortunate thing: memory usually is smaller in size than disks ;) I can do usual work without noticing something's working with the disks intensively, doing O_DIRECT I/O. For example, I can run large report on a database, which requires alot of disk I/O, and run a kernel compile at the same time. Sure, disk access is alot slower, but disk cache helps alot, too. My kernel compile will not be much slower than usual. But if I'll turn O_DIRECT off, the compile will take ages to finish. *And* the report running, too! Because the system tries hard to cache the WRONG pages! (yes I remember fadvise &Co - which aren't used by the database(s) currently, and quite alot of words has been said about that, too; I also noticied it's slower as well, at least currently.) /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Disk Cache, Was: O_DIRECT question
On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote: [snip] > And sure thing, withOUT O_DIRECT, the whole system is almost dead under this > load - because everything is thrown away from the cache, even caches of /bin > /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot). One thing that I've been using, and seems to work well, is a customized version of the readahead program several distros use during boot up. Mine starts off doing: mlockall(MCL_CURRENT|MCL_FUTURE); ...yadda, yadda... and for each file listed: ...open, stat stuff... if( NULL == mmap( NULL, stat_buf.st_size, PROT_READ, MAP_SHARED|MAP_LOCKED|MAP_POPULATE, fd, 0) ) { fprintf(stderr, "'%s' ", file); perror("mmap"); } ...more stuff... and then ends with: pause(); and it sits there forever. As far as I can tell, this makes the program and library code stay in RAM. At least, after a drop_caches nautilus doesn't load 12 MB off disk, it just starts. It has to be reloaded after software updates and after prelinking. I find the 250 MB used to be worthwhile, even if its kinda Windowsey. Something like that could keep your system responsive no matter what the disk cache is doing otherwise. -- Zan Lynx <[EMAIL PROTECTED]> signature.asc Description: This is a digitally signed message part
Re: O_DIRECT question
Linus Torvalds wrote: [] > My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without > having all the BAD behaviour that O_DIRECT adds. *This* point I got from the beginning, once I tried to think how it all is done internally (I never thought about that, because I'm not a kernel hacker to start with) -- currently, linux has ugly/racy places which are either difficult or impossible to fix, all due to this O_DIRECT thing which iteracts badly with other access "methods". > For example, just the requirement that O_DIRECT can never create a file > mapping, and can never interact with ftruncate would actually make > O_DIRECT a lot more palatable to me. Together with just the requirement > that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, > and flush the page cache entirely, would make all the aliases go away. > > At that point, O_DIRECT would be a way of saying "we're going to do > uncached accesses to this pre-allocated file". Which is a half-way > sensible thing to do. Half-way? > But what O_DIRECT does right now is _not_ really sensible, and the > O_DIRECT propeller-heads seem to have some problem even admitting that > there _is_ a problem, because they don't care. Well. In fact, there's NO problems to admit. Yes, yes, yes yes - when you think about it from a general point of view, and think how non-O_DIRECT and O_DIRECT access fits together, it's a complete mess, and you're 100% right it's a mess. But. Those damn "database people" don't mix and match the two accesses together (I'm not one of them, either - I'm just trying to use a DB product on linux). So there's just no issue. The solution to in-kernel races and problems in this case is the usage scenario, and in following simple usage rules. Basically, the above requiriment - "don't mix&match the two together" - is implemented in userspace (yes, there's no guarantee that someone/thing will not do some evil thing, but that's controlled by file permisions). That is, database software itself will not try to use the thing in a wrong way. Simple as that. > A lot of DB people seem to simply not care about security or anything > else.anything else. I'm trying to tell you that quoting numbers is > pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt. When done properly - be it in user- or kernel-space, it IS correct. No database people are ftruncating() a file *and* reading from the past-end of it at the same time for example, and don't mix-n-match cached and direct io, at least not for the same part of a file (if there are, they're really braindead, or it's just a plain bug). > I can calculate PI to a billion decimal places in my head in .1 seconds. > If you don't care about the CORRECTNESS of the result, that is. > > See? It's not about performance. It's about O_DIRECT being fundamentally > broken as it behaves right now. I recall again the above: the actual USAGE of O_DIRECT, as implemented in database software, tries to ensure there's no brokeness, especially fundamental brokeness, just by not performing parallel direct/non-direct read/writes/truncates. This way, the thing Just Works, works *correctly* (provided there's no bugs all the way down to a device), *and* works *fast*. By the way, I can think of some useful cases where *parts* of a file are mmap()ed (even for RW access), and parts are being read/written with O_DIRECT. But that's probably some corner cases. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Sat, 13 Jan 2007, Michael Tokarev wrote: > > (No, really - this load isn't entirely synthetic. It's a typical database > workload - random I/O all over, on a large file. If it can, it combines > several I/Os into one, by requesting more than a single block at a time, > but overall it is random.) My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without having all the BAD behaviour that O_DIRECT adds. For example, just the requirement that O_DIRECT can never create a file mapping, and can never interact with ftruncate would actually make O_DIRECT a lot more palatable to me. Together with just the requirement that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, and flush the page cache entirely, would make all the aliases go away. At that point, O_DIRECT would be a way of saying "we're going to do uncached accesses to this pre-allocated file". Which is a half-way sensible thing to do. But what O_DIRECT does right now is _not_ really sensible, and the O_DIRECT propeller-heads seem to have some problem even admitting that there _is_ a problem, because they don't care. A lot of DB people seem to simply not care about security or anything else.anything else. I'm trying to tell you that quoting numbers is pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt. I can calculate PI to a billion decimal places in my head in .1 seconds. If you don't care about the CORRECTNESS of the result, that is. See? It's not about performance. It's about O_DIRECT being fundamentally broken as it behaves right now. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Michael Tokarev wrote: > Michael Tokarev wrote: > By the way. I just ran - for fun - a read test of a raid array. > > Reading blocks of size 512kbytes, starting at random places on a 400Gb > array, doing 64threads. > > O_DIRECT: 336.73 MB/sec. > !O_DIRECT: 146.00 MB/sec. And when turning off read-ahead, the speed dropped to 30 MB/sec. Read-ahead should not help here, I think... But after analyzing the "randomness" a bit, it turned out alot of requests are coming to places "near" the ones which has been read recently. After switching to another random number generator, speed in a case WITH readahead enabled dropped to almost 5Mb/sec ;) And sure thing, withOUT O_DIRECT, the whole system is almost dead under this load - because everything is thrown away from the cache, even caches of /bin /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot). (No, really - this load isn't entirely synthetic. It's a typical database workload - random I/O all over, on a large file. If it can, it combines several I/Os into one, by requesting more than a single block at a time, but overall it is random.) /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Michael Tokarev wrote: [] > After all the explanations, I still don't see anything wrong with the > interface itself. O_DIRECT isn't "different semantics" - we're still > writing and reading some data. Yes, O_DIRECT and non-O_DIRECT usages > somewhat contradicts with each other, but there are other ways to make > the two happy, instead of introducing alot of stupid, complex, and racy > code all over. By the way. I just ran - for fun - a read test of a raid array. Reading blocks of size 512kbytes, starting at random places on a 400Gb array, doing 64threads. O_DIRECT: 336.73 MB/sec. !O_DIRECT: 146.00 MB/sec. Quite a... difference here. Using posix_fadvice() does not improve it. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Chris Mason wrote: [] > I recently spent some time trying to integrate O_DIRECT locking with > page cache locking. The basic theory is that instead of using > semaphores for solving O_DIRECT vs buffered races, you put something > into the radix tree (I call it a placeholder) to keep the page cache > users out, and lock any existing pages that are present. But seriously - what about just disallowing non-O_DIRECT opens together with O_DIRECT ones ? If the thing will allow non-DIRECT READ-ONLY open, I personally see no problems whatsoever, at all. If non-DIRECT READONLY open will be disallowed too, -- well, a bit less nice, but still workable (allowing online backup of database files opened in O_DIRECT mode using other tools such as `cp' -- if non-direct opens aren't allowed, i'll switch to using dd or somesuch). Yes there may be still a race between ftruncate() and reads (either direct or not), or when filling gaps by writing into places which were skipped by using ftruncate. I don't know how serious those races are. That to say - if the whole thing will be a bit more strict wrt allowing set of operations, races (or some of them, anyway) will just go away (and maybe it will work even better due to quite some code and lock contention removal), and maybe after that, Linus will like the whole thing a bit better... ;) After all the explanations, I still don't see anything wrong with the interface itself. O_DIRECT isn't "different semantics" - we're still writing and reading some data. Yes, O_DIRECT and non-O_DIRECT usages somewhat contradicts with each other, but there are other ways to make the two happy, instead of introducing alot of stupid, complex, and racy code all over. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote: > > looking at the splice(2) api it seems like it'll be difficult to implement > > O_DIRECT pread/pwrite from userland using splice... so there'd need to be > > some help there. > > You'd use vmsplice() to put the write buffers into kernel space (user > space sees it's a pipe file descriptor, but you should just ignore that: > it's really just a kernel buffer). And then splice the resulting kernel > buffers to the destination. I recently spent some time trying to integrate O_DIRECT locking with page cache locking. The basic theory is that instead of using semaphores for solving O_DIRECT vs buffered races, you put something into the radix tree (I call it a placeholder) to keep the page cache users out, and lock any existing pages that are present. O_DIRECT does save cpu from avoiding copies, but it also saves cpu from fewer radix tree operations during massive IOs. The cost of radix tree insertion/deletion on 1MB O_DIRECT ios added ~10% system time on my tiny little dual core box. I'm sure it would be much worse if there was lock contention on a big numa machine, and it grows as the io grows (SGI does massive O_DIRECT ios). To help reduce radix churn, I made it possible for a single placeholder entry to lock down a range in the radix: http://thread.gmane.org/gmane.linux.file-systems/12263 It looks to me as though vmsplice is going to have the same issues as my early patches. The current splice code can avoid the copy but is still working in page sized chunks. Also, splice doesn't support zero copy on things smaller than page sized chunks. The compromise my patch makes is to hide placeholders from almost everything except the DIO code. It may be worthwhile to turn the placeholders into an IO marker that can be useful to filemap_fdatawrite and friends. It should be able to: record the userland/kernel pages involved in a given io map blocks from the FS for making a bio start the io wake people up when the io is done This would allow splice to operate without stealing the userland page (stealing would still be an option of course), and could get rid of big chunks of fs/direct-io.c. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 11 Jan 2007, dean gaudet wrote: > > it seems to me that if splice and fadvise and related things are > sufficient for userland to take care of things "properly" then O_DIRECT > could be changed into splice/fadvise calls either by a library or in the > kernel directly... The problem is two-fold: - the fact that databases use O_DIRECT and all the commercial people are perfectly happy to use a totally idiotic interface (and they don't care about the problems) means that things like fadvice() don't actually get the TLC. For example, the USEONCE thing isn't actually _implemented_, even though from a design standpoint, it would in many ways be preferable over O_DIRECT. It's not just fadvise. It's a general problem for any new interfaces where the old interfaces "just work" - never mind if they are nasty. And O_DIRECT isn't actually all that nasty for users (although the alignment restrictions are obviously irritating, but they are mostly fundamental _hardware_ alignment restrictions, so..). It's only nasty from a kernel internal security/serialization standpoint. So in many ways, apps don't want to change, because they don't really see the problems. (And, as seen in this thread: uses like NFS don't see the problems either, because there the serialization is done entirely somewhere *else*, so the NFS people don't even understand why the whole interface sucks in the first place) - a lot of the reasons for problems for O_DIRECT is the semantics. If we could easily implement the O_DIRECT semantics using something else, we would. But it's semantically not allowed to steal the user page, and it has to wait for it to be all done with, because those are the semantics of "write()". So one of the advantages of vmsplice() and friends is literally that it could allow page stealing, and allow the semantics where any changes to the page (in user space) might make it to disk _after_ vmsplice() has actually already returned, because we literally re-use the page (ie it's fundamentally an async interface). But again, fadvise and vmsplice etc aren't even getting the attention, because right now they are only used by small programs (and generally not done by people who also work on the kernel, and can see that it really would be better to use more natural interfaces). > looking at the splice(2) api it seems like it'll be difficult to implement > O_DIRECT pread/pwrite from userland using splice... so there'd need to be > some help there. You'd use vmsplice() to put the write buffers into kernel space (user space sees it's a pipe file descriptor, but you should just ignore that: it's really just a kernel buffer). And then splice the resulting kernel buffers to the destination. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: O_DIRECT is still crazily racy versus pagecache operations. >>> >>>Yes. O_DIRECT is really fundamentally broken. There's just no way to fix >>>it sanely. >> >>How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ? > > > That is what I think some users could do. If the main issue with O_DIRECT > is the page cache allocations, if we instead had better (read: "any") > support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would > just go away. > > See also the patch that Roy Huang posted about another approach to the > same problem: just limiting page cache usage explicitly. > > That's not the _only_ issue with O_DIRECT, though. It's one big one, but > people like to think that the memory copy makes a difference when you do > IO too (I think it's likely pretty debatable in real life, but I'm totally > certain you can benchmark it, probably even pretty easily especially if > you have fairly studly IO capabilities and a CPU that isn't quite as > studly). > > So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT > picture, and depending on your problems (in this case, the embedded world) > it may even be the *biggest* part. But it's not the whole picture. >From 2.6.19 sources it looks like POSIX_FADV_NOREUSE is no-op there > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: >>OK, madvise() used with mmap'ed file allows to have reads from a file >>with zero-copy between kernel/user buffers and don't pollute cache >>memory unnecessarily. But how about writes? How is to do zero-copy >>writes to a file and don't pollute cache memory without using O_DIRECT? >>Do I miss the appropriate interface? > > > mmap()+msync() can do that too. Sorry, I wasn't sufficiently clear. Mmap()+msync() can't be used for that if data to be written come from some external source, like video capturing hardware, which DMA'ing data directly into the user space buffers. Using mmap'ed area for those DMA buffers doesn't look as a good idea, because, e.g., it will involve unneeded disk reads on the first page faults. So, some O_DIRECT-like interface should exist in the system. Also, as Michael Tokarev noted, operations over mmap'ed areas don't provide good ways for error handling, which effectively makes them unusable for something serious. > Also, regular user-space page-aligned data could easily just be moved into > the page cache. We actually have a lot of the infrastructure for it. See > the "splice()" system call. It's just not very widely used, and the > "drop-behind" behaviour (to then release the data) isn't there. And I bet > that there's lots of work needed to make it work well in practice, but > from a conceptual standpoint the O_DIRECT method really is just about the > *worst* way to do things. splice() needs 2 file descriptors, but looking at it I've found vmsplice() syscall, which, seems, can do the needed actions, although I'm not sure it can work with files and zero-copy. Thanks for pointing on those interfaces. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
dean gaudet wrote: it seems to me that if splice and fadvise and related things are sufficient for userland to take care of things "properly" then O_DIRECT could be changed into splice/fadvise calls either by a library or in the kernel directly... No, because the semantics are entirely different. An application using read/write with O_DIRECT expects read() to block until data is physically fetched from the device. fadvise() does not FORCE the kernel to discard cache, it only hints that it should, so a read() or mmap() very well may reuse a cached page instead of fetching from the disk again. The application also expects write() to block until the data is on the disk. In the case of a blocking write, you could splice/msync, but what about aio? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Hua Zhong wrote: The other problem besides the inability to handle IO errors is that mmap()+msync() is synchronous. You need to go async to keep the pipelines full. msync(addr, len, MS_ASYNC); doesn't do what you want? No, because there is no notification of completion. In fact, does this call actually even avoid blocking in the current code, while asking the kernel to flush the pages in the background? Even if it performs the sync in the background, what about faulting in the pages to be synced? For instance, if you splice pages from a source mmaped file into the destination mmap, then msync on the destination, doesn't the process still block to fault in the source pages? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Aubrey wrote: On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Linus Torvalds wrote: > > On Fri, 12 Jan 2007, Nick Piggin wrote: > >>We are talking about about fragmentation. And limiting pagecache to try to >>avoid fragmentation is a bandaid, especially when the problem can be solved >>(not just papered over, but solved) in userspace. > > > It's not clear that the problem _can_ be solved in user space. > > It's easy enough to say "never allocate more than a page". But it's often > not REALISTIC. > > Very basic issue: the perfect is the enemy of the good. Claiming that > there is a "proper solution" is usually a total red herring. Quite often > there isn't, and the "paper over" is actually not papering over, it's > quite possibly the best solution there is. Yeah *smallish* higher order allocations are fine, and we use them all the time for things like stacks or networking. But Aubrey (who somehow got removed from the cc list) wants to do order 9 allocations from userspace in his nommu environment. I'm just trying to be realistic when I say that this isn't going to be robust and a userspace solution is needed. Hmm..., aside from big order allocations from user space, if there is a large application we need to run, it should be loaded into the memory, so we have to allocate a big block to accommodate it. kernel fun like load_elf_fdpic_binary() etc will request contiguous memory, then if vfs eat up free memory, loading fails. Before we had virtual memory we had only a base address register, start at this location and go thus far, and user program memory had to be contiguous. To change a program size, all other programs might be moved, either by memory copy or actual swap to disk if total memory became a problem. To minimize the pain, programs were loaded at one end of memory, and system buffers and such were allocated at the other. That allowed the most recently loaded program the best chance of being able to grow without thrashing. The point is that if you want to be able to allocate at all, sometimes you will have to write dirty pages, garbage collect, and move or swap programs. The hardware is just too limited to do something less painful, and the user can't see memory to do things better. Linus is right, 'Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is.' I think any solution is going to be ugly, unfortunately. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 11 Jan 2007, Linus Torvalds wrote: > On Thu, 11 Jan 2007, Viktor wrote: > > > > OK, madvise() used with mmap'ed file allows to have reads from a file > > with zero-copy between kernel/user buffers and don't pollute cache > > memory unnecessarily. But how about writes? How is to do zero-copy > > writes to a file and don't pollute cache memory without using O_DIRECT? > > Do I miss the appropriate interface? > > mmap()+msync() can do that too. > > Also, regular user-space page-aligned data could easily just be moved into > the page cache. We actually have a lot of the infrastructure for it. See > the "splice()" system call. it seems to me that if splice and fadvise and related things are sufficient for userland to take care of things "properly" then O_DIRECT could be changed into splice/fadvise calls either by a library or in the kernel directly... looking at the splice(2) api it seems like it'll be difficult to implement O_DIRECT pread/pwrite from userland using splice... so there'd need to be some help there. i'm probably missing something. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Linus Torvalds wrote: > > On Fri, 12 Jan 2007, Nick Piggin wrote: > >>We are talking about about fragmentation. And limiting pagecache to try to >>avoid fragmentation is a bandaid, especially when the problem can be solved >>(not just papered over, but solved) in userspace. > > > It's not clear that the problem _can_ be solved in user space. > > It's easy enough to say "never allocate more than a page". But it's often > not REALISTIC. > > Very basic issue: the perfect is the enemy of the good. Claiming that > there is a "proper solution" is usually a total red herring. Quite often > there isn't, and the "paper over" is actually not papering over, it's > quite possibly the best solution there is. Yeah *smallish* higher order allocations are fine, and we use them all the time for things like stacks or networking. But Aubrey (who somehow got removed from the cc list) wants to do order 9 allocations from userspace in his nommu environment. I'm just trying to be realistic when I say that this isn't going to be robust and a userspace solution is needed. Hmm..., aside from big order allocations from user space, if there is a large application we need to run, it should be loaded into the memory, so we have to allocate a big block to accommodate it. kernel fun like load_elf_fdpic_binary() etc will request contiguous memory, then if vfs eat up free memory, loading fails. -Aubrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 Jan 2007, Nick Piggin wrote: > > Yeah *smallish* higher order allocations are fine, and we use them all the > time for things like stacks or networking. > > But Aubrey (who somehow got removed from the cc list) wants to do order 9 > allocations from userspace in his nommu environment. I'm just trying to be > realistic when I say that this isn't going to be robust and a userspace > solution is needed. I do agree that order-9 allocations simply is unlikely to work without some pre-allocation notion or some serious work at active de-fragmentation (and the page cache is likely to be the _least_ of the problems people will hit - slab and other kernel allocations are likely to be much much harder to handle, since you can't free them in quite as directed a manner). But for smallish-order (eg perhaps 3-4 possibly even more if you are careful in other places), the page cache limiter may well be a "good enough" solution in practice, especially if other allocations can be controlled by strict usage patterns (which is not realistic in a general- purpose kind of situation, but might be realistic in embedded). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Nick Piggin wrote: Linus Torvalds wrote: Very basic issue: the perfect is the enemy of the good. Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is. Yeah *smallish* higher order allocations are fine, and we use them all the time for things like stacks or networking. But Aubrey (who somehow got removed from the cc list) wants to do order 9 allocations from userspace in his nommu environment. I'm just trying to be realistic when I say that this isn't going to be robust and a userspace solution is needed. Oh, and also: I don't disagree with that limiting pagecache to some % might be useful for other reasons. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: On Fri, 12 Jan 2007, Nick Piggin wrote: We are talking about about fragmentation. And limiting pagecache to try to avoid fragmentation is a bandaid, especially when the problem can be solved (not just papered over, but solved) in userspace. It's not clear that the problem _can_ be solved in user space. It's easy enough to say "never allocate more than a page". But it's often not REALISTIC. > Very basic issue: the perfect is the enemy of the good. Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is. Yeah *smallish* higher order allocations are fine, and we use them all the time for things like stacks or networking. But Aubrey (who somehow got removed from the cc list) wants to do order 9 allocations from userspace in his nommu environment. I'm just trying to be realistic when I say that this isn't going to be robust and a userspace solution is needed. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Fri, 12 Jan 2007, Nick Piggin wrote: > > We are talking about about fragmentation. And limiting pagecache to try to > avoid fragmentation is a bandaid, especially when the problem can be solved > (not just papered over, but solved) in userspace. It's not clear that the problem _can_ be solved in user space. It's easy enough to say "never allocate more than a page". But it's often not REALISTIC. Very basic issue: the perfect is the enemy of the good. Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there is. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Bill Davidsen wrote: Nick Piggin wrote: Aubrey wrote: Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do things like limit cache size that are the bandaids. Tuning the system to work appropriately for a given load is not a band-aid. We are talking about about fragmentation. And limiting pagecache to try to avoid fragmentation is a bandaid, especially when the problem can be solved (not just papered over, but solved) in userspace. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Limiting total page cache can be considered first. Only if total page cache overrun limit, check whether the file overrun its per-file limit. If it is true, release partial page cache and wake up kswapd at the same time. On 1/12/07, Aubrey <[EMAIL PROTECTED]> wrote: On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote: > On a embedded systerm, limiting page cache can relieve memory > fragmentation. There is a patch against 2.6.19, which limit every > opened file page cache and total pagecache. When the limit reach, it > will release the page cache overrun the limit. The patch seems to work for me. But some suggestions in my mind: 1) Can we limit the total page cache, not the page cache per each file? think about if total memory is 128M, 10% of it is 12.8M, here if one application is running, it can use 12.8M vfs cache, then the performance will probably not be impacted. However, the current patch limit the page cache per each file, which means if only one application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may be small to the application. --snip--- if (mapping->nrpages >= mapping->pages_limit) balance_cache(mapping); --snip--- 2) A percent number should be better to control the value. Can we add a proc interface to make the value tunable? Thanks, -Aubrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Aubrey wrote: On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote: On a embedded systerm, limiting page cache can relieve memory fragmentation. There is a patch against 2.6.19, which limit every opened file page cache and total pagecache. When the limit reach, it will release the page cache overrun the limit. The patch seems to work for me. But some suggestions in my mind: 1) Can we limit the total page cache, not the page cache per each file? think about if total memory is 128M, 10% of it is 12.8M, here if one application is running, it can use 12.8M vfs cache, then the performance will probably not be impacted. However, the current patch limit the page cache per each file, which means if only one application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may be small to the application. --snip--- if (mapping->nrpages >= mapping->pages_limit) balance_cache(mapping); --snip--- 2) A percent number should be better to control the value. Can we add a proc interface to make the value tunable? Even a global value isn't completely straightforward, and a per-file value would be yet more work. You see, it is hard to do any sort of directed reclaim at these pages. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Nick Piggin wrote: Aubrey wrote: On 1/11/07, Nick Piggin <[EMAIL PROTECTED]> wrote: What you _really_ want to do is avoid large mallocs after boot, or use a CPU with an mmu. I don't think nommu linux was ever intended to be a simple drop in replacement for a normal unix kernel. Is there a position available working on mmu CPU? Joking, :) Yes, some problems are serious on nommu linux. But I think we should try to fix them not avoid them. Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do things like limit cache size that are the bandaids. Tuning the system to work appropriately for a given load is not a band-aid. I have been saying since 2.5.x times that filling memory with cached writes was a bad thing, and filling with writes to a single file was a doubly bad thing. Back in 2.4.NN-aa kernels, there were some tunables to address that, but other than adding your own 2.6 just behaves VERY badly for some loads. Of course, being an embedded system, if they work for you then that's really fine and you can obviously ship with them. But they don't need to go upstream. Anyone who has a few processes which write a lot of data and many processes with more modest i/o needs will see the overfilling of cache with data from one process or even for one file, and the resulting impact on the performance of all other processes, particularly if the kernel decides to write all the data for one file at once, because it avoids seeks, even if it uses the drive for seconds. The code has gone too far in the direction of throughput, at the expense of response to other processes, given the (common) behavior noted. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote: On a embedded systerm, limiting page cache can relieve memory fragmentation. There is a patch against 2.6.19, which limit every opened file page cache and total pagecache. When the limit reach, it will release the page cache overrun the limit. The patch seems to work for me. But some suggestions in my mind: 1) Can we limit the total page cache, not the page cache per each file? think about if total memory is 128M, 10% of it is 12.8M, here if one application is running, it can use 12.8M vfs cache, then the performance will probably not be impacted. However, the current patch limit the page cache per each file, which means if only one application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may be small to the application. --snip--- if (mapping->nrpages >= mapping->pages_limit) balance_cache(mapping); --snip--- 2) A percent number should be better to control the value. Can we add a proc interface to make the value tunable? Thanks, -Aubrey - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
linux-os (Dick Johnson) wrote: On Wed, 10 Jan 2007, Aubrey wrote: Hi all, Opening file with O_DIRECT flag can do the un-buffered read/write access. So if I need un-buffered access, I have to change all of my applications to add this flag. What's more, Some scripts like "cp oldfile newfile" still use pagecache and buffer. Now, my question is, is there a existing way to mount a filesystem with O_DIRECT flag? so that I don't need to change anything in my system. If there is no option so far, What is the right way to achieve my purpose? Thanks a lot. -Aubrey - I don't think O_DIRECT ever did what a lot of folks expect, i.e., write this buffer of data to the physical device _now_. All I/O ends up being buffered. The `man` page states that the I/O will be synchronous, that at the conclusion of the call, data will have been transferred. However, the data written probably will not be in the physical device, perhaps only in a DMA-able buffer with a promise to get it to the SCSI device, soon. No one (who read the specs) ever though thought the write was "right now," just that it was direct from user buffers. So it is not buffered, but it is queued through the elevator. Maybe you need to say why you want to use O_DIRECT with its terrible performance? Because it doesn't have terrible performance, because the user knows better than the o/s what it "right," etc. I used it to eliminate cache impact from large but non-essential operations, others use it on slow machines to avoid the CPU impact and bus bandwidth impact of extra copies. Please don't assume that users are unable to understand how it works because you believe some other feature which does something else would be just as good. There is no other option which causes the writes to be queued right now and not use any cache, and that is sometimes just what you want. I do like the patch to limit per-file and per-system cache, though, in some cases I really would like the system to slow gradually rather than fill 12GB of RAM with backlogged writes, then queue them and have other i/o crawl or stop. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: O_DIRECT question
> The other problem besides the inability to handle IO errors is that > mmap()+msync() is synchronous. You need to go async to keep > the pipelines full. msync(addr, len, MS_ASYNC); doesn't do what you want? > Now if someone wants to implement an aio version of msync and > mlock, that might do the trick. At least for MMU systems. > Non MMU systems just can't play mmap type games. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Michael Tokarev wrote: Linus Torvalds wrote: On Thu, 11 Jan 2007, Viktor wrote: OK, madvise() used with mmap'ed file allows to have reads from a file with zero-copy between kernel/user buffers and don't pollute cache memory unnecessarily. But how about writes? How is to do zero-copy writes to a file and don't pollute cache memory without using O_DIRECT? Do I miss the appropriate interface? mmap()+msync() can do that too. It can, somehow... until there's an I/O error. And *that* is just terrbile. The other problem besides the inability to handle IO errors is that mmap()+msync() is synchronous. You need to go async to keep the pipelines full. Now if someone wants to implement an aio version of msync and mlock, that might do the trick. At least for MMU systems. Non MMU systems just can't play mmap type games. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 2007-01-11 at 11:00 -0800, Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Trond Myklebust wrote: > > > > For NFS, the main feature of interest when it comes to O_DIRECT is > > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help > > because it can't guarantee that the page will be thrown out of the page > > cache before some second process tries to read it. That is particularly > > true if some dopey third party process has mmapped the file. > > You'd still be MUCH better off using the page cache, and just forcing the > IO (but _with_ all the page cache synchronization still active). Which is > trivial to do on the filesystem level, especially for something like NFS. > > If you bypass the page cache, you just make that "dopey third party > process" problem worse. You now _guarantee_ that there are aliases with > different data. Quite, but that is sometimes an admissible state of affairs. One of the things that was infuriating when we were trying to do shared databases over the page cache was that someone would start some unsynchronised process that had nothing to do with the database itself (it would typically be a process that was backing up the rest of the disk or something like that). Said process would end up pinning pages in memory, and prevented the database itself from getting updated data from the server. IOW: the problem was not that of unsynchronised I/O per se. It was rather that of allowing the application to set up its own synchronisation barriers and to ensure that no pages are cached across these barriers. POSIX_FADV_NOREUSE can't offer that guarantee. > Of course, with NFS, the _server_ will resolve any aliases anyway, so at > least you don't get file corruption, but you can get some really strange > things (like the write of one process actually happening before, but being > flushed _after_ and overriding the later write of the O_DIRECT process). Writes are not the real problem here since shared databases typically do implement sufficient synchronisation, and NFS can guarantee that only the dirty data will be written out. However reading back the data is problematic when you have insufficient control over the page cache. The other issue is, of course, that databases don't _want_ to cache the data in this situation, so the extra copy to the page cache is just a bother. As you pointed out, that becomes less of an issue as processor caches and memory speeds increase, but it is still apparently a measurable effect. Cheers Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 11 Jan 2007, Trond Myklebust wrote: > > For NFS, the main feature of interest when it comes to O_DIRECT is > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help > because it can't guarantee that the page will be thrown out of the page > cache before some second process tries to read it. That is particularly > true if some dopey third party process has mmapped the file. You'd still be MUCH better off using the page cache, and just forcing the IO (but _with_ all the page cache synchronization still active). Which is trivial to do on the filesystem level, especially for something like NFS. If you bypass the page cache, you just make that "dopey third party process" problem worse. You now _guarantee_ that there are aliases with different data. Of course, with NFS, the _server_ will resolve any aliases anyway, so at least you don't get file corruption, but you can get some really strange things (like the write of one process actually happening before, but being flushed _after_ and overriding the later write of the O_DIRECT process). And sure, the filesystem can have its own alias avoidance too (by just probing the page cache all the time), but the fundamental fact remains: the problem is that O_DIRECT as a page-cache-bypassing mechanism is BROKEN. If you have issues with caching (but still have to allow it for other things), the way to fix them is not to make uncached accesses, it's to force the cache to be serialized. That's very fundamentally true. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 2007-01-11 at 09:04 -0800, Linus Torvalds wrote: > That is what I think some users could do. If the main issue with O_DIRECT > is the page cache allocations, if we instead had better (read: "any") > support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would > just go away. For NFS, the main feature of interest when it comes to O_DIRECT is strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help because it can't guarantee that the page will be thrown out of the page cache before some second process tries to read it. That is particularly true if some dopey third party process has mmapped the file. Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 11 Jan 2007, Alan wrote: > > Well you can - its called SG_IO and that really does get the OS out of > the way. O_DIRECT gets crazy when you stop using it on devices directly > and use it on files Well, on a raw disk, O_DIRECT is fine too, but yeah, you might as well use SG_IO at that point. All of my issues are all about filesystems. And filesystems is where people use O_DIRECT most. Almost nobody puts their database on a partition of its own these days, afaik. Perhaps for benchmarking or some really high-end stuff. Not "normal users". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
> space, just as an example) is wrong in the first place, but the really > subtle problems come when you realize that you can't really just "bypass" > the OS. Well you can - its called SG_IO and that really does get the OS out of the way. O_DIRECT gets crazy when you stop using it on devices directly and use it on files You do need some way to avoid the copy cost of caches and get data direct to user space, it also needs to be a way that works without MMU tricks because many of that need it are the embedded platforms. Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Viktor wrote: >> OK, madvise() used with mmap'ed file allows to have reads from a file >> with zero-copy between kernel/user buffers and don't pollute cache >> memory unnecessarily. But how about writes? How is to do zero-copy >> writes to a file and don't pollute cache memory without using O_DIRECT? >> Do I miss the appropriate interface? > > mmap()+msync() can do that too. It can, somehow... until there's an I/O error. And *that* is just terrbile. Granted, I didn't check 2.6.x kernels, especially the latest ones. But in 2.4, if an I/O space behind mmap becomes unREADable, the process gets stuck in some unkillable state forever. I don't know what happens with write errors, but that behaviour with read errors is just inacceptable. Sure it's not something like posix_madvise() (whicih is for reads anyway, not writes). But I'd very strongly disagree about usage of mmap for anything more-or-less serious. Because of umm... difficulties with error recovery (if it's at all possible). Note also that anything but O_DIRECT isn't... portable. O_DIRECT, with all its shortcomings and ugliness, works, and works on quite.. some systems. Having something else, especially with very different usage -- I mean, if the whole I/O subsystem in application has to be redesigned and re-written in order to use that advanced (or just "right") mechanism (O_DIRECT is not different from basic read()/write() - just one extra bit at open() time, and all your code, which evolved during years and got years of testing, too -- just works, at least in theory, if O_DIRECT interface is working (ok ok, i know alignment issues, but that's also handled easily)), -- that'd be somewhat problematic. *Unless* there's a very noticeable gain from that. >From my expirience with databases (mostly Oracle, and some with Postgres and Mysql), O_DIRECT has *dramatic* impact on performance. You don't use O_DIRECT, and you lose alot. O_DIRECT is *already* a fastest way possible, I think - for example, it gives maximum speed when writing to or reading from a raw device (/dev/sdb etc). I don't think there's a way to improve that performance... Yes, there ARE, it seems, some ways for improvements, in other areas - like, utilizing write barriers for example, which isn't quite possible now from userspace. But as long as O_DIRECT actually writes data before returning from write() call (as it seems to be the case at least with a normal filesystem on a real block device - I don't touch corner cases like nfs here), it's pretty much THE ideal solution, at least from the application (developer) standpoint. By the way, ext[23]fs is terrible slow with O_DIRECT writes - it gives about 1/4 of the speed of raw device when multiple concurrent direct readers and writers are running. Xfs gives full raw device speed here. I think that MAY be related to locking issues in ext[23], but I don't know for sure. And another "btw" - when creating files, O_DIRECT is quite a killer - each write takes alot more time than "necessary". But once a file has been written, re-writes are pretty much fast. Also, and it's quite.. funny (to me at least). Being curious, I compared write speed (random small-blocks I/O scattered all around the disk) of modern disk drives with and without write cache (WCE=[01] bit in the SCSI "Cache control" page of every disk drive). The fun is: with write cache turned on, actual speed is LOWER than without cache. I don't remember exact numbers, something like 120mb/sec vs 90mb/sec. And I think it's quite expectable, as well - first writes all goes to the cache, but since data stream is going on and on, the cache fills up quickly, and in order to accept the next data, the drive has to free some place in its cache. So instead of just doing its work, it is spending its time to bounce data to/from the cache... Sure it's not about linux pagecache or something like that, but it's still somehow related. :) [] > O_DIRECT - by bypassing the "real" kernel - very fundamentally breaks the > whole _point_ of the kernel. There's tons of races where an O_DIRECT user > (or other users that expect to see the O_DIRECT data) will now see the > wrong data - including seeign uninitialized portions of the disk etc etc. Huh? Well, I plug in a shiny new harddisk into my computer, and do an O_DIRECT read of it - will I see uninitialized data? Sure I will (well, in most cases the whole disk is filled with zeros anyway, so it isn't uninitialized). The same applies to regular read, too. If what you're saying applies to O_DIRECT read of a file on a filesystem, -- well, that's definitely a kernel bug. It should either not allow to read if the file size isn't sector-aligned - to read that last part which isn't a whole sector or whatever, -- or it should ensure the "extra" data is initialized. Yes, that's difficult to implement in the kernel. But it's not an excuse to not to do that. AND I think just failing the
Re: O_DIRECT question
On Thu, 11 Jan 2007, Xavier Bestel wrote: > Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit : > > > O_DIRECT is still crazily racy versus pagecache operations. > > > > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix > > it sanely. > > How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ? That is what I think some users could do. If the main issue with O_DIRECT is the page cache allocations, if we instead had better (read: "any") support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would just go away. See also the patch that Roy Huang posted about another approach to the same problem: just limiting page cache usage explicitly. That's not the _only_ issue with O_DIRECT, though. It's one big one, but people like to think that the memory copy makes a difference when you do IO too (I think it's likely pretty debatable in real life, but I'm totally certain you can benchmark it, probably even pretty easily especially if you have fairly studly IO capabilities and a CPU that isn't quite as studly). So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT picture, and depending on your problems (in this case, the embedded world) it may even be the *biggest* part. But it's not the whole picture. Linus
Re: O_DIRECT question
Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit : > > O_DIRECT is still crazily racy versus pagecache operations. > > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix > it sanely. How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ? Xav - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT question
On Thu, 11 Jan 2007, Roy Huang wrote: > > On a embedded systerm, limiting page cache can relieve memory > fragmentation. There is a patch against 2.6.19, which limit every > opened file page cache and total pagecache. When the limit reach, it > will release the page cache overrun the limit. I do think that something like this is probably a good idea, even on non-embedded setups. We historically couldn't do this, because mapped pages were too damn hard to remove, but that's obviously not much of a problem any more. However, the page-cache limit should NOT be some compile-time constant. It should work the same way the "dirty page" limit works, and probably just default to "feel free to use 90% of memory for page cache". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/