Re: O_DIRECT question

2007-02-06 Thread Pavel Machek
Hi! > > > Which shouldn't be true. There is no fundamental reason why > > > ordinary writes should be slower than O_DIRECT. > > > > Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the > > kernel-user copy, > > You assume that ordinary read()/write() is *required* to do the co

Re: O_DIRECT question

2007-01-31 Thread Michael Tokarev
Phillip Susi wrote: [] > You seem to have missed the point of this thread. Denis Vlasenko's > message that you replied to simply pointed out that they are > semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC > + madvise could be fixed to perform as well. Several people inclu

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli
On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote: > It most certainly matters where the error happened because "you are > screwd" is not an acceptable outcome in a mission critical application. An I/O error is not an acceptable outcome in a mission critical app, all mission critical

Re: O_DIRECT question

2007-01-30 Thread Phillip Susi
Andrea Arcangeli wrote: When you have I/O errors during _writes_ (not Read!!) the raid must kick the disk out of the array before the OS ever notices. And if it's software raid that you're using, the OS should kick out the disk before your app ever notices any I/O error. when the write I/O error

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli
On Tue, Jan 30, 2007 at 08:57:20PM +0100, Andrea Arcangeli wrote: > Please try yourself, it's simple enough: > >time dd if=/dev/hda of=/dev/null bs=16M count=100 >time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync sorry, reading won't help much to exercise sync ;). But t

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli
On Tue, Jan 30, 2007 at 01:50:41PM -0500, Phillip Susi wrote: > It should return the number of bytes successfully written before the > error, giving you the location of the first error. Also using smaller > individual writes ( preferably issued in parallel ) also allows the > problem spot to be

Re: O_DIRECT question

2007-01-30 Thread Phillip Susi
Andrea Arcangeli wrote: On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote: Did you intentionally drop this reply off list? No. Then I'll restore the lkml to the cc list. No, it doesn't... or at least can't report WHERE the error is. O_SYNC doesn't report where the error is eit

Re: O_DIRECT question

2007-01-29 Thread Denis Vlasenko
On Monday 29 January 2007 18:00, Andrea Arcangeli wrote: > On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote: > > I still don't see much difference between O_SYNC and O_DIRECT write > > semantic. > > O_DIRECT is about avoiding the copy_user between cache and userland, > when working w

Re: O_DIRECT question

2007-01-29 Thread Andrea Arcangeli
On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote: > I still don't see much difference between O_SYNC and O_DIRECT write > semantic. O_DIRECT is about avoiding the copy_user between cache and userland, when working with devices that runs faster than ram (think >=100M/sec, quite standa

Re: O_DIRECT question

2007-01-29 Thread Phillip Susi
Denis Vlasenko wrote: I still don't see much difference between O_SYNC and O_DIRECT write semantic. Yes, if you change the normal io paths to properly support playing vmsplice games ( which have a number of corner cases ) to get the zero copy, and support madvise() and O_SYNC to control cachi

Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:30, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Saturday 27 January 2007 15:01, Bodo Eggert wrote: > >> Denis Vlasenko <[EMAIL PROTECTED]> wrote: > >>> On Friday 26 January 2007 19:23, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Thursday 25 Janu

Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko
On Sunday 28 January 2007 16:18, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Friday 26 January 2007 19:23, Bill Davidsen wrote: > >> Denis Vlasenko wrote: > >>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > Phillip Susi wrote: > > [...] > > But even single-th

Re: O_DIRECT question

2007-01-28 Thread Bill Davidsen
Denis Vlasenko wrote: On Saturday 27 January 2007 15:01, Bodo Eggert wrote: Denis Vlasenko <[EMAIL PROTECTED]> wrote: On Friday 26 January 2007 19:23, Bill Davidsen wrote: Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: But even single-threaded I/O but in larg

Re: O_DIRECT question

2007-01-28 Thread Bill Davidsen
Denis Vlasenko wrote: On Friday 26 January 2007 19:23, Bill Davidsen wrote: Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: Phillip Susi wrote: [...] But even single-threaded I/O but in large quantities benefits from O_DIRECT significantly, and I poi

Re: O_DIRECT question

2007-01-27 Thread Denis Vlasenko
On Saturday 27 January 2007 15:01, Bodo Eggert wrote: > Denis Vlasenko <[EMAIL PROTECTED]> wrote: > > On Friday 26 January 2007 19:23, Bill Davidsen wrote: > >> Denis Vlasenko wrote: > >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > > >> >> But even single-threaded I/O but in larg

Re: O_DIRECT question

2007-01-27 Thread Bodo Eggert
Denis Vlasenko <[EMAIL PROTECTED]> wrote: > On Friday 26 January 2007 19:23, Bill Davidsen wrote: >> Denis Vlasenko wrote: >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: >> >> But even single-threaded I/O but in large quantities benefits from >> >> O_DIRECT significantly, and I poi

Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 19:23, Bill Davidsen wrote: > Denis Vlasenko wrote: > > On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > >> Phillip Susi wrote: > >>> Denis Vlasenko wrote: > You mean "You can use aio_write" ? > >>> Exactly. You generally don't use O_DIRECT without aio. C

Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko
On Friday 26 January 2007 18:05, Phillip Susi wrote: > Denis Vlasenko wrote: > > Which shouldn't be true. There is no fundamental reason why > > ordinary writes should be slower than O_DIRECT. > > Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the > kernel-user copy, You assu

Re: O_DIRECT question

2007-01-26 Thread Bill Davidsen
Denis Vlasenko wrote: On Thursday 25 January 2007 21:45, Michael Tokarev wrote: Phillip Susi wrote: Denis Vlasenko wrote: You mean "You can use aio_write" ? Exactly. You generally don't use O_DIRECT without aio. Combining the two is what gives the big win. Well, it's not only aio. Multith

Re: O_DIRECT question

2007-01-26 Thread Phillip Susi
Denis Vlasenko wrote: Which shouldn't be true. There is no fundamental reason why ordinary writes should be slower than O_DIRECT. Again, there IS a reason: O_DIRECT eliminates the cpu overhead of the kernel-user copy, and when coupled with multithreading or aio, allows the IO queues to be ke

Re: O_DIRECT question

2007-01-26 Thread Phillip Susi
Mark Lord wrote: You guys need to backup in this thread. Every example of O_DIRECT here could be replaced with calls to mmap(), msync(), and madvise() (or posix_fadvise). In addition to being at least as fast as O_DIRECT, these have the added benefit of using the page cache (avoiding reads for

Re: O_DIRECT question

2007-01-26 Thread Viktor
Mark Lord wrote: > You guys need to backup in this thread. > > Every example of O_DIRECT here could be replaced with > calls to mmap(), msync(), and madvise() (or posix_fadvise). No. How about handling IO errors? There is no practical way for it with mmap(). > In addition to being at least as fa

Re: O_DIRECT question

2007-01-26 Thread Mark Lord
You guys need to backup in this thread. Every example of O_DIRECT here could be replaced with calls to mmap(), msync(), and madvise() (or posix_fadvise). In addition to being at least as fast as O_DIRECT, these have the added benefit of using the page cache (avoiding reads for data already pres

Re: O_DIRECT question

2007-01-26 Thread Bill Davidsen
Denis Vlasenko wrote: Well, I too currently work with Oracle. Apparently people who wrote damn thing have very, eh, Oracle-centric world-view. "We want direct writes to the disk. Period." Why? Does it makes sense? Are there better ways? - nothing. They think they know better. I fear you are tak

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 21:45, Michael Tokarev wrote: > Phillip Susi wrote: > > Denis Vlasenko wrote: > >> You mean "You can use aio_write" ? > > > > Exactly. You generally don't use O_DIRECT without aio. Combining the > > two is what gives the big win. > > Well, it's not only aio. Multith

Re: O_DIRECT question

2007-01-25 Thread Michael Tokarev
Phillip Susi wrote: > Denis Vlasenko wrote: >> You mean "You can use aio_write" ? > > Exactly. You generally don't use O_DIRECT without aio. Combining the > two is what gives the big win. Well, it's not only aio. Multithreaded I/O also helps alot -- all this, say, to utilize a raid array with

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi
Denis Vlasenko wrote: You mean "You can use aio_write" ? Exactly. You generally don't use O_DIRECT without aio. Combining the two is what gives the big win. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordom

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 20:28, Phillip Susi wrote: > > Ahhh shit, are you saying that fdatasync will wait until writes > > *by all other processes* to thios file will hit the disk? > > Is that thue? > > I think all processes yes, but certainly all writes to this file by this > process. That

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi
Denis Vlasenko wrote: If you opened a file and are doing only O_DIRECT writes, you *always* have your written data flushed, by each write(). How is it different from writes done using "normal" write() + fdatasync() pairs? Because you can do writes async, but not fdatasync ( unless there is an

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko
On Thursday 25 January 2007 16:44, Phillip Susi wrote: > Denis Vlasenko wrote: > > I will still disagree on this point (on point "use O_DIRECT, it's faster"). > > There is no reason why O_DIRECT should be faster than "normal" read/write > > to large, aligned buffer. If O_DIRECT is faster on today's

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi
Denis Vlasenko wrote: I will still disagree on this point (on point "use O_DIRECT, it's faster"). There is no reason why O_DIRECT should be faster than "normal" read/write to large, aligned buffer. If O_DIRECT is faster on today's kernel, then Linux' read()/write() can be optimized more. Ahh bu

Re: O_DIRECT question

2007-01-24 Thread Denis Vlasenko
On Monday 22 January 2007 17:17, Phillip Susi wrote: > > You do not need to know which read() exactly failed due to bad disk. > > Filename and offset from the start is enough. Right? > > > > So, SIGIO/SIGBUS can provide that, and if your handler is of > > void (*sa_sigaction)(int, siginfo_t *,

Re: O_DIRECT question

2007-01-22 Thread Phillip Susi
Denis Vlasenko wrote: The difference is that you block exactly when you try to access data which is not there yet, not sooner (potentially much sooner). If application (e.g. database) needs to know whether data is _really_ there, it should use aio_read (or something better, something which doesn

Re: O_DIRECT question

2007-01-22 Thread Al Boldi
Andrea Arcangeli wrote: > Linus may be right that perhaps one day the CPU will be so much faster > than disk that such a copy will not be measurable and then O_DIRECT > could be downgraded to O_STREAMING or an fadvise. If such a day will > come by, probably that same day Dr. Tanenbaum will be final

Re: O_DIRECT question

2007-01-22 Thread Phillip Susi
Denis Vlasenko wrote: What will happen if we just make open ignore O_DIRECT? ;) And then anyone who feels sad about is advised to do it like described here: http://lkml.org/lkml/2002/5/11/58 Then database and other high performance IO users will be broken. Most of Linus's rant there is bein

Re: O_DIRECT question

2007-01-21 Thread Andrea Arcangeli
Hello everyone, This is a long thread about O_DIRECT surprisingly without a single bugreport in it, that's a good sign that O_DIRECT is starting to work well in 2.6 too ;) On Fri, Jan 12, 2007 at 02:47:48PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 15:35:09 -0700 > Erik Andersen <[EMAIL P

Re: O_DIRECT question

2007-01-21 Thread Denis Vlasenko
On Sunday 21 January 2007 13:09, Michael Tokarev wrote: > Denis Vlasenko wrote: > > On Saturday 20 January 2007 21:55, Michael Tokarev wrote: > >> Denis Vlasenko wrote: > >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > example, which isn't quite possible now from userspace. Bu

Re: O_DIRECT question

2007-01-21 Thread Michael Tokarev
Denis Vlasenko wrote: > On Saturday 20 January 2007 21:55, Michael Tokarev wrote: >> Denis Vlasenko wrote: >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote: example, which isn't quite possible now from userspace. But as long as O_DIRECT actually writes data before returning f

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Saturday 20 January 2007 21:55, Michael Tokarev wrote: > Denis Vlasenko wrote: > > On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > >> example, which isn't quite possible now from userspace. But as long as > >> O_DIRECT actually writes data before returning from write() call (as it >

Re: O_DIRECT question

2007-01-20 Thread Michael Tokarev
Denis Vlasenko wrote: > On Thursday 11 January 2007 18:13, Michael Tokarev wrote: >> example, which isn't quite possible now from userspace. But as long as >> O_DIRECT actually writes data before returning from write() call (as it >> seems to be the case at least with a normal filesystem on a real

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Sunday 14 January 2007 10:11, Nate Diller wrote: > On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > Most applications don't get the kind of performance analysis that > Digeo was doing, and even then, it's rather lucky that we caught that. > So I personally think it'd be best for libc or s

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 18:13, Michael Tokarev wrote: > example, which isn't quite possible now from userspace. But as long as > O_DIRECT actually writes data before returning from write() call (as it > seems to be the case at least with a normal filesystem on a real block > device - I don't t

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko
On Thursday 11 January 2007 16:50, Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Nick Piggin wrote: > > > > Speaking of which, why did we obsolete raw devices? And/or why not just > > go with a minimal O_DIRECT on block device support? Not a rhetorical > > question -- I wasn't involved in the di

Re: O_DIRECT question

2007-01-17 Thread Bodo Eggert
On Tue, 16 Jan 2007, Arjan van de Ven wrote: > On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote: > > Helge Hafting <[EMAIL PROTECTED]> wrote: > > > Michael Tokarev wrote: > > >> But seriously - what about just disallowing non-O_DIRECT opens together > > >> with O_DIRECT ones ? > > >> > > >

Re: O_DIRECT question

2007-01-17 Thread Alex Tomas
I think one problem with mmap/msync is that they can't maintain i_size atomically like regular write does. so, one needs to implement own i_size management in userspace. thanks, Alex > Side note: the only reason O_DIRECT exists is because database people are > too used to it, because other OS's

Re: O_DIRECT question

2007-01-16 Thread Arjan van de Ven
On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote: > Helge Hafting <[EMAIL PROTECTED]> wrote: > > Michael Tokarev wrote: > > >> But seriously - what about just disallowing non-O_DIRECT opens together > >> with O_DIRECT ones ? > >> > > Please do not create a new local DOS attack. > > I open s

Re: O_DIRECT question

2007-01-16 Thread Aubrey Li
On 1/12/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: On Thu, 11 Jan 2007, Roy Huang wrote: > > On a embedded systerm, limiting page cache can relieve memory > fragmentation. There is a patch against 2.6.19, which limit every > opened file page cache and total pagecache. When the limit reach, i

Re: O_DIRECT question

2007-01-16 Thread Bodo Eggert
Helge Hafting <[EMAIL PROTECTED]> wrote: > Michael Tokarev wrote: >> But seriously - what about just disallowing non-O_DIRECT opens together >> with O_DIRECT ones ? >> > Please do not create a new local DOS attack. > I open some important file, say /etc/resolv.conf > with O_DIRECT and just sit

Re: O_DIRECT question

2007-01-15 Thread Jörn Engel
On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote: > > Yes for desktop, server, but maybe not for embedded system, specially > for no-mmu linux. In many embedded system cases, the whole system is > running in the ram, including file system. So it's not necessary using > page cache anymore. Page

Re: O_DIRECT question

2007-01-15 Thread Helge Hafting
Michael Tokarev wrote: Chris Mason wrote: [] I recently spent some time trying to integrate O_DIRECT locking with page cache locking. The basic theory is that instead of using semaphores for solving O_DIRECT vs buffered races, you put something into the radix tree (I call it a placeholder) t

Re: O_DIRECT question

2007-01-14 Thread Bodo Eggert
Bill Davidsen <[EMAIL PROTECTED]> wrote: > My point is, that there is code to handle sparse data now, without > O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem > with the idea of O_DIRECT, the kernel has a security problem. The idea of O_DIRECT is to bypass the pagecache, and

Re: O_DIRECT question

2007-01-14 Thread Bodo Eggert
On Sat, 13 Jan 2007, Bill Davidsen wrote: > Bodo Eggert wrote: > > > (*) This would allow fadvise_size(), too, which could reduce fragmentation > > (and give an early warning on full disks) without forcing e.g. fat to > > zero all blocks. OTOH, fadvise_size() would allow users to reserve

Re: O_DIRECT question

2007-01-14 Thread Bill Davidsen
Michael Tokarev wrote: Bill Davidsen wrote: If I got it right (and please someone tell me if I *really* got it right!), the problem is elsewhere. Suppose you have a filesystem, not at all related to databases and stuff. Your usual root filesystem, with your /etc/ /var and so on directories.

Re: O_DIRECT question

2007-01-14 Thread Nate Diller
On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Fri, 12 Jan 2007 15:35:09 -0700 Erik Andersen <[EMAIL PROTECTED]> wrote: > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > I suspect a lot of people actually have other reasons to avoid caches. > > > > For example, the re

Re: O_DIRECT question

2007-01-13 Thread Michael Tokarev
Bill Davidsen wrote: > Linus Torvalds wrote: >> [] >> But what O_DIRECT does right now is _not_ really sensible, and the >> O_DIRECT propeller-heads seem to have some problem even admitting that >> there _is_ a problem, because they don't care. > > You say that as if it were a failing. Currently

Re: O_DIRECT question

2007-01-13 Thread Bill Davidsen
Linus Torvalds wrote: On Sat, 13 Jan 2007, Michael Tokarev wrote: (No, really - this load isn't entirely synthetic. It's a typical database workload - random I/O all over, on a large file. If it can, it combines several I/Os into one, by requesting more than a single block at a time, but over

Re: O_DIRECT question

2007-01-13 Thread Bill Davidsen
Bodo Eggert wrote: (*) This would allow fadvise_size(), too, which could reduce fragmentation (and give an early warning on full disks) without forcing e.g. fat to zero all blocks. OTOH, fadvise_size() would allow users to reserve the complete disk space without his filesizes reflect

Re: O_DIRECT question

2007-01-13 Thread Bodo Eggert
Linus Torvalds <[EMAIL PROTECTED]> wrote: > On Sat, 13 Jan 2007, Michael Tokarev wrote: >> (No, really - this load isn't entirely synthetic. It's a typical database >> workload - random I/O all over, on a large file. If it can, it combines >> several I/Os into one, by requesting more than a sing

Re: O_DIRECT question

2007-01-12 Thread Nick Piggin
Bill Davidsen wrote: The point is that if you want to be able to allocate at all, sometimes you will have to write dirty pages, garbage collect, and move or swap programs. The hardware is just too limited to do something less painful, and the user can't see memory to do things better. Linus is

Re: O_DIRECT question

2007-01-12 Thread Andrew Morton
On Fri, 12 Jan 2007 15:35:09 -0700 Erik Andersen <[EMAIL PROTECTED]> wrote: > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > > I suspect a lot of people actually have other reasons to avoid caches. > > > > For example, the reason to do O_DIRECT may well not be that you want to

Re: O_DIRECT question

2007-01-12 Thread Erik Andersen
On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote: > I suspect a lot of people actually have other reasons to avoid caches. > > For example, the reason to do O_DIRECT may well not be that you want to > avoid caching per se, but simply because you want to limit page cache > activity.

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Linus Torvalds wrote: > > On Sat, 13 Jan 2007, Michael Tokarev wrote: >>> At that point, O_DIRECT would be a way of saying "we're going to do >>> uncached accesses to this pre-allocated file". Which is a half-way >>> sensible thing to do. >> Half-way? > > I suspect a lot of people actually have

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds
On Sat, 13 Jan 2007, Michael Tokarev wrote: > > > > At that point, O_DIRECT would be a way of saying "we're going to do > > uncached accesses to this pre-allocated file". Which is a half-way > > sensible thing to do. > > Half-way? I suspect a lot of people actually have other reasons to avoi

Re: Disk Cache, Was: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Zan Lynx wrote: > On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote: > [snip] >> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this >> load - because everything is thrown away from the cache, even caches of /bin >> /usr/bin etc... ;) (For that, fadvise() seems to h

Disk Cache, Was: O_DIRECT question

2007-01-12 Thread Zan Lynx
On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote: [snip] > And sure thing, withOUT O_DIRECT, the whole system is almost dead under this > load - because everything is thrown away from the cache, even caches of /bin > /usr/bin etc... ;) (For that, fadvise() seems to help a bit, but not alot

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Linus Torvalds wrote: [] > My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without > having all the BAD behaviour that O_DIRECT adds. *This* point I got from the beginning, once I tried to think how it all is done internally (I never thought about that, because I'm not a kernel

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds
On Sat, 13 Jan 2007, Michael Tokarev wrote: > > (No, really - this load isn't entirely synthetic. It's a typical database > workload - random I/O all over, on a large file. If it can, it combines > several I/Os into one, by requesting more than a single block at a time, > but overall it is ran

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Michael Tokarev wrote: > Michael Tokarev wrote: > By the way. I just ran - for fun - a read test of a raid array. > > Reading blocks of size 512kbytes, starting at random places on a 400Gb > array, doing 64threads. > > O_DIRECT: 336.73 MB/sec. > !O_DIRECT: 146.00 MB/sec. And when turning off r

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Michael Tokarev wrote: [] > After all the explanations, I still don't see anything wrong with the > interface itself. O_DIRECT isn't "different semantics" - we're still > writing and reading some data. Yes, O_DIRECT and non-O_DIRECT usages > somewhat contradicts with each other, but there are oth

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev
Chris Mason wrote: [] > I recently spent some time trying to integrate O_DIRECT locking with > page cache locking. The basic theory is that instead of using > semaphores for solving O_DIRECT vs buffered races, you put something > into the radix tree (I call it a placeholder) to keep the page cache

Re: O_DIRECT question

2007-01-12 Thread Chris Mason
On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote: > > looking at the splice(2) api it seems like it'll be difficult to implement > > O_DIRECT pread/pwrite from userland using splice... so there'd need to be > > some help there. > > You'd use vmsplice() to put the write buffers int

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds
On Thu, 11 Jan 2007, dean gaudet wrote: > > it seems to me that if splice and fadvise and related things are > sufficient for userland to take care of things "properly" then O_DIRECT > could be changed into splice/fadvise calls either by a library or in the > kernel directly... The problem i

Re: O_DIRECT question

2007-01-12 Thread Viktor
Linus Torvalds wrote: O_DIRECT is still crazily racy versus pagecache operations. >>> >>>Yes. O_DIRECT is really fundamentally broken. There's just no way to fix >>>it sanely. >> >>How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ? > > > That is what I think some users could do. If

Re: O_DIRECT question

2007-01-12 Thread Viktor
Linus Torvalds wrote: >>OK, madvise() used with mmap'ed file allows to have reads from a file >>with zero-copy between kernel/user buffers and don't pollute cache >>memory unnecessarily. But how about writes? How is to do zero-copy >>writes to a file and don't pollute cache memory without using O_D

Re: O_DIRECT question

2007-01-12 Thread Phillip Susi
dean gaudet wrote: it seems to me that if splice and fadvise and related things are sufficient for userland to take care of things "properly" then O_DIRECT could be changed into splice/fadvise calls either by a library or in the kernel directly... No, because the semantics are entirely differ

Re: O_DIRECT question

2007-01-12 Thread Phillip Susi
Hua Zhong wrote: The other problem besides the inability to handle IO errors is that mmap()+msync() is synchronous. You need to go async to keep the pipelines full. msync(addr, len, MS_ASYNC); doesn't do what you want? No, because there is no notification of completion. In fact, does this

Re: O_DIRECT question

2007-01-12 Thread Bill Davidsen
Aubrey wrote: On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Linus Torvalds wrote: > > On Fri, 12 Jan 2007, Nick Piggin wrote: > >>We are talking about about fragmentation. And limiting pagecache to try to >>avoid fragmentation is a bandaid, especially when the problem can be solved >>(no

Re: O_DIRECT question

2007-01-11 Thread dean gaudet
On Thu, 11 Jan 2007, Linus Torvalds wrote: > On Thu, 11 Jan 2007, Viktor wrote: > > > > OK, madvise() used with mmap'ed file allows to have reads from a file > > with zero-copy between kernel/user buffers and don't pollute cache > > memory unnecessarily. But how about writes? How is to do zero-co

Re: O_DIRECT question

2007-01-11 Thread Aubrey
On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote: Linus Torvalds wrote: > > On Fri, 12 Jan 2007, Nick Piggin wrote: > >>We are talking about about fragmentation. And limiting pagecache to try to >>avoid fragmentation is a bandaid, especially when the problem can be solved >>(not just papered ove

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Fri, 12 Jan 2007, Nick Piggin wrote: > > Yeah *smallish* higher order allocations are fine, and we use them all the > time for things like stacks or networking. > > But Aubrey (who somehow got removed from the cc list) wants to do order 9 > allocations from userspace in his nommu environment

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin
Nick Piggin wrote: Linus Torvalds wrote: Very basic issue: the perfect is the enemy of the good. Claiming that there is a "proper solution" is usually a total red herring. Quite often there isn't, and the "paper over" is actually not papering over, it's quite possibly the best solution there

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin
Linus Torvalds wrote: On Fri, 12 Jan 2007, Nick Piggin wrote: We are talking about about fragmentation. And limiting pagecache to try to avoid fragmentation is a bandaid, especially when the problem can be solved (not just papered over, but solved) in userspace. It's not clear that the prob

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Fri, 12 Jan 2007, Nick Piggin wrote: > > We are talking about about fragmentation. And limiting pagecache to try to > avoid fragmentation is a bandaid, especially when the problem can be solved > (not just papered over, but solved) in userspace. It's not clear that the problem _can_ be solved

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin
Bill Davidsen wrote: Nick Piggin wrote: Aubrey wrote: Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do things like limit cache size that are the bandaids. Tuning the system to work appropriat

Re: O_DIRECT question

2007-01-11 Thread Roy Huang
Limiting total page cache can be considered first. Only if total page cache overrun limit, check whether the file overrun its per-file limit. If it is true, release partial page cache and wake up kswapd at the same time. On 1/12/07, Aubrey <[EMAIL PROTECTED]> wrote: On 1/11/07, Roy Huang <[EMAIL

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin
Aubrey wrote: On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote: On a embedded systerm, limiting page cache can relieve memory fragmentation. There is a patch against 2.6.19, which limit every opened file page cache and total pagecache. When the limit reach, it will release the page cache overrun

Re: O_DIRECT question

2007-01-11 Thread Bill Davidsen
Nick Piggin wrote: Aubrey wrote: On 1/11/07, Nick Piggin <[EMAIL PROTECTED]> wrote: What you _really_ want to do is avoid large mallocs after boot, or use a CPU with an mmu. I don't think nommu linux was ever intended to be a simple drop in replacement for a normal unix kernel. Is there a

Re: O_DIRECT question

2007-01-11 Thread Aubrey
On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote: On a embedded systerm, limiting page cache can relieve memory fragmentation. There is a patch against 2.6.19, which limit every opened file page cache and total pagecache. When the limit reach, it will release the page cache overrun the limit. Th

Re: O_DIRECT question

2007-01-11 Thread Bill Davidsen
linux-os (Dick Johnson) wrote: On Wed, 10 Jan 2007, Aubrey wrote: Hi all, Opening file with O_DIRECT flag can do the un-buffered read/write access. So if I need un-buffered access, I have to change all of my applications to add this flag. What's more, Some scripts like "cp oldfile newfile" sti

RE: O_DIRECT question

2007-01-11 Thread Hua Zhong
> The other problem besides the inability to handle IO errors is that > mmap()+msync() is synchronous. You need to go async to keep > the pipelines full. msync(addr, len, MS_ASYNC); doesn't do what you want? > Now if someone wants to implement an aio version of msync and > mlock, that might do

Re: O_DIRECT question

2007-01-11 Thread Phillip Susi
Michael Tokarev wrote: Linus Torvalds wrote: On Thu, 11 Jan 2007, Viktor wrote: OK, madvise() used with mmap'ed file allows to have reads from a file with zero-copy between kernel/user buffers and don't pollute cache memory unnecessarily. But how about writes? How is to do zero-copy writes to a

Re: O_DIRECT question

2007-01-11 Thread Trond Myklebust
On Thu, 2007-01-11 at 11:00 -0800, Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Trond Myklebust wrote: > > > > For NFS, the main feature of interest when it comes to O_DIRECT is > > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help > > because it can't guarantee that the pa

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Thu, 11 Jan 2007, Trond Myklebust wrote: > > For NFS, the main feature of interest when it comes to O_DIRECT is > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help > because it can't guarantee that the page will be thrown out of the page > cache before some second process

Re: O_DIRECT question

2007-01-11 Thread Trond Myklebust
On Thu, 2007-01-11 at 09:04 -0800, Linus Torvalds wrote: > That is what I think some users could do. If the main issue with O_DIRECT > is the page cache allocations, if we instead had better (read: "any") > support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would > just go away.

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Thu, 11 Jan 2007, Alan wrote: > > Well you can - its called SG_IO and that really does get the OS out of > the way. O_DIRECT gets crazy when you stop using it on devices directly > and use it on files Well, on a raw disk, O_DIRECT is fine too, but yeah, you might as well use SG_IO at that p

Re: O_DIRECT question

2007-01-11 Thread Alan
> space, just as an example) is wrong in the first place, but the really > subtle problems come when you realize that you can't really just "bypass" > the OS. Well you can - its called SG_IO and that really does get the OS out of the way. O_DIRECT gets crazy when you stop using it on devices dir

Re: O_DIRECT question

2007-01-11 Thread Michael Tokarev
Linus Torvalds wrote: > > On Thu, 11 Jan 2007, Viktor wrote: >> OK, madvise() used with mmap'ed file allows to have reads from a file >> with zero-copy between kernel/user buffers and don't pollute cache >> memory unnecessarily. But how about writes? How is to do zero-copy >> writes to a file and

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Thu, 11 Jan 2007, Xavier Bestel wrote: > Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit : > > > O_DIRECT is still crazily racy versus pagecache operations. > > > > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix > > it sanely. > > How about aliasing

Re: O_DIRECT question

2007-01-11 Thread Xavier Bestel
Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit : > > O_DIRECT is still crazily racy versus pagecache operations. > > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix > it sanely. How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ? Xav -

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds
On Thu, 11 Jan 2007, Roy Huang wrote: > > On a embedded systerm, limiting page cache can relieve memory > fragmentation. There is a patch against 2.6.19, which limit every > opened file page cache and total pagecache. When the limit reach, it > will release the page cache overrun the limit. I do

  1   2   >