subject:"O_DIRECT question"

Re: O_DIRECT question

2007-02-06 Thread Pavel Machek

Hi!

> > > Which shouldn't be true. There is no fundamental reason why
> > > ordinary writes should be slower than O_DIRECT.
> > 
> > Again, there IS a reason:  O_DIRECT eliminates the cpu overhead of the 
> > kernel-user copy,
> 
> You assume that ordinary read()/write() is *required* to do the copying.
> It doesn't. Kernel is allowed to do direct DMAing in this case too.

Kernel is allowed, but it is practically impossible to code. It would
require slow MMU magic.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-31 Thread Michael Tokarev

Phillip Susi wrote:
[]
> You seem to have missed the point of this thread.  Denis Vlasenko's
> message that you replied to simply pointed out that they are
> semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC
> + madvise could be fixed to perform as well.  Several people including
> Linus seem to like this idea and think it is quite possible.

By the way, IF O_SYNC+madvise could be "fixed", can't O_DIRECT be implemented
internally using them?

I mean, during open(O_DIRECT), do open(O_SYNC) instead and call madvise()
appropriately

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli

On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote:
> It most certainly matters where the error happened because "you are 
> screwd" is not an acceptable outcome in a mission critical application. 

An I/O error is not an acceptable outcome in a mission critical app,
all mission critical setups should be fault tolerant, so if raid
cannot recover at the first sign of error the whole system should
instantly go down and let the secondary takeover from it. See slony
etc...

Trying to recover the recoverable by mucking up with data making even
_more_ writes on a failing disk before doing physical mirror image of
the disk (the readable part) isn't a good idea IMHO. At best you could
retry writing on the same sector hoping somebody disconnected the scsi
cable by mistake.

>  A well engineered solution will deal with errors as best as possible, 
> not simply give up and tell the user they are screwed because the 
> designer was lazy.  There is a reason that read and write return the 
> number of bytes _actually_ transfered, and the application is supposed 
> to check that result to verify proper operation.

You can track the range where it happened with fsync too like said in
previous email, and you can take the big database lock and then
read-write read-write every single block in that range until you find
the failing place if you really want to. read-write in place should be
safe.

> No, there is a slight difference.  An fsync() flushes all dirty buffers 
> in an undefined order.  Using O_DIRECT or O_SYNC, you can control the 
> flush order because you can simply wait for one set of writes to 
> complete before starting another set that must not be written until 
> after the first are on the disk.  You can emulate that by placing an 
> fsync between both sets of writes, but that will flush any other
> dirty 

Doing fsync after every write will provide the same ordering
guarantee as O_SYNC, thought it was obvious what I meant here.

The whole point is that most of the time you don't need it, you need
an fsync after a couple of writes. All smtp servers uses fsync for the
same reason, they also have to journal their writes to avoid losing
email when there is a power loss.

If you use writev or aio pwrite you can do well with O_SYNC too though.

> buffers whose ordering you do not care about.  Also there is no aio 
> version of fsync.

please have a second look at aio_abi.h:

IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,

there must be a reason why they exist, right?

> sync has no effect on reading, so that test is pointless.  direct saves 
> the cpu overhead of the buffer copy, but isn't good if the cache isn't 
> entirely cold.  The large buffer size really has little to do with it, 

direct bypasses the cache so the cache is freezing not just cold.

> rather it is the fact that the writes to null do not block dd from 
> making the next read for any length of time.  If dd were blocking on an 
> actual output device, that would leave the input device idle for the 
> portion of the time that dd were blocked.

The objective was to measure the pipeline stall, if you stall it for
other reason anyway what's the point?

> In any case, this is a totally different example than your previous one 
> which had dd _writing_ to a disk, where it would block for long periods 
> of time due to O_SYNC, thereby preventing it from reading from the input 
> buffer in a timely manner.  By not reading the input pipe frequently, it 
> becomes full and thus, tar blocks.  In that case the large buffer size 
> is actually a detriment because with a smaller buffer size, dd would not 
> be blocked as long and so it could empty the pipe more frequently 
> allowing tar to block less.

It would run slower with smaller buffer size because it would block
too and it would read and write slower too. For my backup usage
keeping tar blocked is actually a feature, so the load of the backup
decreases. To me it's important the MB/sec of the writes and the
MB/sec of the reads (to lower the load), I don't care too much about
how long it takes as far as things runs as efficiently as possible
when they run. The rate limiting effect of the blocking isn't a
problem to me.

> You seem to have missed the point of this thread.  Denis Vlasenko's 
> message that you replied to simply pointed out that they are 
> semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC 
> + madvise could be fixed to perform as well.  Several people including 
> Linus seem to like this idea and think it is quite possible.

I answered to that email to point out the fundamental differences
between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry
but that's how things are running today and I don't see quite possible
to change (unless of course we remove performance from the equation,
then indeed they'll be much the same).

Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages
loaded with a new syscall that reads a piece at ti

Re: O_DIRECT question

2007-01-30 Thread Phillip Susi


Andrea Arcangeli wrote:

When you have I/O errors during _writes_ (not Read!!)  the raid must
kick the disk out of the array before the OS ever notices. And if it's
software raid that you're using, the OS should kick out the disk
before your app ever notices any I/O error. when the write I/O error
happens, it's not a problem for the application to solve.


I thought it obvious that we were talking about non recoverable errors 
that then DO make it to the application.  And any kind of mission 
critical app most definitely does care about write errors.  You don't 
need your db completing the transaction when it was only half recorded. 
 It needs to know it failed so it can back out and/or recover the data 
and record it elsewhere.  You certainly don't want the users to think 
everything is fine, walk away, and have the system continue to limp on 
making things worse by the second.



when the I/O error reaches the filesystem if you're lucky if the OS
won't crash (ext3 claims to handle it), if your app receives the I/O
error all you should be doing is to shutdown things gracefully sending
all errors you can to the admin.


If the OS crashes due to an IO error reading user data, then there is 
something seriously wrong and beyond the scope of this discussion.  It 
suffices to say that due to the semantics of write() and sound 
engineering practice, the application expects to be notified of errors 
so it can try to recover, or fail gracefully.  Whether it chooses to 
fail gracefully as you say it should, or recovers from the error, it 
needs to know that an error happened, and where it was.



It doesn't matter much where the error happend, all it matters is that
you didn't have a fault tolerant raid setup (your fault) and your
primary disk just died and you're now screwed(tm). If you could trust
that part of the disk is still sane you could perhaps attempt to avoid
a restore from the last backup, otherwise all you can do is the
equivalent of a e2fsck -f on the db metadata after copying what you
can still read to the new device.


It most certainly matters where the error happened because "you are 
screwd" is not an acceptable outcome in a mission critical application. 
 A well engineered solution will deal with errors as best as possible, 
not simply give up and tell the user they are screwed because the 
designer was lazy.  There is a reason that read and write return the 
number of bytes _actually_ transfered, and the application is supposed 
to check that result to verify proper operation.



Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
offers exactly the same guarantees. Feel free to check the real life
db code. Even bdb uses fsync.


No, there is a slight difference.  An fsync() flushes all dirty buffers 
in an undefined order.  Using O_DIRECT or O_SYNC, you can control the 
flush order because you can simply wait for one set of writes to 
complete before starting another set that must not be written until 
after the first are on the disk.  You can emulate that by placing an 
fsync between both sets of writes, but that will flush any other dirty 
buffers whose ordering you do not care about.  Also there is no aio 
version of fsync.



Please try yourself, it's simple enough:

   time dd if=/dev/hda of=/dev/null bs=16M count=100
   time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
   time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=direct

if you can measure any slowdown in the sync/direct you're welcome (it
runs faster here... as it should). The pipeline stall is not
measurable when it's so infrequent, and actually the pipeline stall is
not a big issue when the I/O is contigous and the dma commands are
always large.

>

aio is mandatory only while dealing with small buffers, especially
while seeking to take advantage of the elevator.




sync has no effect on reading, so that test is pointless.  direct saves 
the cpu overhead of the buffer copy, but isn't good if the cache isn't 
entirely cold.  The large buffer size really has little to do with it, 
rather it is the fact that the writes to null do not block dd from 
making the next read for any length of time.  If dd were blocking on an 
actual output device, that would leave the input device idle for the 
portion of the time that dd were blocked.


In any case, this is a totally different example than your previous one 
which had dd _writing_ to a disk, where it would block for long periods 
of time due to O_SYNC, thereby preventing it from reading from the input 
buffer in a timely manner.  By not reading the input pipe frequently, it 
becomes full and thus, tar blocks.  In that case the large buffer size 
is actually a detriment because with a smaller buffer size, dd would not 
be blocked as long and so it could empty the pipe more frequently 
allowing tar to block less.



This whole thing is about performance, if you remove performance
factors from the equation, you can stick to your O_SYNC 512

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli

On Tue, Jan 30, 2007 at 08:57:20PM +0100, Andrea Arcangeli wrote:
> Please try yourself, it's simple enough:
> 
>time dd if=/dev/hda of=/dev/null bs=16M count=100
>time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync

sorry, reading won't help much to exercise sync ;). But the direct
line is enough to show the effect of I/O pipeline stall. To
effectively test sync of course you want to write to a file instead
(unless you want to wipe out /dev/hda ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-30 Thread Andrea Arcangeli

On Tue, Jan 30, 2007 at 01:50:41PM -0500, Phillip Susi wrote:
> It should return the number of bytes successfully written before the 
> error, giving you the location of the first error.  Also using smaller 
> individual writes ( preferably issued in parallel ) also allows the 
> problem spot to be isolated.

When you have I/O errors during _writes_ (not Read!!)  the raid must
kick the disk out of the array before the OS ever notices. And if it's
software raid that you're using, the OS should kick out the disk
before your app ever notices any I/O error. when the write I/O error
happens, it's not a problem for the application to solve.

when the I/O error reaches the filesystem if you're lucky if the OS
won't crash (ext3 claims to handle it), if your app receives the I/O
error all you should be doing is to shutdown things gracefully sending
all errors you can to the admin.

It doesn't matter much where the error happend, all it matters is that
you didn't have a fault tolerant raid setup (your fault) and your
primary disk just died and you're now screwed(tm). If you could trust
that part of the disk is still sane you could perhaps attempt to avoid
a restore from the last backup, otherwise all you can do is the
equivalent of a e2fsck -f on the db metadata after copying what you
can still read to the new device.

The only time I got a I/O error on writes, about 1G of the disk was
gone, not very useful to know the first 512byte region that
failed... unreadable and unwriteable. Every other time, writing to the
disk actually solved the read I/O error (they weren't write I/O errors
of course).

Now if you're careful enough you can track down which data generated
the I/O error by queuing the blocks that you write in between every
fsync. So you can still know if perhaps only the journal has generated
write I/O errors, in such a case you could tell the user that he can
copy the data files and let the journal be regenerated on the new
disk. I doubt it will help much in practice though (in such a case, I
would always restore the last backup just in case).

> >>Typically you only want one sector of data to be written before you 
> >>continue.  In the cases where you don't, this might be nice, but as I 
> >>said above, you can't handle errors properly.
> >
> >Sorry but you're dreaming if you're thinking anything in real life
> >writes at 512bytes at time with O_SYNC. Try that with any modern
> >harddisk.
> 
> When you are writing a transaction log, you do; you don't need much 
> data, but you do need to be sure it has hit the disk before continuing. 
>  You certainly aren't writing many mb across a dozen write() calls and 
> only then care to make sure it is all flushed in an unknown order.  When 
> order matters, you can not use fsync, which is one of the reasons why 
> databases use O_DIRECT; they care about the ordering.

Sorry but as far as ordering is concerned, O_DIRECT, fsync and O_SYNC
offers exactly the same guarantees. Feel free to check the real life
db code. Even bdb uses fsync.

> >>>Just grep for fsync in the db code of your choice (try postgresql) and
> >>>then explain me why they ever call fsync in their code, if you know
> >>>how to do better with O_SYNC ;).
> >>Doesn't sound like a very good idea to me.
> >
> >Why not a good idea to check any real life app?
> 
> I meant it is not a good idea to use fsync as you can't properly handle 
> errors.

See above.

> >>The stalling is caused by cache pollution.  Since you did not specify a 
> >>block size dd uses the base block size of the output disk.  When 
> >>combined with sync, only one block is written at a time, and no more 
> >>until the first block has been flushed.  Only then does dd send down 
> >>another block to write.  Without dd the kernel is likely allowing many 
> >>mb to be queued in the buffer cache.  Limiting output to one block at a 
> >>time is not good for throughput, but allowing half of ram to be used by 
> >>dirty pages is not good either.
> >
> >Throughput is perfect. I forgot to tell I combine it with ibs=4k
> >obs=16M. Like it would be perfect with odirect too for the same
> >reason. Stalling the I/O pipeline once every 16M isn't measurable in
> 
> Throughput is nowhere near perfect, as the pipeline is stalled for quite 
> some time.  The pipe fills up quickly while dd is blocked on the sync 
> write, which then blocks tar until all 16 MB have hit the disk.  Only 
> then does dd go back to reading from the tar pipe, allowing it to 
> continue.  During the time it takes tar to archive another 16 MB of 
> data, the write queue is empty.  The only time that the tar process gets 
> to continue running while data is written to disk is in the small time 
> it takes for the pipe ( 4 KB isn't it? ) to fill up.

Please try yourself, it's simple enough:

   time dd if=/dev/hda of=/dev/null bs=16M count=100
   time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=sync
   time dd if=/dev/hda of=/dev/null bs=16M count=100 iflag=dire

Re: O_DIRECT question

2007-01-30 Thread Phillip Susi


Andrea Arcangeli wrote:

On Tue, Jan 30, 2007 at 10:36:03AM -0500, Phillip Susi wrote:

Did you intentionally drop this reply off list?


No.


Then I'll restore the lkml to the cc list.


No, it doesn't... or at least can't report WHERE the error is.


O_SYNC doesn't report where the error is either, try a write(fd, buf,
10*1024*1024).


It should return the number of bytes successfully written before the 
error, giving you the location of the first error.  Also using smaller 
individual writes ( preferably issued in parallel ) also allows the 
problem spot to be isolated.


Typically you only want one sector of data to be written before you 
continue.  In the cases where you don't, this might be nice, but as I 
said above, you can't handle errors properly.


Sorry but you're dreaming if you're thinking anything in real life
writes at 512bytes at time with O_SYNC. Try that with any modern
harddisk.


When you are writing a transaction log, you do; you don't need much 
data, but you do need to be sure it has hit the disk before continuing. 
 You certainly aren't writing many mb across a dozen write() calls and 
only then care to make sure it is all flushed in an unknown order.  When 
order matters, you can not use fsync, which is one of the reasons why 
databases use O_DIRECT; they care about the ordering.



Just grep for fsync in the db code of your choice (try postgresql) and
then explain me why they ever call fsync in their code, if you know
how to do better with O_SYNC ;).

Doesn't sound like a very good idea to me.


Why not a good idea to check any real life app?


I meant it is not a good idea to use fsync as you can't properly handle 
errors.


The stalling is caused by cache pollution.  Since you did not specify a 
block size dd uses the base block size of the output disk.  When 
combined with sync, only one block is written at a time, and no more 
until the first block has been flushed.  Only then does dd send down 
another block to write.  Without dd the kernel is likely allowing many 
mb to be queued in the buffer cache.  Limiting output to one block at a 
time is not good for throughput, but allowing half of ram to be used by 
dirty pages is not good either.


Throughput is perfect. I forgot to tell I combine it with ibs=4k
obs=16M. Like it would be perfect with odirect too for the same
reason. Stalling the I/O pipeline once every 16M isn't measurable in


Throughput is nowhere near perfect, as the pipeline is stalled for quite 
some time.  The pipe fills up quickly while dd is blocked on the sync 
write, which then blocks tar until all 16 MB have hit the disk.  Only 
then does dd go back to reading from the tar pipe, allowing it to 
continue.  During the time it takes tar to archive another 16 MB of 
data, the write queue is empty.  The only time that the tar process gets 
to continue running while data is written to disk is in the small time 
it takes for the pipe ( 4 KB isn't it? ) to fill up.


The semantics of the two are very much the same; they only differ in the 
internal implementation.  As far as the caller is concerned, in both 
cases, he is sure that writes are safe on the disk when they return, and 
reads semantically are no different with either flag.  The internal 
implementations lead to different performance characteristics, and the 
other post was simply commenting that the performance characteristics of 
O_SYNC + madvise() is almost the same as O_DIRECT, or even better in 
some cases ( since the data read may already be in cache ).


The semantics mandates the implementation because the semantics make
up for the performance expectations. For the same reason you shouldn't
write 512bytes at time with O_SYNC you also shouldn't use O_SYNC if
your device risks to create a bottleneck in the CPU and memory.


No, semantics have nothing to do with performance.  Semantics deals with 
the state of the machine after the call, not how quickly it got there. 
Semantics is a question of correct operation, not optimal.


With both O_DIRECT and O_SYNC, the machine state is essentially the same 
after the call: the data has hit the disk.  Aside from the performance 
difference, the application can not tell the difference between O_DIRECT 
and O_SYNC, so if that performance difference can be resolved by 
changing the implementation, Linus can be happy and get rid of O_DIRECT.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-29 Thread Denis Vlasenko

On Monday 29 January 2007 18:00, Andrea Arcangeli wrote:
> On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
> > I still don't see much difference between O_SYNC and O_DIRECT write
> > semantic.
> 
> O_DIRECT is about avoiding the copy_user between cache and userland,
> when working with devices that runs faster than ram (think >=100M/sec,
> quite standard hardware unless you've only a desktop or you cannot
> afford raid).

Yes, I know that, but O_DIRECT is also "overloaded" with
O_SYNC-like semantic too ("write doesnt return until data hits
physical media"). To have two ortogonal things "mixed together"
in one flag feels "not Unixy" to me. So I am trying to formulate
saner semantic. So far I think that this looks good:

O_SYNC - usual meaning
O_STREAM - do not try hard to cache me. This includes "if you can
(buffer is sufficiently aligned, yadda, yadda), do not
copy_user into pagecache but just DMA from userspace
pages" - exactly because user told us that he is not
interested in caching!

Then O_DIRECT is approximately = O_SYNC + O_STREAM, and I think
maybe Linus will not hate this "new" O_DIRECT - it doesn't
bypass pagecache.

> O_SYNC is about working around buggy or underperforming VM growing the
> dirty levels beyond optimal levels, or to open logfiles that you want
> to save to disk ASAP (most other journaling usages are better done
> with fsync instead).

I've got a feeling that db people use O_DIRECT (its O_SYNCy behaviour)
as a poor man's write barrier when they must be sure that their redo
logs have hit storage before they start to modify datafiles.
Another reason why they want sync writes is write error detection.
They cannot afford delaying it.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-29 Thread Andrea Arcangeli

On Sun, Jan 28, 2007 at 06:03:08PM +0100, Denis Vlasenko wrote:
> I still don't see much difference between O_SYNC and O_DIRECT write
> semantic.

O_DIRECT is about avoiding the copy_user between cache and userland,
when working with devices that runs faster than ram (think >=100M/sec,
quite standard hardware unless you've only a desktop or you cannot
afford raid).

O_SYNC is about working around buggy or underperforming VM growing the
dirty levels beyond optimal levels, or to open logfiles that you want
to save to disk ASAP (most other journaling usages are better done
with fsync instead). Or you can mount the fs in sync mode when you
deal with users not capable of unmounting devices before unplugging
them. Ideally you should never need O_SYNC, when you need O_SYNC it's
usually a very bad sign. If you need O_DIRECT it's not a bad sign
(needing O_DIRECT is mostly a sign you've a very fast storage).

The only case where I ever used O_SYNC myself is during backups (when
run on standard or mainline kernels that dirty half ram during
backup). For the logfiles I don't find it very useful, if something I
log them remotely (when system crashes usually the logs won't hit the
disk anyway, so it's just slower).

I use "tar | dd oflag=sync" and that generates a huge speedup to the
rest of the system (not necessairly to the backup itself). Yes I could
use even oflag=direct, but I'm fine to pass through the cache (the
backup device runs at 10M/sec through USB, so the copy_user is _sure_
worth it, if something it will help, it will never be a measurable
slowdown), what is not fine is to see half of the ram dirty the whole
time... (hence the need of o_sync).

O_SYNC and O_DIRECT are useful for different scenarios.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-29 Thread Phillip Susi


Denis Vlasenko wrote:

I still don't see much difference between O_SYNC and O_DIRECT write
semantic.


Yes, if you change the normal io paths to properly support playing 
vmsplice games ( which have a number of corner cases ) to get the zero 
copy, and support madvise() and O_SYNC to control caching behavior, and 
fix all the error handling corner cases, then you may be able to do away 
with O_DIRECT.


I believe that doing all that will be much more complex than O_DIRECT 
however.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko

On Sunday 28 January 2007 16:30, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> >> Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> >>> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
>  Denis Vlasenko wrote:
> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >> But even single-threaded I/O but in large quantities benefits from
> >> O_DIRECT significantly, and I pointed this out before.
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> >
>  Other than the copy to buffer taking CPU and memory resources.
> >>> It is not required by any standard that I know. Kernel can be smarter
> >>> and avoid that if it can.
> >> The kernel can also solve the halting problem if it can.
> >>
> >> Do you really think an entropy estamination code on all access patterns in 
> >> the
> >> system will be free as in beer,
> > 
> > Actually I think we need this heuristic:
> > 
> > if (opened_with_O_STREAM && buffer_is_aligned
> > && io_size_is_a_multiple_of_sectorsize)
> > do_IO_directly_to_user_buffer_without_memcpy
> > 
> > is not *that* compilcated.
> > 
> > I think that we can get rid of O_DIRECT peculiar requirements
> > "you *must* not cache me" + "you *must* write me directly to bare metal"
> > by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
> > ("write() should return only when data is written to storage, not sooner").
> > 
> > Why?
> > 
> > Because these O_DIRECT "musts" are rather unusual and overkill. Apps
> > should not have that much control over what kernel does internally;
> > and also O_DIRECT was mixing shampoo and conditioner on one bottle
> > (no-cache and sync writes) - bad API.
> 
> What a shame that other operating systems can manage to really support 
> O_DIRECT, and that major application software can use this api to write 
> portable code that works even on Windows.
> 
> You overlooked the problem that applications using this api assume that 
> reads are on bare metal as well, how do you address the case where 
> thread A does a write, thread B does a read? If you give thread B data 
> from a buffer and it then does a write to another file (which completes 
> before the write from thread A), and then the system crashes, you have 
> just put the files out of sync.

Applications which syncronize their data integrity
by keeping data on hard drive and relying on
"read goes to bare metal, so it can't see written data
before it gets written to bare metal". Wow, this is slow.
Are you talking about this scenario:

Bad:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)
 < crash point
(write returns)

This will be *very* slow - if you use O_DIRECT and do what
is depicted above, you write data, then you read it back,
whic is slow. Why do you want that? Isn't it
much faster to just wait for write to complete, and allow
read to fetch (potentially) cached data?

Better:
fd = open(..., O_SYNC);
fork()
write(fd, buf); [1]
   (wait for write to finish)


 < crash point
(write returns)
   read(fd, buf2); [starts after write 1 started]
   write(somewhere_else, buf2);
   (write returns)

> So you may have to block all i/o for all  
> threads of the application to be sure that doesn't happen.

Not all, only related i/o.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-28 Thread Denis Vlasenko

On Sunday 28 January 2007 16:18, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
>  Phillip Susi wrote:
> 
>   [...]
> 
>  But even single-threaded I/O but in large quantities benefits from 
>  O_DIRECT
>  significantly, and I pointed this out before.
> >>> Which shouldn't be true. There is no fundamental reason why
> >>> ordinary writes should be slower than O_DIRECT.
> >>>
> >> Other than the copy to buffer taking CPU and memory resources.
> > 
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
> 
> Actually, no, the whole idea of page cache is that overall system i/o 
> can be faster if data sit in the page cache for a while. But the real 
> problem is that the application write is now disconnected from the 
> physical write, both in time and order.

Not in O_SYNC case.

> No standard says the kernel couldn't do direct DMA, but since having 
> that required is needed to guarantee write order and error status linked 
> to the actual application i/o, what a kernel "might do" is irrelevant.
> 
> It's much easier to do O_DIRECT by actually doing the direct i/o than to 
> try to catch all the corner cases which arise in faking it.

I still don't see much difference between O_SYNC and O_DIRECT write
semantic.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-28 Thread Bill Davidsen


Denis Vlasenko wrote:

On Saturday 27 January 2007 15:01, Bodo Eggert wrote:

Denis Vlasenko <[EMAIL PROTECTED]> wrote:

On Friday 26 January 2007 19:23, Bill Davidsen wrote:

Denis Vlasenko wrote:

On Thursday 25 January 2007 21:45, Michael Tokarev wrote:

But even single-threaded I/O but in large quantities benefits from
O_DIRECT significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.


Other than the copy to buffer taking CPU and memory resources.

It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.

The kernel can also solve the halting problem if it can.

Do you really think an entropy estamination code on all access patterns in the
system will be free as in beer,


Actually I think we need this heuristic:

if (opened_with_O_STREAM && buffer_is_aligned
&& io_size_is_a_multiple_of_sectorsize)
do_IO_directly_to_user_buffer_without_memcpy

is not *that* compilcated.

I think that we can get rid of O_DIRECT peculiar requirements
"you *must* not cache me" + "you *must* write me directly to bare metal"
by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
("write() should return only when data is written to storage, not sooner").

Why?

Because these O_DIRECT "musts" are rather unusual and overkill. Apps
should not have that much control over what kernel does internally;
and also O_DIRECT was mixing shampoo and conditioner on one bottle
(no-cache and sync writes) - bad API.


What a shame that other operating systems can manage to really support 
O_DIRECT, and that major application software can use this api to write 
portable code that works even on Windows.


You overlooked the problem that applications using this api assume that 
reads are on bare metal as well, how do you address the case where 
thread A does a write, thread B does a read? If you give thread B data 
from a buffer and it then does a write to another file (which completes 
before the write from thread A), and then the system crashes, you have 
just put the files out of sync. So you may have to block all i/o for all 
threads of the application to be sure that doesn't happen. Or introduce 
some complex way to assure that all writes are physically done in 
order... that sounds like a lock infested mess to me, assuming that you 
could ever do it right.


Oracle has their own version of Linux now, do you think that they would 
fork the application or the kernel?


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-28 Thread Bill Davidsen


Denis Vlasenko wrote:

On Friday 26 January 2007 19:23, Bill Davidsen wrote:

Denis Vlasenko wrote:

On Thursday 25 January 2007 21:45, Michael Tokarev wrote:

Phillip Susi wrote:


[...]


But even single-threaded I/O but in large quantities benefits from O_DIRECT
significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.


Other than the copy to buffer taking CPU and memory resources.


It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.


Actually, no, the whole idea of page cache is that overall system i/o 
can be faster if data sit in the page cache for a while. But the real 
problem is that the application write is now disconnected from the 
physical write, both in time and order.


No standard says the kernel couldn't do direct DMA, but since having 
that required is needed to guarantee write order and error status linked 
to the actual application i/o, what a kernel "might do" is irrelevant.


It's much easier to do O_DIRECT by actually doing the direct i/o than to 
try to catch all the corner cases which arise in faking it.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-27 Thread Denis Vlasenko

On Saturday 27 January 2007 15:01, Bodo Eggert wrote:
> Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> > On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> >> Denis Vlasenko wrote:
> >> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> 
> >> >> But even single-threaded I/O but in large quantities benefits from
> >> >> O_DIRECT significantly, and I pointed this out before.
> >> > 
> >> > Which shouldn't be true. There is no fundamental reason why
> >> > ordinary writes should be slower than O_DIRECT.
> >> > 
> >> Other than the copy to buffer taking CPU and memory resources.
> > 
> > It is not required by any standard that I know. Kernel can be smarter
> > and avoid that if it can.
> 
> The kernel can also solve the halting problem if it can.
> 
> Do you really think an entropy estamination code on all access patterns in the
> system will be free as in beer,

Actually I think we need this heuristic:

if (opened_with_O_STREAM && buffer_is_aligned
&& io_size_is_a_multiple_of_sectorsize)
do_IO_directly_to_user_buffer_without_memcpy

is not *that* compilcated.

I think that we can get rid of O_DIRECT peculiar requirements
"you *must* not cache me" + "you *must* write me directly to bare metal"
by replacing it with O_STREAM ("*advice* to not cache me") + O_SYNC
("write() should return only when data is written to storage, not sooner").

Why?

Because these O_DIRECT "musts" are rather unusual and overkill. Apps
should not have that much control over what kernel does internally;
and also O_DIRECT was mixing shampoo and conditioner on one bottle
(no-cache and sync writes) - bad API.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-27 Thread Bodo Eggert

Denis Vlasenko <[EMAIL PROTECTED]> wrote:
> On Friday 26 January 2007 19:23, Bill Davidsen wrote:
>> Denis Vlasenko wrote:
>> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:

>> >> But even single-threaded I/O but in large quantities benefits from
>> >> O_DIRECT significantly, and I pointed this out before.
>> > 
>> > Which shouldn't be true. There is no fundamental reason why
>> > ordinary writes should be slower than O_DIRECT.
>> > 
>> Other than the copy to buffer taking CPU and memory resources.
> 
> It is not required by any standard that I know. Kernel can be smarter
> and avoid that if it can.

The kernel can also solve the halting problem if it can.

Do you really think an entropy estamination code on all access patterns in the
system will be free as in beer, or be able to predict the access pattern of
random applications?
-- 
Top 100 things you don't want the sysadmin to say:
86. What do you mean that wasn't a copy?

Friß, Spammer: [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko

On Friday 26 January 2007 19:23, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> >> Phillip Susi wrote:
> >>> Denis Vlasenko wrote:
>  You mean "You can use aio_write" ?
> >>> Exactly.  You generally don't use O_DIRECT without aio.  Combining the
> >>> two is what gives the big win.
> >> Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
> >> say, to utilize a raid array with many spindles.
> >>
> >> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> >> significantly, and I pointed this out before.
> > 
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> > 
> Other than the copy to buffer taking CPU and memory resources.

It is not required by any standard that I know. Kernel can be smarter
and avoid that if it can.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Denis Vlasenko

On Friday 26 January 2007 18:05, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > Which shouldn't be true. There is no fundamental reason why
> > ordinary writes should be slower than O_DIRECT.
> 
> Again, there IS a reason:  O_DIRECT eliminates the cpu overhead of the 
> kernel-user copy,

You assume that ordinary read()/write() is *required* to do the copying.
It doesn't. Kernel is allowed to do direct DMAing in this case too.

> and when coupled with multithreading or aio, allows  
> the IO queues to be kept full with useful transfers at all times.

Again, ordinary I/O is no different. Especially on fds opened with O_SYNC,
write() will behave very similarly to O_DIRECT one - data is guaranteed
to hit the disk before write() returns.

> Normal read/write requires the kernel to buffer and guess access

No it doesn't *require* that.

> patterns correctly to perform read ahead and write behind perfectly to 
> keep the queues full.  In practice, this does not happen perfectly all 
> of the time, or even most of the time, so it slows things down.

So lets fix the kernel for everyone's benefit intead of "give us
an API specifically for our needs".
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Bill Davidsen


Denis Vlasenko wrote:

On Thursday 25 January 2007 21:45, Michael Tokarev wrote:

Phillip Susi wrote:

Denis Vlasenko wrote:

You mean "You can use aio_write" ?

Exactly.  You generally don't use O_DIRECT without aio.  Combining the
two is what gives the big win.

Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
say, to utilize a raid array with many spindles.

But even single-threaded I/O but in large quantities benefits from O_DIRECT
significantly, and I pointed this out before.


Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.


Other than the copy to buffer taking CPU and memory resources.

--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Phillip Susi


Denis Vlasenko wrote:

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.


Again, there IS a reason:  O_DIRECT eliminates the cpu overhead of the 
kernel-user copy, and when coupled with multithreading or aio, allows 
the IO queues to be kept full with useful transfers at all times. 
Normal read/write requires the kernel to buffer and guess access 
patterns correctly to perform read ahead and write behind perfectly to 
keep the queues full.  In practice, this does not happen perfectly all 
of the time, or even most of the time, so it slows things down.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Phillip Susi


Mark Lord wrote:

You guys need to backup in this thread.

Every example of O_DIRECT here could be replaced with
calls to mmap(), msync(), and madvise() (or posix_fadvise).

In addition to being at least as fast as O_DIRECT,
these have the added benefit of using the page cache (avoiding reads for 
data already present, handling multiple

users of the same data, etc..).


Please actually _read_ the thread.  In every one of my posts I have 
shown why this is not the case.


To briefly rehash the core of the argument, there is no way to 
asynchronously manage IO with mmap, msync, madvise -- instead you take 
page faults or otherwise block, thus stalling the pipeline.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Viktor

Mark Lord wrote:
> You guys need to backup in this thread.
> 
> Every example of O_DIRECT here could be replaced with
> calls to mmap(), msync(), and madvise() (or posix_fadvise).

No. How about handling IO errors? There is no practical way for it with
mmap().

> In addition to being at least as fast as O_DIRECT,
> these have the added benefit of using the page cache 
> (avoiding reads for data already present, handling multiple
> users of the same data, etc..).
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Mark Lord


You guys need to backup in this thread.

Every example of O_DIRECT here could be replaced with
calls to mmap(), msync(), and madvise() (or posix_fadvise).

In addition to being at least as fast as O_DIRECT,
these have the added benefit of using the page cache 
(avoiding reads for data already present, handling multiple

users of the same data, etc..).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-26 Thread Bill Davidsen


Denis Vlasenko wrote:


Well, I too currently work with Oracle.
Apparently people who wrote damn thing have very, eh, Oracle-centric
world-view. "We want direct writes to the disk. Period." Why? Does it
makes sense? Are there better ways? - nothing. They think they know better.

I fear you are taking the Windows approach, that the computer is there 
to serve the o/s and applications have to do things the way the o/s 
wants. As opposed to the UNIX way, where you can either be clever or 
stupid, the o/s is there to allow you to use the hardware, not be your 
mother.


Currently applications have the option of letting the o/s make decisions 
 via open/read/write, or let the o/s make decisions and tell the 
application via aio, or using O_DIRECT and having full control over the 
process. And that's exactly as it should be. It's not up to the o/s to 
be mother.



(And let's not even start on why oracle ignores SIGTERM. Apparently Unix
rules aren't for them. They're too big to play by rules.)


Any process can ignore SIGTERM, or do a significant amount of cleanup 
before exit()ing. Complex operations need to be completed or unwound. 
Why select Oracle? Other applications may also do that, with more or 
less valid reasons.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko

On Thursday 25 January 2007 21:45, Michael Tokarev wrote:
> Phillip Susi wrote:
> > Denis Vlasenko wrote:
> >> You mean "You can use aio_write" ?
> > 
> > Exactly.  You generally don't use O_DIRECT without aio.  Combining the
> > two is what gives the big win.
> 
> Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
> say, to utilize a raid array with many spindles.
> 
> But even single-threaded I/O but in large quantities benefits from O_DIRECT
> significantly, and I pointed this out before.

Which shouldn't be true. There is no fundamental reason why
ordinary writes should be slower than O_DIRECT.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Michael Tokarev

Phillip Susi wrote:
> Denis Vlasenko wrote:
>> You mean "You can use aio_write" ?
> 
> Exactly.  You generally don't use O_DIRECT without aio.  Combining the
> two is what gives the big win.

Well, it's not only aio.  Multithreaded I/O also helps alot -- all this,
say, to utilize a raid array with many spindles.

But even single-threaded I/O but in large quantities benefits from O_DIRECT
significantly, and I pointed this out before.  It's like enabling a write
cache on disk AND doing intensive random writes - the cache - surprizingly -
slows whole thing down by 5..10%.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi


Denis Vlasenko wrote:

You mean "You can use aio_write" ?


Exactly.  You generally don't use O_DIRECT without aio.  Combining the 
two is what gives the big win.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko

On Thursday 25 January 2007 20:28, Phillip Susi wrote:
> > Ahhh shit, are you saying that fdatasync will wait until writes
> > *by all other processes* to thios file will hit the disk?
> > Is that thue?
> 
> I think all processes yes, but certainly all writes to this file by this 
> process.  That means you have to sync for every write, which means you 
> block.  Blocking stalls the pipeline.

I dont understand you here. Suppose fdatasync() is "do not return until
all cached writes to this file *done by current process* hit the disk
(i.e. cached write data from other concurrent processes is not waited for),
report succes or error code". Then

write(fd_O_DIRECT, buf, sz) - will wait until buf's data hit the disk

write(fd, buf, sz) - potentially will return sooner, but
fdatasync(fd) - will wait until buf's data hit the disk

Looks same to me.

> > If you opened a file and are doing only O_DIRECT writes, you
> > *always* have your written data flushed, by each write().
> > How is it different from writes done using
> > "normal" write() + fdatasync() pairs?
> 
> Because you can do writes async, but not fdatasync ( unless there is an 
> async version I don't know about ).

You mean "You can use aio_write" ?
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi


Denis Vlasenko wrote:

If you opened a file and are doing only O_DIRECT writes, you
*always* have your written data flushed, by each write().
How is it different from writes done using
"normal" write() + fdatasync() pairs?


Because you can do writes async, but not fdatasync ( unless there is an 
async version I don't know about ).



Ahhh shit, are you saying that fdatasync will wait until writes
*by all other processes* to thios file will hit the disk?
Is that thue?



I think all processes yes, but certainly all writes to this file by this 
process.  That means you have to sync for every write, which means you 
block.  Blocking stalls the pipeline.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Denis Vlasenko

On Thursday 25 January 2007 16:44, Phillip Susi wrote:
> Denis Vlasenko wrote:
> > I will still disagree on this point (on point "use O_DIRECT, it's faster").
> > There is no reason why O_DIRECT should be faster than "normal" read/write
> > to large, aligned buffer. If O_DIRECT is faster on today's kernel,
> > then Linux' read()/write() can be optimized more.
> 
> Ahh but there IS a reason for it to be faster: the application knows 
> what data it will require, so it should tell the kernel rather than ask 
> it to guess.  Even if you had the kernel playing vmsplice games to get 
> avoid the copy to user space ( which still has a fair amount of overhead 
> ), then you still have the problem of the kernel having to guess what 
> data the application will require next, and try to fetch it early.  Then 
> when the application requests the data, if it is not already in memory, 
> the application blocks until it is, and blocking stalls the pipeline.
> 
> > (I hoped that they can be made even *faster* than O_DIRECT, but as I said,
> > you convinced me with your "error reporting" argument that reads must still
> > block until entire buffer is read. Writes can avoid that - apps can do
> > fdatasync/whatever to make sync writes & error checks if they want).
> 
> 
> fdatasync() is not acceptable either because it flushes the entire file.

If you opened a file and are doing only O_DIRECT writes, you
*always* have your written data flushed, by each write().
How is it different from writes done using
"normal" write() + fdatasync() pairs?

>   This does not allow the application to control the ordering of various 
> writes unless it limits itself to a single write/fdatasync pair at a 
> time.  Further, fdatasync again blocks the application.

Ahhh shit, are you saying that fdatasync will wait until writes
*by all other processes* to thios file will hit the disk?
Is that thue?

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-25 Thread Phillip Susi


Denis Vlasenko wrote:

I will still disagree on this point (on point "use O_DIRECT, it's faster").
There is no reason why O_DIRECT should be faster than "normal" read/write
to large, aligned buffer. If O_DIRECT is faster on today's kernel,
then Linux' read()/write() can be optimized more.


Ahh but there IS a reason for it to be faster: the application knows 
what data it will require, so it should tell the kernel rather than ask 
it to guess.  Even if you had the kernel playing vmsplice games to get 
avoid the copy to user space ( which still has a fair amount of overhead 
), then you still have the problem of the kernel having to guess what 
data the application will require next, and try to fetch it early.  Then 
when the application requests the data, if it is not already in memory, 
the application blocks until it is, and blocking stalls the pipeline.



(I hoped that they can be made even *faster* than O_DIRECT, but as I said,
you convinced me with your "error reporting" argument that reads must still
block until entire buffer is read. Writes can avoid that - apps can do
fdatasync/whatever to make sync writes & error checks if they want).



fdatasync() is not acceptable either because it flushes the entire file. 
 This does not allow the application to control the ordering of various 
writes unless it limits itself to a single write/fdatasync pair at a 
time.  Further, fdatasync again blocks the application.


With aio, the application can keep several read/writes going in 
parallel, thus keeping the pipeline full.  Even if the io were not 
O_DIRECT, and the kernel played vmsplice games to avoid the copy, it 
would still have more overhead, complexity and I think, very little gain 
in most cases.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-24 Thread Denis Vlasenko

On Monday 22 January 2007 17:17, Phillip Susi wrote:
> > You do not need to know which read() exactly failed due to bad disk.
> > Filename and offset from the start is enough. Right?
> > 
> > So, SIGIO/SIGBUS can provide that, and if your handler is of
> > void (*sa_sigaction)(int, siginfo_t *, void *);
> > style, you can get fd, memory address of the fault, etc.
> > Probably kernel can even pass file offset somewhere in siginfo_t...
> 
> Sure... now what does your signal handler have to do in order to handle 
> this error in such a way as to allow the one request to be failed and 
> the task to continue handling other requests?  I don't think this is 
> even possible, yet alone clean.

Actually, you have convinced me on this. While it's is possible
to report error to userspace, it will be highly nontrivial (read:
bug-prone) for userspace to catch and act on the errors.

> > You think "Oracle". But this application may very well be
> > not Oracle, but diff, or dd, or KMail. I don't want to care.
> > I want all big writes to be efficient, not just those done by Oracle.
> > *Including* single threaded ones.
> 
> Then redesign those applications to use aio and O_DIRECT.  Incidentally 
> I have hacked up dd to do just that and have some very nice performance 
> numbers as a result.

I will still disagree on this point (on point "use O_DIRECT, it's faster").
There is no reason why O_DIRECT should be faster than "normal" read/write
to large, aligned buffer. If O_DIRECT is faster on today's kernel,
then Linux' read()/write() can be optimized more.

(I hoped that they can be made even *faster* than O_DIRECT, but as I said,
you convinced me with your "error reporting" argument that reads must still
block until entire buffer is read. Writes can avoid that - apps can do
fdatasync/whatever to make sync writes & error checks if they want).
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-22 Thread Phillip Susi


Denis Vlasenko wrote:

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use 
signals.
Do we have this 'something'? I honestly don't know).


The application _IS_ using aio, which is why it can go and perform other 
work while it waits to be told that the read has completed.  This is not 
possible with mmap because the task is blocked while faulting in pages, 
and unless it tries to access the pages, they won't be faulted in.



In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.


Being blocked in the page fault handler means the cpu is now idle 
because you can't go chew on data that _IS_ in core.  The aio + O_DIRECT 
allows you to control when IO is started rather than rely on the kernel 
to decide when is a good time for readahead, and to KNOW when that IO is 
done so you can chew on the data.



With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".


You have this completely backwards.  With mmap this is what you get 
because you chew data, page fault... chew data... page fault...



What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.


Ever hear of error recovery?  A good db will be able to cope with one or 
two bad blocks, or at the very least continue operating the other tables 
or databases it is hosting, or flush transactions and switch to read 
only mode, or any number of things other than abort().



You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...


Sure... now what does your signal handler have to do in order to handle 
this error in such a way as to allow the one request to be failed and 
the task to continue handling other requests?  I don't think this is 
even possible, yet alone clean.



You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.


Or use aio.  Simple read/write with the kernel trying to outsmart the 
application is nice for very simple applications, but it does not 
provide very good performance.  This is why we have aio and O_DIRECT; 
because the application can manage the IO better than the kernel because 
it actually knows what it needs and when.


Yes, the application ends up being more complex, but that is the price 
you pay.  You simply can't get it perfect in a general purpose kernel 
that has to guess what the application is really trying to do.



You think "Oracle". But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.


Then redesign those applications to use aio and O_DIRECT.  Incidentally 
I have hacked up dd to do just that and have some very nice performance 
numbers as a result.



Well, I too currently work with Oracle.
Apparently people who wrote damn thing have very, eh, Oracle-centric
world-view. "We want direct writes to the disk. Period." Why? Does it
makes sense? Are there better ways? - nothing. They think they know better.


Nobody has shown otherwise to date.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-22 Thread Al Boldi

Andrea Arcangeli wrote:
> Linus may be right that perhaps one day the CPU will be so much faster
> than disk that such a copy will not be measurable and then O_DIRECT
> could be downgraded to O_STREAMING or an fadvise. If such a day will
> come by, probably that same day Dr. Tanenbaum will be finally right
> about his OS design too.

Dr. T. is probably right with his OS design, it's just people aren't ready 
for it, yet.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-22 Thread Phillip Susi


Denis Vlasenko wrote:

What will happen if we just make open ignore O_DIRECT? ;)

And then anyone who feels sad about is advised to do it
like described here:

http://lkml.org/lkml/2002/5/11/58


Then database and other high performance IO users will be broken.  Most 
of Linus's rant there is being rehashed now in this thread, and it has 
been pointed out that using mmap instead is unacceptable because it is 
inherently _synchronous_ and the app can not tolerate the page faults on 
read, and handling IO errors during the page fault is impossible/highly 
problematic.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-21 Thread Andrea Arcangeli

Hello everyone,

This is a long thread about O_DIRECT surprisingly without a single
bugreport in it, that's a good sign that O_DIRECT is starting to work
well in 2.6 too ;)

On Fri, Jan 12, 2007 at 02:47:48PM -0800, Andrew Morton wrote:
> On Fri, 12 Jan 2007 15:35:09 -0700
> Erik Andersen <[EMAIL PROTECTED]> wrote:
> 
> > On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > > I suspect a lot of people actually have other reasons to avoid caches. 
> > > 
> > > For example, the reason to do O_DIRECT may well not be that you want to 
> > > avoid caching per se, but simply because you want to limit page cache 
> > > activity. In which case O_DIRECT "works", but it's really the wrong thing 
> > > to do. We could export other ways to do what people ACTUALLY want, that 
> > > doesn't have the downsides.
> > 
> > I was rather fond of the old O_STREAMING patch by Robert Love,
> 
> That was an akpmpatch whcih I did for the Digeo kernel.  Robert picked it
> up to dehackify it and get it into mainline, but we ended up deciding that
> posix_fadvise() was the way to go because it's standards-based.
> 
> It's a bit more work in the app to use posix_fadvise() well.  But the
> results will be better.  The app should also use sync_file_range()
> intelligently to control its pagecache use.
> 
> The problem with all of these things is that the application needs to be
> changed, and people often cannot do that.  If we want a general way of

And if the application needs to be changed then IMHO it sounds better
to go the last mile and to use O_DIRECT instead of O_STREAMING to run
in zerocopy. Benchmarks have been posted here as well to show what a
kind of difference O_DIRECT can make. O_STREAMING really shouldn't
exist and all O_STREAMING users should be converted to
O_DIRECT.

The only reason O_DIRECT exists is to bypass the pagecache and to run
in zerocopy, to avoid all pagecache lookups and locking, to preserve
cpu caches, to avoid losing smp scalability in the memory bus in
not-numa systems, and to avoid the general cpu overhead of copying the
data with the cpu for no good reason. The cache polluting avoidance
that O_STREAMING and fadvise can also provide, is an almost not
interesting feature.

I'm afraid databases aren't totally stupid here using O_DIRECT, the
caches they keep in ram isn't necessarily always a 1:1 mapping of the
on-disk data, so replacing O_DIRECT with a MAP_SHARED of the source
file, wouldn't be the best even if they could be convinced to trust
the OS instead of insisting to bypass it (and if they could combine
MAP_SHARED with asyncio somehow). They don't have problems to trust
the OS when they map tmpfs as MAP_SHARED after all... Why to waste
time copying the data through pagecache if the pagecache itself won't
be useful when the db is properly tuned?

Linus may be right that perhaps one day the CPU will be so much faster
than disk that such a copy will not be measurable and then O_DIRECT
could be downgraded to O_STREAMING or an fadvise. If such a day will
come by, probably that same day Dr. Tanenbaum will be finally right
about his OS design too.

Storage speed is growing along cpu speeds, especially with contiguous
I/O and by using fast raid storage, so I don't see it very likely that
we can ignore those memory copies any time soon. Perhaps an average
amd64 desktop system with a single sata disk may never get a real
benefit from O_DIRECT compared to O_STREAMING, but that's not the
point as linux doesn't only run on desktops with a single SATA disk
running at only 50M/sec (and abysmal performance while seeking).

With regard to the locking mess, O_DIRECT already fallback to buffered
mode while creating new blocks and uses proper locking to serialize
against i_size changes (by sct). filling holes and i_size changes are
the forbidden sins of O_DIRECT. The rest is just a matter of cache
invalidates or cache flushes run at the right time.

With more recent 2.6 changes, even further complexity has been
introduced to allow mapped cache to see O_DIRECT writes, I've never
been convinced that this was really useful. There was nothing wrong in
having a not uptodate page mapped in userland (except to workaround an
artifical BUG_ON that tried to enforce that artificial invariant for
no apparent required reason), but it should work ok and it can be seen
as a new feature.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-21 Thread Denis Vlasenko

On Sunday 21 January 2007 13:09, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> >> Denis Vlasenko wrote:
> >>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
>  example, which isn't quite possible now from userspace.  But as long as
>  O_DIRECT actually writes data before returning from write() call (as it
>  seems to be the case at least with a normal filesystem on a real block
>  device - I don't touch corner cases like nfs here), it's pretty much
>  THE ideal solution, at least from the application (developer) standpoint.
> >>> Why do you want to wait while 100 megs of data are being written?
> >>> You _have to_ have threaded db code in order to not waste
> >>> gobs of CPU time on UP + even with that you eat context switch
> >>> penalty anyway.
> >> Usually it's done using aio ;)
> >>
> >> It's not that simple really.
> >>
> >> For reads, you have to wait for the data anyway before doing something
> >> with it.  Omiting reads for now.
> > 
> > Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
> > idea here: http://lkml.org/lkml/2002/5/11/58
> > In short, page-aligned read buffer can be just unmapped,
> > with page fault handler catching accesses to yet-unread data.
> > As data comes from disk, it gets mapped back in process'
> > address space.
> 
> > This way read() returns almost immediately and CPU is free to do
> > something useful.
> 
> And what the application does during that page fault?  Waits for the read
> to actually complete?  How it's different from a regular (direct or not)
> read?

The difference is that you block exactly when you try to access
data which is not there yet, not sooner (potentially much sooner).

If application (e.g. database) needs to know whether data is _really_ there,
it should use aio_read (or something better, something which doesn't use 
signals.
Do we have this 'something'? I honestly don't know).

In some cases, evne this is not needed because you don't have any other
things to do, so you just do read() (which returns early), and chew on
data. If your CPU is fast enough and processing of data is light enough
so that it outruns disk - big deal, you block in page fault handler
whenever a page is not read for you in time.
If CPU isn't fast enough, your CPU and disk subsystem are nicely working
in parallel.

With O_DIRECT, you alternate:
"CPU is idle, disk is working" / "CPU is working, disk is idle".

> Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
> for the read to complete.  And also, now we're in an unknown-corner-case when
> an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
> this looks more like mmap than like actual read).
> 
> Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
> all those rases and design stupidity etc.  Yes it may work, provided those
> "corner cases" like I/O errors problems will be fixed.

What do you want to do on I/O error? I guess you cannot do much -
any sensible db will shutdown itself. When your data storage
starts to fail, it's pointless to continue running.

You do not need to know which read() exactly failed due to bad disk.
Filename and offset from the start is enough. Right?

So, SIGIO/SIGBUS can provide that, and if your handler is of
void (*sa_sigaction)(int, siginfo_t *, void *);
style, you can get fd, memory address of the fault, etc.
Probably kernel can even pass file offset somewhere in siginfo_t...

> And yes, sometimes 
> it's not really that interesting to know when exactly we'll sleep actually
> waiting for the I/O - during read or during some memory access...

It differs from performance perspective, as dicussed above.

> There may be other reasons to "want" those extra context switches.
> I mentioned above that oracle doesn't use threads, but processes.

You can still be multithreaded. The point is, with O_DIRECT
you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

> > Assume that we have "clever writes" like Linus described.
> > 
> > /* something like "caching i/o over this fd is mostly useless" */
> > /* (looks like this API is easier to transition to
> >  * than fadvise etc. - it's "looks like" O_DIRECT) */
> > fd = open(..., flags|O_STREAM);
> > ...
> > /* Starts writeout immediately due to O_STREAM,
> >  * marks buf100meg's pages R/O to catch modifications,
> >  * but doesn't block! */
> > write(fd, buf100meg, 100*1024*1024);
> 
> And how do we know when the write completes?
> 
> > /* We are free to do something useful in parallel */
> > sort();
> 
> .. which is done in another process, already started.

You think "Oracle". But this application may very well be
not Oracle, but diff, or dd, or KMail. I don't want to care.
I want all big writes to be efficient, not just those done by Oracle.
*Including* single threaded ones.

> > Why we bothered to write Linux at

Re: O_DIRECT question

2007-01-21 Thread Michael Tokarev

Denis Vlasenko wrote:
> On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
>> Denis Vlasenko wrote:
>>> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
 example, which isn't quite possible now from userspace.  But as long as
 O_DIRECT actually writes data before returning from write() call (as it
 seems to be the case at least with a normal filesystem on a real block
 device - I don't touch corner cases like nfs here), it's pretty much
 THE ideal solution, at least from the application (developer) standpoint.
>>> Why do you want to wait while 100 megs of data are being written?
>>> You _have to_ have threaded db code in order to not waste
>>> gobs of CPU time on UP + even with that you eat context switch
>>> penalty anyway.
>> Usually it's done using aio ;)
>>
>> It's not that simple really.
>>
>> For reads, you have to wait for the data anyway before doing something
>> with it.  Omiting reads for now.
> 
> Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
> idea here: http://lkml.org/lkml/2002/5/11/58
> In short, page-aligned read buffer can be just unmapped,
> with page fault handler catching accesses to yet-unread data.
> As data comes from disk, it gets mapped back in process'
> address space.

> This way read() returns almost immediately and CPU is free to do
> something useful.

And what the application does during that page fault?  Waits for the read
to actually complete?  How it's different from a regular (direct or not)
read?

Well, it IS different: now we can't predict *when* exactly we'll sleep waiting
for the read to complete.  And also, now we're in an unknown-corner-case when
an I/O error occurs, too (I/O errors iteracts badly with things like mmap, and
this looks more like mmap than like actual read).

Yes, this way we'll fix the problems in current O_DIRECT way of doing things -
all those rases and design stupidity etc.  Yes it may work, provided those
"corner cases" like I/O errors problems will be fixed.  And yes, sometimes
it's not really that interesting to know when exactly we'll sleep actually
waiting for the I/O - during read or during some memory access...

Now I wonder how it should look like from an applications standpoint.  It
has its "smart" cache.  A worker thread (process in case of oracle - there's
a very good reason why they don't use threads, and this architecture saved
our data several times already - but that's entirely different topic and
not really relevant here) -- so, a worker process which executes requests
coming from a user application wants to have (read) access a db block
(usually 8Kb in size, but can be 4..32Kb - definitely not 100megs), where
the requested data is located.  It checks whenever this block is in cache,
and if it's not, it is being read from the disk and added to the cache.
The cache resides in a shared memory (so that other processes will be able
to access it too).

With the proposed solution, it looks even better - that `read()' operation
which returns immediately, so all other processes which wants the same page
at the same time will start "using" it immediately.  Provided they all can
access the memory.

This is how a (large) index access or table-access-by-rowid (after index lookup
for example) is done - requesting usually just a single block in some random
place of a file.

There's another access pattern - like, full table scans, where alot of data
is being read sequentially.  It's done in chunks, say, 64 blocks (8Kb each)
at a time.  We read a chunk of data, do some thing on it, and discard it
(caching it isn't a very good idea).  For this access pattern, the proposal
should work fairy well.  Except of the I/O errors handling maybe.

By the way - the *whole* cache thing may be implemented in application
*using in-kernel page cache*, with clever usage of mmap() and friends.
Provided the whole database fits into an address space, or something like
that ;)

>> For writes, it's not that problematic - even 10-15 threads is nothing
>> compared with the I/O (O in this case) itself -- that context switch
>> penalty.
> 
> Well, if you have some CPU intensive thing to do (e.g. sort),
> why not benefit from lack of extra context switch?

There may be other reasons to "want" those extra context switches.
I mentioned above that oracle doesn't use threads, but processes.
I don't know why exactly it's done this way, but I know how it saved
our data.  The short answer is this: bugs ;)  A process doing somethin
with the data and generates write requests to the db goes crazy - some
memory corruption, doing some bad things... But that process does not
do any writes directly - instead, it generates those write requests
in shared memory, and ANOTHER process actually does the writing.  AND
verifies that the requests actually look sanely.  And detects the "bad"
writes, and immediately prevents data corruption.  That other (dbwr)
process does much simpler things, and has its own address space which
isn

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko

On Saturday 20 January 2007 21:55, Michael Tokarev wrote:
> Denis Vlasenko wrote:
> > On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> >> example, which isn't quite possible now from userspace.  But as long as
> >> O_DIRECT actually writes data before returning from write() call (as it
> >> seems to be the case at least with a normal filesystem on a real block
> >> device - I don't touch corner cases like nfs here), it's pretty much
> >> THE ideal solution, at least from the application (developer) standpoint.
> > 
> > Why do you want to wait while 100 megs of data are being written?
> > You _have to_ have threaded db code in order to not waste
> > gobs of CPU time on UP + even with that you eat context switch
> > penalty anyway.
> 
> Usually it's done using aio ;)
> 
> It's not that simple really.
> 
> For reads, you have to wait for the data anyway before doing something
> with it.  Omiting reads for now.

Really? All 100 megs _at once_? Linus described fairly simple (conceptually)
idea here: http://lkml.org/lkml/2002/5/11/58
In short, page-aligned read buffer can be just unmapped,
with page fault handler catching accesses to yet-unread data.
As data comes from disk, it gets mapped back in process'
address space.

This way read() returns almost immediately and CPU is free to do
something useful.

> For writes, it's not that problematic - even 10-15 threads is nothing
> compared with the I/O (O in this case) itself -- that context switch
> penalty.

Well, if you have some CPU intensive thing to do (e.g. sort),
why not benefit from lack of extra context switch?
Assume that we have "clever writes" like Linus described.

/* something like "caching i/o over this fd is mostly useless" */
/* (looks like this API is easier to transition to
 * than fadvise etc. - it's "looks like" O_DIRECT) */
fd = open(..., flags|O_STREAM);
...
/* Starts writeout immediately due to O_STREAM,
 * marks buf100meg's pages R/O to catch modifications,
 * but doesn't block! */
write(fd, buf100meg, 100*1024*1024);
/* We are free to do something useful in parallel */
sort();

> > I hope you agree that threaded code is not ideal performance-wise
> > - async IO is better. O_DIRECT is strictly sync IO.
> 
> Hmm.. Now I'm confused.
> 
> For example, oracle uses aio + O_DIRECT.  It seems to be working... ;)
> As an alternative, there are multiple single-threaded db_writer processes.
> Why do you say O_DIRECT is strictly sync?

I mean that O_DIRECT write() blocks until I/O really is done.
Normal write can block for much less, or not at all.

> In either case - I provided some real numbers in this thread before.
> Yes, O_DIRECT has its problems, even security problems.  But the thing
> is - it is working, and working WAY better - from the performance point
> of view - than "indirect" I/O, and currently there's no alternative that
> works as good as O_DIRECT.

Why we bothered to write Linux at all?
There were other Unixes which worked ok.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-20 Thread Michael Tokarev

Denis Vlasenko wrote:
> On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
>> example, which isn't quite possible now from userspace.  But as long as
>> O_DIRECT actually writes data before returning from write() call (as it
>> seems to be the case at least with a normal filesystem on a real block
>> device - I don't touch corner cases like nfs here), it's pretty much
>> THE ideal solution, at least from the application (developer) standpoint.
> 
> Why do you want to wait while 100 megs of data are being written?
> You _have to_ have threaded db code in order to not waste
> gobs of CPU time on UP + even with that you eat context switch
> penalty anyway.

Usually it's done using aio ;)

It's not that simple really.

For reads, you have to wait for the data anyway before doing something
with it.  Omiting reads for now.

For writes, it's not that problematic - even 10-15 threads is nothing
compared with the I/O (O in this case) itself -- that context switch
penalty.

> I hope you agree that threaded code is not ideal performance-wise
> - async IO is better. O_DIRECT is strictly sync IO.

Hmm.. Now I'm confused.

For example, oracle uses aio + O_DIRECT.  It seems to be working... ;)
As an alternative, there are multiple single-threaded db_writer processes.
Why do you say O_DIRECT is strictly sync?

In either case - I provided some real numbers in this thread before.
Yes, O_DIRECT has its problems, even security problems.  But the thing
is - it is working, and working WAY better - from the performance point
of view - than "indirect" I/O, and currently there's no alternative that
works as good as O_DIRECT.

Thanks.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko

On Sunday 14 January 2007 10:11, Nate Diller wrote:
> On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> Most applications don't get the kind of performance analysis that
> Digeo was doing, and even then, it's rather lucky that we caught that.
>  So I personally think it'd be best for libc or something to simulate
> the O_STREAM behavior if you ask for it.  That would simplify things
> for the most common case, and have the side benefit of reducing the
> amount of extra code an application would need in order to take
> advantage of that feature.

Sounds like you are saying that making O_DIRECT really mean
O_STREAM will work for everybody (including db people,
except that they will moan a lot about "it isn't _real_ O_DIRECT!!!
Linux suxxx"). I don't care about that.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko

On Thursday 11 January 2007 18:13, Michael Tokarev wrote:
> example, which isn't quite possible now from userspace.  But as long as
> O_DIRECT actually writes data before returning from write() call (as it
> seems to be the case at least with a normal filesystem on a real block
> device - I don't touch corner cases like nfs here), it's pretty much
> THE ideal solution, at least from the application (developer) standpoint.

Why do you want to wait while 100 megs of data are being written?
You _have to_ have threaded db code in order to not waste
gobs of CPU time on UP + even with that you eat context switch
penalty anyway.

I hope you agree that threaded code is not ideal performance-wise
- async IO is better. O_DIRECT is strictly sync IO.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-20 Thread Denis Vlasenko

On Thursday 11 January 2007 16:50, Linus Torvalds wrote:
> 
> On Thu, 11 Jan 2007, Nick Piggin wrote:
> > 
> > Speaking of which, why did we obsolete raw devices? And/or why not just
> > go with a minimal O_DIRECT on block device support? Not a rhetorical
> > question -- I wasn't involved in the discussions when they happened, so
> > I would be interested.
> 
> Lots of people want to put their databases in a file. Partitions really 
> weren't nearly flexible enough. So the whole raw device or O_DIRECT just 
> to the block device thing isn't really helping any.
> 
> > O_DIRECT is still crazily racy versus pagecache operations.
> 
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
> it sanely. Except by teaching people not to use it, and making the normal 
> paths fast enough (and that _includes_ doing things like dropping caches 
> more aggressively, but it probably would include more work on the device 
> queue merging stuff etc etc).

What will happen if we just make open ignore O_DIRECT? ;)

And then anyone who feels sad about is advised to do it
like described here:

http://lkml.org/lkml/2002/5/11/58
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-17 Thread Bodo Eggert

On Tue, 16 Jan 2007, Arjan van de Ven wrote:
> On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote:
> > Helge Hafting <[EMAIL PROTECTED]> wrote:
> > > Michael Tokarev wrote:

> > >> But seriously - what about just disallowing non-O_DIRECT opens together
> > >> with O_DIRECT ones ?
> > >>   
> > > Please do not create a new local DOS attack.
> > > I open some important file, say /etc/resolv.conf
> > > with O_DIRECT and just sit on the open handle.
> > > Now nobody else can open that file because
> > > it is "busy" with O_DIRECT ?
> > 
> > Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?
> 
> .. then any user can impact the operation, performance and reliability
> of the database application of another user... sounds like plugging one
> hole by making a bigger hole ;)

Don't allow other users to access your raw database files then, and if
backup kicks in, pausing the database would DTRT for integrety of the
backup. For other applications, paused O_DIRECT may very well be a
problem, but I can't think of one right now.

-- 
Logic: The art of being wrong with confidence... 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-17 Thread Alex Tomas


I think one problem with mmap/msync is that they can't maintain
i_size atomically like regular write does. so, one needs to
implement own i_size management in userspace.

thanks, Alex

> Side note: the only reason O_DIRECT exists is because database people are 
> too used to it, because other OS's haven't had enough taste to tell them 
> to do it right, so they've historically hacked their OS to get out of the 
> way.

> As a result, our madvise and/or posix_fadvise interfaces may not be all 
> that strong, because people sadly don't use them that much. It's a sad 
> example of a totally broken interface (O_DIRECT) resulting in better 
> interfaces not getting used, and then not getting as much development 
> effort put into them.

> So O_DIRECT not only is a total disaster from a design standpoint (just 
> look at all the crap it results in), it also indirectly has hurt better 
> interfaces. For example, POSIX_FADV_NOREUSE (which _could_ be a useful and 
> clean interface to make sure we don't pollute memory unnecessarily with 
> cached pages after they are all done) ends up being a no-op ;/

> Sad. And it's one of those self-fulfilling prophecies. Still, I hope some 
> day we can just rip the damn disaster out.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-16 Thread Arjan van de Ven

On Tue, 2007-01-16 at 21:26 +0100, Bodo Eggert wrote:
> Helge Hafting <[EMAIL PROTECTED]> wrote:
> > Michael Tokarev wrote:
> 
> >> But seriously - what about just disallowing non-O_DIRECT opens together
> >> with O_DIRECT ones ?
> >>   
> > Please do not create a new local DOS attack.
> > I open some important file, say /etc/resolv.conf
> > with O_DIRECT and just sit on the open handle.
> > Now nobody else can open that file because
> > it is "busy" with O_DIRECT ?
> 
> Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?

.. then any user can impact the operation, performance and reliability
of the database application of another user... sounds like plugging one
hole by making a bigger hole ;)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-16 Thread Aubrey Li


On 1/12/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:



On Thu, 11 Jan 2007, Roy Huang wrote:
>
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

I do think that something like this is probably a good idea, even on
non-embedded setups. We historically couldn't do this, because mapped
pages were too damn hard to remove, but that's obviously not much of a
problem any more.

However, the page-cache limit should NOT be some compile-time constant. It
should work the same way the "dirty page" limit works, and probably just
default to "feel free to use 90% of memory for page cache".

Linus



The attached patch limit the page cache by a simple way:

1) If request memory from page cache, Set a flag to mark this kind of
allocation:

static inline struct page *page_cache_alloc(struct address_space *x)
{
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

2) Have zone_watermark_ok done this limit:

+   if (alloc_flags & ALLOC_PAGECACHE){
+   min = min + VFS_CACHE_LIMIT;
+   }
+
   if (free_pages <= min + z->lowmem_reserve[classzone_idx])
   return 0;

3) So, when __alloc_pages is called by page cache, pass the
ALLOC_PAGECACHE into get_page_from_freelist to trigger the pagecache
limit branch in zone_watermark_ok.

This approach works on my side, I'll make a new patch to make the
limit tunable in the proc fs soon.

The following is the patch:
=
Index: mm/page_alloc.c
===
--- mm/page_alloc.c (revision 2645)
+++ mm/page_alloc.c (working copy)
@@ -892,6 +892,9 @@ failed:
#define ALLOC_HARDER0x10 /* try to alloc harder */
#define ALLOC_HIGH  0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET0x40 /* check for correct cpuset */
+#define ALLOC_PAGECACHE0x80 /* __GFP_PAGECACHE set */
+
+#define VFS_CACHE_LIMIT0x400 /* limit VFS cache page */

/*
 * Return 1 if free pages are above 'mark'. This takes into account the order
@@ -910,6 +913,10 @@ int zone_watermark_ok(struct zone *z, in
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;

+   if (alloc_flags & ALLOC_PAGECACHE){
+   min = min + VFS_CACHE_LIMIT;
+   }
+
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
for (o = 0; o < order; o++) {
@@ -1000,8 +1007,12 @@ restart:
return NULL;
}

-   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+   if (gfp_mask & __GFP_PAGECACHE) 
+   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_PAGECACHE);
+   else
+   page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+   zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1027,6 +1038,9 @@ restart:
if (wait)
alloc_flags |= ALLOC_CPUSET;

+   if (gfp_mask & __GFP_PAGECACHE)
+   alloc_flags |= ALLOC_PAGECACHE;
+
/*
 * Go through the zonelist again. Let __GFP_HIGH and allocations
 * coming from realtime tasks go deeper into reserves.
Index: include/linux/gfp.h
===
--- include/linux/gfp.h (revision 2645)
+++ include/linux/gfp.h (working copy)
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x1u) /* Don't use
emergency reserves */
#define __GFP_HARDWALL   ((__force gfp_t)0x2u) /* Enforce
hardwall cpuset memory allocs */
#define __GFP_THISNODE  ((__force gfp_t)0x4u)/* No fallback, no policies */
+#define __GFP_PAGECACHE((__force gfp_t)0x8u) /* Is page cache
allocation ? */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
Index: include/linux/pagemap.h
===
--- include/linux/pagemap.h (revision 2645)
+++ include/linux/pagemap.h (working copy)
@@ -62,7 +62,7 @@ static inline struct page *__page_cache_

static inline struct page *page_cache_alloc(struct address_space *x)
{
-   return __page_cache_alloc(mapping_gfp_mask(x));
+   return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_PAGECACHE);
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
=

Welcome any comm

Re: O_DIRECT question

2007-01-16 Thread Bodo Eggert

Helge Hafting <[EMAIL PROTECTED]> wrote:
> Michael Tokarev wrote:

>> But seriously - what about just disallowing non-O_DIRECT opens together
>> with O_DIRECT ones ?
>>   
> Please do not create a new local DOS attack.
> I open some important file, say /etc/resolv.conf
> with O_DIRECT and just sit on the open handle.
> Now nobody else can open that file because
> it is "busy" with O_DIRECT ?

Suspend O_DIRECT access while non-O_DIRECT-fds are open, fdatasync on close?
-- 
"Unix policy is to not stop root from doing stupid things because
that would also stop him from doing clever things." - Andi Kleen

"It's such a fine line between stupid and clever" - Derek Smalls
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-15 Thread Jörn Engel

On Fri, 12 January 2007 00:19:45 +0800, Aubrey wrote:
>
> Yes for desktop, server, but maybe not for embedded system, specially
> for no-mmu linux. In many embedded system cases, the whole system is
> running in the ram, including file system. So it's not necessary using
> page cache anymore. Page cache can't improve performance on these
> cases, but only fragment memory.

You were not very specific, so I have to guess that you're referring to
the problem of having two copies of the same file in RAM - one in the
page cache and one in the "backing store", which is just RAM.

There are two solutions to this problem.  One is tmpfs, which doesn't
use a backing store and keeps all data in the page cache.  The other is
xip, which doesn't use the page cache and goes directly to backing
store.  Unlike O_DIRECT, xip only works with a RAM or de-facto RAM
backing store (NOR flash works read-only).

So if you really care about memory waste in embedded systems, you should
have a look at mm/filemap_xip.c and continue Carsten Otte's work.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-15 Thread Helge Hafting


Michael Tokarev wrote:

Chris Mason wrote:
[]
  

I recently spent some time trying to integrate O_DIRECT locking with
page cache locking.  The basic theory is that instead of using
semaphores for solving O_DIRECT vs buffered races, you put something
into the radix tree (I call it a placeholder) to keep the page cache
users out, and lock any existing pages that are present.



But seriously - what about just disallowing non-O_DIRECT opens together
with O_DIRECT ones ?
  

Please do not create a new local DOS attack.
I open some important file, say /etc/resolv.conf
with O_DIRECT and just sit on the open handle.
Now nobody else can open that file because
it is "busy" with O_DIRECT ?

Helge Hafting
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-14 Thread Bodo Eggert

Bill Davidsen <[EMAIL PROTECTED]> wrote:

> My point is, that there is code to handle sparse data now, without
> O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem
> with the idea of O_DIRECT, the kernel has a security problem.

The idea of O_DIRECT is to bypass the pagecache, and the pagecache is what
provides the security against reading someone else's data using sparse
files or partial-block-IO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-14 Thread Bodo Eggert

On Sat, 13 Jan 2007, Bill Davidsen wrote:

> Bodo Eggert wrote:
> 
> > (*) This would allow fadvise_size(), too, which could reduce fragmentation
> > (and give an early warning on full disks) without forcing e.g. fat to
> > zero all blocks. OTOH, fadvise_size() would allow users to reserve the
> > complete disk space without his filesizes reflecting this.
> 
> Please clarify how this would interact with quota, and why it wouldn't 
> allow someone to run me out of disk.

I fell into the "write-will-never-fail"-pit. Therefore I have to talk 
about the original purpose, write with O_DIRECT, too.

- Reserved blocks should be taken out of the quota, since they are about
  to be written right now. If you emptied your quota doing this, it's
  your fault. It it was the group's quota, just run fast enough.-)

- If one write failed that extended the reserved range, the reserved area 
  should be shrunk again. Obviously you'll need something clever here.
  * You can't shrink carelessly while there are O_DIRECT writes.
  * You can't just try to grab the semaphore[0] for writing, this will 
deadlock with other write()s.
  * If you drop the read lock, it will work out, because you aren't
writing anymore, and if you get the write lock, there won't be anybody 
else writing. Therefore you can clear the reservation for the not-
written blocks. You may unreserve blocks that should stay reserved,
but that won't harm much. At worst, you'll get fragmentation, loss
of speed and an aborted (because of no free space) write command.
Document this, it's a feature.-)

- If you fadvise_size on a non-quota-disk, you can possibly reserve it 
  completely, without being the easy-to-spot offender. You can do the
  same by actually writing these files, keeping them open and unlinking 
  them. The new quality is: You can't just look at the file sizes in
  /proc in order to spot the offender. However, if you reflect the 
  reserved blocks in the used-blocks-field of struct stat, du will
  work as expected and the BOFH will know whom to LART.

  BTW: If the fs supports holes, using du would be the right thing
  to do anyway.


BTW2: I don't know if reserving without actually assigning blocks is 
supported or easy to support at all. These reservations are the result of 
"These blocks are not yet written, therefore they contain possibly secret 
data that would leak on failed writes, therefore they may not be actually 
assigned to the file before write finishes. They may not be on the free 
list either. And hey, if we support pre-reserving blocks to the file, we 
may additionally use it for fadvise_size. I'll mention that briefly."




[0] r/w semaphore, read={r,w}_odirect, write=ftruncate

-- 
Fun things to slip into your budget
Paradigm pro-activator (a whole pack)
(you mean beer?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-14 Thread Bill Davidsen


Michael Tokarev wrote:

Bill Davidsen wrote:



If I got it right (and please someone tell me if I *really* got it right!),
the problem is elsewhere.

Suppose you have a filesystem, not at all related to databases and stuff.
Your usual root filesystem, with your /etc/ /var and so on directories.

Some time ago you edited /etc/shadow, updating it by writing new file and
renaming it to proper place.  So you have that old content of your shadow
file (now deleted) somewhere on the disk, but not accessible from the
filesystem.

Now, a bad guy deliberately tries to open some file on this filesystem, using
O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and
at the same time tries to use O_DIRECT read of the data.


Which should be identified and zeros returned. Consider: I open a file 
for database use, and legitimately seek to a location out at, say, 
250MB, and then write at the location my hash says I should. That's all 
legitimate. Now when some backup program accesses the file sequentially, 
it gets a boatload of zeros, because Linux "knows" that is sparse data. 
Yes, the backup program should detect this as well, so what?


My point is, that there is code to handle sparse data now, without 
O_DIRECT involved, and if O_DIRECT bypasses that, it's not a problem 
with the idea of O_DIRECT, the kernel has a security problem.


Due to all the races etc, it is possible for him to read that old content of
/etc/shadow file you've deleted before.


I do have one thought, WRT reading uninitialized disk data. I would hope
that sparse files are handled right, and that when doing a write with
O_DIRECT the metadata is not updated until the write is done.


"hope that sparse files are handled right" is a high hope.  Exactly because
this very place IS racy.


Other than assuring that a program can't read where no program has 
written, I don't see a problem. Anyone accessing the same file with 
multiple processes had better be doing user space coordination, and gets 
no sympathy from me if they don't. In this case, "works right" does not 
mean "works as expected," because the program has no right to assume the 
kernel will sort out poor implementations.


Without O_DIRECT the problem of doing ordered i/o in user space becomes 
very difficult, if not impossible, so "get rid of O_DIRECT" is the wrong 
direction. When the program can be sure the i/o is done, then cleverness 
in user space can see that it's done RIGHT.


--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-14 Thread Nate Diller

On 1/12/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

On Fri, 12 Jan 2007 15:35:09 -0700
Erik Andersen <[EMAIL PROTECTED]> wrote:

> On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > I suspect a lot of people actually have other reasons to avoid caches.
> >
> > For example, the reason to do O_DIRECT may well not be that you want to
> > avoid caching per se, but simply because you want to limit page cache
> > activity. In which case O_DIRECT "works", but it's really the wrong thing
> > to do. We could export other ways to do what people ACTUALLY want, that
> > doesn't have the downsides.
>
> I was rather fond of the old O_STREAMING patch by Robert Love,

That was an akpmpatch whcih I did for the Digeo kernel.  Robert picked it
up to dehackify it and get it into mainline, but we ended up deciding that
posix_fadvise() was the way to go because it's standards-based.

It's a bit more work in the app to use posix_fadvise() well.  But the
results will be better.  The app should also use sync_file_range()
intelligently to control its pagecache use.

and there's an interesting note that i should add here, cause there's
a downside to using fadvise() instead of O_STREAM when the programmer
is not careful.  I spent at least a month doing some complex blktrace
analysis to try to figure out why Digeo's new platform (which used the
fadvise() call) didn't have the kind of streaming performance that it
should have.  One symptom I found was that even on the media partition
where I/O should have always been happening in nice 512K chunks
(ra_pages == 128), it seemed to be happening in random values between
32K and 512K.  It turns out that the code pulls in some size chunk,
maybe 32K, then does an fadvise DONTNEED on the fd, *with zero offset
and zero length*, meaning that it wipes out *all* the pagecache for
the file.  That means that the rest of the 512K from the readahead
would get discarded before it got used, and later the remaining pages
in the ra window would get faulted in again.

Most applications don't get the kind of performance analysis that
Digeo was doing, and even then, it's rather lucky that we caught that.
So I personally think it'd be best for libc or something to simulate
the O_STREAM behavior if you ask for it.  That would simplify things
for the most common case, and have the side benefit of reducing the
amount of extra code an application would need in order to take
advantage of that feature.

NATE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-13 Thread Michael Tokarev

Bill Davidsen wrote:
> Linus Torvalds wrote:
>>
[]
>> But what O_DIRECT does right now is _not_ really sensible, and the
>> O_DIRECT propeller-heads seem to have some problem even admitting that
>> there _is_ a problem, because they don't care. 
> 
> You say that as if it were a failing. Currently if you mix access via
> O_DIRECT and non-DIRECT you can get unexpected results. You can screw
> yourself, mangle your data, or have no problems at all if you avoid
> trying to access the same bytes in multiple ways. There are lots of ways
> to get or write stale data, not all involve O_DIRECT in any way, and the
> people actually using O_DIRECT now are managing very well.
> 
> I don't regard it as a system failing that I am allowed to shoot myself
> in the foot, it's one of the benefits of Linux over Windows. Using
> O_DIRECT now is like being your own lawyer, room for both creativity and
> serious error. But what's there appears portable, which is important as
> well.

If I got it right (and please someone tell me if I *really* got it right!),
the problem is elsewhere.

Suppose you have a filesystem, not at all related to databases and stuff.
Your usual root filesystem, with your /etc/ /var and so on directories.

Some time ago you edited /etc/shadow, updating it by writing new file and
renaming it to proper place.  So you have that old content of your shadow
file (now deleted) somewhere on the disk, but not accessible from the
filesystem.

Now, a bad guy deliberately tries to open some file on this filesystem, using
O_DIRECT flag, ftruncates() it to some huge size (or does seek+write), and
at the same time tries to use O_DIRECT read of the data.

Due to all the races etc, it is possible for him to read that old content of
/etc/shadow file you've deleted before.

> I do have one thought, WRT reading uninitialized disk data. I would hope
> that sparse files are handled right, and that when doing a write with
> O_DIRECT the metadata is not updated until the write is done.

"hope that sparse files are handled right" is a high hope.  Exactly because
this very place IS racy.

Again, *IF* I got it correctly.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-13 Thread Bill Davidsen


Linus Torvalds wrote:


On Sat, 13 Jan 2007, Michael Tokarev wrote:

(No, really - this load isn't entirely synthetic.  It's a typical database
workload - random I/O all over, on a large file.  If it can, it combines
several I/Os into one, by requesting more than a single block at a time,
but overall it is random.)


My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without 
having all the BAD behaviour that O_DIRECT adds.


For example, just the requirement that O_DIRECT can never create a file 
mapping, and can never interact with ftruncate would actually make 
O_DIRECT a lot more palatable to me. Together with just the requirement 
that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, 
and flush the page cache entirely, would make all the aliases go away.


At that point, O_DIRECT would be a way of saying "we're going to do 
uncached accesses to this pre-allocated file". Which is a half-way 
sensible thing to do.


But it's not necessary, it would break existing programs, would be 
incompatible with other o/s like AIX, BSD, Solaris. And it doesn't 
provide the legitimate use for O_DIRECT in avoiding cache pollution when 
writing a LARGE file.


But what O_DIRECT does right now is _not_ really sensible, and the 
O_DIRECT propeller-heads seem to have some problem even admitting that 
there _is_ a problem, because they don't care. 


You say that as if it were a failing. Currently if you mix access via 
O_DIRECT and non-DIRECT you can get unexpected results. You can screw 
yourself, mangle your data, or have no problems at all if you avoid 
trying to access the same bytes in multiple ways. There are lots of ways 
to get or write stale data, not all involve O_DIRECT in any way, and the 
people actually using O_DIRECT now are managing very well.


I don't regard it as a system failing that I am allowed to shoot myself 
in the foot, it's one of the benefits of Linux over Windows. Using 
O_DIRECT now is like being your own lawyer, room for both creativity and 
serious error. But what's there appears portable, which is important as 
well.


I do have one thought, WRT reading uninitialized disk data. I would hope 
that sparse files are handled right, and that when doing a write with 
O_DIRECT the metadata is not updated until the write is done.


A lot of DB people seem to simply not care about security or anything 
else.anything else. I'm trying to tell you that quoting numbers is 
pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.


The guiding POSIX standard appears dead, and major DB programs which 
work on Linux run on AIX, Solaris, and BSD. That sounds like a good 
level of compatibility. I'm not sure what more correctness you would 
want beyond a proposed standard and common practice. It's tricky to use, 
like many other neat features.


I xonfess I have abused O_DIRECT by opening a file with O_DIRECT, 
fdopen()ing it for C, supplying my own large aligned buffer, and using 
that with an otherwise unmodified large program which uses fprintf(). 
That worked on all of the major UNIX variants as well.


--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-13 Thread Bill Davidsen


Bodo Eggert wrote:


(*) This would allow fadvise_size(), too, which could reduce fragmentation
(and give an early warning on full disks) without forcing e.g. fat to
zero all blocks. OTOH, fadvise_size() would allow users to reserve the
complete disk space without his filesizes reflecting this.


Please clarify how this would interact with quota, and why it wouldn't 
allow someone to run me out of disk.


--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-13 Thread Bodo Eggert

Linus Torvalds <[EMAIL PROTECTED]> wrote:
> On Sat, 13 Jan 2007, Michael Tokarev wrote:

>> (No, really - this load isn't entirely synthetic.  It's a typical database
>> workload - random I/O all over, on a large file.  If it can, it combines
>> several I/Os into one, by requesting more than a single block at a time,
>> but overall it is random.)
> 
> My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without
> having all the BAD behaviour that O_DIRECT adds.
> 
> For example, just the requirement that O_DIRECT can never create a file
> mapping,

That sounds sane, but the video streaming folks will be unhappy.

Maybe you could do:
 reserve_space(); (*)
 do_write_odirect();
 update_filesize();
and only allow reads up to the current filesize?

Off cause if you do ftruncate first and then write o_direct, the holes will
need to be filled before the corresponding blocks are assigned to the file.
Either you'll zero them or you can insert them into the file after the write.

Races:
against other reads:  May happen in any order, to-be-written pages are
 beyond filesize (inaccessible), zeroed or not yet assigned to the file.
against other writes: No bad effect, since you don't unreserve
 mappings, and update_filesize won't shrink the file. You must, however,
 not reserve two chunks for the same location in the file unless you can
 handle replacing blocks of files.
 open(O_WRITE) without O_DIRECT is not allowed, therefore that can't race.
against truncate: Yes, see below

(*) This would allow fadvise_size(), too, which could reduce fragmentation
(and give an early warning on full disks) without forcing e.g. fat to
zero all blocks. OTOH, fadvise_size() would allow users to reserve the
complete disk space without his filesizes reflecting this.

> and can never interact with ftruncate

ACK, r/w semaphore, read={r,w}_odirect, write=ftruncate?

> would actually make
> O_DIRECT a lot more palatable to me. Together with just the requirement
> that an O_DIRECT open would literally disallow any non-O_DIRECT accesses,
> and flush the page cache entirely, would make all the aliases go away.

That's probably the best semantics.

Maybe you should allow O_READ for the backup people, maybe forcing
O_DIRECT|O_ALLOWDOUBLEBUFFER (doing the extra copy in the kernel).

> At that point, O_DIRECT would be a way of saying "we're going to do
> uncached accesses to this pre-allocated file". Which is a half-way
> sensible thing to do.

And I'd bet nobody would notice these changes unless they try inherently
stupid things.

> But what O_DIRECT does right now is _not_ really sensible, and the
> O_DIRECT propeller-heads seem to have some problem even admitting that
> there _is_ a problem, because they don't care.

It's a hammer - having it will make anything look like a nail,
and there is nothing wrong with hammering a nail!!! .-)

> A lot of DB people seem to simply not care about security or anything
> else.anything else. I'm trying to tell you that quoting numbers is
> pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

The only thing you'll need for a correct database behaviour is:
If one process has completed it's write and the next process opens that
file, it must read the current contents.

Races with normal reads and writes, races with truncate - don't do that then.
You wouldn't expect "cat somefile > database.dat" on a running db to be a
good thing, too, no matter if o_direct is used or not.
-- 
Funny quotes:
3. On the other hand, you have different fingers.

Friß, Spammer: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Nick Piggin


Bill Davidsen wrote:

The point is that if you want to be able to allocate at all, sometimes 
you will have to write dirty pages, garbage collect, and move or swap 
programs. The hardware is just too limited to do something less painful, 
and the user can't see memory to do things better. Linus is right, 
'Claiming that there is a "proper solution" is usually a total red 
herring. Quite often there isn't, and the "paper over" is actually not 
papering over, it's quite possibly the best solution there is.' I think 
any solution is going to be ugly, unfortunately.


It seems quite robust and clean to me, actually. Any userspace memory
that absolutely must be large contiguous regions have to be allocated at
boot or from a pool reserved at boot. All other allocations can be broken
into smaller ones.

Write dirty pages, garbage collect, move or swap programs isn't going
to be robust because there is lots of vital kernel memory that cannot be
moved and will cause fragmentation.

The reclaimable zone work that went on a while ago for hugepages is
exactly how you would also fix this problem and still have a reasonable
degree of flexibility at runtime. It isn't really ugly or hard,  compared
with some of the non-working "solutions" that have been proposed.

The other good thing is that the core mm already has practically
everything required, so the functionality is unintrusive.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Andrew Morton

On Fri, 12 Jan 2007 15:35:09 -0700
Erik Andersen <[EMAIL PROTECTED]> wrote:

> On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> > I suspect a lot of people actually have other reasons to avoid caches. 
> > 
> > For example, the reason to do O_DIRECT may well not be that you want to 
> > avoid caching per se, but simply because you want to limit page cache 
> > activity. In which case O_DIRECT "works", but it's really the wrong thing 
> > to do. We could export other ways to do what people ACTUALLY want, that 
> > doesn't have the downsides.
> 
> I was rather fond of the old O_STREAMING patch by Robert Love,

That was an akpmpatch whcih I did for the Digeo kernel.  Robert picked it
up to dehackify it and get it into mainline, but we ended up deciding that
posix_fadvise() was the way to go because it's standards-based.

It's a bit more work in the app to use posix_fadvise() well.  But the
results will be better.  The app should also use sync_file_range()
intelligently to control its pagecache use.

The problem with all of these things is that the application needs to be
changed, and people often cannot do that.  If we want a general way of
stopping particular apps from swamping pagecache then it'd really need to
be an externally-imposed thing - probably via additional accounting and a
new rlimit.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Erik Andersen

On Fri Jan 12, 2007 at 05:09:09PM -0500, Linus Torvalds wrote:
> I suspect a lot of people actually have other reasons to avoid caches. 
> 
> For example, the reason to do O_DIRECT may well not be that you want to 
> avoid caching per se, but simply because you want to limit page cache 
> activity. In which case O_DIRECT "works", but it's really the wrong thing 
> to do. We could export other ways to do what people ACTUALLY want, that 
> doesn't have the downsides.

I was rather fond of the old O_STREAMING patch by Robert Love,
which added an open() flag telling the kernel to not keep data
from the current file in cache by dropping pages from the
pagecache before the current index.  O_STREAMING was very nice
for when you know you want to read a large file sequentially
without polluting the rest of the cache with GB of data that you
plan on only read once and discard.  It worked nicely at doing
what many people want to use O_DIRECT for.

Using O_STREAMING you would get normal read/write semantics since
you still had the pagecache caching your data, but only the
not-yet-written write-behind data and the not-yet-read read-ahead
data.  With the additional hint the kernel should drop free-able
pages from the pagecache behind the current position, because we
know we will never want them again.  I thought that was a very
nice way of handling things.

 -Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Linus Torvalds wrote:
> 
> On Sat, 13 Jan 2007, Michael Tokarev wrote:
>>> At that point, O_DIRECT would be a way of saying "we're going to do 
>>> uncached accesses to this pre-allocated file". Which is a half-way 
>>> sensible thing to do.
>> Half-way?
> 
> I suspect a lot of people actually have other reasons to avoid caches. 
> 
> For example, the reason to do O_DIRECT may well not be that you want to 
> avoid caching per se, but simply because you want to limit page cache 
> activity. In which case O_DIRECT "works", but it's really the wrong thing 
> to do. We could export other ways to do what people ACTUALLY want, that 
> doesn't have the downsides.
> 
> For example, the page cache is absolutely required if you want to mmap. 
> There's no way you can do O_DIRECT and mmap at the same time and expect 
> any kind of sane behaviour. It may not be what a DB wants to use, but it's 
> an example of where O_DIRECT really falls down.

Provided when the two are about the same part of a file.  If not, and if
the file is "divided" on a proper boundary (sector/page/whatever-aligned),
there's no issues, at least not if all the blocks of a file has been allocated
(no gaps, that is).

What I was referring to in my last email - and said it's a corner case - is:
mmap() start of a file, say, first megabyte of it, where some index/bitmap is
located, and use direct-io on the rest.  So the two aren't overlap.

Still problematic?

>>> But what O_DIRECT does right now is _not_ really sensible, and the 
>>> O_DIRECT propeller-heads seem to have some problem even admitting that 
>>> there _is_ a problem, because they don't care. 
>> Well.  In fact, there's NO problems to admit.
>>
>> Yes, yes, yes yes - when you think about it from a general point of
>> view, and think how non-O_DIRECT and O_DIRECT access fits together,
>> it's a complete mess, and you're 100% right it's a mess.
> 
> You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually 
> fails in many ways right now.
> 
> I've already mentioned ftruncate and block allocation. You don't seem to 
> understand that those are ALSO a problem.

I do understand this.  And this is, too, solved right now in userspace.
For example, when oracle allocates a file for its data, or when it extends
the file, it writes something to every block of new space (using O_DIRECT
while at it, but that's a different story).  The thing is: while it is doing
that, no process tries to do anything with that (part of a) file (not counting
some external processes run by evil hackers ;)  So there's still no races
or fundamental brokeness *in usage*.

It uses ftruncate() to create or extend a file, *and* does O_DIRECT writes
to force block allocations.  That's probably not right, and that alone is
probably difficult to implement in kernel (I just don't know; what I know
for sure is that this way is very slow on ext3).  Maybe because there's no
way to tell kernel something like "set the file size to this and actually
*allocate* space for it" (if it doesn't write some structure to the file).

What I dislike very much is - half-solutions.  And current O_DIRECT indeed
looks like half-a-solution, because sometimes it works, and sometimes, in
*wrong* usage scenario, it doesn't, or racy, etc, and kernel *allows* such
a wrong scenario.  A software should either work correctly, or disallow
a usage where it can't guarantee correctness.  Currently, kernel allows
incorrect usage, and that, plus all the ugly things in code done in attempt
to fix that, suxx.

But the whole thing is not (fundamentally) broken.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds

On Sat, 13 Jan 2007, Michael Tokarev wrote:
> > 
> > At that point, O_DIRECT would be a way of saying "we're going to do 
> > uncached accesses to this pre-allocated file". Which is a half-way 
> > sensible thing to do.
> 
> Half-way?

I suspect a lot of people actually have other reasons to avoid caches. 

For example, the reason to do O_DIRECT may well not be that you want to 
avoid caching per se, but simply because you want to limit page cache 
activity. In which case O_DIRECT "works", but it's really the wrong thing 
to do. We could export other ways to do what people ACTUALLY want, that 
doesn't have the downsides.

For example, the page cache is absolutely required if you want to mmap. 
There's no way you can do O_DIRECT and mmap at the same time and expect 
any kind of sane behaviour. It may not be what a DB wants to use, but it's 
an example of where O_DIRECT really falls down.

> > But what O_DIRECT does right now is _not_ really sensible, and the 
> > O_DIRECT propeller-heads seem to have some problem even admitting that 
> > there _is_ a problem, because they don't care. 
> 
> Well.  In fact, there's NO problems to admit.
> 
> Yes, yes, yes yes - when you think about it from a general point of
> view, and think how non-O_DIRECT and O_DIRECT access fits together,
> it's a complete mess, and you're 100% right it's a mess.

You can't admit that even O_DIRECT _without_ any non-O_DIRECT actually 
fails in many ways right now.

I've already mentioned ftruncate and block allocation. You don't seem to 
understand that those are ALSO a problem.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disk Cache, Was: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Zan Lynx wrote:
> On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote:
> [snip]
>> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
>> load - because everything is thrown away from the cache, even caches of /bin
>> /usr/bin etc... ;)  (For that, fadvise() seems to help a bit, but not alot).
> 
> One thing that I've been using, and seems to work well, is a customized
> version of the readahead program several distros use during boot up.

[idea to lock some (commonly-used) cache pages in memory]

> Something like that could keep your system responsive no matter what the
> disk cache is doing otherwise.

Unfortunately it's not.  Sure, things like libc.so etc will be force-cached
and will start fast.  But not my data files and other stuff (what an
unfortunate thing: memory usually is smaller in size than disks ;)

I can do usual work without noticing something's working with the disks
intensively, doing O_DIRECT I/O.  For example, I can run large report on
a database, which requires alot of disk I/O, and run a kernel compile at
the same time.  Sure, disk access is alot slower, but disk cache helps alot,
too.  My kernel compile will not be much slower than usual.  But if I'll
turn O_DIRECT off, the compile will take ages to finish.  *And* the report
running, too!  Because the system tries hard to cache the WRONG pages!
(yes I remember fadvise &Co - which aren't used by the database(s) currently,
and quite alot of words has been said about that, too;  I also noticied it's
slower as well, at least currently.)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disk Cache, Was: O_DIRECT question

2007-01-12 Thread Zan Lynx

On Sat, 2007-01-13 at 00:03 +0300, Michael Tokarev wrote:
[snip]
> And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
> load - because everything is thrown away from the cache, even caches of /bin
> /usr/bin etc... ;)  (For that, fadvise() seems to help a bit, but not alot).

One thing that I've been using, and seems to work well, is a customized
version of the readahead program several distros use during boot up.

Mine starts off doing:
mlockall(MCL_CURRENT|MCL_FUTURE);
...yadda, yadda...

and for each file listed:
...open, stat stuff...
   if( NULL == mmap(
NULL, stat_buf.st_size, 
PROT_READ, MAP_SHARED|MAP_LOCKED|MAP_POPULATE,
fd, 0)
) {
fprintf(stderr, "'%s' ", file);
perror("mmap");
}
...more stuff...
and then ends with:
pause();
and it sits there forever.

As far as I can tell, this makes the program and library code stay in
RAM.  At least, after a drop_caches nautilus doesn't load 12 MB off
disk, it just starts.  It has to be reloaded after software updates and
after prelinking.  I find the 250 MB used to be worthwhile, even if its
kinda Windowsey.

Something like that could keep your system responsive no matter what the
disk cache is doing otherwise.
-- 
Zan Lynx <[EMAIL PROTECTED]>

signature.asc
Description: This is a digitally signed message part

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Linus Torvalds wrote:
[]
> My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without 
> having all the BAD behaviour that O_DIRECT adds.

*This* point I got from the beginning, once I tried to think how it all
is done internally (I never thought about that, because I'm not a kernel
hacker to start with) -- currently, linux has ugly/racy places which are
either difficult or impossible to fix, all due to this O_DIRECT thing
which iteracts badly with other access "methods".

> For example, just the requirement that O_DIRECT can never create a file 
> mapping, and can never interact with ftruncate would actually make 
> O_DIRECT a lot more palatable to me. Together with just the requirement 
> that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, 
> and flush the page cache entirely, would make all the aliases go away.
> 
> At that point, O_DIRECT would be a way of saying "we're going to do 
> uncached accesses to this pre-allocated file". Which is a half-way 
> sensible thing to do.

Half-way?

> But what O_DIRECT does right now is _not_ really sensible, and the 
> O_DIRECT propeller-heads seem to have some problem even admitting that 
> there _is_ a problem, because they don't care. 

Well.  In fact, there's NO problems to admit.

Yes, yes, yes yes - when you think about it from a general point of
view, and think how non-O_DIRECT and O_DIRECT access fits together,
it's a complete mess, and you're 100% right it's a mess.

But.  Those damn "database people" don't mix and match the two accesses
together (I'm not one of them, either - I'm just trying to use a DB
product on linux).  So there's just no issue.  The solution to in-kernel
races and problems in this case is the usage scenario, and in following
simple usage rules.  Basically, the above requiriment - "don't mix&match
the two together" - is implemented in userspace (yes, there's no guarantee
that someone/thing will not do some evil thing, but that's controlled by
file permisions).  That is, database software itself will not try to use
the thing in a wrong way.  Simple as that.

> A lot of DB people seem to simply not care about security or anything 
> else.anything else. I'm trying to tell you that quoting numbers is 
> pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

When done properly - be it in user- or kernel-space, it IS correct.  No
database people are ftruncating() a file *and* reading from the past-end
of it at the same time for example, and don't mix-n-match cached and direct
io, at least not for the same part of a file (if there are, they're really
braindead, or it's just a plain bug).

> I can calculate PI to a billion decimal places in my head in .1 seconds. 
> If you don't care about the CORRECTNESS of the result, that is.
> 
> See? It's not about performance. It's about O_DIRECT being fundamentally 
> broken as it behaves right now.

I recall again the above: the actual USAGE of O_DIRECT, as implemented
in database software, tries to ensure there's no brokeness, especially
fundamental brokeness, just by not performing parallel direct/non-direct
read/writes/truncates.  This way, the thing Just Works, works *correctly*
(provided there's no bugs all the way down to a device), *and* works *fast*.

By the way, I can think of some useful cases where *parts* of a file are
mmap()ed (even for RW access), and parts are being read/written with O_DIRECT.
But that's probably some corner cases.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds

On Sat, 13 Jan 2007, Michael Tokarev wrote:
> 
> (No, really - this load isn't entirely synthetic.  It's a typical database
> workload - random I/O all over, on a large file.  If it can, it combines
> several I/Os into one, by requesting more than a single block at a time,
> but overall it is random.)

My point is that you can get basically ALL THE SAME GOOD BEHAVIOUR without 
having all the BAD behaviour that O_DIRECT adds.

For example, just the requirement that O_DIRECT can never create a file 
mapping, and can never interact with ftruncate would actually make 
O_DIRECT a lot more palatable to me. Together with just the requirement 
that an O_DIRECT open would literally disallow any non-O_DIRECT accesses, 
and flush the page cache entirely, would make all the aliases go away.

At that point, O_DIRECT would be a way of saying "we're going to do 
uncached accesses to this pre-allocated file". Which is a half-way 
sensible thing to do.

But what O_DIRECT does right now is _not_ really sensible, and the 
O_DIRECT propeller-heads seem to have some problem even admitting that 
there _is_ a problem, because they don't care. 

A lot of DB people seem to simply not care about security or anything 
else.anything else. I'm trying to tell you that quoting numbers is 
pointless, when simply the CORRECTNESS of O_DIRECT is very much in doubt.

I can calculate PI to a billion decimal places in my head in .1 seconds. 
If you don't care about the CORRECTNESS of the result, that is.

See? It's not about performance. It's about O_DIRECT being fundamentally 
broken as it behaves right now.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Michael Tokarev wrote:
> Michael Tokarev wrote:
> By the way.  I just ran - for fun - a read test of a raid array.
> 
> Reading blocks of size 512kbytes, starting at random places on a 400Gb
> array, doing 64threads.
> 
>  O_DIRECT: 336.73 MB/sec.
> !O_DIRECT: 146.00 MB/sec.

And when turning off read-ahead, the speed dropped to 30 MB/sec.  Read-ahead
should not help here, I think... But after analyzing the "randomness" a bit,
it turned out alot of requests are coming to places "near" the ones which has
been read recently.  After switching to another random number generator,
speed in a case WITH readahead enabled dropped to almost 5Mb/sec ;)

And sure thing, withOUT O_DIRECT, the whole system is almost dead under this
load - because everything is thrown away from the cache, even caches of /bin
/usr/bin etc... ;)  (For that, fadvise() seems to help a bit, but not alot).

(No, really - this load isn't entirely synthetic.  It's a typical database
workload - random I/O all over, on a large file.  If it can, it combines
several I/Os into one, by requesting more than a single block at a time,
but overall it is random.)

/mjt

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Michael Tokarev wrote:
[]
> After all the explanations, I still don't see anything wrong with the
> interface itself.  O_DIRECT isn't "different semantics" - we're still
> writing and reading some data.  Yes, O_DIRECT and non-O_DIRECT usages
> somewhat contradicts with each other, but there are other ways to make
> the two happy, instead of introducing alot of stupid, complex, and racy
> code all over.

By the way.  I just ran - for fun - a read test of a raid array.

Reading blocks of size 512kbytes, starting at random places on a 400Gb
array, doing 64threads.

 O_DIRECT: 336.73 MB/sec.
!O_DIRECT: 146.00 MB/sec.

Quite a... difference here.

Using posix_fadvice() does not improve it.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Michael Tokarev

Chris Mason wrote:
[]
> I recently spent some time trying to integrate O_DIRECT locking with
> page cache locking.  The basic theory is that instead of using
> semaphores for solving O_DIRECT vs buffered races, you put something
> into the radix tree (I call it a placeholder) to keep the page cache
> users out, and lock any existing pages that are present.

But seriously - what about just disallowing non-O_DIRECT opens together
with O_DIRECT ones ?

If the thing will allow non-DIRECT READ-ONLY open, I personally see no
problems whatsoever, at all.  If non-DIRECT READONLY open will be disallowed
too, -- well, a bit less nice, but still workable (allowing online backup
of database files opened in O_DIRECT mode using other tools such as `cp' --
if non-direct opens aren't allowed, i'll switch to using dd or somesuch).

Yes there may be still a race between ftruncate() and reads (either direct
or not), or when filling gaps by writing into places which were skipped
by using ftruncate.  I don't know how serious those races are.

That to say - if the whole thing will be a bit more strict wrt allowing
set of operations, races (or some of them, anyway) will just go away
(and maybe it will work even better due to quite some code and lock
contention removal), and maybe after that, Linus will like the whole
thing a bit better... ;)

After all the explanations, I still don't see anything wrong with the
interface itself.  O_DIRECT isn't "different semantics" - we're still
writing and reading some data.  Yes, O_DIRECT and non-O_DIRECT usages
somewhat contradicts with each other, but there are other ways to make
the two happy, instead of introducing alot of stupid, complex, and racy
code all over.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Chris Mason

On Fri, Jan 12, 2007 at 10:06:22AM -0800, Linus Torvalds wrote:

> > looking at the splice(2) api it seems like it'll be difficult to implement 
> > O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
> > some help there.
> 
> You'd use vmsplice() to put the write buffers into kernel space (user 
> space sees it's a pipe file descriptor, but you should just ignore that: 
> it's really just a kernel buffer). And then splice the resulting kernel 
> buffers to the destination.

I recently spent some time trying to integrate O_DIRECT locking with
page cache locking.  The basic theory is that instead of using
semaphores for solving O_DIRECT vs buffered races, you put something
into the radix tree (I call it a placeholder) to keep the page cache
users out, and lock any existing pages that are present.

O_DIRECT does save cpu from avoiding copies, but it also saves cpu from
fewer radix tree operations during massive IOs.  The cost of radix tree
insertion/deletion on 1MB O_DIRECT ios added ~10% system time on
my tiny little dual core box.  I'm sure it would be much worse if there
was lock contention on a big numa machine, and it grows as the io grows
(SGI does massive O_DIRECT ios).

To help reduce radix churn, I made it possible for a single placeholder
entry to lock down a range in the radix:

http://thread.gmane.org/gmane.linux.file-systems/12263

It looks to me as though vmsplice is going to have the same issues as my
early patches.  The current splice code can avoid the copy but is still
working in page sized chunks.  Also, splice doesn't support zero copy on
things smaller than page sized chunks.

The compromise my patch makes is to hide placeholders from almost
everything except the DIO code.  It may be worthwhile to turn the
placeholders into an IO marker that can be useful to filemap_fdatawrite
and friends.

It should be able to:

record the userland/kernel pages involved in a given io
map blocks from the FS for making a bio
start the io
wake people up when the io is done

This would allow splice to operate without stealing the userland page
(stealing would still be an option of course), and could get rid of big
chunks of fs/direct-io.c.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Linus Torvalds

On Thu, 11 Jan 2007, dean gaudet wrote:
> 
> it seems to me that if splice and fadvise and related things are 
> sufficient for userland to take care of things "properly" then O_DIRECT 
> could be changed into splice/fadvise calls either by a library or in the 
> kernel directly...

The problem is two-fold:

 - the fact that databases use O_DIRECT and all the commercial people are 
   perfectly happy to use a totally idiotic interface (and they don't care 
   about the problems) means that things like fadvice() don't actually 
   get the TLC. For example, the USEONCE thing isn't actually 
   _implemented_, even though from a design standpoint, it would in many
   ways be preferable over O_DIRECT.

   It's not just fadvise. It's a general problem for any new interfaces 
   where the old interfaces "just work" - never mind if they are nasty. 
   And O_DIRECT isn't actually all that nasty for users (although the 
   alignment restrictions are obviously irritating, but they are mostly 
   fundamental _hardware_ alignment restrictions, so..). It's only nasty 
   from a kernel internal security/serialization standpoint.

   So in many ways, apps don't want to change, because they don't really 
   see the problems.

   (And, as seen in this thread: uses like NFS don't see the problems 
   either, because there the serialization is done entirely somewhere 
   *else*, so the NFS people don't even understand why the whole interface 
   sucks in the first place)

 - a lot of the reasons for problems for O_DIRECT is the semantics. If we 
   could easily implement the O_DIRECT semantics using something else, we 
   would. But it's semantically not allowed to steal the user page, and it 
   has to wait for it to be all done with, because those are the semantics 
   of "write()".

   So one of the advantages of vmsplice() and friends is literally that it 
   could allow page stealing, and allow the semantics where any changes to 
   the page (in user space) might make it to disk _after_ vmsplice() has 
   actually already returned, because we literally re-use the page (ie 
   it's fundamentally an async interface).

But again, fadvise and vmsplice etc aren't even getting the attention, 
because right now they are only used by small programs (and generally not 
done by people who also work on the kernel, and can see that it really 
would be better to use more natural interfaces).

> looking at the splice(2) api it seems like it'll be difficult to implement 
> O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
> some help there.

You'd use vmsplice() to put the write buffers into kernel space (user 
space sees it's a pipe file descriptor, but you should just ignore that: 
it's really just a kernel buffer). And then splice the resulting kernel 
buffers to the destination.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Viktor

Linus Torvalds wrote:
O_DIRECT is still crazily racy versus pagecache operations.
>>>
>>>Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
>>>it sanely.
>>
>>How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?
> 
> 
> That is what I think some users could do. If the main issue with O_DIRECT 
> is the page cache allocations, if we instead had better (read: "any") 
> support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would 
> just go away.
> 
> See also the patch that Roy Huang posted about another approach to the 
> same problem: just limiting page cache usage explicitly.
> 
> That's not the _only_ issue with O_DIRECT, though. It's one big one, but 
> people like to think that the memory copy makes a difference when you do 
> IO too (I think it's likely pretty debatable in real life, but I'm totally 
> certain you can benchmark it, probably even pretty easily especially if 
> you have fairly studly IO capabilities and a CPU that isn't quite as 
> studly).
> 
> So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT 
> picture, and depending on your problems (in this case, the embedded world) 
> it may even be the *biggest* part. But it's not the whole picture.

>From 2.6.19 sources it looks like POSIX_FADV_NOREUSE is no-op there

>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Viktor

Linus Torvalds wrote:
>>OK, madvise() used with mmap'ed file allows to have reads from a file
>>with zero-copy between kernel/user buffers and don't pollute cache
>>memory unnecessarily. But how about writes? How is to do zero-copy
>>writes to a file and don't pollute cache memory without using O_DIRECT?
>>Do I miss the appropriate interface?
> 
> 
> mmap()+msync() can do that too.

Sorry, I wasn't sufficiently clear. Mmap()+msync() can't be used for
that if data to be written come from some external source, like video
capturing hardware, which DMA'ing data directly into the user space
buffers. Using mmap'ed area for those DMA buffers doesn't look as a good
idea, because, e.g., it will involve unneeded disk reads on the first
page faults.

So, some O_DIRECT-like interface should exist in the system. Also, as
Michael Tokarev noted, operations over mmap'ed areas don't provide good
ways for error handling, which effectively makes them unusable for
something serious.

> Also, regular user-space page-aligned data could easily just be moved into 
> the page cache. We actually have a lot of the infrastructure for it. See 
> the "splice()" system call. It's just not very widely used, and the 
> "drop-behind" behaviour (to then release the data) isn't there. And I bet 
> that there's lots of work needed to make it work well in practice, but 
> from a conceptual standpoint the O_DIRECT method really is just about the 
> *worst* way to do things.

splice() needs 2 file descriptors, but looking at it I've found
vmsplice() syscall, which, seems, can do the needed actions, although
I'm not sure it can work with files and zero-copy. Thanks for pointing
on those interfaces.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Phillip Susi


dean gaudet wrote:
it seems to me that if splice and fadvise and related things are 
sufficient for userland to take care of things "properly" then O_DIRECT 
could be changed into splice/fadvise calls either by a library or in the 
kernel directly...


No, because the semantics are entirely different.  An application using 
read/write with O_DIRECT expects read() to block until data is 
physically fetched from the device.  fadvise() does not FORCE the kernel 
to discard cache, it only hints that it should, so a read() or mmap() 
very well may reuse a cached page instead of fetching from the disk 
again.  The application also expects write() to block until the data is 
on the disk.  In the case of a blocking write, you could splice/msync, 
but what about aio?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Phillip Susi


Hua Zhong wrote:

The other problem besides the inability to handle IO errors is that
mmap()+msync() is synchronous.  You need to go async to keep 
the pipelines full.


msync(addr, len, MS_ASYNC); doesn't do what you want?



No, because there is no notification of completion.  In fact, does this 
call actually even avoid blocking in the current code, while asking the 
kernel to flush the pages in the background?


Even if it performs the sync in the background, what about faulting in 
the pages to be synced?  For instance, if you splice pages from a source 
mmaped file into the destination mmap, then msync on the destination, 
doesn't the process still block to fault in the source pages?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-12 Thread Bill Davidsen

Aubrey wrote:

On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

Linus Torvalds wrote:
>
> On Fri, 12 Jan 2007, Nick Piggin wrote:
>
>>We are talking about about fragmentation. And limiting pagecache to 
try to
>>avoid fragmentation is a bandaid, especially when the problem can 
be solved

>>(not just papered over, but solved) in userspace.
>
>
> It's not clear that the problem _can_ be solved in user space.
>
> It's easy enough to say "never allocate more than a page". But it's 
often

> not REALISTIC.
 >
> Very basic issue: the perfect is the enemy of the good. Claiming that
> there is a "proper solution" is usually a total red herring. Quite 
often

> there isn't, and the "paper over" is actually not papering over, it's
> quite possibly the best solution there is.

Yeah *smallish* higher order allocations are fine, and we use them 
all the

time for things like stacks or networking.

But Aubrey (who somehow got removed from the cc list) wants to do 
order 9
allocations from userspace in his nommu environment. I'm just trying 
to be

realistic when I say that this isn't going to be robust and a userspace
solution is needed.

Hmm..., aside from big order allocations from user space, if there is
a large application we need to run, it should be loaded into the
memory, so we have to allocate a big block to accommodate it. kernel
fun like load_elf_fdpic_binary() etc will request contiguous memory,
then if vfs eat up free memory, loading fails. 
Before we had virtual memory we had only a base address register, start 
at this location and go thus far, and user program memory had to be 
contiguous. To change a program size, all other programs might be moved, 
either by memory copy or actual swap to disk if total memory became a 
problem. To minimize the pain, programs were loaded at one end of 
memory, and system buffers and such were allocated at the other. That 
allowed the most recently loaded program the best chance of being able 
to grow without thrashing.

The point is that if you want to be able to allocate at all, sometimes 
you will have to write dirty pages, garbage collect, and move or swap 
programs. The hardware is just too limited to do something less painful, 
and the user can't see memory to do things better. Linus is right, 
'Claiming that there is a "proper solution" is usually a total red 
herring. Quite often there isn't, and the "paper over" is actually not 
papering over, it's quite possibly the best solution there is.' I think 
any solution is going to be ugly, unfortunately.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread dean gaudet

On Thu, 11 Jan 2007, Linus Torvalds wrote:

> On Thu, 11 Jan 2007, Viktor wrote:
> > 
> > OK, madvise() used with mmap'ed file allows to have reads from a file
> > with zero-copy between kernel/user buffers and don't pollute cache
> > memory unnecessarily. But how about writes? How is to do zero-copy
> > writes to a file and don't pollute cache memory without using O_DIRECT?
> > Do I miss the appropriate interface?
> 
> mmap()+msync() can do that too.
> 
> Also, regular user-space page-aligned data could easily just be moved into 
> the page cache. We actually have a lot of the infrastructure for it. See 
> the "splice()" system call.

it seems to me that if splice and fadvise and related things are 
sufficient for userland to take care of things "properly" then O_DIRECT 
could be changed into splice/fadvise calls either by a library or in the 
kernel directly...

looking at the splice(2) api it seems like it'll be difficult to implement 
O_DIRECT pread/pwrite from userland using splice... so there'd need to be 
some help there.

i'm probably missing something.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Aubrey

On 1/12/07, Nick Piggin <[EMAIL PROTECTED]> wrote:

Linus Torvalds wrote:
>
> On Fri, 12 Jan 2007, Nick Piggin wrote:
>
>>We are talking about about fragmentation. And limiting pagecache to try to
>>avoid fragmentation is a bandaid, especially when the problem can be solved
>>(not just papered over, but solved) in userspace.
>
>
> It's not clear that the problem _can_ be solved in user space.
>
> It's easy enough to say "never allocate more than a page". But it's often
> not REALISTIC.
 >
> Very basic issue: the perfect is the enemy of the good. Claiming that
> there is a "proper solution" is usually a total red herring. Quite often
> there isn't, and the "paper over" is actually not papering over, it's
> quite possibly the best solution there is.

Yeah *smallish* higher order allocations are fine, and we use them all the
time for things like stacks or networking.

But Aubrey (who somehow got removed from the cc list) wants to do order 9
allocations from userspace in his nommu environment. I'm just trying to be
realistic when I say that this isn't going to be robust and a userspace
solution is needed.

Hmm..., aside from big order allocations from user space, if there is
a large application we need to run, it should be loaded into the
memory, so we have to allocate a big block to accommodate it. kernel
fun like load_elf_fdpic_binary() etc will request contiguous memory,
then if vfs eat up free memory, loading fails.

-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Fri, 12 Jan 2007, Nick Piggin wrote:
> 
> Yeah *smallish* higher order allocations are fine, and we use them all the
> time for things like stacks or networking.
> 
> But Aubrey (who somehow got removed from the cc list) wants to do order 9
> allocations from userspace in his nommu environment. I'm just trying to be
> realistic when I say that this isn't going to be robust and a userspace
> solution is needed.

I do agree that order-9 allocations simply is unlikely to work without 
some pre-allocation notion or some serious work at active de-fragmentation 
(and the page cache is likely to be the _least_ of the problems people 
will hit - slab and other kernel allocations are likely to be much much 
harder to handle, since you can't free them in quite as directed a 
manner).

But for smallish-order (eg perhaps 3-4 possibly even more if you are 
careful in other places), the page cache limiter may well be a "good 
enough" solution in practice, especially if other allocations can be 
controlled by strict usage patterns (which is not realistic in a general- 
purpose kind of situation, but might be realistic in embedded).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin


Nick Piggin wrote:

Linus Torvalds wrote:


Very basic issue: the perfect is the enemy of the good. Claiming that 
there is a "proper solution" is usually a total red herring. Quite 
often there isn't, and the "paper over" is actually not papering over, 
it's quite possibly the best solution there is.



Yeah *smallish* higher order allocations are fine, and we use them all the
time for things like stacks or networking.

But Aubrey (who somehow got removed from the cc list) wants to do order 9
allocations from userspace in his nommu environment. I'm just trying to be
realistic when I say that this isn't going to be robust and a userspace
solution is needed.


Oh, and also: I don't disagree with that limiting pagecache to some %
might be useful for other reasons.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin


Linus Torvalds wrote:


On Fri, 12 Jan 2007, Nick Piggin wrote:


We are talking about about fragmentation. And limiting pagecache to try to
avoid fragmentation is a bandaid, especially when the problem can be solved
(not just papered over, but solved) in userspace.



It's not clear that the problem _can_ be solved in user space.

It's easy enough to say "never allocate more than a page". But it's often 
not REALISTIC.

>
Very basic issue: the perfect is the enemy of the good. Claiming that 
there is a "proper solution" is usually a total red herring. Quite often 
there isn't, and the "paper over" is actually not papering over, it's 
quite possibly the best solution there is.


Yeah *smallish* higher order allocations are fine, and we use them all the
time for things like stacks or networking.

But Aubrey (who somehow got removed from the cc list) wants to do order 9
allocations from userspace in his nommu environment. I'm just trying to be
realistic when I say that this isn't going to be robust and a userspace
solution is needed.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Fri, 12 Jan 2007, Nick Piggin wrote:
>
> We are talking about about fragmentation. And limiting pagecache to try to
> avoid fragmentation is a bandaid, especially when the problem can be solved
> (not just papered over, but solved) in userspace.

It's not clear that the problem _can_ be solved in user space.

It's easy enough to say "never allocate more than a page". But it's often 
not REALISTIC.

Very basic issue: the perfect is the enemy of the good. Claiming that 
there is a "proper solution" is usually a total red herring. Quite often 
there isn't, and the "paper over" is actually not papering over, it's 
quite possibly the best solution there is.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin


Bill Davidsen wrote:

Nick Piggin wrote:


Aubrey wrote:

Exactly, and the *real* fix is to modify userspace not to make > 
PAGE_SIZE
mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do 
things

like limit cache size that are the bandaids.



Tuning the system to work appropriately for a given load is not a 
band-aid.


We are talking about about fragmentation. And limiting pagecache to try to
avoid fragmentation is a bandaid, especially when the problem can be solved
(not just papered over, but solved) in userspace.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Roy Huang


Limiting total page cache can be considered first. Only if total page
cache overrun limit, check whether the file overrun its per-file
limit. If it is true, release partial page cache and wake up kswapd at
the same time.

On 1/12/07, Aubrey <[EMAIL PROTECTED]> wrote:

On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote:
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

The patch seems to work for me. But some suggestions in my mind:

1) Can we limit the total page cache, not the page cache per each file?
think about if total memory is 128M, 10% of it is 12.8M, here if
one application is running, it can use 12.8M vfs cache, then the
performance will probably not be impacted. However, the current patch
limit the page cache per each file, which means if only one
application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
be small to the application.
--snip---
if (mapping->nrpages >= mapping->pages_limit)
   balance_cache(mapping);
--snip---

2) A percent number should be better to control the value. Can we add
a proc interface to make the value tunable?

Thanks,
-Aubrey


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Nick Piggin


Aubrey wrote:

On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote:


On a embedded systerm, limiting page cache can relieve memory
fragmentation. There is a patch against 2.6.19, which limit every
opened file page cache and total pagecache. When the limit reach, it
will release the page cache overrun the limit.



The patch seems to work for me. But some suggestions in my mind:

1) Can we limit the total page cache, not the page cache per each file?
   think about if total memory is 128M, 10% of it is 12.8M, here if
one application is running, it can use 12.8M vfs cache, then the
performance will probably not be impacted. However, the current patch
limit the page cache per each file, which means if only one
application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
be small to the application.
--snip---
if (mapping->nrpages >= mapping->pages_limit)
  balance_cache(mapping);
--snip---

2) A percent number should be better to control the value. Can we add
a proc interface to make the value tunable?


Even a global value isn't completely straightforward, and a per-file value
would be yet more work.

You see, it is hard to do any sort of directed reclaim at these pages.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Bill Davidsen


Nick Piggin wrote:

Aubrey wrote:

On 1/11/07, Nick Piggin <[EMAIL PROTECTED]> wrote:



What you _really_ want to do is avoid large mallocs after boot, or use
a CPU with an mmu. I don't think nommu linux was ever intended to be a
simple drop in replacement for a normal unix kernel.



Is there a position available working on mmu CPU? Joking, :)
Yes, some problems are serious on nommu linux. But I think we should
try to fix them not avoid them.


Exactly, and the *real* fix is to modify userspace not to make > PAGE_SIZE
mallocs[*] if it is to be nommu friendly. It is the kernel hacks to do 
things

like limit cache size that are the bandaids.


Tuning the system to work appropriately for a given load is not a 
band-aid. I have been saying since 2.5.x times that filling memory with 
cached writes was a bad thing, and filling with writes to a single file 
was a doubly bad thing. Back in 2.4.NN-aa kernels, there were some 
tunables to address that, but other than adding your own 2.6 just 
behaves VERY badly for some loads.


Of course, being an embedded system, if they work for you then that's
really fine and you can obviously ship with them. But they don't need to
go upstream.

Anyone who has a few processes which write a lot of data and many 
processes with more modest i/o needs will see the overfilling of cache 
with data from one process or even for one file, and the resulting 
impact on the performance of all other processes, particularly if the 
kernel decides to write all the data for one file at once, because it 
avoids seeks, even if it uses the drive for seconds. The code has gone 
too far in the direction of throughput, at the expense of response to 
other processes, given the (common) behavior noted.


--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Aubrey


On 1/11/07, Roy Huang <[EMAIL PROTECTED]> wrote:

On a embedded systerm, limiting page cache can relieve memory
fragmentation. There is a patch against 2.6.19, which limit every
opened file page cache and total pagecache. When the limit reach, it
will release the page cache overrun the limit.


The patch seems to work for me. But some suggestions in my mind:

1) Can we limit the total page cache, not the page cache per each file?
   think about if total memory is 128M, 10% of it is 12.8M, here if
one application is running, it can use 12.8M vfs cache, then the
performance will probably not be impacted. However, the current patch
limit the page cache per each file, which means if only one
application runs it can only use CONFIG_PAGE_LIMIT pages cache. It may
be small to the application.
--snip---
if (mapping->nrpages >= mapping->pages_limit)
  balance_cache(mapping);
--snip---

2) A percent number should be better to control the value. Can we add
a proc interface to make the value tunable?

Thanks,
-Aubrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Bill Davidsen


linux-os (Dick Johnson) wrote:

On Wed, 10 Jan 2007, Aubrey wrote:


Hi all,

Opening file with O_DIRECT flag can do the un-buffered read/write access.
So if I need un-buffered access, I have to change all of my
applications to add this flag. What's more, Some scripts like "cp
oldfile newfile" still use pagecache and buffer.
Now, my question is, is there a existing way to mount a filesystem
with O_DIRECT flag? so that I don't need to change anything in my
system. If there is no option so far, What is the right way to achieve
my purpose?

Thanks a lot.
-Aubrey
-


I don't think O_DIRECT ever did what a lot of folks expect, i.e.,
write this buffer of data to the physical device _now_. All I/O
ends up being buffered. The `man` page states that the I/O will
be synchronous, that at the conclusion of the call, data will have
been transferred. However, the data written probably will not be
in the physical device, perhaps only in a DMA-able buffer with
a promise to get it to the SCSI device, soon.



No one (who read the specs) ever though thought the write was "right
now," just that it was direct from user buffers. So it is not buffered,
but it is queued through the elevator.


Maybe you need to say why you want to use O_DIRECT with its terrible
performance?


Because it doesn't have terrible performance, because the user knows 
better than the o/s what it "right," etc. I used it to eliminate cache 
impact from large but non-essential operations, others use it on slow 
machines to avoid the CPU impact and bus bandwidth impact of extra copies.


Please don't assume that users are unable to understand how it works 
because you believe some other feature which does something else would 
be just as good. There is no other option which causes the writes to be 
queued right now and not use any cache, and that is sometimes just what 
you want.


I do like the patch to limit per-file and per-system cache, though, in 
some cases I really would like the system to slow gradually rather than 
fill 12GB of RAM with backlogged writes, then queue them and have other 
i/o crawl or stop.


--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: O_DIRECT question

2007-01-11 Thread Hua Zhong

> The other problem besides the inability to handle IO errors is that
> mmap()+msync() is synchronous.  You need to go async to keep 
> the pipelines full.

msync(addr, len, MS_ASYNC); doesn't do what you want?

> Now if someone wants to implement an aio version of msync and 
> mlock, that might do the trick.  At least for MMU systems.  
> Non MMU systems just can't play mmap type games.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Phillip Susi


Michael Tokarev wrote:

Linus Torvalds wrote:

On Thu, 11 Jan 2007, Viktor wrote:

OK, madvise() used with mmap'ed file allows to have reads from a file
with zero-copy between kernel/user buffers and don't pollute cache
memory unnecessarily. But how about writes? How is to do zero-copy
writes to a file and don't pollute cache memory without using O_DIRECT?
Do I miss the appropriate interface?

mmap()+msync() can do that too.


It can, somehow... until there's an I/O error.  And *that* is just terrbile.


The other problem besides the inability to handle IO errors is that 
mmap()+msync() is synchronous.  You need to go async to keep the 
pipelines full.


Now if someone wants to implement an aio version of msync and mlock, 
that might do the trick.  At least for MMU systems.  Non MMU systems 
just can't play mmap type games.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Trond Myklebust

On Thu, 2007-01-11 at 11:00 -0800, Linus Torvalds wrote:
> 
> On Thu, 11 Jan 2007, Trond Myklebust wrote:
> > 
> > For NFS, the main feature of interest when it comes to O_DIRECT is
> > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
> > because it can't guarantee that the page will be thrown out of the page
> > cache before some second process tries to read it. That is particularly
> > true if some dopey third party process has mmapped the file.
> 
> You'd still be MUCH better off using the page cache, and just forcing the 
> IO (but _with_ all the page cache synchronization still active). Which is 
> trivial to do on the filesystem level, especially for something like NFS.
> 
> If you bypass the page cache, you just make that "dopey third party 
> process" problem worse. You now _guarantee_ that there are aliases with 
> different data.

Quite, but that is sometimes an admissible state of affairs.

One of the things that was infuriating when we were trying to do shared
databases over the page cache was that someone would start some
unsynchronised process that had nothing to do with the database itself
(it would typically be a process that was backing up the rest of the
disk or something like that). Said process would end up pinning pages in
memory, and prevented the database itself from getting updated data from
the server.

IOW: the problem was not that of unsynchronised I/O per se. It was
rather that of allowing the application to set up its own
synchronisation barriers and to ensure that no pages are cached across
these barriers. POSIX_FADV_NOREUSE can't offer that guarantee.

> Of course, with NFS, the _server_ will resolve any aliases anyway, so at 
> least you don't get file corruption, but you can get some really strange 
> things (like the write of one process actually happening before, but being 
> flushed _after_ and overriding the later write of the O_DIRECT process).

Writes are not the real problem here since shared databases typically do
implement sufficient synchronisation, and NFS can guarantee that only
the dirty data will be written out. However reading back the data is
problematic when you have insufficient control over the page cache.

The other issue is, of course, that databases don't _want_ to cache the
data in this situation, so the extra copy to the page cache is just a
bother. As you pointed out, that becomes less of an issue as processor
caches and memory speeds increase, but it is still apparently a
measurable effect.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Thu, 11 Jan 2007, Trond Myklebust wrote:
> 
> For NFS, the main feature of interest when it comes to O_DIRECT is
> strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
> because it can't guarantee that the page will be thrown out of the page
> cache before some second process tries to read it. That is particularly
> true if some dopey third party process has mmapped the file.

You'd still be MUCH better off using the page cache, and just forcing the 
IO (but _with_ all the page cache synchronization still active). Which is 
trivial to do on the filesystem level, especially for something like NFS.

If you bypass the page cache, you just make that "dopey third party 
process" problem worse. You now _guarantee_ that there are aliases with 
different data.

Of course, with NFS, the _server_ will resolve any aliases anyway, so at 
least you don't get file corruption, but you can get some really strange 
things (like the write of one process actually happening before, but being 
flushed _after_ and overriding the later write of the O_DIRECT process).

And sure, the filesystem can have its own alias avoidance too (by just 
probing the page cache all the time), but the fundamental fact remains: 
the problem is that O_DIRECT as a page-cache-bypassing mechanism is 
BROKEN.

If you have issues with caching (but still have to allow it for other 
things), the way to fix them is not to make uncached accesses, it's to 
force the cache to be serialized. That's very fundamentally true.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Trond Myklebust

On Thu, 2007-01-11 at 09:04 -0800, Linus Torvalds wrote:
> That is what I think some users could do. If the main issue with O_DIRECT 
> is the page cache allocations, if we instead had better (read: "any") 
> support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would 
> just go away.

For NFS, the main feature of interest when it comes to O_DIRECT is
strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
because it can't guarantee that the page will be thrown out of the page
cache before some second process tries to read it. That is particularly
true if some dopey third party process has mmapped the file.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Thu, 11 Jan 2007, Alan wrote:
> 
> Well you can - its called SG_IO and that really does get the OS out of
> the way. O_DIRECT gets crazy when you stop using it on devices directly
> and use it on files

Well, on a raw disk, O_DIRECT is fine too, but yeah, you might as well 
use SG_IO at that point. All of my issues are all about filesystems.

And filesystems is where people use O_DIRECT most. Almost nobody puts 
their database on a partition of its own these days, afaik. Perhaps for 
benchmarking or some really high-end stuff. Not "normal users".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Alan

> space, just as an example) is wrong in the first place, but the really 
> subtle problems come when you realize that you can't really just "bypass" 
> the OS.

Well you can - its called SG_IO and that really does get the OS out of
the way. O_DIRECT gets crazy when you stop using it on devices directly
and use it on files

You do need some way to avoid the copy cost of caches and get data direct
to user space, it also needs to be a way that works without MMU tricks
because many of that need it are the embedded platforms.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Michael Tokarev

Linus Torvalds wrote:
> 
> On Thu, 11 Jan 2007, Viktor wrote:
>> OK, madvise() used with mmap'ed file allows to have reads from a file
>> with zero-copy between kernel/user buffers and don't pollute cache
>> memory unnecessarily. But how about writes? How is to do zero-copy
>> writes to a file and don't pollute cache memory without using O_DIRECT?
>> Do I miss the appropriate interface?
> 
> mmap()+msync() can do that too.

It can, somehow... until there's an I/O error.  And *that* is just terrbile.

Granted, I didn't check 2.6.x kernels, especially the latest ones.  But
in 2.4, if an I/O space behind mmap becomes unREADable, the process gets
stuck in some unkillable state forever.  I don't know what happens with
write errors, but that behaviour with read errors is just inacceptable.

Sure it's not something like posix_madvise() (whicih is for reads anyway,
not writes).  But I'd very strongly disagree about usage of mmap for
anything more-or-less serious.  Because of umm... difficulties with
error recovery (if it's at all possible).

Note also that anything but O_DIRECT isn't... portable.  O_DIRECT, with
all its shortcomings and ugliness, works, and works on quite.. some
systems.  Having something else, especially with very different usage --
I mean, if the whole I/O subsystem in application has to be redesigned
and re-written in order to use that advanced (or just "right") mechanism
(O_DIRECT is not different from basic read()/write() - just one extra
bit at open() time, and all your code, which evolved during years and
got years of testing, too -- just works, at least in theory, if O_DIRECT
interface is working (ok ok, i know alignment issues, but that's also
handled easily)), -- that'd be somewhat problematic.  *Unless* there's
a very noticeable gain from that.

>From my expirience with databases (mostly Oracle, and some with Postgres
and Mysql), O_DIRECT has *dramatic* impact on performance.  You don't
use O_DIRECT, and you lose alot.  O_DIRECT is *already* a fastest way
possible, I think - for example, it gives maximum speed when writing to
or reading from a raw device (/dev/sdb etc).  I don't think there's a
way to improve that performance...  Yes, there ARE, it seems, some ways
for improvements, in other areas - like, utilizing write barriers for
example, which isn't quite possible now from userspace.  But as long as
O_DIRECT actually writes data before returning from write() call (as it
seems to be the case at least with a normal filesystem on a real block
device - I don't touch corner cases like nfs here), it's pretty much
THE ideal solution, at least from the application (developer) standpoint.

By the way, ext[23]fs is terrible slow with O_DIRECT writes - it gives
about 1/4 of the speed of raw device when multiple concurrent direct
readers and writers are running.  Xfs gives full raw device speed here.
I think that MAY be related to locking issues in ext[23], but I don't
know for sure.

And another "btw" - when creating files, O_DIRECT is quite a killer - each
write takes alot more time than "necessary".  But once a file has been
written, re-writes are pretty much fast.

Also, and it's quite.. funny (to me at least).  Being curious, I compared
write speed (random small-blocks I/O scattered all around the disk) of
modern disk drives with and without write cache (WCE=[01] bit in the
SCSI "Cache control" page of every disk drive).  The fun is: with write
cache turned on, actual speed is LOWER than without cache.  I don't
remember exact numbers, something like 120mb/sec vs 90mb/sec.  And I
think it's quite expectable, as well - first writes all goes to the
cache, but since data stream is going on and on, the cache fills up
quickly, and in order to accept the next data, the drive has to free
some place in its cache.  So instead of just doing its work, it is
spending its time to bounce data to/from the cache...

Sure it's not about linux pagecache or something like that, but it's
still somehow related.  :)

[]
> O_DIRECT - by bypassing the "real" kernel - very fundamentally breaks the 
> whole _point_ of the kernel. There's tons of races where an O_DIRECT user 
> (or other users that expect to see the O_DIRECT data) will now see the 
> wrong data - including seeign uninitialized portions of the disk etc etc. 

Huh?  Well, I plug in a shiny new harddisk into my computer, and do an O_DIRECT
read of it - will I see uninitialized data?  Sure I will (well, in most cases
the whole disk is filled with zeros anyway, so it isn't uninitialized).  The
same applies to regular read, too.

If what you're saying applies to O_DIRECT read of a file on a filesystem, --
well, that's definitely a kernel bug.  It should either not allow to read
if the file size isn't sector-aligned - to read that last part which isn't
a whole sector or whatever, -- or it should ensure the "extra" data is
initialized.  Yes, that's difficult to implement in the kernel.  But it's
not an excuse to not to do that.  AND I think just failing the

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Thu, 11 Jan 2007, Xavier Bestel wrote:

> Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit :
> > > O_DIRECT is still crazily racy versus pagecache operations.
> > 
> > Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
> > it sanely.
> 
> How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?

That is what I think some users could do. If the main issue with O_DIRECT 
is the page cache allocations, if we instead had better (read: "any") 
support for POSIX_FADV_NOREUSE, one class of reasons O_DIRECT usage would 
just go away.

See also the patch that Roy Huang posted about another approach to the 
same problem: just limiting page cache usage explicitly.

That's not the _only_ issue with O_DIRECT, though. It's one big one, but 
people like to think that the memory copy makes a difference when you do 
IO too (I think it's likely pretty debatable in real life, but I'm totally 
certain you can benchmark it, probably even pretty easily especially if 
you have fairly studly IO capabilities and a CPU that isn't quite as 
studly).

So POSIX_FADV_NOREUSE kind of support is one _part_ of the O_DIRECT 
picture, and depending on your problems (in this case, the embedded world) 
it may even be the *biggest* part. But it's not the whole picture.

Linus

Re: O_DIRECT question

2007-01-11 Thread Xavier Bestel

Le jeudi 11 janvier 2007 à 07:50 -0800, Linus Torvalds a écrit :
> > O_DIRECT is still crazily racy versus pagecache operations.
> 
> Yes. O_DIRECT is really fundamentally broken. There's just no way to fix 
> it sanely.

How about aliasing O_DIRECT to POSIX_FADV_NOREUSE (sortof) ?

Xav


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: O_DIRECT question

2007-01-11 Thread Linus Torvalds

On Thu, 11 Jan 2007, Roy Huang wrote:
>
> On a embedded systerm, limiting page cache can relieve memory
> fragmentation. There is a patch against 2.6.19, which limit every
> opened file page cache and total pagecache. When the limit reach, it
> will release the page cache overrun the limit.

I do think that something like this is probably a good idea, even on 
non-embedded setups. We historically couldn't do this, because mapped 
pages were too damn hard to remove, but that's obviously not much of a 
problem any more.

However, the page-cache limit should NOT be some compile-time constant. It 
should work the same way the "dirty page" limit works, and probably just 
default to "feel free to use 90% of memory for page cache".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 130 matches

Mail list logo