RE: slow open() calls and o_nonblock

2007-06-04 Thread David Schwartz

Aaron Wiebe wrote:


> David Schwartz wrote:

> > There is no way you can re-try the request. The open must
> > either succeed or
> > not return a handle. It is not like a 'read' operation that has
> > an "I didn't
> > do anything, and you can retry this request" option.

> > If 'open' returns a file handle, you can't retry it (since it
> > must succeed
> > in order to do that, failure must not return a handle). If you 'open'
> > doesn't return a file handle, you can't retry it (because,
> > without a handle,
> > there is no way to associate a future request with this one, if
> > it creates a
> > file, the file must not be created if you don't call 'open' again).

> I understand, but this is exactly the situation that I'm complaining
> about.  There is no functionality to provide a nonblocking open - no
> ability to come back around and retry a given open call.

I agree. I'm addressing why things can't "just work", not arguing that they
aren't broken or should stay broken. ;)

I think a good solution would be to re-use the 'connect' and 'shutdown'
calls. You would need a new asynchronous flag to 'open' that would mean,
*really* don't block. You would have to follow up with 'connect' to complete
the actual opening -- the 'open' would just assign a file descriptor (unless
it could complete or error immediately, of course).

To asynchronously close such a socket, you simply call 'shutdown'. Once the
'shutdown' completes, 'close' would be guaranteed not to block.

Obviously, being able to 'poll' or 'select' would be a huge plus (while an
'open' or 'close' is in progress, of course, otherwise it would always
return immediate availability).

I think this covers all the bases and the only ugly API change is an extra
'open' flag. (Which I think is unavoidable.)

> I'm speaking to my ideal world view - but any application I write
> should not have to wait for the kernel if I don't want it to.   I
> should be able to submit my request, and come back to it later as I so
> decide.

A working generic asynchronous system call interface would be the best
solution, I think. But that may be further off than just an asynchronous
file open/close interface.

> (And I did actually consider writing my own NFS client for about
> 5 minutes.)

Yeah, what a pain that would be. The obvious counter-argument to what I
propose above is that it doesn't handle reads and writes, so why bother with
a complex partial solution?

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Trond Myklebust
On Mon, 2007-06-04 at 12:26 -0400, Aaron Wiebe wrote:
> Actually, lets see if I can summarize this more generically... I
> realize I'm suggesting something that probably would be a massive
> undertaking, but ..
> 
> Regular files are the only interface that requires an application to
> wait.  With any other case, the nonblocking interfaces are fairly
> complete and easy to work with.  If userspace could treat regular
> files in the same fashion as sockets, life would be good.
> 
> I admittedly do not understand internal kernel semantics in the
> differences between a socket and a regular file.  Why couldn't we just
> have a different 'socket type' like PF_FILE or something like this?
> 
> Abstracting any IO through the existing interfaces provided to sockets
> would be ideal from my perspective.  The code required to use a file
> through these interfaces would be more complex in userspace, but the
> abstraction of the current open() itself could simply be an aggregate
> of these interfaces without a nonblocking flag.
> 
> It would, however, fix problems around issues with event-based
> applications handling events from both disk and sockets.  I can't
> trigger disk read/write events in the same event handlers I use for
> sockets (ie, poll or epoll).  I end up having two separate event
> handlers - one for disk (currently using glibc's aio thread kludge),
> and one for sockets.
> 
> I'm sure this isn't a new idea.  Coming from my own development
> backround that had little to do with disk, I was actually surprised
> when I first discovered that I couldn't edge-trigger disk IO through
> poll().
> 
> Thoughts, comments?

Unless you're planning on rearchitecting the entire VFS lookup and
permissions code, you would basically have to fall back onto having a
pool of service threads actually perform the I/O. That can just as
easily be done today in userland.

AFAICS, syslets should give you the means to implement a more scalable
scheme, but we'll have to wait and see if/when those are ready for
kernel inclusion.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Actually, lets see if I can summarize this more generically... I
realize I'm suggesting something that probably would be a massive
undertaking, but ..

Regular files are the only interface that requires an application to
wait.  With any other case, the nonblocking interfaces are fairly
complete and easy to work with.  If userspace could treat regular
files in the same fashion as sockets, life would be good.

I admittedly do not understand internal kernel semantics in the
differences between a socket and a regular file.  Why couldn't we just
have a different 'socket type' like PF_FILE or something like this?

Abstracting any IO through the existing interfaces provided to sockets
would be ideal from my perspective.  The code required to use a file
through these interfaces would be more complex in userspace, but the
abstraction of the current open() itself could simply be an aggregate
of these interfaces without a nonblocking flag.

It would, however, fix problems around issues with event-based
applications handling events from both disk and sockets.  I can't
trigger disk read/write events in the same event handlers I use for
sockets (ie, poll or epoll).  I end up having two separate event
handlers - one for disk (currently using glibc's aio thread kludge),
and one for sockets.

I'm sure this isn't a new idea.  Coming from my own development
backround that had little to do with disk, I was actually surprised
when I first discovered that I couldn't edge-trigger disk IO through
poll().

Thoughts, comments?

-Aaron

On 6/4/07, Aaron Wiebe <[EMAIL PROTECTED]> wrote:

On 6/4/07, Trond Myklebust <[EMAIL PROTECTED]> wrote:
>
> So exactly how would you expect a nonblocking open to work? Should it be
> starting I/O? What if that involves blocking? How would you know when to
> try again?

Well, theres a bunch of options - some have been suggested in the
thread already.  The idea of an open with O_NONBLOCK (or a different
flag) returning a handle immediately, and subsequent calls returning
EAGAIN if the open is incomplete, or ESTALE if it fails (with some
auxiliary method of getting the reason why it failed) are not too far
a stretch from my perspective.

The other option that comes to mind would be to add an interface that
behaves like sockets - get a handle from one system call, set it
nonblocking using fcntl, and use another call to attach it to a
regular file.  This method would make the most sense to me - but its
also because I've worked with sockets in the past far far more than
with regular files.

The one that would take the least amount of work from the application
perspective would be to simply reply to the nonblocking open call with
EAGAIN (or something), and when an open on the same file is performed,
the kernel could have performed its work in the background.  I can
understand, given the fact that there is no handle provided to the
application, that this idea could be sloppy.

I'm still getting caught up on some of the other suggestions (I'm
currently reading about the syslets work that Zach and Ingo are
doing), and it sounds like this is a common complaint that is being
addressed through a number of initiatives.  I'm looking forward to
seeing where that work goes.

-Aaron


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, Trond Myklebust <[EMAIL PROTECTED]> wrote:


So exactly how would you expect a nonblocking open to work? Should it be
starting I/O? What if that involves blocking? How would you know when to
try again?


Well, theres a bunch of options - some have been suggested in the
thread already.  The idea of an open with O_NONBLOCK (or a different
flag) returning a handle immediately, and subsequent calls returning
EAGAIN if the open is incomplete, or ESTALE if it fails (with some
auxiliary method of getting the reason why it failed) are not too far
a stretch from my perspective.

The other option that comes to mind would be to add an interface that
behaves like sockets - get a handle from one system call, set it
nonblocking using fcntl, and use another call to attach it to a
regular file.  This method would make the most sense to me - but its
also because I've worked with sockets in the past far far more than
with regular files.

The one that would take the least amount of work from the application
perspective would be to simply reply to the nonblocking open call with
EAGAIN (or something), and when an open on the same file is performed,
the kernel could have performed its work in the background.  I can
understand, given the fact that there is no handle provided to the
application, that this idea could be sloppy.

I'm still getting caught up on some of the other suggestions (I'm
currently reading about the syslets work that Zach and Ingo are
doing), and it sounds like this is a common complaint that is being
addressed through a number of initiatives.  I'm looking forward to
seeing where that work goes.

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Trond Myklebust
On Mon, 2007-06-04 at 10:20 -0400, Aaron Wiebe wrote:
> I understand, but this is exactly the situation that I'm complaining
> about.  There is no functionality to provide a nonblocking open - no
> ability to come back around and retry a given open call.

So exactly how would you expect a nonblocking open to work? Should it be
starting I/O? What if that involves blocking? How would you know when to
try again?

  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Sorry for the unthreaded responses, I wasn't cc'd here, so I'm
replying to these based on mailing list archives

Al Viro wrote:


BTW, why close these suckers all the time? It's not that kernel would
be unable to hold thousands of open descriptors for your process...
Hash descriptors by pathname and be done with that; don't bother with
close unless you decide that you've got too many of them (e.g. when you
get a hash conflict).


A valid point - I currently keep a pool of 4000 descriptors open and
cycle them out based on inactivity.  I hadn't seriously considered
just keeping them all open, because I simply wasn't sure how well
things would go with 100,000 files open.  Would my backend storage
keep up... would the kernel mind maintaining 100,000 files open over
NFS?

The majority of the files would simply be idle - I would be keeping
file handles open for no reason.  Pooling allows me to substantially
drop the number of opens I require, but I am hesitant to blow the pool
size to substantially higher numbers.  Can anyone shed light on any
issues that may come up with a massive pool size, such as 128k?

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, John Stoffel <[EMAIL PROTECTED]> wrote:


So how many files are in the directory where you're seeing the delays?
And what's the average size of the files in there?


The directories themselves will have a maximum of 160 files, and the
files are maybe a few megs each - the delays are (as you pointed out
earlier) due to the ram restrictions and our filesystem design of very
deep directory structures that Netapps suck at.

My point is more generic though - I will come up with ways to handle
this problem in my application (probably with threads), but I'm
griping more about the lack of a kernel interface that would have
allowed me to avoid this.

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Replying to David Schwartz here.. (David, good to hear from you again
- haven't seen you around since the irc days :))

David Schwartz wrote:


There is no way you can re-try the request. The open must either succeed or
not return a handle. It is not like a 'read' operation that has an "I didn't
do anything, and you can retry this request" option.

If 'open' returns a file handle, you can't retry it (since it must succeed
in order to do that, failure must not return a handle). If you 'open'
doesn't return a file handle, you can't retry it (because, without a handle,
there is no way to associate a future request with this one, if it creates a
file, the file must not be created if you don't call 'open' again).


I understand, but this is exactly the situation that I'm complaining
about.  There is no functionality to provide a nonblocking open - no
ability to come back around and retry a given open call.


You need either threads or a working asynchronous system call interface.
Short of that, you need your own NFS client code.


This is exactly my point - there is no asynchronous system call to do
this work, to my knowledge.  I will likely fix this in my own code
using threads, but I see using threads in this case as working around
that lack of systems interface.  Threads, imho, should be limited to
cases where I'm using them to distribute load across multiple
processors, not because the kernel interfaces for IO cannot support
nonblocking calls.

I'm speaking to my ideal world view - but any application I write
should not have to wait for the kernel if I don't want it to.   I
should be able to submit my request, and come back to it later as I so
decide.

(And I did actually consider writing my own NFS client for about 5 minutes.)

Thanks for the response!
-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread John Stoffel
> "Aaron" == Aaron Wiebe <[EMAIL PROTECTED]> writes:

Aaron> On 6/4/07, Alan Cox <[EMAIL PROTECTED]> wrote:
>> 
>> > Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
>> > call with a nonblocking flag return EAGAIN if its going to take
>> > anywhere near 415ms?
>> 
>> Violation of causality. We don't know it will block for 415ms until 415ms
>> have elapsed.

Aaron> Understood - but what I'm getting at is more the fact that
Aaron> there really doesn't appear to be any real implementation of
Aaron> nonblocking open().  On the socket side of the fence, I would
Aaron> consider a regular file open() to be equivalent to a connect()
Aaron> call - the difference obviously being that we already have a
Aaron> handle for the socket.

Aaron> The end result, however, is roughly the same.  We have a file
Aaron> descriptor with the endpoint established.  In the socket world,
Aaron> we assume that a nonblocking request will always return
Aaron> immediately and the application is expected to come back around
Aaron> and see if the request has completed.  Regular files have no
Aaron> equivalent.

So how many files are in the directory where you're seeing the delays?
And what's the average size of the files in there?  

John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, Alan Cox <[EMAIL PROTECTED]> wrote:


> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> call with a nonblocking flag return EAGAIN if its going to take
> anywhere near 415ms?

Violation of causality. We don't know it will block for 415ms until 415ms
have elapsed.


Understood - but what I'm getting at is more the fact that there
really doesn't appear to be any real implementation of nonblocking
open().  On the socket side of the fence, I would consider a regular
file open() to be equivalent to a connect() call - the difference
obviously being that we already have a handle for the socket.

The end result, however, is roughly the same.  We have a file
descriptor with the endpoint established.  In the socket world, we
assume that a nonblocking request will always return immediately and
the application is expected to come back around and see if the request
has completed.  Regular files have no equivalent.

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/3/07, Neil Brown <[EMAIL PROTECTED]> wrote:


Have you tried the "nocto" mount option for your NFS filesystems.

The cache-coherency rules of NFS require the client to check with the
server at each open.  If you are the sole client on this filesystem,
then you don't need the same cache-coherency, and "nocto" will tell
the NFS client not to both checking with the server in information is
available in cache.


No I haven't - I will research this a little further today.  While
we're not the only client using these filesystems, this process is
(currently) the only process that writes to these files.  Thanks for
the suggestion.

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Alan Cox
> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> call with a nonblocking flag return EAGAIN if its going to take
> anywhere near 415ms?  

Violation of causality. We don't know it will block for 415ms until 415ms
have elapsed. 

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Alan Cox
 Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
 call with a nonblocking flag return EAGAIN if its going to take
 anywhere near 415ms?  

Violation of causality. We don't know it will block for 415ms until 415ms
have elapsed. 

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/3/07, Neil Brown [EMAIL PROTECTED] wrote:


Have you tried the nocto mount option for your NFS filesystems.

The cache-coherency rules of NFS require the client to check with the
server at each open.  If you are the sole client on this filesystem,
then you don't need the same cache-coherency, and nocto will tell
the NFS client not to both checking with the server in information is
available in cache.


No I haven't - I will research this a little further today.  While
we're not the only client using these filesystems, this process is
(currently) the only process that writes to these files.  Thanks for
the suggestion.

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, Alan Cox [EMAIL PROTECTED] wrote:


 Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
 call with a nonblocking flag return EAGAIN if its going to take
 anywhere near 415ms?

Violation of causality. We don't know it will block for 415ms until 415ms
have elapsed.


Understood - but what I'm getting at is more the fact that there
really doesn't appear to be any real implementation of nonblocking
open().  On the socket side of the fence, I would consider a regular
file open() to be equivalent to a connect() call - the difference
obviously being that we already have a handle for the socket.

The end result, however, is roughly the same.  We have a file
descriptor with the endpoint established.  In the socket world, we
assume that a nonblocking request will always return immediately and
the application is expected to come back around and see if the request
has completed.  Regular files have no equivalent.

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread John Stoffel
 Aaron == Aaron Wiebe [EMAIL PROTECTED] writes:

Aaron On 6/4/07, Alan Cox [EMAIL PROTECTED] wrote:
 
  Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
  call with a nonblocking flag return EAGAIN if its going to take
  anywhere near 415ms?
 
 Violation of causality. We don't know it will block for 415ms until 415ms
 have elapsed.

Aaron Understood - but what I'm getting at is more the fact that
Aaron there really doesn't appear to be any real implementation of
Aaron nonblocking open().  On the socket side of the fence, I would
Aaron consider a regular file open() to be equivalent to a connect()
Aaron call - the difference obviously being that we already have a
Aaron handle for the socket.

Aaron The end result, however, is roughly the same.  We have a file
Aaron descriptor with the endpoint established.  In the socket world,
Aaron we assume that a nonblocking request will always return
Aaron immediately and the application is expected to come back around
Aaron and see if the request has completed.  Regular files have no
Aaron equivalent.

So how many files are in the directory where you're seeing the delays?
And what's the average size of the files in there?  

John
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Replying to David Schwartz here.. (David, good to hear from you again
- haven't seen you around since the irc days :))

David Schwartz wrote:


There is no way you can re-try the request. The open must either succeed or
not return a handle. It is not like a 'read' operation that has an I didn't
do anything, and you can retry this request option.

If 'open' returns a file handle, you can't retry it (since it must succeed
in order to do that, failure must not return a handle). If you 'open'
doesn't return a file handle, you can't retry it (because, without a handle,
there is no way to associate a future request with this one, if it creates a
file, the file must not be created if you don't call 'open' again).


I understand, but this is exactly the situation that I'm complaining
about.  There is no functionality to provide a nonblocking open - no
ability to come back around and retry a given open call.


You need either threads or a working asynchronous system call interface.
Short of that, you need your own NFS client code.


This is exactly my point - there is no asynchronous system call to do
this work, to my knowledge.  I will likely fix this in my own code
using threads, but I see using threads in this case as working around
that lack of systems interface.  Threads, imho, should be limited to
cases where I'm using them to distribute load across multiple
processors, not because the kernel interfaces for IO cannot support
nonblocking calls.

I'm speaking to my ideal world view - but any application I write
should not have to wait for the kernel if I don't want it to.   I
should be able to submit my request, and come back to it later as I so
decide.

(And I did actually consider writing my own NFS client for about 5 minutes.)

Thanks for the response!
-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, John Stoffel [EMAIL PROTECTED] wrote:


So how many files are in the directory where you're seeing the delays?
And what's the average size of the files in there?


The directories themselves will have a maximum of 160 files, and the
files are maybe a few megs each - the delays are (as you pointed out
earlier) due to the ram restrictions and our filesystem design of very
deep directory structures that Netapps suck at.

My point is more generic though - I will come up with ways to handle
this problem in my application (probably with threads), but I'm
griping more about the lack of a kernel interface that would have
allowed me to avoid this.

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Sorry for the unthreaded responses, I wasn't cc'd here, so I'm
replying to these based on mailing list archives

Al Viro wrote:


BTW, why close these suckers all the time? It's not that kernel would
be unable to hold thousands of open descriptors for your process...
Hash descriptors by pathname and be done with that; don't bother with
close unless you decide that you've got too many of them (e.g. when you
get a hash conflict).


A valid point - I currently keep a pool of 4000 descriptors open and
cycle them out based on inactivity.  I hadn't seriously considered
just keeping them all open, because I simply wasn't sure how well
things would go with 100,000 files open.  Would my backend storage
keep up... would the kernel mind maintaining 100,000 files open over
NFS?

The majority of the files would simply be idle - I would be keeping
file handles open for no reason.  Pooling allows me to substantially
drop the number of opens I require, but I am hesitant to blow the pool
size to substantially higher numbers.  Can anyone shed light on any
issues that may come up with a massive pool size, such as 128k?

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Trond Myklebust
On Mon, 2007-06-04 at 10:20 -0400, Aaron Wiebe wrote:
 I understand, but this is exactly the situation that I'm complaining
 about.  There is no functionality to provide a nonblocking open - no
 ability to come back around and retry a given open call.

So exactly how would you expect a nonblocking open to work? Should it be
starting I/O? What if that involves blocking? How would you know when to
try again?

  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

On 6/4/07, Trond Myklebust [EMAIL PROTECTED] wrote:


So exactly how would you expect a nonblocking open to work? Should it be
starting I/O? What if that involves blocking? How would you know when to
try again?


Well, theres a bunch of options - some have been suggested in the
thread already.  The idea of an open with O_NONBLOCK (or a different
flag) returning a handle immediately, and subsequent calls returning
EAGAIN if the open is incomplete, or ESTALE if it fails (with some
auxiliary method of getting the reason why it failed) are not too far
a stretch from my perspective.

The other option that comes to mind would be to add an interface that
behaves like sockets - get a handle from one system call, set it
nonblocking using fcntl, and use another call to attach it to a
regular file.  This method would make the most sense to me - but its
also because I've worked with sockets in the past far far more than
with regular files.

The one that would take the least amount of work from the application
perspective would be to simply reply to the nonblocking open call with
EAGAIN (or something), and when an open on the same file is performed,
the kernel could have performed its work in the background.  I can
understand, given the fact that there is no handle provided to the
application, that this idea could be sloppy.

I'm still getting caught up on some of the other suggestions (I'm
currently reading about the syslets work that Zach and Ingo are
doing), and it sounds like this is a common complaint that is being
addressed through a number of initiatives.  I'm looking forward to
seeing where that work goes.

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Aaron Wiebe

Actually, lets see if I can summarize this more generically... I
realize I'm suggesting something that probably would be a massive
undertaking, but ..

Regular files are the only interface that requires an application to
wait.  With any other case, the nonblocking interfaces are fairly
complete and easy to work with.  If userspace could treat regular
files in the same fashion as sockets, life would be good.

I admittedly do not understand internal kernel semantics in the
differences between a socket and a regular file.  Why couldn't we just
have a different 'socket type' like PF_FILE or something like this?

Abstracting any IO through the existing interfaces provided to sockets
would be ideal from my perspective.  The code required to use a file
through these interfaces would be more complex in userspace, but the
abstraction of the current open() itself could simply be an aggregate
of these interfaces without a nonblocking flag.

It would, however, fix problems around issues with event-based
applications handling events from both disk and sockets.  I can't
trigger disk read/write events in the same event handlers I use for
sockets (ie, poll or epoll).  I end up having two separate event
handlers - one for disk (currently using glibc's aio thread kludge),
and one for sockets.

I'm sure this isn't a new idea.  Coming from my own development
backround that had little to do with disk, I was actually surprised
when I first discovered that I couldn't edge-trigger disk IO through
poll().

Thoughts, comments?

-Aaron

On 6/4/07, Aaron Wiebe [EMAIL PROTECTED] wrote:

On 6/4/07, Trond Myklebust [EMAIL PROTECTED] wrote:

 So exactly how would you expect a nonblocking open to work? Should it be
 starting I/O? What if that involves blocking? How would you know when to
 try again?

Well, theres a bunch of options - some have been suggested in the
thread already.  The idea of an open with O_NONBLOCK (or a different
flag) returning a handle immediately, and subsequent calls returning
EAGAIN if the open is incomplete, or ESTALE if it fails (with some
auxiliary method of getting the reason why it failed) are not too far
a stretch from my perspective.

The other option that comes to mind would be to add an interface that
behaves like sockets - get a handle from one system call, set it
nonblocking using fcntl, and use another call to attach it to a
regular file.  This method would make the most sense to me - but its
also because I've worked with sockets in the past far far more than
with regular files.

The one that would take the least amount of work from the application
perspective would be to simply reply to the nonblocking open call with
EAGAIN (or something), and when an open on the same file is performed,
the kernel could have performed its work in the background.  I can
understand, given the fact that there is no handle provided to the
application, that this idea could be sloppy.

I'm still getting caught up on some of the other suggestions (I'm
currently reading about the syslets work that Zach and Ingo are
doing), and it sounds like this is a common complaint that is being
addressed through a number of initiatives.  I'm looking forward to
seeing where that work goes.

-Aaron


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-04 Thread Trond Myklebust
On Mon, 2007-06-04 at 12:26 -0400, Aaron Wiebe wrote:
 Actually, lets see if I can summarize this more generically... I
 realize I'm suggesting something that probably would be a massive
 undertaking, but ..
 
 Regular files are the only interface that requires an application to
 wait.  With any other case, the nonblocking interfaces are fairly
 complete and easy to work with.  If userspace could treat regular
 files in the same fashion as sockets, life would be good.
 
 I admittedly do not understand internal kernel semantics in the
 differences between a socket and a regular file.  Why couldn't we just
 have a different 'socket type' like PF_FILE or something like this?
 
 Abstracting any IO through the existing interfaces provided to sockets
 would be ideal from my perspective.  The code required to use a file
 through these interfaces would be more complex in userspace, but the
 abstraction of the current open() itself could simply be an aggregate
 of these interfaces without a nonblocking flag.
 
 It would, however, fix problems around issues with event-based
 applications handling events from both disk and sockets.  I can't
 trigger disk read/write events in the same event handlers I use for
 sockets (ie, poll or epoll).  I end up having two separate event
 handlers - one for disk (currently using glibc's aio thread kludge),
 and one for sockets.
 
 I'm sure this isn't a new idea.  Coming from my own development
 backround that had little to do with disk, I was actually surprised
 when I first discovered that I couldn't edge-trigger disk IO through
 poll().
 
 Thoughts, comments?

Unless you're planning on rearchitecting the entire VFS lookup and
permissions code, you would basically have to fall back onto having a
pool of service threads actually perform the I/O. That can just as
easily be done today in userland.

AFAICS, syslets should give you the means to implement a more scalable
scheme, but we'll have to wait and see if/when those are ready for
kernel inclusion.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-04 Thread David Schwartz

Aaron Wiebe wrote:


 David Schwartz wrote:

  There is no way you can re-try the request. The open must
  either succeed or
  not return a handle. It is not like a 'read' operation that has
  an I didn't
  do anything, and you can retry this request option.

  If 'open' returns a file handle, you can't retry it (since it
  must succeed
  in order to do that, failure must not return a handle). If you 'open'
  doesn't return a file handle, you can't retry it (because,
  without a handle,
  there is no way to associate a future request with this one, if
  it creates a
  file, the file must not be created if you don't call 'open' again).

 I understand, but this is exactly the situation that I'm complaining
 about.  There is no functionality to provide a nonblocking open - no
 ability to come back around and retry a given open call.

I agree. I'm addressing why things can't just work, not arguing that they
aren't broken or should stay broken. ;)

I think a good solution would be to re-use the 'connect' and 'shutdown'
calls. You would need a new asynchronous flag to 'open' that would mean,
*really* don't block. You would have to follow up with 'connect' to complete
the actual opening -- the 'open' would just assign a file descriptor (unless
it could complete or error immediately, of course).

To asynchronously close such a socket, you simply call 'shutdown'. Once the
'shutdown' completes, 'close' would be guaranteed not to block.

Obviously, being able to 'poll' or 'select' would be a huge plus (while an
'open' or 'close' is in progress, of course, otherwise it would always
return immediate availability).

I think this covers all the bases and the only ugly API change is an extra
'open' flag. (Which I think is unavoidable.)

 I'm speaking to my ideal world view - but any application I write
 should not have to wait for the kernel if I don't want it to.   I
 should be able to submit my request, and come back to it later as I so
 decide.

A working generic asynchronous system call interface would be the best
solution, I think. But that may be further off than just an asynchronous
file open/close interface.

 (And I did actually consider writing my own NFS client for about
 5 minutes.)

Yeah, what a pain that would be. The obvious counter-argument to what I
propose above is that it doesn't handle reads and writes, so why bother with
a complex partial solution?

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread Albert Cahalan

David Schwartz writes:

[Aaron Wiebe]



open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>


How could they make any difference? I can't think of any
conceivable way they could.


Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
call with a nonblocking flag return EAGAIN if its going to take
anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
they take more than 10ms?


There is no way you can re-try the request. The open must either
succeed or not return a handle. It is not like a 'read' operation
that has an "I didn't do anything, and you can retry this request"
option.

If 'open' returns a file handle, you can't retry it (since it must
succeed in order to do that, failure must not return a handle).
If you 'open' doesn't return a file handle, you can't retry it
(because, without a handle, there is no way to associate a future
request with this one, if it creates a file, the file must not be
created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists
(or doesn't exist and can be created, or whatever). This takes
however long it takes on NFS.


This is not the case, though we might need to allocate a new
flag to avoid breaking things.

Let open() with O_UNCHECKED always return a file descriptor,
except perhaps when failure can be identified without doing IO.
The "real" open then proceeds in the background.


From poll() or select(), you can see that the file descriptor

is not ready for anything. Eventually it becomes ready for IO
or reports an error condition. Both select() and poll() are
capable of reporting errors. If the "real" (background) open()
fails, then the only valid operation is close(). Attempts to
do anything else get EBADFD or ESTALE.

You'll also need a background close().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Bernd Eckenfels
In article <[EMAIL PROTECTED]> you wrote:
> In short, I'm distributing logs in realtime for about 600,000
> websites.  The sources of the logs (http, ftp, realmedia, etc) are
> flexible, however the base framework was build around a large cluster
> of webservers.  The output can be to several hundred thousand files
> across about two dozen filers for user consumption - some can be very
> active, some can be completely inactive.

Asuming you have multiple request log summary files, I would just run
multiple "splitters".

> You can certainly open the file, but not block on the call to do it.
> What confuses me is why the kernel would "block" for 415ms on an open
> call.  Thats an eternity to suspend a process that has to distribute
> data such as this.

Because it has to, to return the result with the given API. 

But If you would have a async interface, the operation would still take that
long and your throughput will still be limited by the opens/sec your filers
support, or?

> Except I cant very well keep 600,000 files open over NFS.  :)  Pool
> and queue, and cycle through the pool.  I've managed to achieve a
> balance in my production deployment with this method - my email was
> more of a rant after months of trying to work around a problem (caused
> by a limitation in system calls),

I agree that a unified async layer is nice from the programmers POV, but I
disagree that it would help your performance problem which is caused by NFS
and/or NetApp (and I wont blame them).

Gruss
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Neil Brown
On Sunday June 3, [EMAIL PROTECTED] wrote:
> 
> You can certainly open the file, but not block on the call to do it.
> What confuses me is why the kernel would "block" for 415ms on an open
> call.  Thats an eternity to suspend a process that has to distribute
> data such as this.

Have you tried the "nocto" mount option for your NFS filesystems.

The cache-coherency rules of NFS require the client to check with the
server at each open.  If you are the sole client on this filesystem,
then you don't need the same cache-coherency, and "nocto" will tell
the NFS client not to both checking with the server in information is
available in cache.

This should speed up the time for open considerably.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Bernd Eckenfels
In article <[EMAIL PROTECTED]> you wrote:
> (ps.  having come from the socket side of the fence, its incredibly
> frustrating to be unable to poll() or epoll regular file FDs --
> Especially knowing that the kernel is translating them into a TCP
> socket to do NFS anyway.  Please add regular files to epoll and give
> me a way to do the opens in the same fasion as connects!)

You might want to use Windows? :) 

Gruss
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Aaron Wiebe

Hi John, thanks for responding.  I'm using kernel 2.6.20 on a
home-grown distro.

I've responded to a few specific points inline - but as a whole,
Davide directed me to work that is being done specifically to address
these issues in the kernel, as well as a userspace implementation that
would allow me to sidestep this failing for the time being.


On 6/3/07, John Stoffel <[EMAIL PROTECTED]> wrote:


How large are these files?  Are they all in a single directory?  How
many files are in the directory?

Ugh. Why don't you just write to a DB instead?  It sounds like you're
writing small records, with one record to a file.  It can work, but
when you're doing thousands per-minute, the open/close overhead is
starting to dominate.  Can you just amortize that overhead across a
bunch of writes instead by writing to a single file which is more
structured for your needs?


In short, I'm distributing logs in realtime for about 600,000
websites.  The sources of the logs (http, ftp, realmedia, etc) are
flexible, however the base framework was build around a large cluster
of webservers.  The output can be to several hundred thousand files
across about two dozen filers for user consumption - some can be very
active, some can be completely inactive.


Netapps usually scream for NFS writes and such, so it sounds to me
that you've blown out the NVRAM cache on the box.  Can you elaborate
more on your hardware & Network & Netapp setup?


You're totally correct here - Netapp has told us as much about our
filesystem design, we use too much ram on the filer itself.  Its true
that the application would handle just fine if our filesystem
structure were redesigned - I am approaching this from an application
perspective though.  These units are capable of the raw IO, its the
simple fact that open calls are taking a while.  If I were to thread
off the application (which Davide has been kind enough to provide some
libraries which will make that substantially easier), the problem
wouldn't exist.


The problem is that O_NONBLOCK on files open doesn't make sense.  You
either open it, or you don't.  How long it takes to comlete isn't part
of the spec.


You can certainly open the file, but not block on the call to do it.
What confuses me is why the kernel would "block" for 415ms on an open
call.  Thats an eternity to suspend a process that has to distribute
data such as this.


But in this case, I think you're doing something hokey with your data
design.  You should be opening just a handful of files and then
streaming your writes to those files.   You'll get much more
performance.


Except I cant very well keep 600,000 files open over NFS.  :)  Pool
and queue, and cycle through the pool.  I've managed to achieve a
balance in my production deployment with this method - my email was
more of a rant after months of trying to work around a problem (caused
by a limitation in system calls), only to have it present an order of
magnitude worse than I expected.  Sorry for not giving more
information off the line - and thanks for your time.

-Aaron
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Al Viro
On Sun, Jun 03, 2007 at 05:27:06PM -0700, David Schwartz wrote:
> 
> > Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
> > difference to my open times.  Example:
> >
> > open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>
 
> The 'open' function must, at minimum, confirm that the file exists (or
> doesn't exist and can be created, or whatever). This takes however long it
> takes on NFS.
> 
> You need either threads or a working asynchronous system call interface.
> Short of that, you need your own NFS client code.

BTW, why close these suckers all the time?  It's not that kernel would
be unable to hold thousands of open descriptors for your process...
Hash descriptors by pathname and be done with that; don't bother with
close unless you decide that you've got too many of them (e.g. when you
get a hash conflict).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread David Schwartz

> Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
> difference to my open times.  Example:
>
> open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

How could they make any difference? I can't think of any conceivable way
they could.

> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> call with a nonblocking flag return EAGAIN if its going to take
> anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
> they take more than 10ms?

There is no way you can re-try the request. The open must either succeed or
not return a handle. It is not like a 'read' operation that has an "I didn't
do anything, and you can retry this request" option.

If 'open' returns a file handle, you can't retry it (since it must succeed
in order to do that, failure must not return a handle). If you 'open'
doesn't return a file handle, you can't retry it (because, without a handle,
there is no way to associate a future request with this one, if it creates a
file, the file must not be created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists (or
doesn't exist and can be created, or whatever). This takes however long it
takes on NFS.

You need either threads or a working asynchronous system call interface.
Short of that, you need your own NFS client code.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread John Stoffel
> "Aaron" == Aaron Wiebe <[EMAIL PROTECTED]> writes:

More details on which kernel you're using and which distro would be
helpful.  Also, more details on your App and reasons why you're
writing bunches of small files would help as well. 

Aaron> Greetings all.  I'm not on this list, so I apologize if this subject
Aaron> has been covered before.  (Also, please cc me in the response.)

Aaron> I've spent the last several months trying to work around the lack of a
Aaron> decent disk AIO interface.  I'm starting to wonder if one exists
Aaron> anywhere.  The short version:

Aaron> I have written a daemon that needs to open several thousand
Aaron> files a minute and write a small amount of data to each file.

How large are these files?  Are they all in a single directory?  How
many files are in the directory? 

Ugh. Why don't you just write to a DB instead?  It sounds like you're
writing small records, with one record to a file.  It can work, but
when you're doing thousands per-minute, the open/close overhead is
starting to dominate.  Can you just amortize that overhead across a
bunch of writes instead by writing to a single file which is more
structured for your needs?  

Aaron> After extensive research, I ended up going with the POSIX AIO
Aaron> kludgy pthreads wrapper in glibc to handle my writes due to the
Aaron> time constraints of writing my own pthreads handler into the
Aaron> application.

Aaron> The problem with this equation is that opens, closes and
Aaron> non-readwrite operations (fchmod, fcntl, etc) have no interface
Aaron> in posix aio.  Now I was under the assumption that given open
Aaron> and close operations are comparatively less common than the
Aaron> write operations, this wouldn't be a huge problem.  My tests
Aaron> seemed to reflect that.

Aaron> I went to production with this yesterday to discover that under
Aaron> production load, our filesystems (nfs on netapps) were
Aaron> substantially slower than I was expecting.  open() calls are
Aaron> taking upwards of 2 seconds on occation, and usually ~20ms.

Netapps usually scream for NFS writes and such, so it sounds to me
that you've blown out the NVRAM cache on the box.  Can you elaborate
more on your hardware & Network & Netapp setup?  

Of course, you could also be using sucky NFS configuration, so we need
to see your mount options as well.  You are using TCP and NFSv3,
right?  And a large wsize/rsize values too?  

Have you also checked your NetApp to make sure you have the following
options turned OFF:

nfs.per_client_stats.enable
nfs.mountd_trace

Seeing your exports file and output of 'options nfs' would help.

Aaron> Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make
Aaron> zero difference to my open times.  Example:

Aaron> open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

Aaron> Now, I'm a userspace guy so I can be pretty dense, but
Aaron> shouldn't a call with a nonblocking flag return EAGAIN if its
Aaron> going to take anywhere near 415ms?  Is there a way I can force
Aaron> opens to EAGAIN if they take more than 10ms?

The problem is that O_NONBLOCK on files open doesn't make sense.  You
either open it, or you don't.  How long it takes to comlete isn't part
of the spec.

But in this case, I think you're doing something hokey with your data
design.  You should be opening just a handful of files and then
streaming your writes to those files.   You'll get much more
performance.

Also, have you tried writing to a local disk instead of via NFS to see
how local disk speed is?  

Aaron> (ps.  having come from the socket side of the fence, its
Aaron> incredibly frustrating to be unable to poll() or epoll regular
Aaron> file FDs -- Especially knowing that the kernel is translating
Aaron> them into a TCP socket to do NFS anyway.  Please add regular
Aaron> files to epoll and give me a way to do the opens in the same
Aaron> fasion as connects!)

epoll isn't going to help you much here, it's the open which is
causing the delay, not the writing to the file itself.

Maybe you need to be caching more of your writes into memory on the
client side, and then streaming them to the NetApp later on when you
know you can write a bunch of data at once.

But honestly, I think you've done a bad job architecting your
application's backend data store and you really need to re-think it
through.  

Heck, I'm not even much of a programmer, I'm a SysAdmin who runs
Netapps and talks the users into more sane ways of getting better
performance out of their applications.  *grin*.

John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Davide Libenzi
On Sun, 3 Jun 2007, Aaron Wiebe wrote:

> (ps.  having come from the socket side of the fence, its incredibly
> frustrating to be unable to poll() or epoll regular file FDs --
> Especially knowing that the kernel is translating them into a TCP
> socket to do NFS anyway.  Please add regular files to epoll and give
> me a way to do the opens in the same fasion as connects!)

You may want to follow Ingo and Zach work on syslets/threadlets. If that 
goes in, you can make *any* syscall asynchronous.
I ended up writing a userspace library, to cover the same exact problem 
you have:

http://www.xmailserver.org/guasi.html

I basically host an epoll_wait (containing all my sockets, pipes, etc) 
inside a GUASI async request, where other non-pollable async requests are 
hosted. So guasi_fetch() becomes my main event collector, and when the 
epoll_wait async request show up, I handle all the events in there.
This is a *very-trivial* HTTP server using such solution (coroutines, 
epoll and GUASI):

http://www.xmailserver.org/cghttpd-home.html



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Davide Libenzi
On Sun, 3 Jun 2007, Aaron Wiebe wrote:

 (ps.  having come from the socket side of the fence, its incredibly
 frustrating to be unable to poll() or epoll regular file FDs --
 Especially knowing that the kernel is translating them into a TCP
 socket to do NFS anyway.  Please add regular files to epoll and give
 me a way to do the opens in the same fasion as connects!)

You may want to follow Ingo and Zach work on syslets/threadlets. If that 
goes in, you can make *any* syscall asynchronous.
I ended up writing a userspace library, to cover the same exact problem 
you have:

http://www.xmailserver.org/guasi.html

I basically host an epoll_wait (containing all my sockets, pipes, etc) 
inside a GUASI async request, where other non-pollable async requests are 
hosted. So guasi_fetch() becomes my main event collector, and when the 
epoll_wait async request show up, I handle all the events in there.
This is a *very-trivial* HTTP server using such solution (coroutines, 
epoll and GUASI):

http://www.xmailserver.org/cghttpd-home.html



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread John Stoffel
 Aaron == Aaron Wiebe [EMAIL PROTECTED] writes:

More details on which kernel you're using and which distro would be
helpful.  Also, more details on your App and reasons why you're
writing bunches of small files would help as well. 

Aaron Greetings all.  I'm not on this list, so I apologize if this subject
Aaron has been covered before.  (Also, please cc me in the response.)

Aaron I've spent the last several months trying to work around the lack of a
Aaron decent disk AIO interface.  I'm starting to wonder if one exists
Aaron anywhere.  The short version:

Aaron I have written a daemon that needs to open several thousand
Aaron files a minute and write a small amount of data to each file.

How large are these files?  Are they all in a single directory?  How
many files are in the directory? 

Ugh. Why don't you just write to a DB instead?  It sounds like you're
writing small records, with one record to a file.  It can work, but
when you're doing thousands per-minute, the open/close overhead is
starting to dominate.  Can you just amortize that overhead across a
bunch of writes instead by writing to a single file which is more
structured for your needs?  

Aaron After extensive research, I ended up going with the POSIX AIO
Aaron kludgy pthreads wrapper in glibc to handle my writes due to the
Aaron time constraints of writing my own pthreads handler into the
Aaron application.

Aaron The problem with this equation is that opens, closes and
Aaron non-readwrite operations (fchmod, fcntl, etc) have no interface
Aaron in posix aio.  Now I was under the assumption that given open
Aaron and close operations are comparatively less common than the
Aaron write operations, this wouldn't be a huge problem.  My tests
Aaron seemed to reflect that.

Aaron I went to production with this yesterday to discover that under
Aaron production load, our filesystems (nfs on netapps) were
Aaron substantially slower than I was expecting.  open() calls are
Aaron taking upwards of 2 seconds on occation, and usually ~20ms.

Netapps usually scream for NFS writes and such, so it sounds to me
that you've blown out the NVRAM cache on the box.  Can you elaborate
more on your hardware  Network  Netapp setup?  

Of course, you could also be using sucky NFS configuration, so we need
to see your mount options as well.  You are using TCP and NFSv3,
right?  And a large wsize/rsize values too?  

Have you also checked your NetApp to make sure you have the following
options turned OFF:

nfs.per_client_stats.enable
nfs.mountd_trace

Seeing your exports file and output of 'options nfs' would help.

Aaron Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make
Aaron zero difference to my open times.  Example:

Aaron open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147

Aaron Now, I'm a userspace guy so I can be pretty dense, but
Aaron shouldn't a call with a nonblocking flag return EAGAIN if its
Aaron going to take anywhere near 415ms?  Is there a way I can force
Aaron opens to EAGAIN if they take more than 10ms?

The problem is that O_NONBLOCK on files open doesn't make sense.  You
either open it, or you don't.  How long it takes to comlete isn't part
of the spec.

But in this case, I think you're doing something hokey with your data
design.  You should be opening just a handful of files and then
streaming your writes to those files.   You'll get much more
performance.

Also, have you tried writing to a local disk instead of via NFS to see
how local disk speed is?  

Aaron (ps.  having come from the socket side of the fence, its
Aaron incredibly frustrating to be unable to poll() or epoll regular
Aaron file FDs -- Especially knowing that the kernel is translating
Aaron them into a TCP socket to do NFS anyway.  Please add regular
Aaron files to epoll and give me a way to do the opens in the same
Aaron fasion as connects!)

epoll isn't going to help you much here, it's the open which is
causing the delay, not the writing to the file itself.

Maybe you need to be caching more of your writes into memory on the
client side, and then streaming them to the NetApp later on when you
know you can write a bunch of data at once.

But honestly, I think you've done a bad job architecting your
application's backend data store and you really need to re-think it
through.  

Heck, I'm not even much of a programmer, I'm a SysAdmin who runs
Netapps and talks the users into more sane ways of getting better
performance out of their applications.  *grin*.

John
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread David Schwartz

 Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
 difference to my open times.  Example:

 open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147

How could they make any difference? I can't think of any conceivable way
they could.

 Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
 call with a nonblocking flag return EAGAIN if its going to take
 anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
 they take more than 10ms?

There is no way you can re-try the request. The open must either succeed or
not return a handle. It is not like a 'read' operation that has an I didn't
do anything, and you can retry this request option.

If 'open' returns a file handle, you can't retry it (since it must succeed
in order to do that, failure must not return a handle). If you 'open'
doesn't return a file handle, you can't retry it (because, without a handle,
there is no way to associate a future request with this one, if it creates a
file, the file must not be created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists (or
doesn't exist and can be created, or whatever). This takes however long it
takes on NFS.

You need either threads or a working asynchronous system call interface.
Short of that, you need your own NFS client code.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Al Viro
On Sun, Jun 03, 2007 at 05:27:06PM -0700, David Schwartz wrote:
 
  Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
  difference to my open times.  Example:
 
  open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147
 
 The 'open' function must, at minimum, confirm that the file exists (or
 doesn't exist and can be created, or whatever). This takes however long it
 takes on NFS.
 
 You need either threads or a working asynchronous system call interface.
 Short of that, you need your own NFS client code.

BTW, why close these suckers all the time?  It's not that kernel would
be unable to hold thousands of open descriptors for your process...
Hash descriptors by pathname and be done with that; don't bother with
close unless you decide that you've got too many of them (e.g. when you
get a hash conflict).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Aaron Wiebe

Hi John, thanks for responding.  I'm using kernel 2.6.20 on a
home-grown distro.

I've responded to a few specific points inline - but as a whole,
Davide directed me to work that is being done specifically to address
these issues in the kernel, as well as a userspace implementation that
would allow me to sidestep this failing for the time being.


On 6/3/07, John Stoffel [EMAIL PROTECTED] wrote:


How large are these files?  Are they all in a single directory?  How
many files are in the directory?

Ugh. Why don't you just write to a DB instead?  It sounds like you're
writing small records, with one record to a file.  It can work, but
when you're doing thousands per-minute, the open/close overhead is
starting to dominate.  Can you just amortize that overhead across a
bunch of writes instead by writing to a single file which is more
structured for your needs?


In short, I'm distributing logs in realtime for about 600,000
websites.  The sources of the logs (http, ftp, realmedia, etc) are
flexible, however the base framework was build around a large cluster
of webservers.  The output can be to several hundred thousand files
across about two dozen filers for user consumption - some can be very
active, some can be completely inactive.


Netapps usually scream for NFS writes and such, so it sounds to me
that you've blown out the NVRAM cache on the box.  Can you elaborate
more on your hardware  Network  Netapp setup?


You're totally correct here - Netapp has told us as much about our
filesystem design, we use too much ram on the filer itself.  Its true
that the application would handle just fine if our filesystem
structure were redesigned - I am approaching this from an application
perspective though.  These units are capable of the raw IO, its the
simple fact that open calls are taking a while.  If I were to thread
off the application (which Davide has been kind enough to provide some
libraries which will make that substantially easier), the problem
wouldn't exist.


The problem is that O_NONBLOCK on files open doesn't make sense.  You
either open it, or you don't.  How long it takes to comlete isn't part
of the spec.


You can certainly open the file, but not block on the call to do it.
What confuses me is why the kernel would block for 415ms on an open
call.  Thats an eternity to suspend a process that has to distribute
data such as this.


But in this case, I think you're doing something hokey with your data
design.  You should be opening just a handful of files and then
streaming your writes to those files.   You'll get much more
performance.


Except I cant very well keep 600,000 files open over NFS.  :)  Pool
and queue, and cycle through the pool.  I've managed to achieve a
balance in my production deployment with this method - my email was
more of a rant after months of trying to work around a problem (caused
by a limitation in system calls), only to have it present an order of
magnitude worse than I expected.  Sorry for not giving more
information off the line - and thanks for your time.

-Aaron
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Bernd Eckenfels
In article [EMAIL PROTECTED] you wrote:
 (ps.  having come from the socket side of the fence, its incredibly
 frustrating to be unable to poll() or epoll regular file FDs --
 Especially knowing that the kernel is translating them into a TCP
 socket to do NFS anyway.  Please add regular files to epoll and give
 me a way to do the opens in the same fasion as connects!)

You might want to use Windows? :) 

Gruss
Bernd
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Neil Brown
On Sunday June 3, [EMAIL PROTECTED] wrote:
 
 You can certainly open the file, but not block on the call to do it.
 What confuses me is why the kernel would block for 415ms on an open
 call.  Thats an eternity to suspend a process that has to distribute
 data such as this.

Have you tried the nocto mount option for your NFS filesystems.

The cache-coherency rules of NFS require the client to check with the
server at each open.  If you are the sole client on this filesystem,
then you don't need the same cache-coherency, and nocto will tell
the NFS client not to both checking with the server in information is
available in cache.

This should speed up the time for open considerably.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: slow open() calls and o_nonblock

2007-06-03 Thread Bernd Eckenfels
In article [EMAIL PROTECTED] you wrote:
 In short, I'm distributing logs in realtime for about 600,000
 websites.  The sources of the logs (http, ftp, realmedia, etc) are
 flexible, however the base framework was build around a large cluster
 of webservers.  The output can be to several hundred thousand files
 across about two dozen filers for user consumption - some can be very
 active, some can be completely inactive.

Asuming you have multiple request log summary files, I would just run
multiple splitters.

 You can certainly open the file, but not block on the call to do it.
 What confuses me is why the kernel would block for 415ms on an open
 call.  Thats an eternity to suspend a process that has to distribute
 data such as this.

Because it has to, to return the result with the given API. 

But If you would have a async interface, the operation would still take that
long and your throughput will still be limited by the opens/sec your filers
support, or?

 Except I cant very well keep 600,000 files open over NFS.  :)  Pool
 and queue, and cycle through the pool.  I've managed to achieve a
 balance in my production deployment with this method - my email was
 more of a rant after months of trying to work around a problem (caused
 by a limitation in system calls),

I agree that a unified async layer is nice from the programmers POV, but I
disagree that it would help your performance problem which is caused by NFS
and/or NetApp (and I wont blame them).

Gruss
Bernd
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread Albert Cahalan

David Schwartz writes:

[Aaron Wiebe]



open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147


How could they make any difference? I can't think of any
conceivable way they could.


Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
call with a nonblocking flag return EAGAIN if its going to take
anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
they take more than 10ms?


There is no way you can re-try the request. The open must either
succeed or not return a handle. It is not like a 'read' operation
that has an I didn't do anything, and you can retry this request
option.

If 'open' returns a file handle, you can't retry it (since it must
succeed in order to do that, failure must not return a handle).
If you 'open' doesn't return a file handle, you can't retry it
(because, without a handle, there is no way to associate a future
request with this one, if it creates a file, the file must not be
created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists
(or doesn't exist and can be created, or whatever). This takes
however long it takes on NFS.


This is not the case, though we might need to allocate a new
flag to avoid breaking things.

Let open() with O_UNCHECKED always return a file descriptor,
except perhaps when failure can be identified without doing IO.
The real open then proceeds in the background.


From poll() or select(), you can see that the file descriptor

is not ready for anything. Eventually it becomes ready for IO
or reports an error condition. Both select() and poll() are
capable of reporting errors. If the real (background) open()
fails, then the only valid operation is close(). Attempts to
do anything else get EBADFD or ESTALE.

You'll also need a background close().
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/