IPv4 established connection paradox
I apologize for bringing this to the list, but after several hours of google, testing, and so forth I'm coming up blank. I've been unable to reproduce this state in targeted tests, however the application itself does it on a semi-regular basis. I currently have an application with the following state (from netstat): tcp0 0 127.0.0.1:51115 127.0.0.1:51115 ESTABLISHED 46965/python2.6 (from lsof): python2.6 46965 root 14u IPv4 11218239 0t0 TCP localhost:51115->localhost:51115 (ESTABLISHED) The application is blocked in recvfrom() on that socket. I'm not looking for any specific assistance except the basic one: how is it possible to get into this state? In my tests, binding to the same port outgoing as a listener isn't possible (not surprised), and if that's the case, how is it possible to ever have an established connection to... the same socket. This is effectively blocking binds to the port (which is actually used by another application most of the time). The application would normally connect to this port to status-check the running service. In this case, the service is unable to start because of this state. Older RHEL6.6 kernel (2.6.32-431), but if this is a bug I can't seem to find any mention of it anywhere. And if it's not, I'm totally confused and hoping someone can explain this to me. (Please cc me in response) -Aaron -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
IPv4 established connection paradox
I apologize for bringing this to the list, but after several hours of google, testing, and so forth I'm coming up blank. I've been unable to reproduce this state in targeted tests, however the application itself does it on a semi-regular basis. I currently have an application with the following state (from netstat): tcp0 0 127.0.0.1:51115 127.0.0.1:51115 ESTABLISHED 46965/python2.6 (from lsof): python2.6 46965 root 14u IPv4 11218239 0t0 TCP localhost:51115-localhost:51115 (ESTABLISHED) The application is blocked in recvfrom() on that socket. I'm not looking for any specific assistance except the basic one: how is it possible to get into this state? In my tests, binding to the same port outgoing as a listener isn't possible (not surprised), and if that's the case, how is it possible to ever have an established connection to... the same socket. This is effectively blocking binds to the port (which is actually used by another application most of the time). The application would normally connect to this port to status-check the running service. In this case, the service is unable to start because of this state. Older RHEL6.6 kernel (2.6.32-431), but if this is a bug I can't seem to find any mention of it anywhere. And if it's not, I'm totally confused and hoping someone can explain this to me. (Please cc me in response) -Aaron -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Actually, lets see if I can summarize this more generically... I realize I'm suggesting something that probably would be a massive undertaking, but .. Regular files are the only interface that requires an application to wait. With any other case, the nonblocking interfaces are fairly complete and easy to work with. If userspace could treat regular files in the same fashion as sockets, life would be good. I admittedly do not understand internal kernel semantics in the differences between a socket and a regular file. Why couldn't we just have a different 'socket type' like PF_FILE or something like this? Abstracting any IO through the existing interfaces provided to sockets would be ideal from my perspective. The code required to use a file through these interfaces would be more complex in userspace, but the abstraction of the current open() itself could simply be an aggregate of these interfaces without a nonblocking flag. It would, however, fix problems around issues with event-based applications handling events from both disk and sockets. I can't trigger disk read/write events in the same event handlers I use for sockets (ie, poll or epoll). I end up having two separate event handlers - one for disk (currently using glibc's aio thread kludge), and one for sockets. I'm sure this isn't a new idea. Coming from my own development backround that had little to do with disk, I was actually surprised when I first discovered that I couldn't edge-trigger disk IO through poll(). Thoughts, comments? -Aaron On 6/4/07, Aaron Wiebe <[EMAIL PROTECTED]> wrote: On 6/4/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: > > So exactly how would you expect a nonblocking open to work? Should it be > starting I/O? What if that involves blocking? How would you know when to > try again? Well, theres a bunch of options - some have been suggested in the thread already. The idea of an open with O_NONBLOCK (or a different flag) returning a handle immediately, and subsequent calls returning EAGAIN if the open is incomplete, or ESTALE if it fails (with some auxiliary method of getting the reason why it failed) are not too far a stretch from my perspective. The other option that comes to mind would be to add an interface that behaves like sockets - get a handle from one system call, set it nonblocking using fcntl, and use another call to attach it to a regular file. This method would make the most sense to me - but its also because I've worked with sockets in the past far far more than with regular files. The one that would take the least amount of work from the application perspective would be to simply reply to the nonblocking open call with EAGAIN (or something), and when an open on the same file is performed, the kernel could have performed its work in the background. I can understand, given the fact that there is no handle provided to the application, that this idea could be sloppy. I'm still getting caught up on some of the other suggestions (I'm currently reading about the syslets work that Zach and Ingo are doing), and it sounds like this is a common complaint that is being addressed through a number of initiatives. I'm looking forward to seeing where that work goes. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, Trond Myklebust <[EMAIL PROTECTED]> wrote: So exactly how would you expect a nonblocking open to work? Should it be starting I/O? What if that involves blocking? How would you know when to try again? Well, theres a bunch of options - some have been suggested in the thread already. The idea of an open with O_NONBLOCK (or a different flag) returning a handle immediately, and subsequent calls returning EAGAIN if the open is incomplete, or ESTALE if it fails (with some auxiliary method of getting the reason why it failed) are not too far a stretch from my perspective. The other option that comes to mind would be to add an interface that behaves like sockets - get a handle from one system call, set it nonblocking using fcntl, and use another call to attach it to a regular file. This method would make the most sense to me - but its also because I've worked with sockets in the past far far more than with regular files. The one that would take the least amount of work from the application perspective would be to simply reply to the nonblocking open call with EAGAIN (or something), and when an open on the same file is performed, the kernel could have performed its work in the background. I can understand, given the fact that there is no handle provided to the application, that this idea could be sloppy. I'm still getting caught up on some of the other suggestions (I'm currently reading about the syslets work that Zach and Ingo are doing), and it sounds like this is a common complaint that is being addressed through a number of initiatives. I'm looking forward to seeing where that work goes. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Sorry for the unthreaded responses, I wasn't cc'd here, so I'm replying to these based on mailing list archives Al Viro wrote: BTW, why close these suckers all the time? It's not that kernel would be unable to hold thousands of open descriptors for your process... Hash descriptors by pathname and be done with that; don't bother with close unless you decide that you've got too many of them (e.g. when you get a hash conflict). A valid point - I currently keep a pool of 4000 descriptors open and cycle them out based on inactivity. I hadn't seriously considered just keeping them all open, because I simply wasn't sure how well things would go with 100,000 files open. Would my backend storage keep up... would the kernel mind maintaining 100,000 files open over NFS? The majority of the files would simply be idle - I would be keeping file handles open for no reason. Pooling allows me to substantially drop the number of opens I require, but I am hesitant to blow the pool size to substantially higher numbers. Can anyone shed light on any issues that may come up with a massive pool size, such as 128k? -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, John Stoffel <[EMAIL PROTECTED]> wrote: So how many files are in the directory where you're seeing the delays? And what's the average size of the files in there? The directories themselves will have a maximum of 160 files, and the files are maybe a few megs each - the delays are (as you pointed out earlier) due to the ram restrictions and our filesystem design of very deep directory structures that Netapps suck at. My point is more generic though - I will come up with ways to handle this problem in my application (probably with threads), but I'm griping more about the lack of a kernel interface that would have allowed me to avoid this. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Replying to David Schwartz here.. (David, good to hear from you again - haven't seen you around since the irc days :)) David Schwartz wrote: There is no way you can re-try the request. The open must either succeed or not return a handle. It is not like a 'read' operation that has an "I didn't do anything, and you can retry this request" option. If 'open' returns a file handle, you can't retry it (since it must succeed in order to do that, failure must not return a handle). If you 'open' doesn't return a file handle, you can't retry it (because, without a handle, there is no way to associate a future request with this one, if it creates a file, the file must not be created if you don't call 'open' again). I understand, but this is exactly the situation that I'm complaining about. There is no functionality to provide a nonblocking open - no ability to come back around and retry a given open call. You need either threads or a working asynchronous system call interface. Short of that, you need your own NFS client code. This is exactly my point - there is no asynchronous system call to do this work, to my knowledge. I will likely fix this in my own code using threads, but I see using threads in this case as working around that lack of systems interface. Threads, imho, should be limited to cases where I'm using them to distribute load across multiple processors, not because the kernel interfaces for IO cannot support nonblocking calls. I'm speaking to my ideal world view - but any application I write should not have to wait for the kernel if I don't want it to. I should be able to submit my request, and come back to it later as I so decide. (And I did actually consider writing my own NFS client for about 5 minutes.) Thanks for the response! -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, Alan Cox <[EMAIL PROTECTED]> wrote: > Now, I'm a userspace guy so I can be pretty dense, but shouldn't a > call with a nonblocking flag return EAGAIN if its going to take > anywhere near 415ms? Violation of causality. We don't know it will block for 415ms until 415ms have elapsed. Understood - but what I'm getting at is more the fact that there really doesn't appear to be any real implementation of nonblocking open(). On the socket side of the fence, I would consider a regular file open() to be equivalent to a connect() call - the difference obviously being that we already have a handle for the socket. The end result, however, is roughly the same. We have a file descriptor with the endpoint established. In the socket world, we assume that a nonblocking request will always return immediately and the application is expected to come back around and see if the request has completed. Regular files have no equivalent. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/3/07, Neil Brown <[EMAIL PROTECTED]> wrote: Have you tried the "nocto" mount option for your NFS filesystems. The cache-coherency rules of NFS require the client to check with the server at each open. If you are the sole client on this filesystem, then you don't need the same cache-coherency, and "nocto" will tell the NFS client not to both checking with the server in information is available in cache. No I haven't - I will research this a little further today. While we're not the only client using these filesystems, this process is (currently) the only process that writes to these files. Thanks for the suggestion. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/3/07, Neil Brown [EMAIL PROTECTED] wrote: Have you tried the nocto mount option for your NFS filesystems. The cache-coherency rules of NFS require the client to check with the server at each open. If you are the sole client on this filesystem, then you don't need the same cache-coherency, and nocto will tell the NFS client not to both checking with the server in information is available in cache. No I haven't - I will research this a little further today. While we're not the only client using these filesystems, this process is (currently) the only process that writes to these files. Thanks for the suggestion. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, Alan Cox [EMAIL PROTECTED] wrote: Now, I'm a userspace guy so I can be pretty dense, but shouldn't a call with a nonblocking flag return EAGAIN if its going to take anywhere near 415ms? Violation of causality. We don't know it will block for 415ms until 415ms have elapsed. Understood - but what I'm getting at is more the fact that there really doesn't appear to be any real implementation of nonblocking open(). On the socket side of the fence, I would consider a regular file open() to be equivalent to a connect() call - the difference obviously being that we already have a handle for the socket. The end result, however, is roughly the same. We have a file descriptor with the endpoint established. In the socket world, we assume that a nonblocking request will always return immediately and the application is expected to come back around and see if the request has completed. Regular files have no equivalent. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Replying to David Schwartz here.. (David, good to hear from you again - haven't seen you around since the irc days :)) David Schwartz wrote: There is no way you can re-try the request. The open must either succeed or not return a handle. It is not like a 'read' operation that has an I didn't do anything, and you can retry this request option. If 'open' returns a file handle, you can't retry it (since it must succeed in order to do that, failure must not return a handle). If you 'open' doesn't return a file handle, you can't retry it (because, without a handle, there is no way to associate a future request with this one, if it creates a file, the file must not be created if you don't call 'open' again). I understand, but this is exactly the situation that I'm complaining about. There is no functionality to provide a nonblocking open - no ability to come back around and retry a given open call. You need either threads or a working asynchronous system call interface. Short of that, you need your own NFS client code. This is exactly my point - there is no asynchronous system call to do this work, to my knowledge. I will likely fix this in my own code using threads, but I see using threads in this case as working around that lack of systems interface. Threads, imho, should be limited to cases where I'm using them to distribute load across multiple processors, not because the kernel interfaces for IO cannot support nonblocking calls. I'm speaking to my ideal world view - but any application I write should not have to wait for the kernel if I don't want it to. I should be able to submit my request, and come back to it later as I so decide. (And I did actually consider writing my own NFS client for about 5 minutes.) Thanks for the response! -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, John Stoffel [EMAIL PROTECTED] wrote: So how many files are in the directory where you're seeing the delays? And what's the average size of the files in there? The directories themselves will have a maximum of 160 files, and the files are maybe a few megs each - the delays are (as you pointed out earlier) due to the ram restrictions and our filesystem design of very deep directory structures that Netapps suck at. My point is more generic though - I will come up with ways to handle this problem in my application (probably with threads), but I'm griping more about the lack of a kernel interface that would have allowed me to avoid this. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Sorry for the unthreaded responses, I wasn't cc'd here, so I'm replying to these based on mailing list archives Al Viro wrote: BTW, why close these suckers all the time? It's not that kernel would be unable to hold thousands of open descriptors for your process... Hash descriptors by pathname and be done with that; don't bother with close unless you decide that you've got too many of them (e.g. when you get a hash conflict). A valid point - I currently keep a pool of 4000 descriptors open and cycle them out based on inactivity. I hadn't seriously considered just keeping them all open, because I simply wasn't sure how well things would go with 100,000 files open. Would my backend storage keep up... would the kernel mind maintaining 100,000 files open over NFS? The majority of the files would simply be idle - I would be keeping file handles open for no reason. Pooling allows me to substantially drop the number of opens I require, but I am hesitant to blow the pool size to substantially higher numbers. Can anyone shed light on any issues that may come up with a massive pool size, such as 128k? -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
On 6/4/07, Trond Myklebust [EMAIL PROTECTED] wrote: So exactly how would you expect a nonblocking open to work? Should it be starting I/O? What if that involves blocking? How would you know when to try again? Well, theres a bunch of options - some have been suggested in the thread already. The idea of an open with O_NONBLOCK (or a different flag) returning a handle immediately, and subsequent calls returning EAGAIN if the open is incomplete, or ESTALE if it fails (with some auxiliary method of getting the reason why it failed) are not too far a stretch from my perspective. The other option that comes to mind would be to add an interface that behaves like sockets - get a handle from one system call, set it nonblocking using fcntl, and use another call to attach it to a regular file. This method would make the most sense to me - but its also because I've worked with sockets in the past far far more than with regular files. The one that would take the least amount of work from the application perspective would be to simply reply to the nonblocking open call with EAGAIN (or something), and when an open on the same file is performed, the kernel could have performed its work in the background. I can understand, given the fact that there is no handle provided to the application, that this idea could be sloppy. I'm still getting caught up on some of the other suggestions (I'm currently reading about the syslets work that Zach and Ingo are doing), and it sounds like this is a common complaint that is being addressed through a number of initiatives. I'm looking forward to seeing where that work goes. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Actually, lets see if I can summarize this more generically... I realize I'm suggesting something that probably would be a massive undertaking, but .. Regular files are the only interface that requires an application to wait. With any other case, the nonblocking interfaces are fairly complete and easy to work with. If userspace could treat regular files in the same fashion as sockets, life would be good. I admittedly do not understand internal kernel semantics in the differences between a socket and a regular file. Why couldn't we just have a different 'socket type' like PF_FILE or something like this? Abstracting any IO through the existing interfaces provided to sockets would be ideal from my perspective. The code required to use a file through these interfaces would be more complex in userspace, but the abstraction of the current open() itself could simply be an aggregate of these interfaces without a nonblocking flag. It would, however, fix problems around issues with event-based applications handling events from both disk and sockets. I can't trigger disk read/write events in the same event handlers I use for sockets (ie, poll or epoll). I end up having two separate event handlers - one for disk (currently using glibc's aio thread kludge), and one for sockets. I'm sure this isn't a new idea. Coming from my own development backround that had little to do with disk, I was actually surprised when I first discovered that I couldn't edge-trigger disk IO through poll(). Thoughts, comments? -Aaron On 6/4/07, Aaron Wiebe [EMAIL PROTECTED] wrote: On 6/4/07, Trond Myklebust [EMAIL PROTECTED] wrote: So exactly how would you expect a nonblocking open to work? Should it be starting I/O? What if that involves blocking? How would you know when to try again? Well, theres a bunch of options - some have been suggested in the thread already. The idea of an open with O_NONBLOCK (or a different flag) returning a handle immediately, and subsequent calls returning EAGAIN if the open is incomplete, or ESTALE if it fails (with some auxiliary method of getting the reason why it failed) are not too far a stretch from my perspective. The other option that comes to mind would be to add an interface that behaves like sockets - get a handle from one system call, set it nonblocking using fcntl, and use another call to attach it to a regular file. This method would make the most sense to me - but its also because I've worked with sockets in the past far far more than with regular files. The one that would take the least amount of work from the application perspective would be to simply reply to the nonblocking open call with EAGAIN (or something), and when an open on the same file is performed, the kernel could have performed its work in the background. I can understand, given the fact that there is no handle provided to the application, that this idea could be sloppy. I'm still getting caught up on some of the other suggestions (I'm currently reading about the syslets work that Zach and Ingo are doing), and it sounds like this is a common complaint that is being addressed through a number of initiatives. I'm looking forward to seeing where that work goes. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Hi John, thanks for responding. I'm using kernel 2.6.20 on a home-grown distro. I've responded to a few specific points inline - but as a whole, Davide directed me to work that is being done specifically to address these issues in the kernel, as well as a userspace implementation that would allow me to sidestep this failing for the time being. On 6/3/07, John Stoffel <[EMAIL PROTECTED]> wrote: How large are these files? Are they all in a single directory? How many files are in the directory? Ugh. Why don't you just write to a DB instead? It sounds like you're writing small records, with one record to a file. It can work, but when you're doing thousands per-minute, the open/close overhead is starting to dominate. Can you just amortize that overhead across a bunch of writes instead by writing to a single file which is more structured for your needs? In short, I'm distributing logs in realtime for about 600,000 websites. The sources of the logs (http, ftp, realmedia, etc) are flexible, however the base framework was build around a large cluster of webservers. The output can be to several hundred thousand files across about two dozen filers for user consumption - some can be very active, some can be completely inactive. Netapps usually scream for NFS writes and such, so it sounds to me that you've blown out the NVRAM cache on the box. Can you elaborate more on your hardware & Network & Netapp setup? You're totally correct here - Netapp has told us as much about our filesystem design, we use too much ram on the filer itself. Its true that the application would handle just fine if our filesystem structure were redesigned - I am approaching this from an application perspective though. These units are capable of the raw IO, its the simple fact that open calls are taking a while. If I were to thread off the application (which Davide has been kind enough to provide some libraries which will make that substantially easier), the problem wouldn't exist. The problem is that O_NONBLOCK on files open doesn't make sense. You either open it, or you don't. How long it takes to comlete isn't part of the spec. You can certainly open the file, but not block on the call to do it. What confuses me is why the kernel would "block" for 415ms on an open call. Thats an eternity to suspend a process that has to distribute data such as this. But in this case, I think you're doing something hokey with your data design. You should be opening just a handful of files and then streaming your writes to those files. You'll get much more performance. Except I cant very well keep 600,000 files open over NFS. :) Pool and queue, and cycle through the pool. I've managed to achieve a balance in my production deployment with this method - my email was more of a rant after months of trying to work around a problem (caused by a limitation in system calls), only to have it present an order of magnitude worse than I expected. Sorry for not giving more information off the line - and thanks for your time. -Aaron - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
slow open() calls and o_nonblock
Greetings all. I'm not on this list, so I apologize if this subject has been covered before. (Also, please cc me in the response.) I've spent the last several months trying to work around the lack of a decent disk AIO interface. I'm starting to wonder if one exists anywhere. The short version: I have written a daemon that needs to open several thousand files a minute and write a small amount of data to each file. After extensive research, I ended up going with the POSIX AIO kludgy pthreads wrapper in glibc to handle my writes due to the time constraints of writing my own pthreads handler into the application. The problem with this equation is that opens, closes and non-readwrite operations (fchmod, fcntl, etc) have no interface in posix aio. Now I was under the assumption that given open and close operations are comparatively less common than the write operations, this wouldn't be a huge problem. My tests seemed to reflect that. I went to production with this yesterday to discover that under production load, our filesystems (nfs on netapps) were substantially slower than I was expecting. open() calls are taking upwards of 2 seconds on occation, and usually ~20ms. Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero difference to my open times. Example: open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a call with a nonblocking flag return EAGAIN if its going to take anywhere near 415ms? Is there a way I can force opens to EAGAIN if they take more than 10ms? Thanks for any help you folks can offer. -Aaron Wiebe (ps. having come from the socket side of the fence, its incredibly frustrating to be unable to poll() or epoll regular file FDs -- Especially knowing that the kernel is translating them into a TCP socket to do NFS anyway. Please add regular files to epoll and give me a way to do the opens in the same fasion as connects!) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
slow open() calls and o_nonblock
Greetings all. I'm not on this list, so I apologize if this subject has been covered before. (Also, please cc me in the response.) I've spent the last several months trying to work around the lack of a decent disk AIO interface. I'm starting to wonder if one exists anywhere. The short version: I have written a daemon that needs to open several thousand files a minute and write a small amount of data to each file. After extensive research, I ended up going with the POSIX AIO kludgy pthreads wrapper in glibc to handle my writes due to the time constraints of writing my own pthreads handler into the application. The problem with this equation is that opens, closes and non-readwrite operations (fchmod, fcntl, etc) have no interface in posix aio. Now I was under the assumption that given open and close operations are comparatively less common than the write operations, this wouldn't be a huge problem. My tests seemed to reflect that. I went to production with this yesterday to discover that under production load, our filesystems (nfs on netapps) were substantially slower than I was expecting. open() calls are taking upwards of 2 seconds on occation, and usually ~20ms. Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero difference to my open times. Example: open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147 Now, I'm a userspace guy so I can be pretty dense, but shouldn't a call with a nonblocking flag return EAGAIN if its going to take anywhere near 415ms? Is there a way I can force opens to EAGAIN if they take more than 10ms? Thanks for any help you folks can offer. -Aaron Wiebe (ps. having come from the socket side of the fence, its incredibly frustrating to be unable to poll() or epoll regular file FDs -- Especially knowing that the kernel is translating them into a TCP socket to do NFS anyway. Please add regular files to epoll and give me a way to do the opens in the same fasion as connects!) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: slow open() calls and o_nonblock
Hi John, thanks for responding. I'm using kernel 2.6.20 on a home-grown distro. I've responded to a few specific points inline - but as a whole, Davide directed me to work that is being done specifically to address these issues in the kernel, as well as a userspace implementation that would allow me to sidestep this failing for the time being. On 6/3/07, John Stoffel [EMAIL PROTECTED] wrote: How large are these files? Are they all in a single directory? How many files are in the directory? Ugh. Why don't you just write to a DB instead? It sounds like you're writing small records, with one record to a file. It can work, but when you're doing thousands per-minute, the open/close overhead is starting to dominate. Can you just amortize that overhead across a bunch of writes instead by writing to a single file which is more structured for your needs? In short, I'm distributing logs in realtime for about 600,000 websites. The sources of the logs (http, ftp, realmedia, etc) are flexible, however the base framework was build around a large cluster of webservers. The output can be to several hundred thousand files across about two dozen filers for user consumption - some can be very active, some can be completely inactive. Netapps usually scream for NFS writes and such, so it sounds to me that you've blown out the NVRAM cache on the box. Can you elaborate more on your hardware Network Netapp setup? You're totally correct here - Netapp has told us as much about our filesystem design, we use too much ram on the filer itself. Its true that the application would handle just fine if our filesystem structure were redesigned - I am approaching this from an application perspective though. These units are capable of the raw IO, its the simple fact that open calls are taking a while. If I were to thread off the application (which Davide has been kind enough to provide some libraries which will make that substantially easier), the problem wouldn't exist. The problem is that O_NONBLOCK on files open doesn't make sense. You either open it, or you don't. How long it takes to comlete isn't part of the spec. You can certainly open the file, but not block on the call to do it. What confuses me is why the kernel would block for 415ms on an open call. Thats an eternity to suspend a process that has to distribute data such as this. But in this case, I think you're doing something hokey with your data design. You should be opening just a handful of files and then streaming your writes to those files. You'll get much more performance. Except I cant very well keep 600,000 files open over NFS. :) Pool and queue, and cycle through the pool. I've managed to achieve a balance in my production deployment with this method - my email was more of a rant after months of trying to work around a problem (caused by a limitation in system calls), only to have it present an order of magnitude worse than I expected. Sorry for not giving more information off the line - and thanks for your time. -Aaron - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fwd: uninterruptable fcntl calls
Greetings, I've run into a situation where fcntl F_SETLKW calls lock up nearly completely. I've tried several approaches to handle this case, and have yet to come up with some method of handling this. I've never really ventured outside userspace, so I'm turning to this list to try and get a handle on this. Over NFSv3 udp, this situation takes place VERY rarely, however with the volume I do, its creating a problem. In short, I am attempting to read or write lock, and the call hangs to the point where a sigkill is not captured - no signal is. I've tried alarming out and I've tried switching the socket to nonblocking - nothing I can think of prevents or even allows me to handle the case. I understand NFS locking can be rather sketchy at times - but all I need is the ability to handle the case. I can force the process to die by sending a sigkill, then stracing. The strace reports the process as sigstop, then processes the kill signal. All I need here is a method of capturing this case. I can "repair" the stuck lock by regenerating the file, but I can't capture the case in order to handle this in code. Any help would be useful - I am currently running 2.6.15.6 compiled with the NFS patches from linux-nfs.org, but this case was happening before applying those patches. I'd be happy to provide any more information nessecary. I've been struggling with this one for a few months now. Thanks, -Aaron Straces: rt_sigaction(SIGALRM, {0xb7f56640, [ALRM], 0}, {SIG_DFL}, 8) = 0 alarm(120) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} [hangs] Or: fcntl64(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} Code used for locking: static int db_lock(int fd, int type) { struct flock fl; struct timespec *tv = (struct timespec *) malloc(sizeof(struct timespec)); int ret, c = 0; if(!(fd > 0)) return -1; #ifdef SIGALRM_HACK /* after two minutes, wig out */ sigalrm_set(); alarm(120); #endif fl.l_whence = SEEK_SET; fl.l_start = 0; fl.l_len = 0; fl.l_type = type; #ifdef NONBLOCKING_HACK set_nonblocking(fd); #endif while((ret = fcntl(fd, F_SETLKW, )) < 0) { c++; if(c > 600) { /* we've been waiting for 60 seconds... */ my_error("stuck on fcntl request, aborting"); return -1; } tv->tv_nsec = 100; /* 10th of a second wait */ tv->tv_sec = 0; nanosleep(tv, NULL); } free(tv); #ifdef SIGALRM_HACK sigalrm_unset(); #endif #ifdef NONBLOCKING_HACK unset_nonblocking(fd); #endif return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
uninterruptable fcntl calls
Greetings, I've run into a situation where fcntl F_SETLKW calls lock up nearly completely. I've tried several approaches to handle this case, and have yet to come up with some method of handling this. I've never really ventured outside userspace, so I'm turning to this list to try and get a handle on this. Over NFSv3 udp, this situation takes place VERY rarely, however with the volume I do, its creating a problem. In short, I am attempting to read or write lock, and the call hangs to the point where a sigkill is not captured - no signal is. I've tried alarming out and I've tried switching the socket to nonblocking - nothing I can think of prevents or even allows me to handle the case. I understand NFS locking can be rather sketchy at times - but all I need is the ability to handle the case. I can force the process to die by sending a sigkill, then stracing. The strace reports the process as sigstop, then processes the kill signal. All I need here is a method of capturing this case. I can "repair" the stuck lock by regenerating the file, but I can't capture the case in order to handle this in code. Any help would be useful - I am currently running 2.6.15.6 compiled with the NFS patches from linux-nfs.org, but this case was happening before applying those patches. I'd be happy to provide any more information nessecary. I've been struggling with this one for a few months now. Thanks, -Aaron Straces: rt_sigaction(SIGALRM, {0xb7f56640, [ALRM], 0}, {SIG_DFL}, 8) = 0 alarm(120) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} [hangs] Or: fcntl64(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} Code used for locking: static int db_lock(int fd, int type) { struct flock fl; struct timespec *tv = (struct timespec *) malloc(sizeof(struct timespec)); int ret, c = 0; if(!(fd > 0)) return -1; #ifdef SIGALRM_HACK /* after two minutes, wig out */ sigalrm_set(); alarm(120); #endif fl.l_whence = SEEK_SET; fl.l_start = 0; fl.l_len = 0; fl.l_type = type; #ifdef NONBLOCKING_HACK set_nonblocking(fd); #endif while((ret = fcntl(fd, F_SETLKW, )) < 0) { c++; if(c > 600) { /* we've been waiting for 60 seconds... */ my_error("stuck on fcntl request, aborting"); return -1; } tv->tv_nsec = 100; /* 10th of a second wait */ tv->tv_sec = 0; nanosleep(tv, NULL); } free(tv); #ifdef SIGALRM_HACK sigalrm_unset(); #endif #ifdef NONBLOCKING_HACK unset_nonblocking(fd); #endif return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
uninterruptable fcntl calls
Greetings, I've run into a situation where fcntl F_SETLKW calls lock up nearly completely. I've tried several approaches to handle this case, and have yet to come up with some method of handling this. I've never really ventured outside userspace, so I'm turning to this list to try and get a handle on this. Over NFSv3 udp, this situation takes place VERY rarely, however with the volume I do, its creating a problem. In short, I am attempting to read or write lock, and the call hangs to the point where a sigkill is not captured - no signal is. I've tried alarming out and I've tried switching the socket to nonblocking - nothing I can think of prevents or even allows me to handle the case. I understand NFS locking can be rather sketchy at times - but all I need is the ability to handle the case. I can force the process to die by sending a sigkill, then stracing. The strace reports the process as sigstop, then processes the kill signal. All I need here is a method of capturing this case. I can repair the stuck lock by regenerating the file, but I can't capture the case in order to handle this in code. Any help would be useful - I am currently running 2.6.15.6 compiled with the NFS patches from linux-nfs.org, but this case was happening before applying those patches. I'd be happy to provide any more information nessecary. I've been struggling with this one for a few months now. Thanks, -Aaron Straces: rt_sigaction(SIGALRM, {0xb7f56640, [ALRM], 0}, {SIG_DFL}, 8) = 0 alarm(120) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} [hangs] Or: fcntl64(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} Code used for locking: static int db_lock(int fd, int type) { struct flock fl; struct timespec *tv = (struct timespec *) malloc(sizeof(struct timespec)); int ret, c = 0; if(!(fd 0)) return -1; #ifdef SIGALRM_HACK /* after two minutes, wig out */ sigalrm_set(); alarm(120); #endif fl.l_whence = SEEK_SET; fl.l_start = 0; fl.l_len = 0; fl.l_type = type; #ifdef NONBLOCKING_HACK set_nonblocking(fd); #endif while((ret = fcntl(fd, F_SETLKW, fl)) 0) { c++; if(c 600) { /* we've been waiting for 60 seconds... */ my_error(stuck on fcntl request, aborting); return -1; } tv-tv_nsec = 100; /* 10th of a second wait */ tv-tv_sec = 0; nanosleep(tv, NULL); } free(tv); #ifdef SIGALRM_HACK sigalrm_unset(); #endif #ifdef NONBLOCKING_HACK unset_nonblocking(fd); #endif return ret; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Fwd: uninterruptable fcntl calls
Greetings, I've run into a situation where fcntl F_SETLKW calls lock up nearly completely. I've tried several approaches to handle this case, and have yet to come up with some method of handling this. I've never really ventured outside userspace, so I'm turning to this list to try and get a handle on this. Over NFSv3 udp, this situation takes place VERY rarely, however with the volume I do, its creating a problem. In short, I am attempting to read or write lock, and the call hangs to the point where a sigkill is not captured - no signal is. I've tried alarming out and I've tried switching the socket to nonblocking - nothing I can think of prevents or even allows me to handle the case. I understand NFS locking can be rather sketchy at times - but all I need is the ability to handle the case. I can force the process to die by sending a sigkill, then stracing. The strace reports the process as sigstop, then processes the kill signal. All I need here is a method of capturing this case. I can repair the stuck lock by regenerating the file, but I can't capture the case in order to handle this in code. Any help would be useful - I am currently running 2.6.15.6 compiled with the NFS patches from linux-nfs.org, but this case was happening before applying those patches. I'd be happy to provide any more information nessecary. I've been struggling with this one for a few months now. Thanks, -Aaron Straces: rt_sigaction(SIGALRM, {0xb7f56640, [ALRM], 0}, {SIG_DFL}, 8) = 0 alarm(120) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} [hangs] Or: fcntl64(3, F_GETFL) = 0x8002 (flags O_RDWR|O_LARGEFILE) fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0 fcntl64(3, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0} Code used for locking: static int db_lock(int fd, int type) { struct flock fl; struct timespec *tv = (struct timespec *) malloc(sizeof(struct timespec)); int ret, c = 0; if(!(fd 0)) return -1; #ifdef SIGALRM_HACK /* after two minutes, wig out */ sigalrm_set(); alarm(120); #endif fl.l_whence = SEEK_SET; fl.l_start = 0; fl.l_len = 0; fl.l_type = type; #ifdef NONBLOCKING_HACK set_nonblocking(fd); #endif while((ret = fcntl(fd, F_SETLKW, fl)) 0) { c++; if(c 600) { /* we've been waiting for 60 seconds... */ my_error(stuck on fcntl request, aborting); return -1; } tv-tv_nsec = 100; /* 10th of a second wait */ tv-tv_sec = 0; nanosleep(tv, NULL); } free(tv); #ifdef SIGALRM_HACK sigalrm_unset(); #endif #ifdef NONBLOCKING_HACK unset_nonblocking(fd); #endif return ret; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/