Re: kqueue microbenchmark results
Terry Lambert wrote: > > > > Which is precisely why you need to know where in the chain of events this > > > happened. Otherwise if I see > > > 'read on fd 5' > > > 'read on fd 5' > > > How do I know which read is for which fd in the multithreaded case > > > > That can't happen, can it? Let's say the following happens: > >close(5) > >accept() = 5 > >call kevent() and rebind fd 5 > > The 'close(5)' would remove the old fd 5 events. Therefore, > > any fd 5 events you see returned from kevent are for the new fd 5. > > Strictly speaking, it can happen in two cases: > > 1) single acceptor thread, multiple worker threads > 2) multiple anonymous "work to do" threads > > In both these cases, the incoming requests from a client are > given to any thread, rather than a particular thread. > > In the first case, we can have (id:executer order:event): > > 1:1:open 5 > 2:2:read 5 > 3:4:read 5 > 2:3:close 5 > > If thread 2 processes the close event before thread 3 processes > the read event, then when thread 3 attempts procssing, it will > fail. You're not talking about kqueue() / kevent() here, are you? With that interface, thread 2 would not see a close event; instead, the other events for fd 5 would vanish from the queue. If you were indeed talking about kqueue() / kevent(), please flesh out the example a bit more, showing who calls kevent(). (A race that *can* happen is fd 5 could be closed by another thread after a 'read 5' event is pulled from the event queue and before it is processed, but that could happen with any readiness notification API at all.) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
> > Which is precisely why you need to know where in the chain of events this > > happened. Otherwise if I see > > > > 'read on fd 5' > > 'read on fd 5' > > > > How do I know which read is for which fd in the multithreaded case > > That can't happen, can it? Let's say the following happens: >close(5) >accept() = 5 >call kevent() and rebind fd 5 > The 'close(5)' would remove the old fd 5 events. Therefore, > any fd 5 events you see returned from kevent are for the new fd 5. > > (I suspect it helps that kevent() is both the only way to > bind events and the only way to pick them up; makes it harder > for one thread to sneak a new fd into the event list without > the thread calling kevent() noticing.) Strictly speaking, it can happen in two cases: 1) single acceptor thread, multiple worker threads 2) multiple anonymous "work to do" threads In both these cases, the incoming requests from a client are given to any thread, rather than a particular thread. In the first case, we can have (id:executer order:event): 1:1:open 5 2:2:read 5 3:4:read 5 2:3:close 5 If thread 2 processes the close event before thread 3 processes the read event, then when thread 3 attempts procssing, it will fail. Technically, this is a group ordering problem in the design of the software, which should instead queue all events to a dispatch thread, and the threads should use IPC to serialize processing of serial events. This is similar to the problem with async mounted FS recovery in event of a crash: without ordering guarantees, you can only get to a "good" state, not necessarily "the one correct state". In the second case, we can have: 1:2:read 5 2:1:open 5 3:4:read 5 2:3:close 5 This is just a non-degenerate form of the first case, where we allow thread 1 and all other threads to be identical, and don't serialize open state initialization. The NetWare for UNIX system uses this model. The benefit is that all user space threads can be identical. This means that I can use either threads or processes, and it won't matter, so my software can run on older systems that lack "perfect" threads models, simply by using processes, and putting client state into shared memory. In this case, there is no need for inter-thread synchronization; instead, we must insist that events be dispatched sequentially, and that the events be processed serially. This effectively requires event processing completion notigfication from user space to kernel space. In NetWare for UNIX, this was accomplished using a streams MUX which knew that the NetWare protocol was request-response. This also permitted "busy" responses to be turned around in kernel space, without incurring a kernel-to-user space scheduling penalty. It also permitted "piggyback", where an ioctl to the mux was used to respond, and combined sending a response with the next read. This reduced protection domain crossing and the context switch overhead by 50%. Finally, the MUX sent requests to user space in LIFO order. This approach is called "hot engine scheduling", in that the last reader in from user space is the most likely to have its pages in core, so as to not need swapping to handle the next request. I was architect of much of the process model discussed above; as you can see, there are some significant performance wins to be had by building the right interfaces, and putting the code on the right side of the user/kernel boundary. In any case, the answer is that you can not assume that the only correct way to solve a problem like event inversion is serialization of events in user space (or kernel space). This is not strictly a "threaded application implementation" issue, and it is not strictly a kernel serialization of event delivery issue. Another case, which NetWare did not handle, is that of rejected authentication. Even if you went with the first model, and forced your programmers to use expensive inter-thread synchronization, or worse, bound each client to a single thread in the server, thus rendering the system likely to have skewed thread load, getting worse the longer the connection was up, you would still have the problem of rejected authentication. A client might attempt to send authentication followed by commands in the same packet series, without waiting for an explicit ACK after each one (i.e. it might attempt to implement a sliding window over a virtual circuit), and the system on the other end might dilligently queue the events, only to have the authentication be rejected, but with packets queued already to user space for processing, assuming serialization in user space. You would then need a much more complex mechanism, to allow you to invalidate an already queued event to another thread, which you don't know about in your thread, before you release the interlock. Otherwise the client may get responses without a valid authentication. You need look no further than LDAPv3 for an example of a protocol where this
Re: kqueue microbenchmark results
* Dan Kegel <[EMAIL PROTECTED]> [001027 09:40] wrote: > Alan Cox wrote: > > > > kqueue currently does this; a close() on an fd will remove any pending > > > > events from the queues that they are on which correspond to that fd. > > > > > > the application of a close event. What can I say, "the fd formerly known > > > as X" is now gone? It would be incorrect to say that "fd X was closed", > > > since X no longer refers to anything, and the application may have reused > > > that fd for another file. > > > > Which is precisely why you need to know where in the chain of events this > > happened. Otherwise if I see > > > > 'read on fd 5' > > 'read on fd 5' > > > > How do I know which read is for which fd in the multithreaded case > > That can't happen, can it? Let's say the following happens: >close(5) >accept() = 5 >call kevent() and rebind fd 5 > The 'close(5)' would remove the old fd 5 events. Therefore, > any fd 5 events you see returned from kevent are for the new fd 5. > > (I suspect it helps that kevent() is both the only way to > bind events and the only way to pick them up; makes it harder > for one thread to sneak a new fd into the event list without > the thread calling kevent() noticing.) Yes, that's how it does and should work. Noticing the close() should be done via thread communication/IPC not stuck into kqueue. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Alan Cox wrote: > > > kqueue currently does this; a close() on an fd will remove any pending > > > events from the queues that they are on which correspond to that fd. > > > > the application of a close event. What can I say, "the fd formerly known > > as X" is now gone? It would be incorrect to say that "fd X was closed", > > since X no longer refers to anything, and the application may have reused > > that fd for another file. > > Which is precisely why you need to know where in the chain of events this > happened. Otherwise if I see > > 'read on fd 5' > 'read on fd 5' > > How do I know which read is for which fd in the multithreaded case That can't happen, can it? Let's say the following happens: close(5) accept() = 5 call kevent() and rebind fd 5 The 'close(5)' would remove the old fd 5 events. Therefore, any fd 5 events you see returned from kevent are for the new fd 5. (I suspect it helps that kevent() is both the only way to bind events and the only way to pick them up; makes it harder for one thread to sneak a new fd into the event list without the thread calling kevent() noticing.) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Jamie Lokier <[EMAIL PROTECTED]> [001027 08:21] wrote: > Alfred Perlstein wrote: > > > If a programmer does not ever wish to block under any circumstances, it's > > > his obligation to communicate this desire to the implementation. Otherwise, > > > the implementation can block if it doesn't have data or an error available > > > at the instant 'read' is called, regardless of what it may have known or > > > done in the past. > > > > Yes, and as you mentioned, it was _bugs_ in the operating system > > that did this. > > Not for writes. POLLOUT may be returned when the kernel thinks you have > enough memory to do a write, but someone else may allocate memory before > you call write(). Or does POLLOUT not work this way? POLLOUT checks the socketbuffer (if we're talking about sockets), and yes you may still block on mbuf allocation (if we're talking about FreeBSD) if the socket isn't set non-blocking. Actually POLLOUT may be set even if there isn't enough memory for a write in the network buffer pool. > For read, you still want to declare the sockets non-blocking so your > code is robust on _other_ operating systems. It's pretty straightforward. Yes, it's true, not using non-blocking sockets is like ignoring friction in a physics problem, but assuming you have complete control over the machine it shouldn't trip you up that often. And we're talking about readability, not writeability which as you mentioned may block because of contention for the network buffer pool. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Alfred Perlstein wrote: > > If a programmer does not ever wish to block under any circumstances, it's > > his obligation to communicate this desire to the implementation. Otherwise, > > the implementation can block if it doesn't have data or an error available > > at the instant 'read' is called, regardless of what it may have known or > > done in the past. > > Yes, and as you mentioned, it was _bugs_ in the operating system > that did this. Not for writes. POLLOUT may be returned when the kernel thinks you have enough memory to do a write, but someone else may allocate memory before you call write(). Or does POLLOUT not work this way? For read, you still want to declare the sockets non-blocking so your code is robust on _other_ operating systems. It's pretty straightforward. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Alfred Perlstein wrote: If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. Yes, and as you mentioned, it was _bugs_ in the operating system that did this. Not for writes. POLLOUT may be returned when the kernel thinks you have enough memory to do a write, but someone else may allocate memory before you call write(). Or does POLLOUT not work this way? For read, you still want to declare the sockets non-blocking so your code is robust on _other_ operating systems. It's pretty straightforward. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Jamie Lokier [EMAIL PROTECTED] [001027 08:21] wrote: Alfred Perlstein wrote: If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. Yes, and as you mentioned, it was _bugs_ in the operating system that did this. Not for writes. POLLOUT may be returned when the kernel thinks you have enough memory to do a write, but someone else may allocate memory before you call write(). Or does POLLOUT not work this way? POLLOUT checks the socketbuffer (if we're talking about sockets), and yes you may still block on mbuf allocation (if we're talking about FreeBSD) if the socket isn't set non-blocking. Actually POLLOUT may be set even if there isn't enough memory for a write in the network buffer pool. For read, you still want to declare the sockets non-blocking so your code is robust on _other_ operating systems. It's pretty straightforward. Yes, it's true, not using non-blocking sockets is like ignoring friction in a physics problem, but assuming you have complete control over the machine it shouldn't trip you up that often. And we're talking about readability, not writeability which as you mentioned may block because of contention for the network buffer pool. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Alan Cox wrote: kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case That can't happen, can it? Let's say the following happens: close(5) accept() = 5 call kevent() and rebind fd 5 The 'close(5)' would remove the old fd 5 events. Therefore, any fd 5 events you see returned from kevent are for the new fd 5. (I suspect it helps that kevent() is both the only way to bind events and the only way to pick them up; makes it harder for one thread to sneak a new fd into the event list without the thread calling kevent() noticing.) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Dan Kegel [EMAIL PROTECTED] [001027 09:40] wrote: Alan Cox wrote: kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case That can't happen, can it? Let's say the following happens: close(5) accept() = 5 call kevent() and rebind fd 5 The 'close(5)' would remove the old fd 5 events. Therefore, any fd 5 events you see returned from kevent are for the new fd 5. (I suspect it helps that kevent() is both the only way to bind events and the only way to pick them up; makes it harder for one thread to sneak a new fd into the event list without the thread calling kevent() noticing.) Yes, that's how it does and should work. Noticing the close() should be done via thread communication/IPC not stuck into kqueue. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case That can't happen, can it? Let's say the following happens: close(5) accept() = 5 call kevent() and rebind fd 5 The 'close(5)' would remove the old fd 5 events. Therefore, any fd 5 events you see returned from kevent are for the new fd 5. (I suspect it helps that kevent() is both the only way to bind events and the only way to pick them up; makes it harder for one thread to sneak a new fd into the event list without the thread calling kevent() noticing.) Strictly speaking, it can happen in two cases: 1) single acceptor thread, multiple worker threads 2) multiple anonymous "work to do" threads In both these cases, the incoming requests from a client are given to any thread, rather than a particular thread. In the first case, we can have (id:executer order:event): 1:1:open 5 2:2:read 5 3:4:read 5 2:3:close 5 If thread 2 processes the close event before thread 3 processes the read event, then when thread 3 attempts procssing, it will fail. Technically, this is a group ordering problem in the design of the software, which should instead queue all events to a dispatch thread, and the threads should use IPC to serialize processing of serial events. This is similar to the problem with async mounted FS recovery in event of a crash: without ordering guarantees, you can only get to a "good" state, not necessarily "the one correct state". In the second case, we can have: 1:2:read 5 2:1:open 5 3:4:read 5 2:3:close 5 This is just a non-degenerate form of the first case, where we allow thread 1 and all other threads to be identical, and don't serialize open state initialization. The NetWare for UNIX system uses this model. The benefit is that all user space threads can be identical. This means that I can use either threads or processes, and it won't matter, so my software can run on older systems that lack "perfect" threads models, simply by using processes, and putting client state into shared memory. In this case, there is no need for inter-thread synchronization; instead, we must insist that events be dispatched sequentially, and that the events be processed serially. This effectively requires event processing completion notigfication from user space to kernel space. In NetWare for UNIX, this was accomplished using a streams MUX which knew that the NetWare protocol was request-response. This also permitted "busy" responses to be turned around in kernel space, without incurring a kernel-to-user space scheduling penalty. It also permitted "piggyback", where an ioctl to the mux was used to respond, and combined sending a response with the next read. This reduced protection domain crossing and the context switch overhead by 50%. Finally, the MUX sent requests to user space in LIFO order. This approach is called "hot engine scheduling", in that the last reader in from user space is the most likely to have its pages in core, so as to not need swapping to handle the next request. I was architect of much of the process model discussed above; as you can see, there are some significant performance wins to be had by building the right interfaces, and putting the code on the right side of the user/kernel boundary. In any case, the answer is that you can not assume that the only correct way to solve a problem like event inversion is serialization of events in user space (or kernel space). This is not strictly a "threaded application implementation" issue, and it is not strictly a kernel serialization of event delivery issue. Another case, which NetWare did not handle, is that of rejected authentication. Even if you went with the first model, and forced your programmers to use expensive inter-thread synchronization, or worse, bound each client to a single thread in the server, thus rendering the system likely to have skewed thread load, getting worse the longer the connection was up, you would still have the problem of rejected authentication. A client might attempt to send authentication followed by commands in the same packet series, without waiting for an explicit ACK after each one (i.e. it might attempt to implement a sliding window over a virtual circuit), and the system on the other end might dilligently queue the events, only to have the authentication be rejected, but with packets queued already to user space for processing, assuming serialization in user space. You would then need a much more complex mechanism, to allow you to invalidate an already queued event to another thread, which you don't know about in your thread, before you release the interlock. Otherwise the client may get responses without a valid authentication. You need look no further than LDAPv3 for an example of a protocol where this is possible (assuming
Re: kqueue microbenchmark results
Terry Lambert wrote: Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case That can't happen, can it? Let's say the following happens: close(5) accept() = 5 call kevent() and rebind fd 5 The 'close(5)' would remove the old fd 5 events. Therefore, any fd 5 events you see returned from kevent are for the new fd 5. Strictly speaking, it can happen in two cases: 1) single acceptor thread, multiple worker threads 2) multiple anonymous "work to do" threads In both these cases, the incoming requests from a client are given to any thread, rather than a particular thread. In the first case, we can have (id:executer order:event): 1:1:open 5 2:2:read 5 3:4:read 5 2:3:close 5 If thread 2 processes the close event before thread 3 processes the read event, then when thread 3 attempts procssing, it will fail. You're not talking about kqueue() / kevent() here, are you? With that interface, thread 2 would not see a close event; instead, the other events for fd 5 would vanish from the queue. If you were indeed talking about kqueue() / kevent(), please flesh out the example a bit more, showing who calls kevent(). (A race that *can* happen is fd 5 could be closed by another thread after a 'read 5' event is pulled from the event queue and before it is processed, but that could happen with any readiness notification API at all.) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Alan Cox <[EMAIL PROTECTED]> [001026 18:33] wrote: > > the application of a close event. What can I say, "the fd formerly known > > as X" is now gone? It would be incorrect to say that "fd X was closed", > > since X no longer refers to anything, and the application may have reused > > that fd for another file. > > Which is precisely why you need to know where in the chain of events this > happened. Otherwise if I see > > 'read on fd 5' > 'read on fd 5' > > How do I know which read is for which fd in the multithreaded case No you don't, you don't see anything with the current code unless fd 5 is still around, what you're presenting to Jonathan is a application threading problem, not something that need to be resolved by the OS. > > As for the multi-thread case, this would be a bug; if one thread closes > > the descriptor, the other thread is going to get an EBADF when it goes > > to perform the read. > > Another thread may already have reused the fd This is another example of an application threading problem. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
> the application of a close event. What can I say, "the fd formerly known > as X" is now gone? It would be incorrect to say that "fd X was closed", > since X no longer refers to anything, and the application may have reused > that fd for another file. Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case > As for the multi-thread case, this would be a bug; if one thread closes > the descriptor, the other thread is going to get an EBADF when it goes > to perform the read. Another thread may already have reused the fd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Fri, Oct 27, 2000 at 01:50:40AM +0100, Alan Cox wrote: > > kqueue currently does this; a close() on an fd will remove any pending > > events from the queues that they are on which correspond to that fd. > > This seems an odd thing to do. Surely what you need to do is to post a > 'close completed' event to the queue. This also makes more sense when you > have a threaded app and another thread may well currently be in say a read > at the time it is closed Actually, it makes sense when you think about it. The `fd' is actually a capability that the application uses to refer to the open file in the kernel. If the app does a close() on the fd, it destroys this naming. The application then has no capability left which refers to the formerly open socket, and conversly, the kernel has no capability (name) to notify the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. As for the multi-thread case, this would be a bug; if one thread closes the descriptor, the other thread is going to get an EBADF when it goes to perform the read. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Alan Cox <[EMAIL PROTECTED]> [001026 17:50] wrote: > > kqueue currently does this; a close() on an fd will remove any pending > > events from the queues that they are on which correspond to that fd. > > This seems an odd thing to do. Surely what you need to do is to post a > 'close completed' event to the queue. This also makes more sense when you > have a threaded app and another thread may well currently be in say a read > at the time it is closed Kqueue's flexibility could allow this to be implemented, all you would need to do is make a new filter trigger. You might need a _bit_ of hackery to make sure those aren't removed, or one could just add the event after clearing all pending events. Adding a filter to be informed when a specific fd is closed is certainly an option, it doesn't make very much sense because that fd could then be reused quickly by something else... but anyhow: The point of this interface is to ask kqueue to report only on the things you are interested in, not to generate superfluous that you wouldn't care about. You could make such a flag if Linux adopted this interface and I'm sure we'd be forced to adopt it, but if you make kqueue generate info an application won't care about I don't think that would be taken back. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
> kqueue currently does this; a close() on an fd will remove any pending > events from the queues that they are on which correspond to that fd. This seems an odd thing to do. Surely what you need to do is to post a 'close completed' event to the queue. This also makes more sense when you have a threaded app and another thread may well currently be in say a read at the time it is closed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
This is a long posting, with a humble beginning, but it has a point. I'm being complete so that no one is left in the dark, or in any doubt as to what that point is. That means rehashing some history. This posting is not really about select or Linux: it's about interfaces. Like cached state, interfaces can often be harmful. NB: I really should redirect this to FreeBSD, as well, since there are people in that camp who haven't learned the lesson, either, but I'll leave it in -chat, for now. --- [ ... kqueue discussion ... ] > Linux also thought it was OK to modify the contents of the > timeval structure before returning it. It's been pointed out that I should provide more context for this statement, before people look at me strangely and make circling motions with their index fingers around their ears (or whatever the international sign for "crazy" is these days). So I'll start with a brief history. The context is this: the select API was designed with the idea that one might wish to do non-I/O related background processing. Toward this end, one could have several ways of using the API: 1) The (struct timeval *) could be NULL. This means "block until a signal or until a condition on which you are selecting is true"; select is a BSD interface, and, until BSD 4.x and POSIX signals, the signal would actually call the handler and restart the select call, so in effect, this really meant "block until you longjmp out of a signal handler or until a condition on which you are selecting is true". 2) The (struct timeval *) could point to the address of a real timeval structure (i.e. not be NULL); in that case, the result depended on the contents: a) If the timeval struct was zero valued, it meant that the select should poll for one of the conditions being selected for in the descriptor set, and return a 0 if no conditions were true. The contents of the bitmaps and timeval struct were left alone. b) If the timeval struct was not zero valued, it meant that the select should wait until the time specified had expired since the system call was first started, or one of the conditions being selected for was true. If the timeout expired, then a 0 would be returned, but if one or more of the conditions were true, the number of descriptors on which true conditions existed would be returned. Wedging so much into a single interface was fraught with peril: it was undefined as to what would happen if the timeval specified an interval of 5 seconds, yet there was a persistently rescheduled alarm every 2 seconds, resulting in a signal handler call that did _not_ longjmp... would the timer expire after 5 seconds, or would the timer be considered to have been restarted along with the call? Implementations that went both ways existed. Mostly, programmers used longjmp in signal handlers, and it wasn't a portability issue. More perilous, the question of what to do with a partially satisfied request that was interrupted with a timer or signal handler and longjump (later, siginterrupt(2), and later POSIX non-restart default behaviour). This meant that the bitmap of select events might have been modified already, after the wakeup, but before the process was rescheduled to run. Finally, the select manual page specifically reserved the right to modify the contents of the timeval struct; this was presumably so that you could either do accurate timekeeping by maintaining a running tally using the timeval deficit (a lot of math, that), or, more likely, to deal with the system call restart, and ensure that signals would not prevent the select from ever exiting in the case of system call restart. So this was the select API definition. --- Being pragmatists, programmers programmed to the behaviour of the API in actual implementations, rather than to the strict "letter of the law" laid down by the man page. This meant that select was called in loop control constructs, and that the bitmaps were reinitialized each time through the loop. It also meant that the timeval struct was not reinitialized, since that was more work, and no known implementations would modify it. Pre-POSIX signals, signal handlers were handled on a signal stack, as a result of a kernel trampoline outcall, and that meant that a restarting system call would not impact the countdown. --- Linux came along, and implemented the letter of the law; the machines were no sufficiently fast, and the math sufficiently cheap, that it was now possible to usefully accurate timekeeping using the inverted math required of keeping a running tally using the timeval deficit. So they implemented it: it was more useful than the historical
Re: kqueue microbenchmark results
On Thu, Oct 26, 2000 at 02:16:28AM -0700, Gideon Glass wrote: > Jonathan Lemon wrote: > > > > Also, consider the following scenario for the proposed get_event(): > > > >1. packet arrives, queues an event. > >2. user retrieves event. > >3. second packet arrives, queues event again. > >4. user reads() all data. > > > > Now, next time around the loop, we get a notification for an event > > when there is no data to read. The application now must be prepared > > to handle this case (meaning no blocking read() calls can be used). > > > > Also, what happens if the user closes the socket after step 4 above? > > Depends on the implementation. If the item in the queue is the > struct file (or whatever an fd indexes to), then the implementation > can only queue the fd once. This also avoids the problem with > closing sockets - close() would naturally do a list_del() or whatever > on the struct file. > > At least I think it could be implemented this way... kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. I was trying to point out that it isn't as simple as it would seem at first glance, as you have to consider an issues like this. Also, if the implementation allows multiple event types per fd, (leading to multiple queued events per fd) there no longer is a 1:1 mapping to something like 'struct file', and performing a list walk doesn't scale very well. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Jonathan Lemon wrote: > > Also, consider the following scenario for the proposed get_event(): > >1. packet arrives, queues an event. >2. user retrieves event. >3. second packet arrives, queues event again. >4. user reads() all data. > > Now, next time around the loop, we get a notification for an event > when there is no data to read. The application now must be prepared > to handle this case (meaning no blocking read() calls can be used). > > Also, what happens if the user closes the socket after step 4 above? Depends on the implementation. If the item in the queue is the struct file (or whatever an fd indexes to), then the implementation can only queue the fd once. This also avoids the problem with closing sockets - close() would naturally do a list_del() or whatever on the struct file. At least I think it could be implemented this way... gid > > The user now receives a notification for a fd which no longer exists, > or possibly has been reused for another connection. This may or may > not make a difference to the application, but it must be prepared to > handle it anyway. I believe that Zack Brown ran into this problem with > one of the webservers he was writing. > > > > You can find my paper at http://people.freebsd.org/~jlemon > > > > I'll go and read it now. :) > > The paper talks about some of the issues we have been discussing, as > well as the design rationale behind kqueue. I'd be happy to answer > any questions about the paper. > -- > Jonathan > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
> * David Schwartz <[EMAIL PROTECTED]> [001025 15:35] wrote: > > > > If a programmer does not ever wish to block under any > circumstances, it's > > his obligation to communicate this desire to the > implementation. Otherwise, > > the implementation can block if it doesn't have data or an > error available > > at the instant 'read' is called, regardless of what it may have known or > > done in the past. It's also just generally good programming > practice. There > > was a time when many operating systems had bugs that caused > 'select loop' > > type applications to hang if they didn't set all their descriptors > > non-blocking. > > Yes, and as you mentioned, it was _bugs_ in the operating system > that did this. Right. I can't imagine a way in which this could happen for TCP without a bug. For other protocols, it's not so far fetched. For UDP, which is defined as lossy, I could imagine an implementation that changed its mind about accepting a packet due to memory demands. > I don't think it's wise to continue speculating on this issue unless > you can point to a specific document that says that it's OK for > this type of behavior to happen. SuS2 says that 'read' behaves like 'recv' with no flags for a socket. SuS2 says that for a socket, "If no messages are available at the socket and O_NONBLOCK is not set on the socket's file descriptor, recv() blocks until a message arrives." > Let's take a look at the FreeBSD manpage for poll: > > POLLIN Data other than high priority data may be read without > blocking. At the time you return from poll. This says nothing about any later time. [snip] >#define POLLIN 0x0001/* There is data to read */ > > This seems to imply that it is one hell of a bug to block, returning > an error would be acceptable, but surely not blocking. This brief comment is not meant to be thorough. In fact, it says nothing about error conditions and implies that it's wrong to return POLLIN for an error. > I know manpages are a poor source for references but you're the one > putting up a big fight for blocking behavior from poll, perhaps you > can point out a standard that contradicts the manpages? When you code to a standard, your code must not fail under any conditions permitted by the standard. Failing to set your file descriptors non-blocking when you never want to block depends upon behavior not guaranteed. Unfortunately, none of the standards provides sufficiently clear statements about this behavior. In fact, I can't even find any standard that says it's correct to signal POLLIN when there's an error. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
[ ... blocking read after signalling that data is available ... ] > Yes, and as you mentioned, it was _bugs_ in the operating system > that did this. I think it's reasonable for the OS to discard, for example, connection requests which are not serviced in a reasonable time window. Likewise, it's reasonable to consider some protocol that would allow the sender to repudiate a packet that it decided that it didn't want to send; this would, in fact, be extremely useful in multicast protocols that signalled all available servers with a request, and then repudiated the request after receiving a response, on the theory that the server was too loaded, or the link to congested, or the programmer of the repudiated servers was such a bad coder that the server was too lazy to get off its butt and answer the request in a reasonable amount of time. A protocol based on this second approach would actually be able to solve "the gnutella congestion problem" (quoted, as I believe it's simply a case of the universe and the laws of physics voting against gnutella as being a dumb idea, since it's just a repeat of the original NetWare and LANMan scaling problems). The real problem is that the interface is making a potentially incorrect assumption about the underlying implementation, and that means that it won't be portable to systems whose underlying implementations don't satify the (undocumented and unwarranted) assumption. People whine about WSOCK32 being "gratuitously different" with regard to resource tracking and implying a shutdown on a socket close or an application exit, but they forget that that all came about because the original interface, and the programmers who used it, assumed a kernel space implementation, and that the kernel would resource track sockets, as if they were file descriptors. I think your Sun example: > POLLINData other than high priority data may be read >without blocking. For STREAMS, this flag is set in >revents even if the message is of zero length. Implies that a recv or recvfrom is required, and use of a read after a POLLIN, which can't retrieve high priority data from a socket, may result in the process blocking. Well, "duh!", the read is on the normal data channel, and the POLLIN corresponds to the high priority channel ...what did you expect, when you called the wrong system call on a socket? > I see a trend here, let's try Linux: Linux also thought it was OK to modify the contents of the timeval structure before returning it. Terry Lambert [EMAIL PROTECTED] --- Any opinions in this posting are my own and not those of my present or previous employers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
[ ... blocking read after signalling that data is available ... ] Yes, and as you mentioned, it was _bugs_ in the operating system that did this. I think it's reasonable for the OS to discard, for example, connection requests which are not serviced in a reasonable time window. Likewise, it's reasonable to consider some protocol that would allow the sender to repudiate a packet that it decided that it didn't want to send; this would, in fact, be extremely useful in multicast protocols that signalled all available servers with a request, and then repudiated the request after receiving a response, on the theory that the server was too loaded, or the link to congested, or the programmer of the repudiated servers was such a bad coder that the server was too lazy to get off its butt and answer the request in a reasonable amount of time. A protocol based on this second approach would actually be able to solve "the gnutella congestion problem" (quoted, as I believe it's simply a case of the universe and the laws of physics voting against gnutella as being a dumb idea, since it's just a repeat of the original NetWare and LANMan scaling problems). The real problem is that the interface is making a potentially incorrect assumption about the underlying implementation, and that means that it won't be portable to systems whose underlying implementations don't satify the (undocumented and unwarranted) assumption. People whine about WSOCK32 being "gratuitously different" with regard to resource tracking and implying a shutdown on a socket close or an application exit, but they forget that that all came about because the original interface, and the programmers who used it, assumed a kernel space implementation, and that the kernel would resource track sockets, as if they were file descriptors. I think your Sun example: POLLINData other than high priority data may be read without blocking. For STREAMS, this flag is set in revents even if the message is of zero length. Implies that a recv or recvfrom is required, and use of a read after a POLLIN, which can't retrieve high priority data from a socket, may result in the process blocking. Well, "duh!", the read is on the normal data channel, and the POLLIN corresponds to the high priority channel ...what did you expect, when you called the wrong system call on a socket? I see a trend here, let's try Linux: Linux also thought it was OK to modify the contents of the timeval structure before returning it. Terry Lambert [EMAIL PROTECTED] --- Any opinions in this posting are my own and not those of my present or previous employers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
* David Schwartz [EMAIL PROTECTED] [001025 15:35] wrote: If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. It's also just generally good programming practice. There was a time when many operating systems had bugs that caused 'select loop' type applications to hang if they didn't set all their descriptors non-blocking. Yes, and as you mentioned, it was _bugs_ in the operating system that did this. Right. I can't imagine a way in which this could happen for TCP without a bug. For other protocols, it's not so far fetched. For UDP, which is defined as lossy, I could imagine an implementation that changed its mind about accepting a packet due to memory demands. I don't think it's wise to continue speculating on this issue unless you can point to a specific document that says that it's OK for this type of behavior to happen. SuS2 says that 'read' behaves like 'recv' with no flags for a socket. SuS2 says that for a socket, "If no messages are available at the socket and O_NONBLOCK is not set on the socket's file descriptor, recv() blocks until a message arrives." Let's take a look at the FreeBSD manpage for poll: POLLIN Data other than high priority data may be read without blocking. At the time you return from poll. This says nothing about any later time. [snip] #define POLLIN 0x0001/* There is data to read */ This seems to imply that it is one hell of a bug to block, returning an error would be acceptable, but surely not blocking. This brief comment is not meant to be thorough. In fact, it says nothing about error conditions and implies that it's wrong to return POLLIN for an error. I know manpages are a poor source for references but you're the one putting up a big fight for blocking behavior from poll, perhaps you can point out a standard that contradicts the manpages? When you code to a standard, your code must not fail under any conditions permitted by the standard. Failing to set your file descriptors non-blocking when you never want to block depends upon behavior not guaranteed. Unfortunately, none of the standards provides sufficiently clear statements about this behavior. In fact, I can't even find any standard that says it's correct to signal POLLIN when there's an error. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Thu, Oct 26, 2000 at 02:16:28AM -0700, Gideon Glass wrote: Jonathan Lemon wrote: Also, consider the following scenario for the proposed get_event(): 1. packet arrives, queues an event. 2. user retrieves event. 3. second packet arrives, queues event again. 4. user reads() all data. Now, next time around the loop, we get a notification for an event when there is no data to read. The application now must be prepared to handle this case (meaning no blocking read() calls can be used). Also, what happens if the user closes the socket after step 4 above? Depends on the implementation. If the item in the queue is the struct file (or whatever an fd indexes to), then the implementation can only queue the fd once. This also avoids the problem with closing sockets - close() would naturally do a list_del() or whatever on the struct file. At least I think it could be implemented this way... kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. I was trying to point out that it isn't as simple as it would seem at first glance, as you have to consider an issues like this. Also, if the implementation allows multiple event types per fd, (leading to multiple queued events per fd) there no longer is a 1:1 mapping to something like 'struct file', and performing a list walk doesn't scale very well. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
This is a long posting, with a humble beginning, but it has a point. I'm being complete so that no one is left in the dark, or in any doubt as to what that point is. That means rehashing some history. This posting is not really about select or Linux: it's about interfaces. Like cached state, interfaces can often be harmful. NB: I really should redirect this to FreeBSD, as well, since there are people in that camp who haven't learned the lesson, either, but I'll leave it in -chat, for now. --- [ ... kqueue discussion ... ] Linux also thought it was OK to modify the contents of the timeval structure before returning it. It's been pointed out that I should provide more context for this statement, before people look at me strangely and make circling motions with their index fingers around their ears (or whatever the international sign for "crazy" is these days). So I'll start with a brief history. The context is this: the select API was designed with the idea that one might wish to do non-I/O related background processing. Toward this end, one could have several ways of using the API: 1) The (struct timeval *) could be NULL. This means "block until a signal or until a condition on which you are selecting is true"; select is a BSD interface, and, until BSD 4.x and POSIX signals, the signal would actually call the handler and restart the select call, so in effect, this really meant "block until you longjmp out of a signal handler or until a condition on which you are selecting is true". 2) The (struct timeval *) could point to the address of a real timeval structure (i.e. not be NULL); in that case, the result depended on the contents: a) If the timeval struct was zero valued, it meant that the select should poll for one of the conditions being selected for in the descriptor set, and return a 0 if no conditions were true. The contents of the bitmaps and timeval struct were left alone. b) If the timeval struct was not zero valued, it meant that the select should wait until the time specified had expired since the system call was first started, or one of the conditions being selected for was true. If the timeout expired, then a 0 would be returned, but if one or more of the conditions were true, the number of descriptors on which true conditions existed would be returned. Wedging so much into a single interface was fraught with peril: it was undefined as to what would happen if the timeval specified an interval of 5 seconds, yet there was a persistently rescheduled alarm every 2 seconds, resulting in a signal handler call that did _not_ longjmp... would the timer expire after 5 seconds, or would the timer be considered to have been restarted along with the call? Implementations that went both ways existed. Mostly, programmers used longjmp in signal handlers, and it wasn't a portability issue. More perilous, the question of what to do with a partially satisfied request that was interrupted with a timer or signal handler and longjump (later, siginterrupt(2), and later POSIX non-restart default behaviour). This meant that the bitmap of select events might have been modified already, after the wakeup, but before the process was rescheduled to run. Finally, the select manual page specifically reserved the right to modify the contents of the timeval struct; this was presumably so that you could either do accurate timekeeping by maintaining a running tally using the timeval deficit (a lot of math, that), or, more likely, to deal with the system call restart, and ensure that signals would not prevent the select from ever exiting in the case of system call restart. So this was the select API definition. --- Being pragmatists, programmers programmed to the behaviour of the API in actual implementations, rather than to the strict "letter of the law" laid down by the man page. This meant that select was called in loop control constructs, and that the bitmaps were reinitialized each time through the loop. It also meant that the timeval struct was not reinitialized, since that was more work, and no known implementations would modify it. Pre-POSIX signals, signal handlers were handled on a signal stack, as a result of a kernel trampoline outcall, and that meant that a restarting system call would not impact the countdown. --- Linux came along, and implemented the letter of the law; the machines were no sufficiently fast, and the math sufficiently cheap, that it was now possible to usefully accurate timekeeping using the inverted math required of keeping a running tally using the timeval deficit. So they implemented it: it was more useful than the historical
Re: kqueue microbenchmark results
kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. This seems an odd thing to do. Surely what you need to do is to post a 'close completed' event to the queue. This also makes more sense when you have a threaded app and another thread may well currently be in say a read at the time it is closed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Alan Cox [EMAIL PROTECTED] [001026 17:50] wrote: kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. This seems an odd thing to do. Surely what you need to do is to post a 'close completed' event to the queue. This also makes more sense when you have a threaded app and another thread may well currently be in say a read at the time it is closed Kqueue's flexibility could allow this to be implemented, all you would need to do is make a new filter trigger. You might need a _bit_ of hackery to make sure those aren't removed, or one could just add the event after clearing all pending events. Adding a filter to be informed when a specific fd is closed is certainly an option, it doesn't make very much sense because that fd could then be reused quickly by something else... but anyhow: The point of this interface is to ask kqueue to report only on the things you are interested in, not to generate superfluous that you wouldn't care about. You could make such a flag if Linux adopted this interface and I'm sure we'd be forced to adopt it, but if you make kqueue generate info an application won't care about I don't think that would be taken back. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Fri, Oct 27, 2000 at 01:50:40AM +0100, Alan Cox wrote: kqueue currently does this; a close() on an fd will remove any pending events from the queues that they are on which correspond to that fd. This seems an odd thing to do. Surely what you need to do is to post a 'close completed' event to the queue. This also makes more sense when you have a threaded app and another thread may well currently be in say a read at the time it is closed Actually, it makes sense when you think about it. The `fd' is actually a capability that the application uses to refer to the open file in the kernel. If the app does a close() on the fd, it destroys this naming. The application then has no capability left which refers to the formerly open socket, and conversly, the kernel has no capability (name) to notify the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. As for the multi-thread case, this would be a bug; if one thread closes the descriptor, the other thread is going to get an EBADF when it goes to perform the read. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case As for the multi-thread case, this would be a bug; if one thread closes the descriptor, the other thread is going to get an EBADF when it goes to perform the read. Another thread may already have reused the fd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* Alan Cox [EMAIL PROTECTED] [001026 18:33] wrote: the application of a close event. What can I say, "the fd formerly known as X" is now gone? It would be incorrect to say that "fd X was closed", since X no longer refers to anything, and the application may have reused that fd for another file. Which is precisely why you need to know where in the chain of events this happened. Otherwise if I see 'read on fd 5' 'read on fd 5' How do I know which read is for which fd in the multithreaded case No you don't, you don't see anything with the current code unless fd 5 is still around, what you're presenting to Jonathan is a application threading problem, not something that need to be resolved by the OS. As for the multi-thread case, this would be a bug; if one thread closes the descriptor, the other thread is going to get an EBADF when it goes to perform the read. Another thread may already have reused the fd This is another example of an application threading problem. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* David Schwartz <[EMAIL PROTECTED]> [001025 15:35] wrote: > > If a programmer does not ever wish to block under any circumstances, it's > his obligation to communicate this desire to the implementation. Otherwise, > the implementation can block if it doesn't have data or an error available > at the instant 'read' is called, regardless of what it may have known or > done in the past. It's also just generally good programming practice. There > was a time when many operating systems had bugs that caused 'select loop' > type applications to hang if they didn't set all their descriptors > non-blocking. Yes, and as you mentioned, it was _bugs_ in the operating system that did this. I don't think it's wise to continue speculating on this issue unless you can point to a specific document that says that it's OK for this type of behavior to happen. Let's take a look at the FreeBSD manpage for poll: POLLIN Data other than high priority data may be read without blocking. ok no one bothers to do *BSD compat anymore (*grumble*), so, Solaris: POLLINData other than high priority data may be read without blocking. For STREAMS, this flag is set in revents even if the message is of zero length. I see a trend here, let's try Linux: #define POLLIN 0x0001/* There is data to read */ This seems to imply that it is one hell of a bug to block, returning an error would be acceptable, but surely not blocking. I know manpages are a poor source for references but you're the one putting up a big fight for blocking behavior from poll, perhaps you can point out a standard that contradicts the manpages? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote: > On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: > > > ends up making the job of the application harder. A simple example > > to illustrate the point: what if the application does not choose > > to read all the data from an incoming packet? The app now has to > What applications would do better by postponing some of the reading? > I can't think of any reason off the top of my head why an application > wouldn't want to read everything it can. Doing everything in smaller I can see this happening if the application does not know how much data is in the buffer, or if the data is being read into a buffer does not have much space left in it. Jim - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
> On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote: > > > > > Now, next time around the loop, we get a notification for an event > > > when there is no data to read. The application now must be prepared > > > to handle this case (meaning no blocking read() calls can be used). > > > -- > > > Jonathan > > > > If the programmer never wants to block in a read call, he > > should never do a > > blocking read anyway. There's no standard that requires > > readability at time > > X to imply readability at time X+1. > > Quite true on the surface. But taking that statement at face value > implies that it is okay for poll() to return POLLIN on a descriptor > even if there is no data to be read. I don't think this is the intention. Never mind what it implies. Just stick to what it says. :) In my opinion, it's perfectly reasonable for an implementation to show POLLIN on a call to poll() and then later block in read(). As far as I know no implementation does this, but no standard prevents an implementation from, for example, swapping out received TCP to disk if it's not retrieved and blocking later when you ask for the data until it can get the data back. I would even argue that it's possible for an implementation to decide that a connection had errored (for example, due to a timeout) and signalling POLLIN. Then before you call 'read', it gets a packet and decides that the connection is actually fine and so blocks in 'read'. This wouldn't seem possible in TCP, but it's possible to imagine protocols where it's sensible to do. And again, as far as I know, no standard prohibits it. If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. It's also just generally good programming practice. There was a time when many operating systems had bugs that caused 'select loop' type applications to hang if they didn't set all their descriptors non-blocking. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote: > > > Now, next time around the loop, we get a notification for an event > > when there is no data to read. The application now must be prepared > > to handle this case (meaning no blocking read() calls can be used). > > -- > > Jonathan > > If the programmer never wants to block in a read call, he should never do a > blocking read anyway. There's no standard that requires readability at time > X to imply readability at time X+1. Quite true on the surface. But taking that statement at face value implies that it is okay for poll() to return POLLIN on a descriptor even if there is no data to be read. I don't think this is the intention. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
> Now, next time around the loop, we get a notification for an event > when there is no data to read. The application now must be prepared > to handle this case (meaning no blocking read() calls can be used). > -- > Jonathan If the programmer never wants to block in a read call, he should never do a blocking read anyway. There's no standard that requires readability at time X to imply readability at time X+1. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:40:28AM -0700, Simon Kirby wrote: > On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote: > > > Consider a program which reads from point A, writes to point B. If > > the buffer associated with B fills up, then we don't want to continue > > reading from A. > > > > A/B may be network sockets, pipes, or ptys. > > Fine, but we can bind the event watching to the device or socket or pipe > that will clog up, right? In which case, we'll later get a write event > (just like with select()), and then once there is some progress you can > go back to read()ing from the original descriptor. This is even easier > than using select() because you don't have to take the descriptor out of > the read set and put it in the write set temporarily -- it will > automatically work that way. Yes, but with the above, you can't use get_event() as your main dispatching loop to do the read() call any more, since there may be no notifications pending in the queue. So you have to expand your main loop to include both get_event() as well as walk the "these descriptors may have partial data" list. Also, as Jamie pointed out, with kqueue/select you can do: kevent/read/write while with a pure edge-triggered scheme, you either must do: bind_event/read/.../read == 0/write Or maintain your own "this descriptor may have data" list. Also, consider the following scenario for the proposed get_event(): 1. packet arrives, queues an event. 2. user retrieves event. 3. second packet arrives, queues event again. 4. user reads() all data. Now, next time around the loop, we get a notification for an event when there is no data to read. The application now must be prepared to handle this case (meaning no blocking read() calls can be used). Also, what happens if the user closes the socket after step 4 above? The user now receives a notification for a fd which no longer exists, or possibly has been reused for another connection. This may or may not make a difference to the application, but it must be prepared to handle it anyway. I believe that Zack Brown ran into this problem with one of the webservers he was writing. > > You can find my paper at http://people.freebsd.org/~jlemon > > I'll go and read it now. :) The paper talks about some of the issues we have been discussing, as well as the design rationale behind kqueue. I'd be happy to answer any questions about the paper. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 09:40:53PM +0200, Jamie Lokier wrote: > Simon Kirby wrote: > > And you'd need to take the descriptor out of the read() set in the > > select() case anyway, so I don't really see what's different. > > The difference is that taking a bit out of select()'s bitmap is > basically free. Whereas the equivalent with events is a bind_event() > system call. With the caveat that kevent() will take a changelist at the same time that it returns an eventlist, so while you do incur some kernel processing to temporarily disable the descriptor, the system call is essentially free. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
> What applications would do better by postponing some of the reading? > I can't think of any reason off the top of my head why an application > wouldn't want to read everything it can. Doing everything in smaller > chunks would increase overhead (but maybe reduce latencies very slightly > -- albeit probably not much when using a get_events()-style interface). Applications that: o Want to limit their memory footprint by limiting the amount of process VM they consume, and so limit their buffer size to less than all the data the stacks might be capable of providing at one time o With fixed-size messages, which want to operate on a message at a time, without restricting the sender to sending only a single message or whole messages at one time o Want to limit their overall system processing overhead for irrelevent/stale data (example: one which implements state delta referesh events, such as "Bolo" or "Netrek") o Have to implement "leaky bucket" algorithms, where it is permissible to drop some data on the floor, and assume it will be retransmited later (e.g. ATM or user space protocols which want to implement QoS guarantees) o Need to take advantage of kernel strategies for protection from denial of service attacks, without having to redo those strategies themselves (particularly, flood attacks; this is the same reason inetd supports connection rate limiting on behalf of the programs it is responsible for starting) o With multiple data channels to which they are listening, some of which are more important than others (e.g. the Real Networks streaming media protocols are an example) o Want to evealuate the contents of a security negotiation prior to accepting data that was sent using an expired certificate or otherwise bogus credentials There are all sorts of good reasons a programmer would want to trust the kernel, instead of having to build ring buffers into each and every program they write to ensure they remember data which is irrelevent to the processing at hand, or protect their code against buffer overflows initiated by trusted applications. > Isn't it probably better to keep the kernel implementation as efficient > as possible so that the majority of applications which will read (and > write) all data possible can do it as efficiently as possible? Queueing > up the events, even as they are in the form received from the kernel, is > pretty simple for a userspace program to do, and I think it's the best > place for it. Reading, yes. Writing, no. The buffers they are filling in the kernel belong to the kernel, not the application, despite what Microsoft tells you about WSOCK32 programming. The WSOCK32 model assumes that the networking is implemented in another user space process, rather than in the kernel. People who use the "async" WSOCK32 interface rarely understand the implications because they rarely understand how async messages are built using a Windows data pump, which serializes all requests through the moral equivalent of a select loop (which is why NT supports async notification on socket I/O, but other versions of Windows does not [NB: actually, it could, using an fd=-1 I/O completion port, but the WSOCK32 programmers were a bit lazy and were also being told to keep performance under that of NT]). In any case, it's not just a matter of queueing up kernel events, it's also a matter of partially instead of completely reacting to the events, since if an event comes in that says you have 1k of data, and you only read 128 bytes of it, you will have to requeue, in LIFO instead of FIFO order, a modified event with 1k-128 bytes, so the next read completes as expected. Very gross code, which must be then duplicated in every iser space program, and either requires a "tail minus one" pointer, or requires doubly linking the user space event queue. > I know nothing about any other implementations, though, and I'm speaking > mainly from the experiences I've had with coding daemons using select(). Programs which are select-based are usually DFAs (Deterministic Finite State Automatons), which operate on non-blocking file descriptors. This means that I/O is not interleaved, and so is not necessarily as efficient as it could be, should there ever occur a time when an I/O completion posts sooner after being checked than the amount of time it takes to complete 50% of an event processing cycle (the reasons for this involve queueing theory algebra, and are easier to explain in terms of the relative value of positive and negative caches in situations where a cache miss results in the need to perform a linear traversal). A lot of this can be "magically" alleviated using POSIX AIO calls in the underlying implementation, instead of relying on non-blocking I/O -- even then, don't expect a better
Re: kqueue microbenchmark results
Simon Kirby wrote: > And you'd need to take the descriptor out of the read() set in the > select() case anyway, so I don't really see what's different. The difference is that taking a bit out of select()'s bitmap is basically free. Whereas the equivalent with events is a bind_event() system call. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote: > Consider a program which reads from point A, writes to point B. If > the buffer associated with B fills up, then we don't want to continue > reading from A. > > A/B may be network sockets, pipes, or ptys. Fine, but we can bind the event watching to the device or socket or pipe that will clog up, right? In which case, we'll later get a write event (just like with select()), and then once there is some progress you can go back to read()ing from the original descriptor. This is even easier than using select() because you don't have to take the descriptor out of the read set and put it in the write set temporarily -- it will automatically work that way. > Or perhaps you receive a request to use a resource that is currently > busy. Does your application want to postpone the request, or read the > data immediately, even if the request can't be serviced yet? Assuming this "resource" has a way of waking up the process when it unclogs, then you can go back and read the remaining data later, which is what you would want to do anyway. > My point is that I can easily think of several examples as to where > this behavior may be beneficial to the application, and I use some of > them myself. You can indeed get the same result by forcing each and > every application that wants this behavior to implement their own > tracking mechanism, but this strikes me as error-prone and places an > undue burden on the application programmer. I can see that you could write it this way... I'm just trying to see if it's really needed. :) As I wrote in my last email to Jamie, you would need to implement a tracking mechanism in any case to avoid DoS attacks from clients or a case where a single client can clog up the reading from any other client. And you'd need to take the descriptor out of the read() set in the select() case anyway, so I don't really see what's different. > You can find my paper at http://people.freebsd.org/~jlemon I'll go and read it now. :) Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 07:08:48PM +0200, Jamie Lokier wrote: > Simon Kirby wrote: > > > What applications would do better by postponing some of the reading? > > I can't think of any reason off the top of my head why an application > > wouldn't want to read everything it can. > > Pipelined server. > > 1. Wait for event. > 2. Read block > 3. If EAGAIN, goto 1. > 4. If next request in block is incomplete, goto 2. > 5. Process next request in block. > 6. Write response. > 7. If EAGAIN, wait until output is ready for writing then goto 6. > 8. Goto 1 or 2, your choice. >(Here I'd go to 2 if the last read was complete -- it avoids a >redundant call to poll()). > > If you simply read everything you can at step 2, you'll run out of > memory the moment someone sends you 10 requests. > > This doesn't happen if you leave unread data in kernel space -- > TCP windows and all that. Hmm, I don't understand. What happens at "wait until output is ready for writing then goto 6"? You mean you would stop the main loop to wait for a single client to unclog? Wouldn't you just do this? -> 1. Wait for event (read and write queued). Event occurs: Incoming data available. 2. Read a block. 3. Process block just read: Does it contain a full request? If not, queue, goto 2, munge together. If no more data, queue beginning of request, if any, and goto 1. 4. Walk over available requests in block just read. Process. 5. Attempt to write response, if any. 6. Attempted write: Did it all get out? If not, queue waiting writable data and goto 1 to wait for a write event. 7. Goto 2. Assume we got write clogged. Some loop later: 10. Wait for event (read and write queued). Event occurs: Write space available. 11. Write remaining available data. 12. Attempted write: Did it all get out? If not, queue remaining writable data and goto 1 to wait for another write event. 13. Goto 2. (If we're some sort of forwarding daemon and the receiving end of our forward has just unclogged, we want to read any readable data we had waiting. Same with if we're just answering a request, though, as the send direction could still get clogged.) What can't you do here? What's wrong? Note that the write event will let you read any remaining queued data. If you actually stop from going back to the main loop when you're write clogged, you will pause the daemon and create an easy DoS problem. There's no way around needing to queue writable data at least. This is how I wrote my irc daemon a while back, and it works fine with select(). I can't see what wouldn't work with edge-triggered events except perhaps the write() event -- I'm not sure what would be considered "triggered", perhaps when it goes under a watermark or something. In any case, it should all still work assuming get_events() offers the ability to receive "write space available" events. You don't have to read all data if you don't want to, assuming you will get another event later that will unclog the situation (meaning the obstacle must also trigger an event when it is cleared). In fact, if you did leave the read queued in a daemon using select() before, you'd keep looping endlessly taking all CPU and never idle because there would always be read data available. You'd have to not queue the descriptor into the read set and instead stick it in the write set so that you can sleep waiting for the write set to become available, effectively ignorning any further events on the read set until the write unclogs. This sounds just like what would happen if you only got one notification (edge triggered) in the first place. Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote: > On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: > > > Yes, someone pointed me to those today. I would suggest reading > > some of the relevant literature before embarking on a design. My > > paper discusses some of the issues, and Mogul/Banga make some good > > points too. > > > > While an 'edge-trigger' design is indeed simpler, I feel that it > > ends up making the job of the application harder. A simple example > > to illustrate the point: what if the application does not choose > > to read all the data from an incoming packet? The app now has to > > implement its own state mechanism to remember that there may be pending > > data in the buffer, since it will not get another event notification > > unless another packet arrives. > > What applications would do better by postponing some of the reading? > I can't think of any reason off the top of my head why an application > wouldn't want to read everything it can. Doing everything in smaller > chunks would increase overhead (but maybe reduce latencies very slightly > -- albeit probably not much when using a get_events()-style interface). Consider a program which reads from point A, writes to point B. If the buffer associated with B fills up, then we don't want to continue reading from A. A/B may be network sockets, pipes, or ptys. Or perhaps you receive a request to use a resource that is currently busy. Does your application want to postpone the request, or read the data immediately, even if the request can't be serviced yet? My point is that I can easily think of several examples as to where this behavior may be beneficial to the application, and I use some of them myself. You can indeed get the same result by forcing each and every application that wants this behavior to implement their own tracking mechanism, but this strikes me as error-prone and places an undue burden on the application programmer. > Isn't it probably better to keep the kernel implementation as efficient > as possible so that the majority of applications which will read (and > write) all data possible can do it as efficiently as possible? Queueing > up the events, even as they are in the form received from the kernel, is > pretty simple for a userspace program to do, and I think it's the best > place for it. I don't believe that you must sacrifice efficiency to achieve this goal, I think that you can provide both forms in an efficent fashion. > I know nothing about any other implementations, though, and I'm speaking > mainly from the experiences I've had with coding daemons using select(). > You mention you wrote a paper discussing this issue...Where could I find > this? I'm also speaking from experience, from using various forms of event notification. kqueue() is actually a 3rd generation system, building off the experience I had with the first two, along with other input. You can find my paper at http://people.freebsd.org/~jlemon -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Simon Kirby wrote: > > While an 'edge-trigger' design is indeed simpler, I feel that it > > ends up making the job of the application harder. A simple example > > to illustrate the point: what if the application does not choose > > to read all the data from an incoming packet? The app now has to > > implement its own state mechanism to remember that there may be pending > > data in the buffer, since it will not get another event notification > > unless another packet arrives. > > What applications would do better by postponing some of the reading? > I can't think of any reason off the top of my head why an application > wouldn't want to read everything it can. Pipelined server. 1. Wait for event. 2. Read block 3. If EAGAIN, goto 1. 4. If next request in block is incomplete, goto 2. 5. Process next request in block. 6. Write response. 7. If EAGAIN, wait until output is ready for writing then goto 6. 8. Goto 1 or 2, your choice. (Here I'd go to 2 if the last read was complete -- it avoids a redundant call to poll()). If you simply read everything you can at step 2, you'll run out of memory the moment someone sends you 10 requests. This doesn't happen if you leave unread data in kernel space -- TCP windows and all that. enjoy, -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: > Yes, someone pointed me to those today. I would suggest reading > some of the relevant literature before embarking on a design. My > paper discusses some of the issues, and Mogul/Banga make some good > points too. > > While an 'edge-trigger' design is indeed simpler, I feel that it > ends up making the job of the application harder. A simple example > to illustrate the point: what if the application does not choose > to read all the data from an incoming packet? The app now has to > implement its own state mechanism to remember that there may be pending > data in the buffer, since it will not get another event notification > unless another packet arrives. What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Doing everything in smaller chunks would increase overhead (but maybe reduce latencies very slightly -- albeit probably not much when using a get_events()-style interface). Isn't it probably better to keep the kernel implementation as efficient as possible so that the majority of applications which will read (and write) all data possible can do it as efficiently as possible? Queueing up the events, even as they are in the form received from the kernel, is pretty simple for a userspace program to do, and I think it's the best place for it. I know nothing about any other implementations, though, and I'm speaking mainly from the experiences I've had with coding daemons using select(). You mention you wrote a paper discussing this issue...Where could I find this? Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: Yes, someone pointed me to those today. I would suggest reading some of the relevant literature before embarking on a design. My paper discusses some of the issues, and Mogul/Banga make some good points too. While an 'edge-trigger' design is indeed simpler, I feel that it ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to implement its own state mechanism to remember that there may be pending data in the buffer, since it will not get another event notification unless another packet arrives. What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Doing everything in smaller chunks would increase overhead (but maybe reduce latencies very slightly -- albeit probably not much when using a get_events()-style interface). Isn't it probably better to keep the kernel implementation as efficient as possible so that the majority of applications which will read (and write) all data possible can do it as efficiently as possible? Queueing up the events, even as they are in the form received from the kernel, is pretty simple for a userspace program to do, and I think it's the best place for it. I know nothing about any other implementations, though, and I'm speaking mainly from the experiences I've had with coding daemons using select(). You mention you wrote a paper discussing this issue...Where could I find this? Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Simon Kirby wrote: While an 'edge-trigger' design is indeed simpler, I feel that it ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to implement its own state mechanism to remember that there may be pending data in the buffer, since it will not get another event notification unless another packet arrives. What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Pipelined server. 1. Wait for event. 2. Read block 3. If EAGAIN, goto 1. 4. If next request in block is incomplete, goto 2. 5. Process next request in block. 6. Write response. 7. If EAGAIN, wait until output is ready for writing then goto 6. 8. Goto 1 or 2, your choice. (Here I'd go to 2 if the last read was complete -- it avoids a redundant call to poll()). If you simply read everything you can at step 2, you'll run out of memory the moment someone sends you 10 requests. This doesn't happen if you leave unread data in kernel space -- TCP windows and all that. enjoy, -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote: On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: Yes, someone pointed me to those today. I would suggest reading some of the relevant literature before embarking on a design. My paper discusses some of the issues, and Mogul/Banga make some good points too. While an 'edge-trigger' design is indeed simpler, I feel that it ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to implement its own state mechanism to remember that there may be pending data in the buffer, since it will not get another event notification unless another packet arrives. What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Doing everything in smaller chunks would increase overhead (but maybe reduce latencies very slightly -- albeit probably not much when using a get_events()-style interface). Consider a program which reads from point A, writes to point B. If the buffer associated with B fills up, then we don't want to continue reading from A. A/B may be network sockets, pipes, or ptys. Or perhaps you receive a request to use a resource that is currently busy. Does your application want to postpone the request, or read the data immediately, even if the request can't be serviced yet? My point is that I can easily think of several examples as to where this behavior may be beneficial to the application, and I use some of them myself. You can indeed get the same result by forcing each and every application that wants this behavior to implement their own tracking mechanism, but this strikes me as error-prone and places an undue burden on the application programmer. Isn't it probably better to keep the kernel implementation as efficient as possible so that the majority of applications which will read (and write) all data possible can do it as efficiently as possible? Queueing up the events, even as they are in the form received from the kernel, is pretty simple for a userspace program to do, and I think it's the best place for it. I don't believe that you must sacrifice efficiency to achieve this goal, I think that you can provide both forms in an efficent fashion. I know nothing about any other implementations, though, and I'm speaking mainly from the experiences I've had with coding daemons using select(). You mention you wrote a paper discussing this issue...Where could I find this? I'm also speaking from experience, from using various forms of event notification. kqueue() is actually a 3rd generation system, building off the experience I had with the first two, along with other input. You can find my paper at http://people.freebsd.org/~jlemon -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 07:08:48PM +0200, Jamie Lokier wrote: Simon Kirby wrote: What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Pipelined server. 1. Wait for event. 2. Read block 3. If EAGAIN, goto 1. 4. If next request in block is incomplete, goto 2. 5. Process next request in block. 6. Write response. 7. If EAGAIN, wait until output is ready for writing then goto 6. 8. Goto 1 or 2, your choice. (Here I'd go to 2 if the last read was complete -- it avoids a redundant call to poll()). If you simply read everything you can at step 2, you'll run out of memory the moment someone sends you 10 requests. This doesn't happen if you leave unread data in kernel space -- TCP windows and all that. Hmm, I don't understand. What happens at "wait until output is ready for writing then goto 6"? You mean you would stop the main loop to wait for a single client to unclog? Wouldn't you just do this? - 1. Wait for event (read and write queued). Event occurs: Incoming data available. 2. Read a block. 3. Process block just read: Does it contain a full request? If not, queue, goto 2, munge together. If no more data, queue beginning of request, if any, and goto 1. 4. Walk over available requests in block just read. Process. 5. Attempt to write response, if any. 6. Attempted write: Did it all get out? If not, queue waiting writable data and goto 1 to wait for a write event. 7. Goto 2. Assume we got write clogged. Some loop later: 10. Wait for event (read and write queued). Event occurs: Write space available. 11. Write remaining available data. 12. Attempted write: Did it all get out? If not, queue remaining writable data and goto 1 to wait for another write event. 13. Goto 2. (If we're some sort of forwarding daemon and the receiving end of our forward has just unclogged, we want to read any readable data we had waiting. Same with if we're just answering a request, though, as the send direction could still get clogged.) What can't you do here? What's wrong? Note that the write event will let you read any remaining queued data. If you actually stop from going back to the main loop when you're write clogged, you will pause the daemon and create an easy DoS problem. There's no way around needing to queue writable data at least. This is how I wrote my irc daemon a while back, and it works fine with select(). I can't see what wouldn't work with edge-triggered events except perhaps the write() event -- I'm not sure what would be considered "triggered", perhaps when it goes under a watermark or something. In any case, it should all still work assuming get_events() offers the ability to receive "write space available" events. You don't have to read all data if you don't want to, assuming you will get another event later that will unclog the situation (meaning the obstacle must also trigger an event when it is cleared). In fact, if you did leave the read queued in a daemon using select() before, you'd keep looping endlessly taking all CPU and never idle because there would always be read data available. You'd have to not queue the descriptor into the read set and instead stick it in the write set so that you can sleep waiting for the write set to become available, effectively ignorning any further events on the read set until the write unclogs. This sounds just like what would happen if you only got one notification (edge triggered) in the first place. Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote: Consider a program which reads from point A, writes to point B. If the buffer associated with B fills up, then we don't want to continue reading from A. A/B may be network sockets, pipes, or ptys. Fine, but we can bind the event watching to the device or socket or pipe that will clog up, right? In which case, we'll later get a write event (just like with select()), and then once there is some progress you can go back to read()ing from the original descriptor. This is even easier than using select() because you don't have to take the descriptor out of the read set and put it in the write set temporarily -- it will automatically work that way. Or perhaps you receive a request to use a resource that is currently busy. Does your application want to postpone the request, or read the data immediately, even if the request can't be serviced yet? Assuming this "resource" has a way of waking up the process when it unclogs, then you can go back and read the remaining data later, which is what you would want to do anyway. My point is that I can easily think of several examples as to where this behavior may be beneficial to the application, and I use some of them myself. You can indeed get the same result by forcing each and every application that wants this behavior to implement their own tracking mechanism, but this strikes me as error-prone and places an undue burden on the application programmer. I can see that you could write it this way... I'm just trying to see if it's really needed. :) As I wrote in my last email to Jamie, you would need to implement a tracking mechanism in any case to avoid DoS attacks from clients or a case where a single client can clog up the reading from any other client. And you'd need to take the descriptor out of the read() set in the select() case anyway, so I don't really see what's different. You can find my paper at http://people.freebsd.org/~jlemon I'll go and read it now. :) Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Simon Kirby wrote: And you'd need to take the descriptor out of the read() set in the select() case anyway, so I don't really see what's different. The difference is that taking a bit out of select()'s bitmap is basically free. Whereas the equivalent with events is a bind_event() system call. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Doing everything in smaller chunks would increase overhead (but maybe reduce latencies very slightly -- albeit probably not much when using a get_events()-style interface). Applications that: o Want to limit their memory footprint by limiting the amount of process VM they consume, and so limit their buffer size to less than all the data the stacks might be capable of providing at one time o With fixed-size messages, which want to operate on a message at a time, without restricting the sender to sending only a single message or whole messages at one time o Want to limit their overall system processing overhead for irrelevent/stale data (example: one which implements state delta referesh events, such as "Bolo" or "Netrek") o Have to implement "leaky bucket" algorithms, where it is permissible to drop some data on the floor, and assume it will be retransmited later (e.g. ATM or user space protocols which want to implement QoS guarantees) o Need to take advantage of kernel strategies for protection from denial of service attacks, without having to redo those strategies themselves (particularly, flood attacks; this is the same reason inetd supports connection rate limiting on behalf of the programs it is responsible for starting) o With multiple data channels to which they are listening, some of which are more important than others (e.g. the Real Networks streaming media protocols are an example) o Want to evealuate the contents of a security negotiation prior to accepting data that was sent using an expired certificate or otherwise bogus credentials There are all sorts of good reasons a programmer would want to trust the kernel, instead of having to build ring buffers into each and every program they write to ensure they remember data which is irrelevent to the processing at hand, or protect their code against buffer overflows initiated by trusted applications. Isn't it probably better to keep the kernel implementation as efficient as possible so that the majority of applications which will read (and write) all data possible can do it as efficiently as possible? Queueing up the events, even as they are in the form received from the kernel, is pretty simple for a userspace program to do, and I think it's the best place for it. Reading, yes. Writing, no. The buffers they are filling in the kernel belong to the kernel, not the application, despite what Microsoft tells you about WSOCK32 programming. The WSOCK32 model assumes that the networking is implemented in another user space process, rather than in the kernel. People who use the "async" WSOCK32 interface rarely understand the implications because they rarely understand how async messages are built using a Windows data pump, which serializes all requests through the moral equivalent of a select loop (which is why NT supports async notification on socket I/O, but other versions of Windows does not [NB: actually, it could, using an fd=-1 I/O completion port, but the WSOCK32 programmers were a bit lazy and were also being told to keep performance under that of NT]). In any case, it's not just a matter of queueing up kernel events, it's also a matter of partially instead of completely reacting to the events, since if an event comes in that says you have 1k of data, and you only read 128 bytes of it, you will have to requeue, in LIFO instead of FIFO order, a modified event with 1k-128 bytes, so the next read completes as expected. Very gross code, which must be then duplicated in every iser space program, and either requires a "tail minus one" pointer, or requires doubly linking the user space event queue. I know nothing about any other implementations, though, and I'm speaking mainly from the experiences I've had with coding daemons using select(). Programs which are select-based are usually DFAs (Deterministic Finite State Automatons), which operate on non-blocking file descriptors. This means that I/O is not interleaved, and so is not necessarily as efficient as it could be, should there ever occur a time when an I/O completion posts sooner after being checked than the amount of time it takes to complete 50% of an event processing cycle (the reasons for this involve queueing theory algebra, and are easier to explain in terms of the relative value of positive and negative caches in situations where a cache miss results in the need to perform a linear traversal). A lot of this can be "magically" alleviated using POSIX AIO calls in the underlying implementation, instead of relying on non-blocking I/O -- even then, don't expect a better than 50%
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 09:40:53PM +0200, Jamie Lokier wrote: Simon Kirby wrote: And you'd need to take the descriptor out of the read() set in the select() case anyway, so I don't really see what's different. The difference is that taking a bit out of select()'s bitmap is basically free. Whereas the equivalent with events is a bind_event() system call. With the caveat that kevent() will take a changelist at the same time that it returns an eventlist, so while you do incur some kernel processing to temporarily disable the descriptor, the system call is essentially free. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:40:28AM -0700, Simon Kirby wrote: On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote: Consider a program which reads from point A, writes to point B. If the buffer associated with B fills up, then we don't want to continue reading from A. A/B may be network sockets, pipes, or ptys. Fine, but we can bind the event watching to the device or socket or pipe that will clog up, right? In which case, we'll later get a write event (just like with select()), and then once there is some progress you can go back to read()ing from the original descriptor. This is even easier than using select() because you don't have to take the descriptor out of the read set and put it in the write set temporarily -- it will automatically work that way. Yes, but with the above, you can't use get_event() as your main dispatching loop to do the read() call any more, since there may be no notifications pending in the queue. So you have to expand your main loop to include both get_event() as well as walk the "these descriptors may have partial data" list. Also, as Jamie pointed out, with kqueue/select you can do: kevent/read/write while with a pure edge-triggered scheme, you either must do: bind_event/read/.../read == 0/write Or maintain your own "this descriptor may have data" list. Also, consider the following scenario for the proposed get_event(): 1. packet arrives, queues an event. 2. user retrieves event. 3. second packet arrives, queues event again. 4. user reads() all data. Now, next time around the loop, we get a notification for an event when there is no data to read. The application now must be prepared to handle this case (meaning no blocking read() calls can be used). Also, what happens if the user closes the socket after step 4 above? The user now receives a notification for a fd which no longer exists, or possibly has been reused for another connection. This may or may not make a difference to the application, but it must be prepared to handle it anyway. I believe that Zack Brown ran into this problem with one of the webservers he was writing. You can find my paper at http://people.freebsd.org/~jlemon I'll go and read it now. :) The paper talks about some of the issues we have been discussing, as well as the design rationale behind kqueue. I'd be happy to answer any questions about the paper. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
Now, next time around the loop, we get a notification for an event when there is no data to read. The application now must be prepared to handle this case (meaning no blocking read() calls can be used). -- Jonathan If the programmer never wants to block in a read call, he should never do a blocking read anyway. There's no standard that requires readability at time X to imply readability at time X+1. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote: Now, next time around the loop, we get a notification for an event when there is no data to read. The application now must be prepared to handle this case (meaning no blocking read() calls can be used). -- Jonathan If the programmer never wants to block in a read call, he should never do a blocking read anyway. There's no standard that requires readability at time X to imply readability at time X+1. Quite true on the surface. But taking that statement at face value implies that it is okay for poll() to return POLLIN on a descriptor even if there is no data to be read. I don't think this is the intention. Never mind what it implies. Just stick to what it says. :) In my opinion, it's perfectly reasonable for an implementation to show POLLIN on a call to poll() and then later block in read(). As far as I know no implementation does this, but no standard prevents an implementation from, for example, swapping out received TCP to disk if it's not retrieved and blocking later when you ask for the data until it can get the data back. I would even argue that it's possible for an implementation to decide that a connection had errored (for example, due to a timeout) and signalling POLLIN. Then before you call 'read', it gets a packet and decides that the connection is actually fine and so blocks in 'read'. This wouldn't seem possible in TCP, but it's possible to imagine protocols where it's sensible to do. And again, as far as I know, no standard prohibits it. If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. It's also just generally good programming practice. There was a time when many operating systems had bugs that caused 'select loop' type applications to hang if they didn't set all their descriptors non-blocking. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote: On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote: ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to What applications would do better by postponing some of the reading? I can't think of any reason off the top of my head why an application wouldn't want to read everything it can. Doing everything in smaller I can see this happening if the application does not know how much data is in the buffer, or if the data is being read into a buffer does not have much space left in it. Jim - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
* David Schwartz [EMAIL PROTECTED] [001025 15:35] wrote: If a programmer does not ever wish to block under any circumstances, it's his obligation to communicate this desire to the implementation. Otherwise, the implementation can block if it doesn't have data or an error available at the instant 'read' is called, regardless of what it may have known or done in the past. It's also just generally good programming practice. There was a time when many operating systems had bugs that caused 'select loop' type applications to hang if they didn't set all their descriptors non-blocking. Yes, and as you mentioned, it was _bugs_ in the operating system that did this. I don't think it's wise to continue speculating on this issue unless you can point to a specific document that says that it's OK for this type of behavior to happen. Let's take a look at the FreeBSD manpage for poll: POLLIN Data other than high priority data may be read without blocking. ok no one bothers to do *BSD compat anymore (*grumble*), so, Solaris: POLLINData other than high priority data may be read without blocking. For STREAMS, this flag is set in revents even if the message is of zero length. I see a trend here, let's try Linux: #define POLLIN 0x0001/* There is data to read */ This seems to imply that it is one hell of a bug to block, returning an error would be acceptable, but surely not blocking. I know manpages are a poor source for references but you're the one putting up a big fight for blocking behavior from poll, perhaps you can point out a standard that contradicts the manpages? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Tue, Oct 24, 2000 at 09:45:14PM -0700, Dan Kegel wrote: > If you haven't already, you might peek at the discussion on > linux-kernel. Linus seems to be on the verge of adding > something like kqueue() to Linux, but appears opposed to > supporting level-triggering; he likes the simplicity of > edge triggering (from the kernel's point of view!). See > http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9 Yes, someone pointed me to those today. I would suggest reading some of the relevant literature before embarking on a design. My paper discusses some of the issues, and Mogul/Banga make some good points too. While an 'edge-trigger' design is indeed simpler, I feel that it ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to implement its own state mechanism to remember that there may be pending data in the buffer, since it will not get another event notification unless another packet arrives. kqueue() provides the ability for the user to choose which model suits their needs better, in keeping with the unix philosophy of tools, not policies. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Johnathan, Thanks for running that test for me! I've added your results (plus a cautionary note about microbenchmarks and a link to your site) to http://www.kegel.com/dkftpbench/Poller_bench.html If you haven't already, you might peek at the discussion on linux-kernel. Linus seems to be on the verge of adding something like kqueue() to Linux, but appears opposed to supporting level-triggering; he likes the simplicity of edge triggering (from the kernel's point of view!). See http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9 Thanks, Dan Jonathan Lemon wrote: > I recently stumbled across a message you posted asking for > microbenchmarks on kqueue. While I do think that microbenchmarks > are partially misleading, I did run them on my machine for > various numbers of connections, with varying number of active > connections. The results are shown below. > > The results dovetail with what I expect: kqueue scales depending > on the number of active connections that it sees, not with the > total number of connections. > > Also, I presented a paper/talk at the recent BSDCon 2000, these > are available at http://www.freebsd.org/~jlemon if you're interested. > -- > Jonathan > > This is on a single processor 600Mhz Pentium-III with 512MB of > memory, running FreeBSD 4.x-STABLE: [ 1 active pipe ] > cache[10:13pm]> ./Poller_bench 5 1 spk 100 1000 1 3 > pipes100 10001 3 > select 54 -- - > poll 5055211559 35178 > kqueue 8 88 8 [ 10 active pipes ] > cache[10:13pm]> ./Poller_bench 5 10 spk 100 1000 1 3 > pipes100 10001 3 > select100 -- - > poll 9557111697 35499 > kqueue 52 52 55 56 [ 100 active pipes ] > cache[10:13pm]> ./Poller_bench 5 100 spk 100 1000 1 3 > pipes100 10001 3 > select542 -- - > poll528 109112440 36530 > kqueue574592 623 702 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
Johnathan, Thanks for running that test for me! I've added your results (plus a cautionary note about microbenchmarks and a link to your site) to http://www.kegel.com/dkftpbench/Poller_bench.html If you haven't already, you might peek at the discussion on linux-kernel. Linus seems to be on the verge of adding something like kqueue() to Linux, but appears opposed to supporting level-triggering; he likes the simplicity of edge triggering (from the kernel's point of view!). See http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9 Thanks, Dan Jonathan Lemon wrote: I recently stumbled across a message you posted asking for microbenchmarks on kqueue. While I do think that microbenchmarks are partially misleading, I did run them on my machine for various numbers of connections, with varying number of active connections. The results are shown below. The results dovetail with what I expect: kqueue scales depending on the number of active connections that it sees, not with the total number of connections. Also, I presented a paper/talk at the recent BSDCon 2000, these are available at http://www.freebsd.org/~jlemon if you're interested. -- Jonathan This is on a single processor 600Mhz Pentium-III with 512MB of memory, running FreeBSD 4.x-STABLE: [ 1 active pipe ] cache[10:13pm] ./Poller_bench 5 1 spk 100 1000 1 3 pipes100 10001 3 select 54 -- - poll 5055211559 35178 kqueue 8 88 8 [ 10 active pipes ] cache[10:13pm] ./Poller_bench 5 10 spk 100 1000 1 3 pipes100 10001 3 select100 -- - poll 9557111697 35499 kqueue 52 52 55 56 [ 100 active pipes ] cache[10:13pm] ./Poller_bench 5 100 spk 100 1000 1 3 pipes100 10001 3 select542 -- - poll528 109112440 36530 kqueue574592 623 702 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: kqueue microbenchmark results
On Tue, Oct 24, 2000 at 09:45:14PM -0700, Dan Kegel wrote: If you haven't already, you might peek at the discussion on linux-kernel. Linus seems to be on the verge of adding something like kqueue() to Linux, but appears opposed to supporting level-triggering; he likes the simplicity of edge triggering (from the kernel's point of view!). See http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9 Yes, someone pointed me to those today. I would suggest reading some of the relevant literature before embarking on a design. My paper discusses some of the issues, and Mogul/Banga make some good points too. While an 'edge-trigger' design is indeed simpler, I feel that it ends up making the job of the application harder. A simple example to illustrate the point: what if the application does not choose to read all the data from an incoming packet? The app now has to implement its own state mechanism to remember that there may be pending data in the buffer, since it will not get another event notification unless another packet arrives. kqueue() provides the ability for the user to choose which model suits their needs better, in keeping with the unix philosophy of tools, not policies. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/