Re: kqueue microbenchmark results

2000-10-27 Thread Dan Kegel

Terry Lambert wrote:
> 
> > > Which is precisely why you need to know where in the chain of events this
> > > happened. Otherwise if I see
> > > 'read on fd 5'
> > > 'read on fd 5'
> > > How do I know which read is for which fd in the multithreaded case
> >
> > That can't happen, can it?  Let's say the following happens:
> >close(5)
> >accept() = 5
> >call kevent() and rebind fd 5
> > The 'close(5)' would remove the old fd 5 events.  Therefore,
> > any fd 5 events you see returned from kevent are for the new fd 5.
> 
> Strictly speaking, it can happen in two cases:
> 
> 1)  single acceptor thread, multiple worker threads
> 2)  multiple anonymous "work to do" threads
> 
> In both these cases, the incoming requests from a client are
> given to any thread, rather than a particular thread.
> 
> In the first case, we can have (id:executer order:event):
> 
> 1:1:open 5
> 2:2:read 5
> 3:4:read 5
> 2:3:close 5
> 
> If thread 2 processes the close event before thread 3 processes
> the read event, then when thread 3 attempts procssing, it will
> fail.

You're not talking about kqueue() / kevent() here, are you?
With that interface, thread 2 would not see a close event;
instead, the other events for fd 5 would vanish from the queue.
If you were indeed talking about kqueue() / kevent(), please flesh
out the example a bit more, showing who calls kevent().

(A race that *can* happen is fd 5 could be closed by another
thread after a 'read 5' event is pulled from the event queue and
before it is processed, but that could happen with any
readiness notification API at all.)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Terry Lambert

> > Which is precisely why you need to know where in the chain of events this
> > happened. Otherwise if I see
> > 
> > 'read on fd 5'
> > 'read on fd 5'
> > 
> > How do I know which read is for which fd in the multithreaded case
> 
> That can't happen, can it?  Let's say the following happens:
>close(5)
>accept() = 5
>call kevent() and rebind fd 5
> The 'close(5)' would remove the old fd 5 events.  Therefore,
> any fd 5 events you see returned from kevent are for the new fd 5.
> 
> (I suspect it helps that kevent() is both the only way to
> bind events and the only way to pick them up; makes it harder
> for one thread to sneak a new fd into the event list without
> the thread calling kevent() noticing.)

Strictly speaking, it can happen in two cases:

1)  single acceptor thread, multiple worker threads

2)  multiple anonymous "work to do" threads

In both these cases, the incoming requests from a client are
given to any thread, rather than a particular thread.

In the first case, we can have (id:executer order:event):

1:1:open 5
2:2:read 5
3:4:read 5
2:3:close 5

If thread 2 processes the close event before thread 3 processes
the read event, then when thread 3 attempts procssing, it will
fail.

Technically, this is a group ordering problem in the design of
the software, which should instead queue all events to a dispatch
thread, and the threads should use IPC to serialize processing of
serial events.  This is similar to the problem with async mounted
FS recovery in event of a crash: without ordering guarantees, you
can only get to a "good" state, not necessarily "the one correct
state".

In the second case, we can have:

1:2:read 5
2:1:open 5
3:4:read 5
2:3:close 5

This is just a non-degenerate form of the first case, where we
allow thread 1 and all other threads to be identical, and don't
serialize open state initialization.

The NetWare for UNIX system uses this model.  The benefit is
that all user space threads can be identical.  This means that
I can use either threads or processes, and it won't matter, so
my software can run on older systems that lack "perfect" threads
models, simply by using processes, and putting client state into
shared memory.

In this case, there is no need for inter-thread synchronization;
instead, we must insist that events be dispatched sequentially,
and that the events be processed serially.  This effectively
requires event processing completion notigfication from user
space to kernel space.

In NetWare for UNIX, this was accomplished using a streams MUX
which knew that the NetWare protocol was request-response.  This
also permitted "busy" responses to be turned around in kernel
space, without incurring a kernel-to-user space scheduling
penalty.  It also permitted "piggyback", where an ioctl to the
mux was used to respond, and combined sending a response with
the next read.  This reduced protection domain crossing and the
context switch overhead by 50%.  Finally, the MUX sent requests
to user space in LIFO order.  This approach is called "hot engine
scheduling", in that the last reader in from user space is the
most likely to have its pages in core, so as to not need swapping
to handle the next request.

I was architect of much of the process model discussed above; as
you can see, there are some significant performance wins to be
had by building the right interfaces, and putting the code on
the right side of the user/kernel boundary.

In any case, the answer is that you can not assume that the only
correct way to solve a problem like event inversion is serialization
of events in user space (or kernel space).  This is not strictly a
"threaded application implementation" issue, and it is not strictly
a kernel serialization of event delivery issue.

Another case, which NetWare did not handle, is that of rejected
authentication.  Even if you went with the first model, and forced
your programmers to use expensive inter-thread synchronization, or
worse, bound each client to a single thread in the server, thus
rendering the system likely to have skewed thread load, getting
worse the longer the connection was up, you would still have the
problem of rejected authentication.  A client might attempt to
send authentication followed by commands in the same packet series,
without waiting for an explicit ACK after each one (i.e. it might
attempt to implement a sliding window over a virtual circuit), and
the system on the other end might dilligently queue the events,
only to have the authentication be rejected, but with packets
queued already to user space for processing, assuming serialization
in user space.  You would then need a much more complex mechanism,
to allow you to invalidate an already queued event to another
thread, which you don't know about in your thread, before you
release the interlock.  Otherwise the client may get responses
without a valid authentication.

You need look no further than LDAPv3 for an example of a protocol
where this 

Re: kqueue microbenchmark results

2000-10-27 Thread Alfred Perlstein

* Dan Kegel <[EMAIL PROTECTED]> [001027 09:40] wrote:
> Alan Cox wrote:
> > > > kqueue currently does this; a close() on an fd will remove any pending
> > > > events from the queues that they are on which correspond to that fd.
> > > 
> > > the application of a close event.  What can I say, "the fd formerly known
> > > as X" is now gone?  It would be incorrect to say that "fd X was closed",
> > > since X no longer refers to anything, and the application may have reused
> > > that fd for another file.
> > 
> > Which is precisely why you need to know where in the chain of events this
> > happened. Otherwise if I see
> > 
> > 'read on fd 5'
> > 'read on fd 5'
> > 
> > How do I know which read is for which fd in the multithreaded case
> 
> That can't happen, can it?  Let's say the following happens:
>close(5)
>accept() = 5
>call kevent() and rebind fd 5
> The 'close(5)' would remove the old fd 5 events.  Therefore,
> any fd 5 events you see returned from kevent are for the new fd 5.
> 
> (I suspect it helps that kevent() is both the only way to
> bind events and the only way to pick them up; makes it harder
> for one thread to sneak a new fd into the event list without
> the thread calling kevent() noticing.)

Yes, that's how it does and should work.  Noticing the close()
should be done via thread communication/IPC not stuck into
kqueue.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Dan Kegel

Alan Cox wrote:
> > > kqueue currently does this; a close() on an fd will remove any pending
> > > events from the queues that they are on which correspond to that fd.
> > 
> > the application of a close event.  What can I say, "the fd formerly known
> > as X" is now gone?  It would be incorrect to say that "fd X was closed",
> > since X no longer refers to anything, and the application may have reused
> > that fd for another file.
> 
> Which is precisely why you need to know where in the chain of events this
> happened. Otherwise if I see
> 
> 'read on fd 5'
> 'read on fd 5'
> 
> How do I know which read is for which fd in the multithreaded case

That can't happen, can it?  Let's say the following happens:
   close(5)
   accept() = 5
   call kevent() and rebind fd 5
The 'close(5)' would remove the old fd 5 events.  Therefore,
any fd 5 events you see returned from kevent are for the new fd 5.

(I suspect it helps that kevent() is both the only way to
bind events and the only way to pick them up; makes it harder
for one thread to sneak a new fd into the event list without
the thread calling kevent() noticing.)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Alfred Perlstein

* Jamie Lokier <[EMAIL PROTECTED]> [001027 08:21] wrote:
> Alfred Perlstein wrote:
> > > If a programmer does not ever wish to block under any circumstances, it's
> > > his obligation to communicate this desire to the implementation. Otherwise,
> > > the implementation can block if it doesn't have data or an error available
> > > at the instant 'read' is called, regardless of what it may have known or
> > > done in the past.
> > 
> > Yes, and as you mentioned, it was _bugs_ in the operating system
> > that did this.
> 
> Not for writes.  POLLOUT may be returned when the kernel thinks you have
> enough memory to do a write, but someone else may allocate memory before
> you call write().  Or does POLLOUT not work this way?

POLLOUT checks the socketbuffer (if we're talking about sockets),
and yes you may still block on mbuf allocation (if we're talking
about FreeBSD) if the socket isn't set non-blocking.  Actually
POLLOUT may be set even if there isn't enough memory for a write
in the network buffer pool.

> For read, you still want to declare the sockets non-blocking so your
> code is robust on _other_ operating systems.  It's pretty straightforward.

Yes, it's true, not using non-blocking sockets is like ignoring
friction in a physics problem, but assuming you have complete
control over the machine it shouldn't trip you up that often.  And
we're talking about readability, not writeability which as you
mentioned may block because of contention for the network buffer
pool.


-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Jamie Lokier

Alfred Perlstein wrote:
> > If a programmer does not ever wish to block under any circumstances, it's
> > his obligation to communicate this desire to the implementation. Otherwise,
> > the implementation can block if it doesn't have data or an error available
> > at the instant 'read' is called, regardless of what it may have known or
> > done in the past.
> 
> Yes, and as you mentioned, it was _bugs_ in the operating system
> that did this.

Not for writes.  POLLOUT may be returned when the kernel thinks you have
enough memory to do a write, but someone else may allocate memory before
you call write().  Or does POLLOUT not work this way?

For read, you still want to declare the sockets non-blocking so your
code is robust on _other_ operating systems.  It's pretty straightforward.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Jamie Lokier

Alfred Perlstein wrote:
  If a programmer does not ever wish to block under any circumstances, it's
  his obligation to communicate this desire to the implementation. Otherwise,
  the implementation can block if it doesn't have data or an error available
  at the instant 'read' is called, regardless of what it may have known or
  done in the past.
 
 Yes, and as you mentioned, it was _bugs_ in the operating system
 that did this.

Not for writes.  POLLOUT may be returned when the kernel thinks you have
enough memory to do a write, but someone else may allocate memory before
you call write().  Or does POLLOUT not work this way?

For read, you still want to declare the sockets non-blocking so your
code is robust on _other_ operating systems.  It's pretty straightforward.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Alfred Perlstein

* Jamie Lokier [EMAIL PROTECTED] [001027 08:21] wrote:
 Alfred Perlstein wrote:
   If a programmer does not ever wish to block under any circumstances, it's
   his obligation to communicate this desire to the implementation. Otherwise,
   the implementation can block if it doesn't have data or an error available
   at the instant 'read' is called, regardless of what it may have known or
   done in the past.
  
  Yes, and as you mentioned, it was _bugs_ in the operating system
  that did this.
 
 Not for writes.  POLLOUT may be returned when the kernel thinks you have
 enough memory to do a write, but someone else may allocate memory before
 you call write().  Or does POLLOUT not work this way?

POLLOUT checks the socketbuffer (if we're talking about sockets),
and yes you may still block on mbuf allocation (if we're talking
about FreeBSD) if the socket isn't set non-blocking.  Actually
POLLOUT may be set even if there isn't enough memory for a write
in the network buffer pool.

 For read, you still want to declare the sockets non-blocking so your
 code is robust on _other_ operating systems.  It's pretty straightforward.

Yes, it's true, not using non-blocking sockets is like ignoring
friction in a physics problem, but assuming you have complete
control over the machine it shouldn't trip you up that often.  And
we're talking about readability, not writeability which as you
mentioned may block because of contention for the network buffer
pool.


-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Dan Kegel

Alan Cox wrote:
   kqueue currently does this; a close() on an fd will remove any pending
   events from the queues that they are on which correspond to that fd.
  
  the application of a close event.  What can I say, "the fd formerly known
  as X" is now gone?  It would be incorrect to say that "fd X was closed",
  since X no longer refers to anything, and the application may have reused
  that fd for another file.
 
 Which is precisely why you need to know where in the chain of events this
 happened. Otherwise if I see
 
 'read on fd 5'
 'read on fd 5'
 
 How do I know which read is for which fd in the multithreaded case

That can't happen, can it?  Let's say the following happens:
   close(5)
   accept() = 5
   call kevent() and rebind fd 5
The 'close(5)' would remove the old fd 5 events.  Therefore,
any fd 5 events you see returned from kevent are for the new fd 5.

(I suspect it helps that kevent() is both the only way to
bind events and the only way to pick them up; makes it harder
for one thread to sneak a new fd into the event list without
the thread calling kevent() noticing.)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Alfred Perlstein

* Dan Kegel [EMAIL PROTECTED] [001027 09:40] wrote:
 Alan Cox wrote:
kqueue currently does this; a close() on an fd will remove any pending
events from the queues that they are on which correspond to that fd.
   
   the application of a close event.  What can I say, "the fd formerly known
   as X" is now gone?  It would be incorrect to say that "fd X was closed",
   since X no longer refers to anything, and the application may have reused
   that fd for another file.
  
  Which is precisely why you need to know where in the chain of events this
  happened. Otherwise if I see
  
  'read on fd 5'
  'read on fd 5'
  
  How do I know which read is for which fd in the multithreaded case
 
 That can't happen, can it?  Let's say the following happens:
close(5)
accept() = 5
call kevent() and rebind fd 5
 The 'close(5)' would remove the old fd 5 events.  Therefore,
 any fd 5 events you see returned from kevent are for the new fd 5.
 
 (I suspect it helps that kevent() is both the only way to
 bind events and the only way to pick them up; makes it harder
 for one thread to sneak a new fd into the event list without
 the thread calling kevent() noticing.)

Yes, that's how it does and should work.  Noticing the close()
should be done via thread communication/IPC not stuck into
kqueue.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-27 Thread Terry Lambert

  Which is precisely why you need to know where in the chain of events this
  happened. Otherwise if I see
  
  'read on fd 5'
  'read on fd 5'
  
  How do I know which read is for which fd in the multithreaded case
 
 That can't happen, can it?  Let's say the following happens:
close(5)
accept() = 5
call kevent() and rebind fd 5
 The 'close(5)' would remove the old fd 5 events.  Therefore,
 any fd 5 events you see returned from kevent are for the new fd 5.
 
 (I suspect it helps that kevent() is both the only way to
 bind events and the only way to pick them up; makes it harder
 for one thread to sneak a new fd into the event list without
 the thread calling kevent() noticing.)

Strictly speaking, it can happen in two cases:

1)  single acceptor thread, multiple worker threads

2)  multiple anonymous "work to do" threads

In both these cases, the incoming requests from a client are
given to any thread, rather than a particular thread.

In the first case, we can have (id:executer order:event):

1:1:open 5
2:2:read 5
3:4:read 5
2:3:close 5

If thread 2 processes the close event before thread 3 processes
the read event, then when thread 3 attempts procssing, it will
fail.

Technically, this is a group ordering problem in the design of
the software, which should instead queue all events to a dispatch
thread, and the threads should use IPC to serialize processing of
serial events.  This is similar to the problem with async mounted
FS recovery in event of a crash: without ordering guarantees, you
can only get to a "good" state, not necessarily "the one correct
state".

In the second case, we can have:

1:2:read 5
2:1:open 5
3:4:read 5
2:3:close 5

This is just a non-degenerate form of the first case, where we
allow thread 1 and all other threads to be identical, and don't
serialize open state initialization.

The NetWare for UNIX system uses this model.  The benefit is
that all user space threads can be identical.  This means that
I can use either threads or processes, and it won't matter, so
my software can run on older systems that lack "perfect" threads
models, simply by using processes, and putting client state into
shared memory.

In this case, there is no need for inter-thread synchronization;
instead, we must insist that events be dispatched sequentially,
and that the events be processed serially.  This effectively
requires event processing completion notigfication from user
space to kernel space.

In NetWare for UNIX, this was accomplished using a streams MUX
which knew that the NetWare protocol was request-response.  This
also permitted "busy" responses to be turned around in kernel
space, without incurring a kernel-to-user space scheduling
penalty.  It also permitted "piggyback", where an ioctl to the
mux was used to respond, and combined sending a response with
the next read.  This reduced protection domain crossing and the
context switch overhead by 50%.  Finally, the MUX sent requests
to user space in LIFO order.  This approach is called "hot engine
scheduling", in that the last reader in from user space is the
most likely to have its pages in core, so as to not need swapping
to handle the next request.

I was architect of much of the process model discussed above; as
you can see, there are some significant performance wins to be
had by building the right interfaces, and putting the code on
the right side of the user/kernel boundary.

In any case, the answer is that you can not assume that the only
correct way to solve a problem like event inversion is serialization
of events in user space (or kernel space).  This is not strictly a
"threaded application implementation" issue, and it is not strictly
a kernel serialization of event delivery issue.

Another case, which NetWare did not handle, is that of rejected
authentication.  Even if you went with the first model, and forced
your programmers to use expensive inter-thread synchronization, or
worse, bound each client to a single thread in the server, thus
rendering the system likely to have skewed thread load, getting
worse the longer the connection was up, you would still have the
problem of rejected authentication.  A client might attempt to
send authentication followed by commands in the same packet series,
without waiting for an explicit ACK after each one (i.e. it might
attempt to implement a sliding window over a virtual circuit), and
the system on the other end might dilligently queue the events,
only to have the authentication be rejected, but with packets
queued already to user space for processing, assuming serialization
in user space.  You would then need a much more complex mechanism,
to allow you to invalidate an already queued event to another
thread, which you don't know about in your thread, before you
release the interlock.  Otherwise the client may get responses
without a valid authentication.

You need look no further than LDAPv3 for an example of a protocol
where this is possible (assuming 

Re: kqueue microbenchmark results

2000-10-27 Thread Dan Kegel

Terry Lambert wrote:
 
   Which is precisely why you need to know where in the chain of events this
   happened. Otherwise if I see
   'read on fd 5'
   'read on fd 5'
   How do I know which read is for which fd in the multithreaded case
 
  That can't happen, can it?  Let's say the following happens:
 close(5)
 accept() = 5
 call kevent() and rebind fd 5
  The 'close(5)' would remove the old fd 5 events.  Therefore,
  any fd 5 events you see returned from kevent are for the new fd 5.
 
 Strictly speaking, it can happen in two cases:
 
 1)  single acceptor thread, multiple worker threads
 2)  multiple anonymous "work to do" threads
 
 In both these cases, the incoming requests from a client are
 given to any thread, rather than a particular thread.
 
 In the first case, we can have (id:executer order:event):
 
 1:1:open 5
 2:2:read 5
 3:4:read 5
 2:3:close 5
 
 If thread 2 processes the close event before thread 3 processes
 the read event, then when thread 3 attempts procssing, it will
 fail.

You're not talking about kqueue() / kevent() here, are you?
With that interface, thread 2 would not see a close event;
instead, the other events for fd 5 would vanish from the queue.
If you were indeed talking about kqueue() / kevent(), please flesh
out the example a bit more, showing who calls kevent().

(A race that *can* happen is fd 5 could be closed by another
thread after a 'read 5' event is pulled from the event queue and
before it is processed, but that could happen with any
readiness notification API at all.)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alfred Perlstein

* Alan Cox <[EMAIL PROTECTED]> [001026 18:33] wrote:
> > the application of a close event.  What can I say, "the fd formerly known
> > as X" is now gone?  It would be incorrect to say that "fd X was closed",
> > since X no longer refers to anything, and the application may have reused
> > that fd for another file.
> 
> Which is precisely why you need to know where in the chain of events this
> happened. Otherwise if I see
> 
>   'read on fd 5'
>   'read on fd 5'
> 
> How do I know which read is for which fd in the multithreaded case

No you don't, you don't see anything with the current code unless
fd 5 is still around, what you're presenting to Jonathan is a
application threading problem, not something that need to be
resolved by the OS.

> > As for the multi-thread case, this would be a bug; if one thread closes
> > the descriptor, the other thread is going to get an EBADF when it goes 
> > to perform the read.
> 
> Another thread may already have reused the fd

This is another example of an application threading problem.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alan Cox

> the application of a close event.  What can I say, "the fd formerly known
> as X" is now gone?  It would be incorrect to say that "fd X was closed",
> since X no longer refers to anything, and the application may have reused
> that fd for another file.

Which is precisely why you need to know where in the chain of events this
happened. Otherwise if I see

'read on fd 5'
'read on fd 5'

How do I know which read is for which fd in the multithreaded case

> As for the multi-thread case, this would be a bug; if one thread closes
> the descriptor, the other thread is going to get an EBADF when it goes 
> to perform the read.

Another thread may already have reused the fd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Jonathan Lemon

On Fri, Oct 27, 2000 at 01:50:40AM +0100, Alan Cox wrote:
> > kqueue currently does this; a close() on an fd will remove any pending
> > events from the queues that they are on which correspond to that fd.
> 
> This seems an odd thing to do. Surely what you need to do is to post a
> 'close completed' event to the queue. This also makes more sense when you
> have a threaded app and another thread may well currently be in say a read
> at the time it is closed

Actually, it makes sense when you think about it.  The `fd' is actually
a capability that the application uses to refer to the open file in the
kernel.  If the app does a close() on the fd, it destroys this naming.

The application then has no capability left which refers to the formerly
open socket, and conversly, the kernel has no capability (name) to notify
the application of a close event.  What can I say, "the fd formerly known
as X" is now gone?  It would be incorrect to say that "fd X was closed",
since X no longer refers to anything, and the application may have reused
that fd for another file.

As for the multi-thread case, this would be a bug; if one thread closes
the descriptor, the other thread is going to get an EBADF when it goes 
to perform the read.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alfred Perlstein

* Alan Cox <[EMAIL PROTECTED]> [001026 17:50] wrote:
> > kqueue currently does this; a close() on an fd will remove any pending
> > events from the queues that they are on which correspond to that fd.
> 
> This seems an odd thing to do. Surely what you need to do is to post a
> 'close completed' event to the queue. This also makes more sense when you
> have a threaded app and another thread may well currently be in say a read
> at the time it is closed

Kqueue's flexibility could allow this to be implemented, all you
would need to do is make a new filter trigger.  You might need
a _bit_ of hackery to make sure those aren't removed, or one
could just add the event after clearing all pending events.

Adding a filter to be informed when a specific fd is closed is
certainly an option, it doesn't make very much sense because that
fd could then be reused quickly by something else...

but anyhow:

The point of this interface is to ask kqueue to report only on the
things you are interested in, not to generate superfluous that you
wouldn't care about.  You could make such a flag if Linux adopted
this interface and I'm sure we'd be forced to adopt it, but if you
make kqueue generate info an application won't care about I don't
think that would be taken back.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alan Cox

> kqueue currently does this; a close() on an fd will remove any pending
> events from the queues that they are on which correspond to that fd.

This seems an odd thing to do. Surely what you need to do is to post a
'close completed' event to the queue. This also makes more sense when you
have a threaded app and another thread may well currently be in say a read
at the time it is closed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Terry Lambert

This is a long posting, with a humble beginning, but it has
a point.  I'm being complete so that no one is left in the
dark, or in any doubt as to what that point is.  That means
rehashing some history.

This posting is not really about select or Linux: it's about
interfaces.  Like cached state, interfaces can often be
harmful.

NB: I really should redirect this to FreeBSD, as well, since
there are people in that camp who haven't learned the lesson,
either, but I'll leave it in -chat, for now.

---

[ ... kqueue discussion ... ]

> Linux also thought it was OK to modify the contents of the
> timeval structure before returning it.

It's been pointed out that I should provide more context
for this statement, before people look at me strangely and
make circling motions with their index fingers around
their ears (or whatever the international sign for "crazy"
is these days).  So I'll start with a brief history.

The context is this: the select API was designed with the
idea that one might wish to do non-I/O related background
processing.  Toward this end, one could have several ways
of using the API:

1)  The (struct timeval *) could be NULL.  This means
"block until a signal or until a condition on
which you are selecting is true"; select is a BSD
interface, and, until BSD 4.x and POSIX signals,
the signal would actually call the handler and
restart the select call, so in effect, this really
meant "block until you longjmp out of a signal
handler or until a condition on which you are
selecting is true".

2)  The (struct timeval *) could point to the address
of a real timeval structure (i.e. not be NULL); in
that case, the result depended on the contents:

a)  If the timeval struct was zero valued, it
meant that the select should poll for one
of the conditions being selected for in
the descriptor set, and return a 0 if no
conditions were true.  The contents of
the bitmaps and timeval struct were left
alone.

b)  If the timeval struct was not zero valued,
it meant that the select should wait until
the time specified had expired since the
system call was first started, or one of
the conditions being selected for was true.
If the timeout expired, then a 0 would be
returned, but if one or more of the conditions
were true, the number of descriptors on which
true conditions existed would be returned.

Wedging so much into a single interface was fraught with peril:
it was undefined as to what would happen if the timeval specified
an interval of 5 seconds, yet there was a persistently rescheduled
alarm every 2 seconds, resulting in a signal handler call that did
_not_ longjmp... would the timer expire after 5 seconds, or would
the timer be considered to have been restarted along with the call?
Implementations that went both ways existed.  Mostly, programmers
used longjmp in signal handlers, and it wasn't a portability issue.

More perilous, the question of what to do with a partially
satisfied request that was interrupted with a timer or signal
handler and longjump (later, siginterrupt(2), and later POSIX
non-restart default behaviour).  This meant that the bitmap of
select events might have been modified already, after the
wakeup, but before the process was rescheduled to run.

Finally, the select manual page specifically reserved the right
to modify the contents of the timeval struct; this was presumably
so that you could either do accurate timekeeping by maintaining
a running tally using the timeval deficit (a lot of math, that),
or, more likely, to deal with the system call restart, and ensure
that signals would not prevent the select from ever exiting in
the case of system call restart.

So this was the select API definition.

---

Being pragmatists, programmers programmed to the behaviour of
the API in actual implementations, rather than to the strict
"letter of the law" laid down by the man page.  This meant
that select was called in loop control constructs, and that
the bitmaps were reinitialized each time through the loop.

It also meant that the timeval struct was not reinitialized,
since that was more work, and no known implementations would
modify it.  Pre-POSIX signals, signal handlers were handled on
a signal stack, as a result of a kernel trampoline outcall,
and that meant that a restarting system call would not impact
the countdown.

---

Linux came along, and implemented the letter of the law; the
machines were no sufficiently fast, and the math sufficiently
cheap, that it was now possible to usefully accurate timekeeping
using the inverted math required of keeping a running tally
using the timeval deficit.  So they implemented it: it was
more useful than the historical 

Re: kqueue microbenchmark results

2000-10-26 Thread Jonathan Lemon

On Thu, Oct 26, 2000 at 02:16:28AM -0700, Gideon Glass wrote:
> Jonathan Lemon wrote:
> > 
> > Also, consider the following scenario for the proposed get_event():
> > 
> >1. packet arrives, queues an event.
> >2. user retrieves event.
> >3. second packet arrives, queues event again.
> >4. user reads() all data.
> > 
> > Now, next time around the loop, we get a notification for an event
> > when there is no data to read.  The application now must be prepared
> > to handle this case (meaning no blocking read() calls can be used).
> > 
> > Also, what happens if the user closes the socket after step 4 above?
> 
> Depends on the implementation.  If the item in the queue is the
> struct file (or whatever an fd indexes to), then the implementation
> can only queue the fd once.  This also avoids the problem with
> closing sockets - close() would naturally do a list_del() or whatever
> on the struct file.
> 
> At least I think it could be implemented this way...

kqueue currently does this; a close() on an fd will remove any pending
events from the queues that they are on which correspond to that fd.
I was trying to point out that it isn't as simple as it would seem at
first glance, as you have to consider an issues like this.  Also, if the 
implementation allows multiple event types per fd, (leading to multiple
queued events per fd) there no longer is a 1:1 mapping to something like
'struct file', and performing a list walk doesn't scale very well.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Gideon Glass

Jonathan Lemon wrote:
> 
> Also, consider the following scenario for the proposed get_event():
> 
>1. packet arrives, queues an event.
>2. user retrieves event.
>3. second packet arrives, queues event again.
>4. user reads() all data.
> 
> Now, next time around the loop, we get a notification for an event
> when there is no data to read.  The application now must be prepared
> to handle this case (meaning no blocking read() calls can be used).
> 
> Also, what happens if the user closes the socket after step 4 above?

Depends on the implementation.  If the item in the queue is the
struct file (or whatever an fd indexes to), then the implementation
can only queue the fd once.  This also avoids the problem with
closing sockets - close() would naturally do a list_del() or whatever
on the struct file.

At least I think it could be implemented this way...

gid



> 
> The user now receives a notification for a fd which no longer exists,
> or possibly has been reused for another connection.  This may or may
> not make a difference to the application, but it must be prepared to
> handle it anyway.  I believe that Zack Brown ran into this problem with
> one of the webservers he was writing.
> 
> > > You can find my paper at http://people.freebsd.org/~jlemon
> >
> > I'll go and read it now. :)
> 
> The paper talks about some of the issues we have been discussing, as
> well as the design rationale behind kqueue.  I'd be happy to answer
> any questions about the paper.
> --
> Jonathan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-26 Thread David Schwartz

> * David Schwartz <[EMAIL PROTECTED]> [001025 15:35] wrote:
> >
> > If a programmer does not ever wish to block under any
> circumstances, it's
> > his obligation to communicate this desire to the
> implementation. Otherwise,
> > the implementation can block if it doesn't have data or an
> error available
> > at the instant 'read' is called, regardless of what it may have known or
> > done in the past. It's also just generally good programming
> practice. There
> > was a time when many operating systems had bugs that caused
> 'select loop'
> > type applications to hang if they didn't set all their descriptors
> > non-blocking.
>
> Yes, and as you mentioned, it was _bugs_ in the operating system
> that did this.

Right. I can't imagine a way in which this could happen for TCP without a
bug. For other protocols, it's not so far fetched. For UDP, which is defined
as lossy, I could imagine an implementation that changed its mind about
accepting a packet due to memory demands.

> I don't think it's wise to continue speculating on this issue unless
> you can point to a specific document that says that it's OK for
> this type of behavior to happen.

SuS2 says that 'read' behaves like 'recv' with no flags for a socket. SuS2
says that for a socket, "If no messages are available at the socket and
O_NONBLOCK is not set on the socket's file descriptor, recv() blocks until a
message arrives."

> Let's take a look at the FreeBSD manpage for poll:
>
>  POLLIN Data other than high priority data may be read without
> blocking.

At the time you return from poll. This says nothing about any later time.

[snip]
>#define POLLIN  0x0001/* There is data to read */
>
> This seems to imply that it is one hell of a bug to block, returning
> an error would be acceptable, but surely not blocking.

This brief comment is not meant to be thorough. In fact, it says nothing
about error conditions and implies that it's wrong to return POLLIN for an
error.

> I know manpages are a poor source for references but you're the one
> putting up a big fight for blocking behavior from poll, perhaps you
> can point out a standard that contradicts the manpages?

When you code to a standard, your code must not fail under any conditions
permitted by the standard. Failing to set your file descriptors non-blocking
when you never want to block depends upon behavior not guaranteed.

Unfortunately, none of the standards provides sufficiently clear statements
about this behavior. In fact, I can't even find any standard that says it's
correct to signal POLLIN when there's an error.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Terry Lambert

[ ... blocking read after signalling that data is available ... ]

> Yes, and as you mentioned, it was _bugs_ in the operating system
> that did this.

I think it's reasonable for the OS to discard, for example,
connection requests which are not serviced in a reasonable
time window.  Likewise, it's reasonable to consider some
protocol that would allow the sender to repudiate a packet
that it decided that it didn't want to send; this would, in
fact, be extremely useful in multicast protocols that signalled
all available servers with a request, and then repudiated the
request after receiving a response, on the theory that the
server was too loaded, or the link to congested, or the
programmer of the repudiated servers was such a bad coder that
the server was too lazy to get off its butt and answer the
request in a reasonable amount of time.

A protocol based on this second approach would actually be
able to solve "the gnutella congestion problem" (quoted, as
I believe it's simply a case of the universe and the laws of
physics voting against gnutella as being a dumb idea, since
it's just a repeat of the original NetWare and LANMan scaling
problems).

The real problem is that the interface is making a potentially
incorrect assumption about the underlying implementation, and
that means that it won't be portable to systems whose underlying
implementations don't satify the (undocumented and unwarranted)
assumption.

People whine about WSOCK32 being "gratuitously different" with
regard to resource tracking and implying a shutdown on a socket
close or an application exit, but they forget that that all
came about because the original interface, and the programmers
who used it, assumed a kernel space implementation, and that
the kernel would resource track sockets, as if they were file
descriptors.

I think your Sun example:

>  POLLINData other than high priority  data  may  be  read
>without blocking. For STREAMS, this flag is set in
>revents even if the message is of zero length.

Implies that a recv or recvfrom is required, and use of a
read after a POLLIN, which can't retrieve high priority data
from a socket, may result in the process blocking.  Well, "duh!",
the read is on the normal data channel, and the POLLIN corresponds
to the high priority channel ...what did you expect, when you
called the wrong system call on a socket?


> I see a trend here, let's try Linux:

Linux also thought it was OK to modify the contents of the
timeval structure before returning it.


Terry Lambert
[EMAIL PROTECTED]
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Terry Lambert

[ ... blocking read after signalling that data is available ... ]

 Yes, and as you mentioned, it was _bugs_ in the operating system
 that did this.

I think it's reasonable for the OS to discard, for example,
connection requests which are not serviced in a reasonable
time window.  Likewise, it's reasonable to consider some
protocol that would allow the sender to repudiate a packet
that it decided that it didn't want to send; this would, in
fact, be extremely useful in multicast protocols that signalled
all available servers with a request, and then repudiated the
request after receiving a response, on the theory that the
server was too loaded, or the link to congested, or the
programmer of the repudiated servers was such a bad coder that
the server was too lazy to get off its butt and answer the
request in a reasonable amount of time.

A protocol based on this second approach would actually be
able to solve "the gnutella congestion problem" (quoted, as
I believe it's simply a case of the universe and the laws of
physics voting against gnutella as being a dumb idea, since
it's just a repeat of the original NetWare and LANMan scaling
problems).

The real problem is that the interface is making a potentially
incorrect assumption about the underlying implementation, and
that means that it won't be portable to systems whose underlying
implementations don't satify the (undocumented and unwarranted)
assumption.

People whine about WSOCK32 being "gratuitously different" with
regard to resource tracking and implying a shutdown on a socket
close or an application exit, but they forget that that all
came about because the original interface, and the programmers
who used it, assumed a kernel space implementation, and that
the kernel would resource track sockets, as if they were file
descriptors.

I think your Sun example:

  POLLINData other than high priority  data  may  be  read
without blocking. For STREAMS, this flag is set in
revents even if the message is of zero length.

Implies that a recv or recvfrom is required, and use of a
read after a POLLIN, which can't retrieve high priority data
from a socket, may result in the process blocking.  Well, "duh!",
the read is on the normal data channel, and the POLLIN corresponds
to the high priority channel ...what did you expect, when you
called the wrong system call on a socket?


 I see a trend here, let's try Linux:

Linux also thought it was OK to modify the contents of the
timeval structure before returning it.


Terry Lambert
[EMAIL PROTECTED]
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-26 Thread David Schwartz

 * David Schwartz [EMAIL PROTECTED] [001025 15:35] wrote:
 
  If a programmer does not ever wish to block under any
 circumstances, it's
  his obligation to communicate this desire to the
 implementation. Otherwise,
  the implementation can block if it doesn't have data or an
 error available
  at the instant 'read' is called, regardless of what it may have known or
  done in the past. It's also just generally good programming
 practice. There
  was a time when many operating systems had bugs that caused
 'select loop'
  type applications to hang if they didn't set all their descriptors
  non-blocking.

 Yes, and as you mentioned, it was _bugs_ in the operating system
 that did this.

Right. I can't imagine a way in which this could happen for TCP without a
bug. For other protocols, it's not so far fetched. For UDP, which is defined
as lossy, I could imagine an implementation that changed its mind about
accepting a packet due to memory demands.

 I don't think it's wise to continue speculating on this issue unless
 you can point to a specific document that says that it's OK for
 this type of behavior to happen.

SuS2 says that 'read' behaves like 'recv' with no flags for a socket. SuS2
says that for a socket, "If no messages are available at the socket and
O_NONBLOCK is not set on the socket's file descriptor, recv() blocks until a
message arrives."

 Let's take a look at the FreeBSD manpage for poll:

  POLLIN Data other than high priority data may be read without
 blocking.

At the time you return from poll. This says nothing about any later time.

[snip]
#define POLLIN  0x0001/* There is data to read */

 This seems to imply that it is one hell of a bug to block, returning
 an error would be acceptable, but surely not blocking.

This brief comment is not meant to be thorough. In fact, it says nothing
about error conditions and implies that it's wrong to return POLLIN for an
error.

 I know manpages are a poor source for references but you're the one
 putting up a big fight for blocking behavior from poll, perhaps you
 can point out a standard that contradicts the manpages?

When you code to a standard, your code must not fail under any conditions
permitted by the standard. Failing to set your file descriptors non-blocking
when you never want to block depends upon behavior not guaranteed.

Unfortunately, none of the standards provides sufficiently clear statements
about this behavior. In fact, I can't even find any standard that says it's
correct to signal POLLIN when there's an error.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Jonathan Lemon

On Thu, Oct 26, 2000 at 02:16:28AM -0700, Gideon Glass wrote:
 Jonathan Lemon wrote:
  
  Also, consider the following scenario for the proposed get_event():
  
 1. packet arrives, queues an event.
 2. user retrieves event.
 3. second packet arrives, queues event again.
 4. user reads() all data.
  
  Now, next time around the loop, we get a notification for an event
  when there is no data to read.  The application now must be prepared
  to handle this case (meaning no blocking read() calls can be used).
  
  Also, what happens if the user closes the socket after step 4 above?
 
 Depends on the implementation.  If the item in the queue is the
 struct file (or whatever an fd indexes to), then the implementation
 can only queue the fd once.  This also avoids the problem with
 closing sockets - close() would naturally do a list_del() or whatever
 on the struct file.
 
 At least I think it could be implemented this way...

kqueue currently does this; a close() on an fd will remove any pending
events from the queues that they are on which correspond to that fd.
I was trying to point out that it isn't as simple as it would seem at
first glance, as you have to consider an issues like this.  Also, if the 
implementation allows multiple event types per fd, (leading to multiple
queued events per fd) there no longer is a 1:1 mapping to something like
'struct file', and performing a list walk doesn't scale very well.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Terry Lambert

This is a long posting, with a humble beginning, but it has
a point.  I'm being complete so that no one is left in the
dark, or in any doubt as to what that point is.  That means
rehashing some history.

This posting is not really about select or Linux: it's about
interfaces.  Like cached state, interfaces can often be
harmful.

NB: I really should redirect this to FreeBSD, as well, since
there are people in that camp who haven't learned the lesson,
either, but I'll leave it in -chat, for now.

---

[ ... kqueue discussion ... ]

 Linux also thought it was OK to modify the contents of the
 timeval structure before returning it.

It's been pointed out that I should provide more context
for this statement, before people look at me strangely and
make circling motions with their index fingers around
their ears (or whatever the international sign for "crazy"
is these days).  So I'll start with a brief history.

The context is this: the select API was designed with the
idea that one might wish to do non-I/O related background
processing.  Toward this end, one could have several ways
of using the API:

1)  The (struct timeval *) could be NULL.  This means
"block until a signal or until a condition on
which you are selecting is true"; select is a BSD
interface, and, until BSD 4.x and POSIX signals,
the signal would actually call the handler and
restart the select call, so in effect, this really
meant "block until you longjmp out of a signal
handler or until a condition on which you are
selecting is true".

2)  The (struct timeval *) could point to the address
of a real timeval structure (i.e. not be NULL); in
that case, the result depended on the contents:

a)  If the timeval struct was zero valued, it
meant that the select should poll for one
of the conditions being selected for in
the descriptor set, and return a 0 if no
conditions were true.  The contents of
the bitmaps and timeval struct were left
alone.

b)  If the timeval struct was not zero valued,
it meant that the select should wait until
the time specified had expired since the
system call was first started, or one of
the conditions being selected for was true.
If the timeout expired, then a 0 would be
returned, but if one or more of the conditions
were true, the number of descriptors on which
true conditions existed would be returned.

Wedging so much into a single interface was fraught with peril:
it was undefined as to what would happen if the timeval specified
an interval of 5 seconds, yet there was a persistently rescheduled
alarm every 2 seconds, resulting in a signal handler call that did
_not_ longjmp... would the timer expire after 5 seconds, or would
the timer be considered to have been restarted along with the call?
Implementations that went both ways existed.  Mostly, programmers
used longjmp in signal handlers, and it wasn't a portability issue.

More perilous, the question of what to do with a partially
satisfied request that was interrupted with a timer or signal
handler and longjump (later, siginterrupt(2), and later POSIX
non-restart default behaviour).  This meant that the bitmap of
select events might have been modified already, after the
wakeup, but before the process was rescheduled to run.

Finally, the select manual page specifically reserved the right
to modify the contents of the timeval struct; this was presumably
so that you could either do accurate timekeeping by maintaining
a running tally using the timeval deficit (a lot of math, that),
or, more likely, to deal with the system call restart, and ensure
that signals would not prevent the select from ever exiting in
the case of system call restart.

So this was the select API definition.

---

Being pragmatists, programmers programmed to the behaviour of
the API in actual implementations, rather than to the strict
"letter of the law" laid down by the man page.  This meant
that select was called in loop control constructs, and that
the bitmaps were reinitialized each time through the loop.

It also meant that the timeval struct was not reinitialized,
since that was more work, and no known implementations would
modify it.  Pre-POSIX signals, signal handlers were handled on
a signal stack, as a result of a kernel trampoline outcall,
and that meant that a restarting system call would not impact
the countdown.

---

Linux came along, and implemented the letter of the law; the
machines were no sufficiently fast, and the math sufficiently
cheap, that it was now possible to usefully accurate timekeeping
using the inverted math required of keeping a running tally
using the timeval deficit.  So they implemented it: it was
more useful than the historical 

Re: kqueue microbenchmark results

2000-10-26 Thread Alan Cox

 kqueue currently does this; a close() on an fd will remove any pending
 events from the queues that they are on which correspond to that fd.

This seems an odd thing to do. Surely what you need to do is to post a
'close completed' event to the queue. This also makes more sense when you
have a threaded app and another thread may well currently be in say a read
at the time it is closed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alfred Perlstein

* Alan Cox [EMAIL PROTECTED] [001026 17:50] wrote:
  kqueue currently does this; a close() on an fd will remove any pending
  events from the queues that they are on which correspond to that fd.
 
 This seems an odd thing to do. Surely what you need to do is to post a
 'close completed' event to the queue. This also makes more sense when you
 have a threaded app and another thread may well currently be in say a read
 at the time it is closed

Kqueue's flexibility could allow this to be implemented, all you
would need to do is make a new filter trigger.  You might need
a _bit_ of hackery to make sure those aren't removed, or one
could just add the event after clearing all pending events.

Adding a filter to be informed when a specific fd is closed is
certainly an option, it doesn't make very much sense because that
fd could then be reused quickly by something else...

but anyhow:

The point of this interface is to ask kqueue to report only on the
things you are interested in, not to generate superfluous that you
wouldn't care about.  You could make such a flag if Linux adopted
this interface and I'm sure we'd be forced to adopt it, but if you
make kqueue generate info an application won't care about I don't
think that would be taken back.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Jonathan Lemon

On Fri, Oct 27, 2000 at 01:50:40AM +0100, Alan Cox wrote:
  kqueue currently does this; a close() on an fd will remove any pending
  events from the queues that they are on which correspond to that fd.
 
 This seems an odd thing to do. Surely what you need to do is to post a
 'close completed' event to the queue. This also makes more sense when you
 have a threaded app and another thread may well currently be in say a read
 at the time it is closed

Actually, it makes sense when you think about it.  The `fd' is actually
a capability that the application uses to refer to the open file in the
kernel.  If the app does a close() on the fd, it destroys this naming.

The application then has no capability left which refers to the formerly
open socket, and conversly, the kernel has no capability (name) to notify
the application of a close event.  What can I say, "the fd formerly known
as X" is now gone?  It would be incorrect to say that "fd X was closed",
since X no longer refers to anything, and the application may have reused
that fd for another file.

As for the multi-thread case, this would be a bug; if one thread closes
the descriptor, the other thread is going to get an EBADF when it goes 
to perform the read.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alan Cox

 the application of a close event.  What can I say, "the fd formerly known
 as X" is now gone?  It would be incorrect to say that "fd X was closed",
 since X no longer refers to anything, and the application may have reused
 that fd for another file.

Which is precisely why you need to know where in the chain of events this
happened. Otherwise if I see

'read on fd 5'
'read on fd 5'

How do I know which read is for which fd in the multithreaded case

 As for the multi-thread case, this would be a bug; if one thread closes
 the descriptor, the other thread is going to get an EBADF when it goes 
 to perform the read.

Another thread may already have reused the fd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-26 Thread Alfred Perlstein

* Alan Cox [EMAIL PROTECTED] [001026 18:33] wrote:
  the application of a close event.  What can I say, "the fd formerly known
  as X" is now gone?  It would be incorrect to say that "fd X was closed",
  since X no longer refers to anything, and the application may have reused
  that fd for another file.
 
 Which is precisely why you need to know where in the chain of events this
 happened. Otherwise if I see
 
   'read on fd 5'
   'read on fd 5'
 
 How do I know which read is for which fd in the multithreaded case

No you don't, you don't see anything with the current code unless
fd 5 is still around, what you're presenting to Jonathan is a
application threading problem, not something that need to be
resolved by the OS.

  As for the multi-thread case, this would be a bug; if one thread closes
  the descriptor, the other thread is going to get an EBADF when it goes 
  to perform the read.
 
 Another thread may already have reused the fd

This is another example of an application threading problem.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Alfred Perlstein

* David Schwartz <[EMAIL PROTECTED]> [001025 15:35] wrote:
> 
> If a programmer does not ever wish to block under any circumstances, it's
> his obligation to communicate this desire to the implementation. Otherwise,
> the implementation can block if it doesn't have data or an error available
> at the instant 'read' is called, regardless of what it may have known or
> done in the past. It's also just generally good programming practice. There
> was a time when many operating systems had bugs that caused 'select loop'
> type applications to hang if they didn't set all their descriptors
> non-blocking.

Yes, and as you mentioned, it was _bugs_ in the operating system
that did this.

I don't think it's wise to continue speculating on this issue unless
you can point to a specific document that says that it's OK for
this type of behavior to happen.

Let's take a look at the FreeBSD manpage for poll:

 POLLIN Data other than high priority data may be read without
blocking.

ok no one bothers to do *BSD compat anymore (*grumble*),
so, Solaris:

 POLLINData other than high priority  data  may  be  read
   without blocking. For STREAMS, this flag is set in
   revents even if the message is of zero length.

I see a trend here, let's try Linux:

   #define POLLIN  0x0001/* There is data to read */

This seems to imply that it is one hell of a bug to block, returning
an error would be acceptable, but surely not blocking.

I know manpages are a poor source for references but you're the one
putting up a big fight for blocking behavior from poll, perhaps you
can point out a standard that contradicts the manpages?

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread James Lewis Nance

On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote:
> On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:
> 
> > ends up making the job of the application harder.  A simple example
> > to illustrate the point: what if the application does not choose 
> > to read all the data from an incoming packet?  The app now has to 

> What applications would do better by postponing some of the reading? 
> I can't think of any reason off the top of my head why an application
> wouldn't want to read everything it can.  Doing everything in smaller

I can see this happening if the application does not know how much data
is in the buffer, or if the data is being read into a buffer does not
have much space left in it.

Jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-25 Thread David Schwartz


> On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote:
> >
> > > Now, next time around the loop, we get a notification for an event
> > > when there is no data to read.  The application now must be prepared
> > > to handle this case (meaning no blocking read() calls can be used).
> > > --
> > > Jonathan
> >
> > If the programmer never wants to block in a read call, he
> > should never do a
> > blocking read anyway. There's no standard that requires
> > readability at time
> > X to imply readability at time X+1.
>
> Quite true on the surface.  But taking that statement at face value
> implies that it is okay for poll() to return POLLIN on a descriptor
> even if there is no data to be read.  I don't think this is the intention.

Never mind what it implies. Just stick to what it says. :)

In my opinion, it's perfectly reasonable for an implementation to show
POLLIN on a call to poll() and then later block in read(). As far as I know
no implementation does this, but no standard prevents an implementation
from, for example, swapping out received TCP to disk if it's not retrieved
and blocking later when you ask for the data until it can get the data back.

I would even argue that it's possible for an implementation to decide that
a connection had errored (for example, due to a timeout) and signalling
POLLIN. Then before you call 'read', it gets a packet and decides that the
connection is actually fine and so blocks in 'read'. This wouldn't seem
possible in TCP, but it's possible to imagine protocols where it's sensible
to do. And again, as far as I know, no standard prohibits it.

If a programmer does not ever wish to block under any circumstances, it's
his obligation to communicate this desire to the implementation. Otherwise,
the implementation can block if it doesn't have data or an error available
at the instant 'read' is called, regardless of what it may have known or
done in the past. It's also just generally good programming practice. There
was a time when many operating systems had bugs that caused 'select loop'
type applications to hang if they didn't set all their descriptors
non-blocking.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote:
> 
> > Now, next time around the loop, we get a notification for an event
> > when there is no data to read.  The application now must be prepared
> > to handle this case (meaning no blocking read() calls can be used).
> > --
> > Jonathan
> 
>   If the programmer never wants to block in a read call, he should never do a
> blocking read anyway. There's no standard that requires readability at time
> X to imply readability at time X+1.

Quite true on the surface.  But taking that statement at face value
implies that it is okay for poll() to return POLLIN on a descriptor
even if there is no data to be read.  I don't think this is the intention.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-25 Thread David Schwartz


> Now, next time around the loop, we get a notification for an event
> when there is no data to read.  The application now must be prepared
> to handle this case (meaning no blocking read() calls can be used).
> --
> Jonathan

If the programmer never wants to block in a read call, he should never do a
blocking read anyway. There's no standard that requires readability at time
X to imply readability at time X+1.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 11:40:28AM -0700, Simon Kirby wrote:
> On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote:
> 
> > Consider a program which reads from point A, writes to point B.  If
> > the buffer associated with B fills up, then we don't want to continue
> > reading from A.
> > 
> > A/B may be network sockets, pipes, or ptys. 
> 
> Fine, but we can bind the event watching to the device or socket or pipe
> that will clog up, right?  In which case, we'll later get a write event
> (just like with select()), and then once there is some progress you can
> go back to read()ing from the original descriptor.  This is even easier
> than using select() because you don't have to take the descriptor out of
> the read set and put it in the write set temporarily -- it will
> automatically work that way.

Yes, but with the above, you can't use get_event() as your main
dispatching loop to do the read() call any more, since there may
be no notifications pending in the queue.  So you have to expand 
your main loop to include both get_event() as well as walk the 
"these descriptors may have partial data" list.


Also, as Jamie pointed out, with kqueue/select you can do:

kevent/read/write

while with a pure edge-triggered scheme, you either must do:

bind_event/read/.../read == 0/write

Or maintain your own "this descriptor may have data" list.

Also, consider the following scenario for the proposed get_event():

   1. packet arrives, queues an event.
   2. user retrieves event.
   3. second packet arrives, queues event again.
   4. user reads() all data.

Now, next time around the loop, we get a notification for an event
when there is no data to read.  The application now must be prepared
to handle this case (meaning no blocking read() calls can be used).

Also, what happens if the user closes the socket after step 4 above?

The user now receives a notification for a fd which no longer exists,
or possibly has been reused for another connection.  This may or may
not make a difference to the application, but it must be prepared to
handle it anyway.  I believe that Zack Brown ran into this problem with
one of the webservers he was writing.


> > You can find my paper at http://people.freebsd.org/~jlemon
> 
> I'll go and read it now. :)

The paper talks about some of the issues we have been discussing, as
well as the design rationale behind kqueue.  I'd be happy to answer 
any questions about the paper.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 09:40:53PM +0200, Jamie Lokier wrote:
> Simon Kirby wrote:
> > And you'd need to take the descriptor out of the read() set in the
> > select() case anyway, so I don't really see what's different.
> 
> The difference is that taking a bit out of select()'s bitmap is
> basically free.  Whereas the equivalent with events is a bind_event()
> system call.

With the caveat that kevent() will take a changelist at the same time
that it returns an eventlist, so while you do incur some kernel processing
to temporarily disable the descriptor, the system call is essentially
free.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Terry Lambert

> What applications would do better by postponing some of the reading? 
> I can't think of any reason off the top of my head why an application
> wouldn't want to read everything it can.  Doing everything in smaller
> chunks would increase overhead (but maybe reduce latencies very slightly
> -- albeit probably not much when using a get_events()-style interface).

Applications that:

o   Want to limit their memory footprint by limiting the
amount of process VM they consume, and so limit their
buffer size to less than all the data the stacks might
be capable of providing at one time

o   With fixed-size messages, which want to operate on a
message at a time, without restricting the sender to
sending only a single message or whole messages at
one time

o   Want to limit their overall system processing overhead
for irrelevent/stale data (example: one which implements
state delta referesh events, such as "Bolo" or "Netrek")

o   Have to implement "leaky bucket" algorithms, where it
is permissible to drop some data on the floor, and
assume it will be retransmited later (e.g. ATM or user
space protocols which want to implement QoS guarantees)

o   Need to take advantage of kernel strategies for protection
from denial of service attacks, without having to redo
those strategies themselves (particularly, flood attacks;
this is the same reason inetd supports connection rate
limiting on behalf of the programs it is responsible for
starting)

o   With multiple data channels to which they are listening,
some of which are more important than others (e.g. the
Real Networks streaming media protocols are an example)

o   Want to evealuate the contents of a security negotiation
prior to accepting data that was sent using an expired
certificate or otherwise bogus credentials

There are all sorts of good reasons a programmer would want to
trust the kernel, instead of having to build ring buffers into
each and every program they write to ensure they remember data
which is irrelevent to the processing at hand, or protect their
code against buffer overflows initiated by trusted applications.


> Isn't it probably better to keep the kernel implementation as efficient
> as possible so that the majority of applications which will read (and
> write) all data possible can do it as efficiently as possible?  Queueing
> up the events, even as they are in the form received from the kernel, is
> pretty simple for a userspace program to do, and I think it's the best
> place for it.

Reading, yes.  Writing, no.  The buffers they are filling in
the kernel belong to the kernel, not the application, despite
what Microsoft tells you about WSOCK32 programming.  The WSOCK32
model assumes that the networking is implemented in another user
space process, rather than in the kernel.  People who use the
"async" WSOCK32 interface rarely understand the implications
because they rarely understand how async messages are built
using a Windows data pump, which serializes all requests through
the moral equivalent of a select loop (which is why NT supports
async notification on socket I/O, but other versions of Windows
does not [NB: actually, it could, using an fd=-1 I/O completion
port, but the WSOCK32 programmers were a bit lazy and were also
being told to keep performance under that of NT]).

In any case, it's not just a matter of queueing up kernel events,
it's also a matter of partially instead of completely reacting to
the events, since if an event comes in that says you have 1k of
data, and you only read 128 bytes of it, you will have to requeue,
in LIFO instead of FIFO order, a modified event with 1k-128 bytes,
so the next read completes as expected.  Very gross code, which
must be then duplicated in every iser space program, and either
requires a "tail minus one" pointer, or requires doubly linking
the user space event queue.


> I know nothing about any other implementations, though, and I'm speaking
> mainly from the experiences I've had with coding daemons using select(). 

Programs which are select-based are usually DFAs (Deterministic
Finite State Automatons), which operate on non-blocking file
descriptors.  This means that I/O is not interleaved, and so is
not necessarily as efficient as it could be, should there ever
occur a time when an I/O completion posts sooner after being
checked than the amount of time it takes to complete 50% of an
event processing cycle (the reasons for this involve queueing
theory algebra, and are easier to explain in terms of the relative
value of positive and negative caches in situations where a cache
miss results in the need to perform a linear traversal).  A lot
of this can be "magically" alleviated using POSIX AIO calls in
the underlying implementation, instead of relying on non-blocking
I/O -- even then, don't expect a better 

Re: kqueue microbenchmark results

2000-10-25 Thread Jamie Lokier

Simon Kirby wrote:
> And you'd need to take the descriptor out of the read() set in the
> select() case anyway, so I don't really see what's different.

The difference is that taking a bit out of select()'s bitmap is
basically free.  Whereas the equivalent with events is a bind_event()
system call.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote:

> Consider a program which reads from point A, writes to point B.  If
> the buffer associated with B fills up, then we don't want to continue
> reading from A.
> 
> A/B may be network sockets, pipes, or ptys. 

Fine, but we can bind the event watching to the device or socket or pipe
that will clog up, right?  In which case, we'll later get a write event
(just like with select()), and then once there is some progress you can
go back to read()ing from the original descriptor.  This is even easier
than using select() because you don't have to take the descriptor out of
the read set and put it in the write set temporarily -- it will
automatically work that way.

> Or perhaps you receive a request to use a resource that is currently
> busy.  Does your application want to postpone the request, or read the
> data immediately, even if the request can't be serviced yet?

Assuming this "resource" has a way of waking up the process when it
unclogs, then you can go back and read the remaining data later, which is
what you would want to do anyway.

> My point is that I can easily think of several examples as to where
> this behavior may be beneficial to the application, and I use some of 
> them myself.  You can indeed get the same result by forcing each and
> every application that wants this behavior to implement their own
> tracking mechanism, but this strikes me as error-prone and places an 
> undue burden on the application programmer.

I can see that you could write it this way... I'm just trying to see if
it's really needed. :)

As I wrote in my last email to Jamie, you would need to implement a
tracking mechanism in any case to avoid DoS attacks from clients or a
case where a single client can clog up the reading from any other client. 
And you'd need to take the descriptor out of the read() set in the
select() case anyway, so I don't really see what's different.

> You can find my paper at http://people.freebsd.org/~jlemon

I'll go and read it now. :)

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 07:08:48PM +0200, Jamie Lokier wrote:

> Simon Kirby wrote:
> 
> > What applications would do better by postponing some of the reading? 
> > I can't think of any reason off the top of my head why an application
> > wouldn't want to read everything it can.
> 
> Pipelined server.
> 
> 1. Wait for event.
> 2. Read block
> 3. If EAGAIN, goto 1.
> 4. If next request in block is incomplete, goto 2.
> 5. Process next request in block.
> 6. Write response.
> 7. If EAGAIN, wait until output is ready for writing then goto 6.
> 8. Goto 1 or 2, your choice.
>(Here I'd go to 2 if the last read was complete -- it avoids a
>redundant call to poll()).
> 
> If you simply read everything you can at step 2, you'll run out of
> memory the moment someone sends you 10 requests.
> 
> This doesn't happen if you leave unread data in kernel space --
> TCP windows and all that.

Hmm, I don't understand.

What happens at "wait until output is ready for writing then goto 6"?
You mean you would stop the main loop to wait for a single client to
unclog?  Wouldn't you just do this? ->

1. Wait for event (read and write queued).  Event occurs: Incoming
   data available.
2. Read a block.
3. Process block just read: Does it contain a full request?  If not,
   queue, goto 2, munge together.  If no more data, queue beginning
   of request, if any, and goto 1.
4. Walk over available requests in block just read.  Process.
5. Attempt to write response, if any.
6. Attempted write: Did it all get out?  If not, queue waiting
   writable data and goto 1 to wait for a write event.
7. Goto 2.

Assume we got write clogged.  Some loop later:

10. Wait for event (read and write queued).  Event occurs: Write
space available.
11. Write remaining available data.
12. Attempted write: Did it all get out?  If not, queue remaining
writable data and goto 1 to wait for another write event.
13. Goto 2.

(If we're some sort of forwarding daemon and the receiving end
of our forward has just unclogged, we want to read any readable
data we had waiting.  Same with if we're just answering a
request, though, as the send direction could still get clogged.)

What can't you do here?  What's wrong?  Note that the write event will
let you read any remaining queued data.  If you actually stop from going
back to the main loop when you're write clogged, you will pause the
daemon and create an easy DoS problem.  There's no way around needing to
queue writable data at least.

This is how I wrote my irc daemon a while back, and it works fine with
select().  I can't see what wouldn't work with edge-triggered events
except perhaps the write() event -- I'm not sure what would be considered
"triggered", perhaps when it goes under a watermark or something.  In any
case, it should all still work assuming get_events() offers the ability
to receive "write space available" events.

You don't have to read all data if you don't want to, assuming you will
get another event later that will unclog the situation (meaning the
obstacle must also trigger an event when it is cleared).

In fact, if you did leave the read queued in a daemon using select()
before, you'd keep looping endlessly taking all CPU and never idle
because there would always be read data available.  You'd have to not
queue the descriptor into the read set and instead stick it in the write
set so that you can sleep waiting for the write set to become available,
effectively ignorning any further events on the read set until the write
unclogs.  This sounds just like what would happen if you only got one
notification (edge triggered) in the first place.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote:
> On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:
> 
> > Yes, someone pointed me to those today.  I would suggest reading
> > some of the relevant literature before embarking on a design.  My
> > paper discusses some of the issues, and Mogul/Banga make some good
> > points too.
> > 
> > While an 'edge-trigger' design is indeed simpler, I feel that it 
> > ends up making the job of the application harder.  A simple example
> > to illustrate the point: what if the application does not choose 
> > to read all the data from an incoming packet?  The app now has to 
> > implement its own state mechanism to remember that there may be pending
> > data in the buffer, since it will not get another event notification
> > unless another packet arrives.
> 
> What applications would do better by postponing some of the reading? 
> I can't think of any reason off the top of my head why an application
> wouldn't want to read everything it can.  Doing everything in smaller
> chunks would increase overhead (but maybe reduce latencies very slightly
> -- albeit probably not much when using a get_events()-style interface).

Consider a program which reads from point A, writes to point B.  If
the buffer associated with B fills up, then we don't want to continue
reading from A.

A/B may be network sockets, pipes, or ptys. 

Or perhaps you receive a request to use a resource that is currently
busy.  Does your application want to postpone the request, or read the
data immediately, even if the request can't be serviced yet?

My point is that I can easily think of several examples as to where
this behavior may be beneficial to the application, and I use some of 
them myself.  You can indeed get the same result by forcing each and
every application that wants this behavior to implement their own
tracking mechanism, but this strikes me as error-prone and places an 
undue burden on the application programmer.


> Isn't it probably better to keep the kernel implementation as efficient
> as possible so that the majority of applications which will read (and
> write) all data possible can do it as efficiently as possible?  Queueing
> up the events, even as they are in the form received from the kernel, is
> pretty simple for a userspace program to do, and I think it's the best
> place for it.

I don't believe that you must sacrifice efficiency to achieve this
goal, I think that you can provide both forms in an efficent fashion.


> I know nothing about any other implementations, though, and I'm speaking
> mainly from the experiences I've had with coding daemons using select(). 
> You mention you wrote a paper discussing this issue...Where could I find
> this?

I'm also speaking from experience, from using various forms of 
event notification.  kqueue() is actually a 3rd generation system,
building off the experience I had with the first two, along with other
input.

You can find my paper at http://people.freebsd.org/~jlemon
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jamie Lokier

Simon Kirby wrote:
> > While an 'edge-trigger' design is indeed simpler, I feel that it 
> > ends up making the job of the application harder.  A simple example
> > to illustrate the point: what if the application does not choose 
> > to read all the data from an incoming packet?  The app now has to 
> > implement its own state mechanism to remember that there may be pending
> > data in the buffer, since it will not get another event notification
> > unless another packet arrives.
> 
> What applications would do better by postponing some of the reading? 
> I can't think of any reason off the top of my head why an application
> wouldn't want to read everything it can.

Pipelined server.

1. Wait for event.
2. Read block
3. If EAGAIN, goto 1.
4. If next request in block is incomplete, goto 2.
5. Process next request in block.
6. Write response.
7. If EAGAIN, wait until output is ready for writing then goto 6.
8. Goto 1 or 2, your choice.
   (Here I'd go to 2 if the last read was complete -- it avoids a
   redundant call to poll()).

If you simply read everything you can at step 2, you'll run out of
memory the moment someone sends you 10 requests.

This doesn't happen if you leave unread data in kernel space --
TCP windows and all that.

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:

> Yes, someone pointed me to those today.  I would suggest reading
> some of the relevant literature before embarking on a design.  My
> paper discusses some of the issues, and Mogul/Banga make some good
> points too.
> 
> While an 'edge-trigger' design is indeed simpler, I feel that it 
> ends up making the job of the application harder.  A simple example
> to illustrate the point: what if the application does not choose 
> to read all the data from an incoming packet?  The app now has to 
> implement its own state mechanism to remember that there may be pending
> data in the buffer, since it will not get another event notification
> unless another packet arrives.

What applications would do better by postponing some of the reading? 
I can't think of any reason off the top of my head why an application
wouldn't want to read everything it can.  Doing everything in smaller
chunks would increase overhead (but maybe reduce latencies very slightly
-- albeit probably not much when using a get_events()-style interface).

Isn't it probably better to keep the kernel implementation as efficient
as possible so that the majority of applications which will read (and
write) all data possible can do it as efficiently as possible?  Queueing
up the events, even as they are in the form received from the kernel, is
pretty simple for a userspace program to do, and I think it's the best
place for it.

I know nothing about any other implementations, though, and I'm speaking
mainly from the experiences I've had with coding daemons using select(). 
You mention you wrote a paper discussing this issue...Where could I find
this?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:

 Yes, someone pointed me to those today.  I would suggest reading
 some of the relevant literature before embarking on a design.  My
 paper discusses some of the issues, and Mogul/Banga make some good
 points too.
 
 While an 'edge-trigger' design is indeed simpler, I feel that it 
 ends up making the job of the application harder.  A simple example
 to illustrate the point: what if the application does not choose 
 to read all the data from an incoming packet?  The app now has to 
 implement its own state mechanism to remember that there may be pending
 data in the buffer, since it will not get another event notification
 unless another packet arrives.

What applications would do better by postponing some of the reading? 
I can't think of any reason off the top of my head why an application
wouldn't want to read everything it can.  Doing everything in smaller
chunks would increase overhead (but maybe reduce latencies very slightly
-- albeit probably not much when using a get_events()-style interface).

Isn't it probably better to keep the kernel implementation as efficient
as possible so that the majority of applications which will read (and
write) all data possible can do it as efficiently as possible?  Queueing
up the events, even as they are in the form received from the kernel, is
pretty simple for a userspace program to do, and I think it's the best
place for it.

I know nothing about any other implementations, though, and I'm speaking
mainly from the experiences I've had with coding daemons using select(). 
You mention you wrote a paper discussing this issue...Where could I find
this?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jamie Lokier

Simon Kirby wrote:
  While an 'edge-trigger' design is indeed simpler, I feel that it 
  ends up making the job of the application harder.  A simple example
  to illustrate the point: what if the application does not choose 
  to read all the data from an incoming packet?  The app now has to 
  implement its own state mechanism to remember that there may be pending
  data in the buffer, since it will not get another event notification
  unless another packet arrives.
 
 What applications would do better by postponing some of the reading? 
 I can't think of any reason off the top of my head why an application
 wouldn't want to read everything it can.

Pipelined server.

1. Wait for event.
2. Read block
3. If EAGAIN, goto 1.
4. If next request in block is incomplete, goto 2.
5. Process next request in block.
6. Write response.
7. If EAGAIN, wait until output is ready for writing then goto 6.
8. Goto 1 or 2, your choice.
   (Here I'd go to 2 if the last read was complete -- it avoids a
   redundant call to poll()).

If you simply read everything you can at step 2, you'll run out of
memory the moment someone sends you 10 requests.

This doesn't happen if you leave unread data in kernel space --
TCP windows and all that.

enjoy,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote:
 On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:
 
  Yes, someone pointed me to those today.  I would suggest reading
  some of the relevant literature before embarking on a design.  My
  paper discusses some of the issues, and Mogul/Banga make some good
  points too.
  
  While an 'edge-trigger' design is indeed simpler, I feel that it 
  ends up making the job of the application harder.  A simple example
  to illustrate the point: what if the application does not choose 
  to read all the data from an incoming packet?  The app now has to 
  implement its own state mechanism to remember that there may be pending
  data in the buffer, since it will not get another event notification
  unless another packet arrives.
 
 What applications would do better by postponing some of the reading? 
 I can't think of any reason off the top of my head why an application
 wouldn't want to read everything it can.  Doing everything in smaller
 chunks would increase overhead (but maybe reduce latencies very slightly
 -- albeit probably not much when using a get_events()-style interface).

Consider a program which reads from point A, writes to point B.  If
the buffer associated with B fills up, then we don't want to continue
reading from A.

A/B may be network sockets, pipes, or ptys. 

Or perhaps you receive a request to use a resource that is currently
busy.  Does your application want to postpone the request, or read the
data immediately, even if the request can't be serviced yet?

My point is that I can easily think of several examples as to where
this behavior may be beneficial to the application, and I use some of 
them myself.  You can indeed get the same result by forcing each and
every application that wants this behavior to implement their own
tracking mechanism, but this strikes me as error-prone and places an 
undue burden on the application programmer.


 Isn't it probably better to keep the kernel implementation as efficient
 as possible so that the majority of applications which will read (and
 write) all data possible can do it as efficiently as possible?  Queueing
 up the events, even as they are in the form received from the kernel, is
 pretty simple for a userspace program to do, and I think it's the best
 place for it.

I don't believe that you must sacrifice efficiency to achieve this
goal, I think that you can provide both forms in an efficent fashion.


 I know nothing about any other implementations, though, and I'm speaking
 mainly from the experiences I've had with coding daemons using select(). 
 You mention you wrote a paper discussing this issue...Where could I find
 this?

I'm also speaking from experience, from using various forms of 
event notification.  kqueue() is actually a 3rd generation system,
building off the experience I had with the first two, along with other
input.

You can find my paper at http://people.freebsd.org/~jlemon
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 07:08:48PM +0200, Jamie Lokier wrote:

 Simon Kirby wrote:
 
  What applications would do better by postponing some of the reading? 
  I can't think of any reason off the top of my head why an application
  wouldn't want to read everything it can.
 
 Pipelined server.
 
 1. Wait for event.
 2. Read block
 3. If EAGAIN, goto 1.
 4. If next request in block is incomplete, goto 2.
 5. Process next request in block.
 6. Write response.
 7. If EAGAIN, wait until output is ready for writing then goto 6.
 8. Goto 1 or 2, your choice.
(Here I'd go to 2 if the last read was complete -- it avoids a
redundant call to poll()).
 
 If you simply read everything you can at step 2, you'll run out of
 memory the moment someone sends you 10 requests.
 
 This doesn't happen if you leave unread data in kernel space --
 TCP windows and all that.

Hmm, I don't understand.

What happens at "wait until output is ready for writing then goto 6"?
You mean you would stop the main loop to wait for a single client to
unclog?  Wouldn't you just do this? -

1. Wait for event (read and write queued).  Event occurs: Incoming
   data available.
2. Read a block.
3. Process block just read: Does it contain a full request?  If not,
   queue, goto 2, munge together.  If no more data, queue beginning
   of request, if any, and goto 1.
4. Walk over available requests in block just read.  Process.
5. Attempt to write response, if any.
6. Attempted write: Did it all get out?  If not, queue waiting
   writable data and goto 1 to wait for a write event.
7. Goto 2.

Assume we got write clogged.  Some loop later:

10. Wait for event (read and write queued).  Event occurs: Write
space available.
11. Write remaining available data.
12. Attempted write: Did it all get out?  If not, queue remaining
writable data and goto 1 to wait for another write event.
13. Goto 2.

(If we're some sort of forwarding daemon and the receiving end
of our forward has just unclogged, we want to read any readable
data we had waiting.  Same with if we're just answering a
request, though, as the send direction could still get clogged.)

What can't you do here?  What's wrong?  Note that the write event will
let you read any remaining queued data.  If you actually stop from going
back to the main loop when you're write clogged, you will pause the
daemon and create an easy DoS problem.  There's no way around needing to
queue writable data at least.

This is how I wrote my irc daemon a while back, and it works fine with
select().  I can't see what wouldn't work with edge-triggered events
except perhaps the write() event -- I'm not sure what would be considered
"triggered", perhaps when it goes under a watermark or something.  In any
case, it should all still work assuming get_events() offers the ability
to receive "write space available" events.

You don't have to read all data if you don't want to, assuming you will
get another event later that will unclog the situation (meaning the
obstacle must also trigger an event when it is cleared).

In fact, if you did leave the read queued in a daemon using select()
before, you'd keep looping endlessly taking all CPU and never idle
because there would always be read data available.  You'd have to not
queue the descriptor into the read set and instead stick it in the write
set so that you can sleep waiting for the write set to become available,
effectively ignorning any further events on the read set until the write
unclogs.  This sounds just like what would happen if you only got one
notification (edge triggered) in the first place.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Simon Kirby

On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote:

 Consider a program which reads from point A, writes to point B.  If
 the buffer associated with B fills up, then we don't want to continue
 reading from A.
 
 A/B may be network sockets, pipes, or ptys. 

Fine, but we can bind the event watching to the device or socket or pipe
that will clog up, right?  In which case, we'll later get a write event
(just like with select()), and then once there is some progress you can
go back to read()ing from the original descriptor.  This is even easier
than using select() because you don't have to take the descriptor out of
the read set and put it in the write set temporarily -- it will
automatically work that way.

 Or perhaps you receive a request to use a resource that is currently
 busy.  Does your application want to postpone the request, or read the
 data immediately, even if the request can't be serviced yet?

Assuming this "resource" has a way of waking up the process when it
unclogs, then you can go back and read the remaining data later, which is
what you would want to do anyway.

 My point is that I can easily think of several examples as to where
 this behavior may be beneficial to the application, and I use some of 
 them myself.  You can indeed get the same result by forcing each and
 every application that wants this behavior to implement their own
 tracking mechanism, but this strikes me as error-prone and places an 
 undue burden on the application programmer.

I can see that you could write it this way... I'm just trying to see if
it's really needed. :)

As I wrote in my last email to Jamie, you would need to implement a
tracking mechanism in any case to avoid DoS attacks from clients or a
case where a single client can clog up the reading from any other client. 
And you'd need to take the descriptor out of the read() set in the
select() case anyway, so I don't really see what's different.

 You can find my paper at http://people.freebsd.org/~jlemon

I'll go and read it now. :)

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jamie Lokier

Simon Kirby wrote:
 And you'd need to take the descriptor out of the read() set in the
 select() case anyway, so I don't really see what's different.

The difference is that taking a bit out of select()'s bitmap is
basically free.  Whereas the equivalent with events is a bind_event()
system call.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Terry Lambert

 What applications would do better by postponing some of the reading? 
 I can't think of any reason off the top of my head why an application
 wouldn't want to read everything it can.  Doing everything in smaller
 chunks would increase overhead (but maybe reduce latencies very slightly
 -- albeit probably not much when using a get_events()-style interface).

Applications that:

o   Want to limit their memory footprint by limiting the
amount of process VM they consume, and so limit their
buffer size to less than all the data the stacks might
be capable of providing at one time

o   With fixed-size messages, which want to operate on a
message at a time, without restricting the sender to
sending only a single message or whole messages at
one time

o   Want to limit their overall system processing overhead
for irrelevent/stale data (example: one which implements
state delta referesh events, such as "Bolo" or "Netrek")

o   Have to implement "leaky bucket" algorithms, where it
is permissible to drop some data on the floor, and
assume it will be retransmited later (e.g. ATM or user
space protocols which want to implement QoS guarantees)

o   Need to take advantage of kernel strategies for protection
from denial of service attacks, without having to redo
those strategies themselves (particularly, flood attacks;
this is the same reason inetd supports connection rate
limiting on behalf of the programs it is responsible for
starting)

o   With multiple data channels to which they are listening,
some of which are more important than others (e.g. the
Real Networks streaming media protocols are an example)

o   Want to evealuate the contents of a security negotiation
prior to accepting data that was sent using an expired
certificate or otherwise bogus credentials

There are all sorts of good reasons a programmer would want to
trust the kernel, instead of having to build ring buffers into
each and every program they write to ensure they remember data
which is irrelevent to the processing at hand, or protect their
code against buffer overflows initiated by trusted applications.


 Isn't it probably better to keep the kernel implementation as efficient
 as possible so that the majority of applications which will read (and
 write) all data possible can do it as efficiently as possible?  Queueing
 up the events, even as they are in the form received from the kernel, is
 pretty simple for a userspace program to do, and I think it's the best
 place for it.

Reading, yes.  Writing, no.  The buffers they are filling in
the kernel belong to the kernel, not the application, despite
what Microsoft tells you about WSOCK32 programming.  The WSOCK32
model assumes that the networking is implemented in another user
space process, rather than in the kernel.  People who use the
"async" WSOCK32 interface rarely understand the implications
because they rarely understand how async messages are built
using a Windows data pump, which serializes all requests through
the moral equivalent of a select loop (which is why NT supports
async notification on socket I/O, but other versions of Windows
does not [NB: actually, it could, using an fd=-1 I/O completion
port, but the WSOCK32 programmers were a bit lazy and were also
being told to keep performance under that of NT]).

In any case, it's not just a matter of queueing up kernel events,
it's also a matter of partially instead of completely reacting to
the events, since if an event comes in that says you have 1k of
data, and you only read 128 bytes of it, you will have to requeue,
in LIFO instead of FIFO order, a modified event with 1k-128 bytes,
so the next read completes as expected.  Very gross code, which
must be then duplicated in every iser space program, and either
requires a "tail minus one" pointer, or requires doubly linking
the user space event queue.


 I know nothing about any other implementations, though, and I'm speaking
 mainly from the experiences I've had with coding daemons using select(). 

Programs which are select-based are usually DFAs (Deterministic
Finite State Automatons), which operate on non-blocking file
descriptors.  This means that I/O is not interleaved, and so is
not necessarily as efficient as it could be, should there ever
occur a time when an I/O completion posts sooner after being
checked than the amount of time it takes to complete 50% of an
event processing cycle (the reasons for this involve queueing
theory algebra, and are easier to explain in terms of the relative
value of positive and negative caches in situations where a cache
miss results in the need to perform a linear traversal).  A lot
of this can be "magically" alleviated using POSIX AIO calls in
the underlying implementation, instead of relying on non-blocking
I/O -- even then, don't expect a better than 50% 

Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 09:40:53PM +0200, Jamie Lokier wrote:
 Simon Kirby wrote:
  And you'd need to take the descriptor out of the read() set in the
  select() case anyway, so I don't really see what's different.
 
 The difference is that taking a bit out of select()'s bitmap is
 basically free.  Whereas the equivalent with events is a bind_event()
 system call.

With the caveat that kevent() will take a changelist at the same time
that it returns an eventlist, so while you do incur some kernel processing
to temporarily disable the descriptor, the system call is essentially
free.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Jonathan Lemon

On Wed, Oct 25, 2000 at 11:40:28AM -0700, Simon Kirby wrote:
 On Wed, Oct 25, 2000 at 12:23:07PM -0500, Jonathan Lemon wrote:
 
  Consider a program which reads from point A, writes to point B.  If
  the buffer associated with B fills up, then we don't want to continue
  reading from A.
  
  A/B may be network sockets, pipes, or ptys. 
 
 Fine, but we can bind the event watching to the device or socket or pipe
 that will clog up, right?  In which case, we'll later get a write event
 (just like with select()), and then once there is some progress you can
 go back to read()ing from the original descriptor.  This is even easier
 than using select() because you don't have to take the descriptor out of
 the read set and put it in the write set temporarily -- it will
 automatically work that way.

Yes, but with the above, you can't use get_event() as your main
dispatching loop to do the read() call any more, since there may
be no notifications pending in the queue.  So you have to expand 
your main loop to include both get_event() as well as walk the 
"these descriptors may have partial data" list.


Also, as Jamie pointed out, with kqueue/select you can do:

kevent/read/write

while with a pure edge-triggered scheme, you either must do:

bind_event/read/.../read == 0/write

Or maintain your own "this descriptor may have data" list.

Also, consider the following scenario for the proposed get_event():

   1. packet arrives, queues an event.
   2. user retrieves event.
   3. second packet arrives, queues event again.
   4. user reads() all data.

Now, next time around the loop, we get a notification for an event
when there is no data to read.  The application now must be prepared
to handle this case (meaning no blocking read() calls can be used).

Also, what happens if the user closes the socket after step 4 above?

The user now receives a notification for a fd which no longer exists,
or possibly has been reused for another connection.  This may or may
not make a difference to the application, but it must be prepared to
handle it anyway.  I believe that Zack Brown ran into this problem with
one of the webservers he was writing.


  You can find my paper at http://people.freebsd.org/~jlemon
 
 I'll go and read it now. :)

The paper talks about some of the issues we have been discussing, as
well as the design rationale behind kqueue.  I'd be happy to answer 
any questions about the paper.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-25 Thread David Schwartz


 Now, next time around the loop, we get a notification for an event
 when there is no data to read.  The application now must be prepared
 to handle this case (meaning no blocking read() calls can be used).
 --
 Jonathan

If the programmer never wants to block in a read call, he should never do a
blocking read anyway. There's no standard that requires readability at time
X to imply readability at time X+1.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: kqueue microbenchmark results

2000-10-25 Thread David Schwartz


 On Wed, Oct 25, 2000 at 03:11:37PM -0700, David Schwartz wrote:
 
   Now, next time around the loop, we get a notification for an event
   when there is no data to read.  The application now must be prepared
   to handle this case (meaning no blocking read() calls can be used).
   --
   Jonathan
 
  If the programmer never wants to block in a read call, he
  should never do a
  blocking read anyway. There's no standard that requires
  readability at time
  X to imply readability at time X+1.

 Quite true on the surface.  But taking that statement at face value
 implies that it is okay for poll() to return POLLIN on a descriptor
 even if there is no data to be read.  I don't think this is the intention.

Never mind what it implies. Just stick to what it says. :)

In my opinion, it's perfectly reasonable for an implementation to show
POLLIN on a call to poll() and then later block in read(). As far as I know
no implementation does this, but no standard prevents an implementation
from, for example, swapping out received TCP to disk if it's not retrieved
and blocking later when you ask for the data until it can get the data back.

I would even argue that it's possible for an implementation to decide that
a connection had errored (for example, due to a timeout) and signalling
POLLIN. Then before you call 'read', it gets a packet and decides that the
connection is actually fine and so blocks in 'read'. This wouldn't seem
possible in TCP, but it's possible to imagine protocols where it's sensible
to do. And again, as far as I know, no standard prohibits it.

If a programmer does not ever wish to block under any circumstances, it's
his obligation to communicate this desire to the implementation. Otherwise,
the implementation can block if it doesn't have data or an error available
at the instant 'read' is called, regardless of what it may have known or
done in the past. It's also just generally good programming practice. There
was a time when many operating systems had bugs that caused 'select loop'
type applications to hang if they didn't set all their descriptors
non-blocking.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread James Lewis Nance

On Wed, Oct 25, 2000 at 11:27:09AM -0400, Simon Kirby wrote:
 On Wed, Oct 25, 2000 at 01:02:46AM -0500, Jonathan Lemon wrote:
 
  ends up making the job of the application harder.  A simple example
  to illustrate the point: what if the application does not choose 
  to read all the data from an incoming packet?  The app now has to 

 What applications would do better by postponing some of the reading? 
 I can't think of any reason off the top of my head why an application
 wouldn't want to read everything it can.  Doing everything in smaller

I can see this happening if the application does not know how much data
is in the buffer, or if the data is being read into a buffer does not
have much space left in it.

Jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-25 Thread Alfred Perlstein

* David Schwartz [EMAIL PROTECTED] [001025 15:35] wrote:
 
 If a programmer does not ever wish to block under any circumstances, it's
 his obligation to communicate this desire to the implementation. Otherwise,
 the implementation can block if it doesn't have data or an error available
 at the instant 'read' is called, regardless of what it may have known or
 done in the past. It's also just generally good programming practice. There
 was a time when many operating systems had bugs that caused 'select loop'
 type applications to hang if they didn't set all their descriptors
 non-blocking.

Yes, and as you mentioned, it was _bugs_ in the operating system
that did this.

I don't think it's wise to continue speculating on this issue unless
you can point to a specific document that says that it's OK for
this type of behavior to happen.

Let's take a look at the FreeBSD manpage for poll:

 POLLIN Data other than high priority data may be read without
blocking.

ok no one bothers to do *BSD compat anymore (*grumble*),
so, Solaris:

 POLLINData other than high priority  data  may  be  read
   without blocking. For STREAMS, this flag is set in
   revents even if the message is of zero length.

I see a trend here, let's try Linux:

   #define POLLIN  0x0001/* There is data to read */

This seems to imply that it is one hell of a bug to block, returning
an error would be acceptable, but surely not blocking.

I know manpages are a poor source for references but you're the one
putting up a big fight for blocking behavior from poll, perhaps you
can point out a standard that contradicts the manpages?

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
"I have the heart of a child; I keep it in a jar on my desk."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-24 Thread Jonathan Lemon

On Tue, Oct 24, 2000 at 09:45:14PM -0700, Dan Kegel wrote:
> If you haven't already, you might peek at the discussion on
> linux-kernel.  Linus seems to be on the verge of adding
> something like kqueue() to Linux, but appears opposed to
> supporting level-triggering; he likes the simplicity of
> edge triggering (from the kernel's point of view!).  See
> http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9

Yes, someone pointed me to those today.  I would suggest reading
some of the relevant literature before embarking on a design.  My
paper discusses some of the issues, and Mogul/Banga make some good
points too.

While an 'edge-trigger' design is indeed simpler, I feel that it 
ends up making the job of the application harder.  A simple example
to illustrate the point: what if the application does not choose 
to read all the data from an incoming packet?  The app now has to 
implement its own state mechanism to remember that there may be pending
data in the buffer, since it will not get another event notification
unless another packet arrives.

kqueue() provides the ability for the user to choose which model suits
their needs better, in keeping with the unix philosophy of tools, not
policies.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-24 Thread Dan Kegel

Johnathan,
Thanks for running that test for me!  I've added your results
(plus a cautionary note about microbenchmarks and a link to
your site) to http://www.kegel.com/dkftpbench/Poller_bench.html

If you haven't already, you might peek at the discussion on
linux-kernel.  Linus seems to be on the verge of adding
something like kqueue() to Linux, but appears opposed to
supporting level-triggering; he likes the simplicity of
edge triggering (from the kernel's point of view!).  See
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9

Thanks,
Dan

Jonathan Lemon wrote:
> I recently stumbled across a message you posted asking for
> microbenchmarks on kqueue.  While I do think that microbenchmarks
> are partially misleading, I did run them on my machine for
> various numbers of connections, with varying number of active
> connections.  The results are shown below.
> 
> The results dovetail with what I expect: kqueue scales depending
> on the number of active connections that it sees, not with the
> total number of connections.
> 
> Also, I presented a paper/talk at the recent BSDCon 2000, these
> are available at http://www.freebsd.org/~jlemon if you're interested.
> --
> Jonathan
> 
> This is on a single processor 600Mhz Pentium-III with 512MB of
> memory, running FreeBSD 4.x-STABLE:

[ 1 active pipe ]

> cache[10:13pm]> ./Poller_bench 5 1 spk 100 1000 1 3
>  pipes100   10001   3
> select 54  --   -
>   poll 5055211559   35178
> kqueue  8  88   8

[ 10 active pipes ]

> cache[10:13pm]> ./Poller_bench 5 10 spk 100 1000 1 3
>  pipes100   10001   3
> select100  --   -
>   poll 9557111697   35499
> kqueue 52 52   55  56

[ 100 active pipes ]

> cache[10:13pm]> ./Poller_bench 5 100 spk 100 1000 1 3
>  pipes100   10001   3
> select542  --   -
>   poll528   109112440   36530
> kqueue574592  623 702
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-24 Thread Dan Kegel

Johnathan,
Thanks for running that test for me!  I've added your results
(plus a cautionary note about microbenchmarks and a link to
your site) to http://www.kegel.com/dkftpbench/Poller_bench.html

If you haven't already, you might peek at the discussion on
linux-kernel.  Linus seems to be on the verge of adding
something like kqueue() to Linux, but appears opposed to
supporting level-triggering; he likes the simplicity of
edge triggering (from the kernel's point of view!).  See
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9

Thanks,
Dan

Jonathan Lemon wrote:
 I recently stumbled across a message you posted asking for
 microbenchmarks on kqueue.  While I do think that microbenchmarks
 are partially misleading, I did run them on my machine for
 various numbers of connections, with varying number of active
 connections.  The results are shown below.
 
 The results dovetail with what I expect: kqueue scales depending
 on the number of active connections that it sees, not with the
 total number of connections.
 
 Also, I presented a paper/talk at the recent BSDCon 2000, these
 are available at http://www.freebsd.org/~jlemon if you're interested.
 --
 Jonathan
 
 This is on a single processor 600Mhz Pentium-III with 512MB of
 memory, running FreeBSD 4.x-STABLE:

[ 1 active pipe ]

 cache[10:13pm] ./Poller_bench 5 1 spk 100 1000 1 3
  pipes100   10001   3
 select 54  --   -
   poll 5055211559   35178
 kqueue  8  88   8

[ 10 active pipes ]

 cache[10:13pm] ./Poller_bench 5 10 spk 100 1000 1 3
  pipes100   10001   3
 select100  --   -
   poll 9557111697   35499
 kqueue 52 52   55  56

[ 100 active pipes ]

 cache[10:13pm] ./Poller_bench 5 100 spk 100 1000 1 3
  pipes100   10001   3
 select542  --   -
   poll528   109112440   36530
 kqueue574592  623 702
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: kqueue microbenchmark results

2000-10-24 Thread Jonathan Lemon

On Tue, Oct 24, 2000 at 09:45:14PM -0700, Dan Kegel wrote:
 If you haven't already, you might peek at the discussion on
 linux-kernel.  Linus seems to be on the verge of adding
 something like kqueue() to Linux, but appears opposed to
 supporting level-triggering; he likes the simplicity of
 edge triggering (from the kernel's point of view!).  See
 http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9

Yes, someone pointed me to those today.  I would suggest reading
some of the relevant literature before embarking on a design.  My
paper discusses some of the issues, and Mogul/Banga make some good
points too.

While an 'edge-trigger' design is indeed simpler, I feel that it 
ends up making the job of the application harder.  A simple example
to illustrate the point: what if the application does not choose 
to read all the data from an incoming packet?  The app now has to 
implement its own state mechanism to remember that there may be pending
data in the buffer, since it will not get another event notification
unless another packet arrives.

kqueue() provides the ability for the user to choose which model suits
their needs better, in keeping with the unix philosophy of tools, not
policies.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/