Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thursday 08 March 2007 18:28, Linus Torvalds wrote: > The sad part is that there really is no reason why the BSD crowd couldn't > have done recvmsg() as an "extended read with per-system call flags", > which would have made things like O_NONBLOCK etc unnecessary, because you > could do it just with MSG_DONTWAIT.. Wait a second here... O_NONBLOCK is not just unnecessary - it's buggy! Try to do nonblocking read from stdin (fd #0) - * setting O_NONBLOCK with fcntl will set it for all other processes which has the same stdin! * trying to reset O_NONBLOCK after the read doesn't help (think kill -9) * duping fd #0 doesn't help because O_NONBLOCK is not per-fd, it's shared just like filepos. I really like that trick with recvmsg + MSG_DONTWAIT instead. -- vda - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thursday 08 March 2007 18:28, Linus Torvalds wrote: The sad part is that there really is no reason why the BSD crowd couldn't have done recvmsg() as an extended read with per-system call flags, which would have made things like O_NONBLOCK etc unnecessary, because you could do it just with MSG_DONTWAIT.. Wait a second here... O_NONBLOCK is not just unnecessary - it's buggy! Try to do nonblocking read from stdin (fd #0) - * setting O_NONBLOCK with fcntl will set it for all other processes which has the same stdin! * trying to reset O_NONBLOCK after the read doesn't help (think kill -9) * duping fd #0 doesn't help because O_NONBLOCK is not per-fd, it's shared just like filepos. I really like that trick with recvmsg + MSG_DONTWAIT instead. -- vda - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, David M. Lloyd wrote: > On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: > > int signalfd_dequeue(int fd, siginfo_t *info, long timeo); > > > > The "fd" parameter must ba a signalfd file descriptor. The "info" parameter > > is a pointer to the siginfo that will receive the dequeued signal, and > > "timeo" is a timeout in milliseconds, or -1 for infinite. > > The signalfd_dequeue function returns 0 if successfull. > > Does this support non-blocking mode? It doesn't seem to at my level of > understanding anyway. If I use this with EPOLLET for example, I'd > expect to get a single EPOLLIN when a signal arrives, which would > indicate to me that I must call signalfd_dequeue() in a loop until I get > EAGAIN in order to be sure I've consumed all the outstanding signals so > that the edge-triggered notification can be "re-armed". > > Make sense? Yes, the new version can use the O_NONBLOCK flag and read(2) will return EAGAIN. Will post a newer version later, together with timerfd ... - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. If you're going to allow reads from the fd, then it makes sense to let processes get the fd by opening a file; if you've got to call siginfo_create() or what have you, the fact that you can do reads isn't terribly useful to perl or what have you. Inotify even makes you get events with read() today; but you have to use system calls to get the fd and to send events (inotify_add_watch, inotify_rm_watch), which is something I've never understood. It would, to me, be very cool if everything that was a stream of bytes really was a file; this is something I've always thought plan 9 got very right. Epoll and inotify could work this way, and I'm sure there's others I'm forgetting. If people are interested in this sort of thing, I'll start working on a patch. Are there any thoughts on where these sorts of pseudodevices should be these days? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Linus Torvalds [EMAIL PROTECTED] wrote: Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. If you're going to allow reads from the fd, then it makes sense to let processes get the fd by opening a file; if you've got to call siginfo_create() or what have you, the fact that you can do reads isn't terribly useful to perl or what have you. Inotify even makes you get events with read() today; but you have to use system calls to get the fd and to send events (inotify_add_watch, inotify_rm_watch), which is something I've never understood. It would, to me, be very cool if everything that was a stream of bytes really was a file; this is something I've always thought plan 9 got very right. Epoll and inotify could work this way, and I'm sure there's others I'm forgetting. If people are interested in this sort of thing, I'll start working on a patch. Are there any thoughts on where these sorts of pseudodevices should be these days? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, David M. Lloyd wrote: On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: int signalfd_dequeue(int fd, siginfo_t *info, long timeo); The fd parameter must ba a signalfd file descriptor. The info parameter is a pointer to the siginfo that will receive the dequeued signal, and timeo is a timeout in milliseconds, or -1 for infinite. The signalfd_dequeue function returns 0 if successfull. Does this support non-blocking mode? It doesn't seem to at my level of understanding anyway. If I use this with EPOLLET for example, I'd expect to get a single EPOLLIN when a signal arrives, which would indicate to me that I must call signalfd_dequeue() in a loop until I get EAGAIN in order to be sure I've consumed all the outstanding signals so that the edge-triggered notification can be re-armed. Make sense? Yes, the new version can use the O_NONBLOCK flag and read(2) will return EAGAIN. Will post a newer version later, together with timerfd ... - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: > Ugh. Still can't understand, probably I missed something or misread this > patch. > > If we shift signalfd_notify() from > specific_send_sig_info/__group_send_sig_info > to send_signal(), we have the same "list_empty()" fastpath if no signalfds are > attached to the sighand. The difference is that we don't count sig_ignored() > signals, which looks right to me. Now I shifted the check into send_signal(), since signalfd now uses the standard dequeue_signal, and hence compete with standard signal delivery against the queue. This was initial Linus design. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 03/08, Davide Libenzi wrote: > > On Fri, 9 Mar 2007, Oleg Nesterov wrote: > > > > Logic is, if it's not an RT signal, queue only one, otherwise multiple. > > > The bit on the ->pending mask is clealer only when the queue slot becomes > > > empty. > > > > Yes, I see what the code does, but I don't undestand why. For example, > > SIGCHLD was > > delivered to the process _and_ handled several times, then > > sys_signalfd_dequeue() > > comes and finds only one siginfo. Isn't this strange? > > That's the same logic the kernel folows for non-RT signals. Yes, specific_send_sig_info/__group_send_sig_info "ignores" the non-RT signal - if we already have a pending one (blocked, or not yet handled) - the only reason is backward compatibility signalfd: - if the same signr was not fetched via sys_signalfd_dequeue() - the reason is Suppose we are doing a simple application which just logs all signals which were sent to this process. Now, we can miss the signal if the logging blocks. Think about sub-threads. We can send the same non-RT signal to T1, T2, T3 via specific_send_sig_info(). All 3 signals will be delivered, but signalfd (which is process wide) will see only the first. > > > The two trasports can rely on different masks. The signalfd_notify() does > > > not even go in signalfd_deliver() if no signalfds are attached to the > > > sighand. > > > > Sorry, I don't understand. The masks are different, yes, but ->sighand is > > the > > same? How this can make any difference for "if no signalfds are attached" ? > > The list_empty() che would not make you fall inside signalfd_deliver(), > hence the fast path really lives up to its name ;) Ugh. Still can't understand, probably I missed something or misread this patch. If we shift signalfd_notify() from specific_send_sig_info/__group_send_sig_info to send_signal(), we have the same "list_empty()" fastpath if no signalfds are attached to the sighand. The difference is that we don't count sig_ignored() signals, which looks right to me. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Jeremy Fitzhardinge wrote: > > The difficulty is that there are 4 different formats of signal structure > you could get: (traditional|siginfo) x (32bit|64bit). See my suggestion of a fixed-format (and much cleaned up) pseudo-siginfo thing earlier in this thread, and also actually mentioned as a "good to do" in my original email from 2003 that did the original patch ;) > (Hey, can you send signals by writing into the signalfd? Very plan9...) It's an obvious extension. Whether there is any real point to it or not, I dunno. After all, you could just send the signal instead. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: > So I think you should get rid of signalfd_dequeue(), and just replace it > with a "read()" function. > The difficulty is that there are 4 different formats of signal structure you could get: (traditional|siginfo) x (32bit|64bit). What happens if you're a 32 bit process, you fork and exec a 64bit process who inherits the signalfd, and they start reading? What format of signal structure do they get? What do you get? What if you start doing partial reads? I think signalfd_dequeue() is warty, but read() has has a number of details to sort out. (Hey, can you send signals by writing into the signalfd? Very plan9...) J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: > > Logic is, if it's not an RT signal, queue only one, otherwise multiple. > > The bit on the ->pending mask is clealer only when the queue slot becomes > > empty. > > Yes, I see what the code does, but I don't undestand why. For example, > SIGCHLD was > delivered to the process _and_ handled several times, then > sys_signalfd_dequeue() > comes and finds only one siginfo. Isn't this strange? That's the same logic the kernel folows for non-RT signals. > > The two trasports can rely on different masks. The signalfd_notify() does > > not even go in signalfd_deliver() if no signalfds are attached to the > > sighand. > > Sorry, I don't understand. The masks are different, yes, but ->sighand is the > same? How this can make any difference for "if no signalfds are attached" ? The list_empty() che would not make you fall inside signalfd_deliver(), hence the fast path really lives up to its name ;) > Also. A malicious user can eat all memory, > signalfd_deliver()->kmem_cache_alloc() > doesn't check any limits. I'll make that use the std dequeu_signal, so everything is handled in there. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: > > > On Fri, 9 Mar 2007, Oleg Nesterov wrote: > > > > Also. A malicious user can eat all memory, > > signalfd_deliver()->kmem_cache_alloc() > > doesn't check any limits. > > This, btw, is one reason I *really* think signalfd() should just use the > same old signal queue, and not try to make its own. > > Signal queueing and unqueuing simply isn't that simple. Alrighty. Back to the dequeue_signal ... - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: > > Also. A malicious user can eat all memory, > signalfd_deliver()->kmem_cache_alloc() > doesn't check any limits. This, btw, is one reason I *really* think signalfd() should just use the same old signal queue, and not try to make its own. Signal queueing and unqueuing simply isn't that simple. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 03/08, Davide Libenzi wrote: > > On Thu, 8 Mar 2007, Oleg Nesterov wrote: > > > Davide Libenzi wrote: > > > > > > +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct > > > siginfo *info) > > > +{ > > > + int nsig = 0; > > > + struct list_head *pos; > > > + struct signalfd_ctx *ctx; > > > + struct signalfd_sq *sq; > > > + > > > + list_for_each(pos, >sfdlist) { > > > + ctx = list_entry(pos, struct signalfd_ctx, lnk); > > > + /* > > > + * We use a negative signal value as a way to broadcast that the > > > + * sighand has been orphaned, so that we can notify all the > > > + * listeners about this. > > > + */ > > > + if (sig < 0) > > > + __wake_up_locked(>wqh, TASK_UNINTERRUPTIBLE | > > > TASK_INTERRUPTIBLE); > > > + else if (sigismember(>sigmask, sig) && > > > + (sig >= SIGRTMIN || !sigismember(>pending, sig))) > > > { > > > + sigaddset(>pending, sig); > > > > I don't understand the "(sig >= SIGRTMIN || !sigismember(>pending, > > sig))" > > check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal > > may > > be pending in ctx->pending just because it was not signalfd_fetchsig()ed, > > yes? > > Logic is, if it's not an RT signal, queue only one, otherwise multiple. > The bit on the ->pending mask is clealer only when the queue slot becomes > empty. Yes, I see what the code does, but I don't undestand why. For example, SIGCHLD was delivered to the process _and_ handled several times, then sys_signalfd_dequeue() comes and finds only one siginfo. Isn't this strange? > > > @@ -780,6 +785,11 @@ > > > BUG_ON(!irqs_disabled()); > > > assert_spin_locked(>sighand->siglock); > > > > > > + /* > > > + * Deliver the signal to listening signalfds ... > > > + */ > > > + signalfd_notify(t->sighand, sig, info); > > > + > > > /* Short-circuit ignored signals. */ > > > if (sig_ignored(t, sig)) > > > goto out; > > > @@ -968,6 +978,11 @@ > > > assert_spin_locked(>sighand->siglock); > > > handle_stop_signal(sig, p); > > > > > > + /* > > > + * Deliver the signal to listening signalfds ... > > > + */ > > > + signalfd_notify(p->sighand, sig, info); > > > + > > > /* Short-circuit ignored signals. */ > > > if (sig_ignored(p, sig)) > > > return ret; > > > > It is strange that we are doing signalfd_notify() even if the signal is > > ignored. > > Isn't it better to shift signalfd_notify() into send_signal() ? This way we > > do > > not need the special check in signalfd_deliver() above. > > The two trasports can rely on different masks. The signalfd_notify() does > not even go in signalfd_deliver() if no signalfds are attached to the > sighand. Sorry, I don't understand. The masks are different, yes, but ->sighand is the same? How this can make any difference for "if no signalfds are attached" ? OK. What is the purpose of signalfd? Should it record the signals which were sent to the process, or only those which were delivered? The latter looks more natural for me. But inn any case, I don't see a reason to check ->pending mask in signalfd_deliver(). BTW, sys_signalfd_dequeue() re-queues signalfd_sq if copy_siginfo_to_user() fails. Why not -EFAULT? Also. A malicious user can eat all memory, signalfd_deliver()->kmem_cache_alloc() doesn't check any limits. > > Also, this patch doesn't take send_sigqueue/send_group_sigqueue into > > account. > > I added that too. but I noticed something strange, dunno if intentional or > not. In send_sigqueue and send_group_sigqueue, the check for the > timer-special and the ignored is inverted. This lead to two different > behaviours. Is there a reason for that? I was wondering about that too. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it "struct siginfo" (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. Probably, but someone will have to add the 'signalfd' system call anyway. And *that* is why "everything is a stream of bytes" is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! The problem with read() returning arbitrary unstructured data is that there is hard to do versioning/extensibility, since the userspace can't specify the requested/expected format. The only way it could be done is by the (nbytes) parameter to read() which is not very nice (and useless for scripts). This is the same problem that makes sysfs/procfs fragile unless the file format is very well specified for extensibility (and it's easy to f-it up, since there seems to be little consistency there... using the XML horror would actually be an improvement). Breaking sysfs/procfs might be acceptable once every few years, but signal handling will be part of every application event loop and there is no room for breaking anything. (although, one could to the versioning the ugly way by creating the new 'signalfd' syscall instead). I'd say: make read() return the signal number (probably as 4-byte int, in network byte order), but for everything else, use the system call. Mark P.S. I'm currently worried if the fact that FUTEX_FD is being deprecated is a problem :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Oleg Nesterov wrote: > Davide Libenzi wrote: > > > > +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct > > siginfo *info) > > +{ > > + int nsig = 0; > > + struct list_head *pos; > > + struct signalfd_ctx *ctx; > > + struct signalfd_sq *sq; > > + > > + list_for_each(pos, >sfdlist) { > > + ctx = list_entry(pos, struct signalfd_ctx, lnk); > > + /* > > +* We use a negative signal value as a way to broadcast that the > > +* sighand has been orphaned, so that we can notify all the > > +* listeners about this. > > +*/ > > + if (sig < 0) > > + __wake_up_locked(>wqh, TASK_UNINTERRUPTIBLE | > > TASK_INTERRUPTIBLE); > > + else if (sigismember(>sigmask, sig) && > > +(sig >= SIGRTMIN || !sigismember(>pending, sig))) > > { > > + sigaddset(>pending, sig); > > I don't understand the "(sig >= SIGRTMIN || !sigismember(>pending, sig))" > check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal may > be pending in ctx->pending just because it was not signalfd_fetchsig()ed, yes? Logic is, if it's not an RT signal, queue only one, otherwise multiple. The bit on the ->pending mask is clealer only when the queue slot becomes empty. > Please also see below. > > > +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t > > sizemask) > > +{ > > > > [...snip...] > > > > + } else { > > + error = -EBADF; > > + file = fget(ufd); > > + if (!file) > > + goto err_exit; > > + ctx = file->private_data; > > + error = -EINVAL; > > + if (file->f_op != _fops) { > > + fput(file); > > + goto err_exit; > > + } > > + spin_lock_irq(>sighand->siglock); > > + ctx->sigmask = sigmask; > > + spin_unlock_irq(>sighand->siglock); > > + wake_up(>wqh); > > Can't this race with sys_signalfd_dequeue() which use lockless > __add_wait_queue()? > Looks like we should do __wake_up_locked() under ctx->sighand->siglock. Yes, good catch. Fixed. > > --- linux-2.6.20.ep2.orig/kernel/signal.c 2007-03-07 15:55:43.0 > > -0800 > > +++ linux-2.6.20.ep2/kernel/signal.c2007-03-07 15:59:01.0 > > -0800 > > > > [...snip...] > > > > @@ -780,6 +785,11 @@ > > BUG_ON(!irqs_disabled()); > > assert_spin_locked(>sighand->siglock); > > > > + /* > > +* Deliver the signal to listening signalfds ... > > +*/ > > + signalfd_notify(t->sighand, sig, info); > > + > > /* Short-circuit ignored signals. */ > > if (sig_ignored(t, sig)) > > goto out; > > @@ -968,6 +978,11 @@ > > assert_spin_locked(>sighand->siglock); > > handle_stop_signal(sig, p); > > > > + /* > > +* Deliver the signal to listening signalfds ... > > +*/ > > + signalfd_notify(p->sighand, sig, info); > > + > > /* Short-circuit ignored signals. */ > > if (sig_ignored(p, sig)) > > return ret; > > It is strange that we are doing signalfd_notify() even if the signal is > ignored. > Isn't it better to shift signalfd_notify() into send_signal() ? This way we do > not need the special check in signalfd_deliver() above. The two trasports can rely on different masks. The signalfd_notify() does not even go in signalfd_deliver() if no signalfds are attached to the sighand. > Also, this patch doesn't take send_sigqueue/send_group_sigqueue into account. I added that too. but I noticed something strange, dunno if intentional or not. In send_sigqueue and send_group_sigqueue, the check for the timer-special and the ignored is inverted. This lead to two different behaviours. Is there a reason for that? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it "struct siginfo" (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. Probably, but someone will have to add the 'signalfd' system call anyway. And *that* is why "everything is a stream of bytes" is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! The problem with read() returning arbitrary unstructured data is that there is hard to do versioning/extensibility, since the userspace can't specify the requested/expected format. The only way it could be done is by the (nbytes) parameter to read() which is not very nice (and useless for scripts). This is the same problem that makes sysfs/procfs fragile unless the file format is very well specified for extensibility (and it's easy to f-it up, since there seems to be little consistency there... using the XML horror would actually be an improvement). Breaking sysfs/procfs might be acceptable once every few years, but signal handling will be part of every application event loop and there is no room for breaking anything. (although, one could to the versioning the ugly way by creating the new 'signalfd' syscall instead). I'd say: make read() return the signal number (probably as 4-byte int, in network byte order), but for everything else, use the system call. Mark P.S. I'm currently worried if the fact that FUTEX_FD is being deprecated is a problem :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: So anybody who would "go with the Berkeley crowd" really shows a lot of bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's so long ago that I don't think there is any point in us trying to fix it any more. Well, they did invent the socket, which sucks less than a lot of other things. You would prefer STREAMS, perhaps? :-) My point is that this is a message-oriented problem, not a stream-oriented problem, and read() just isn't a very good interface for passing structured messages around. I agree that the details of recvmsg() ancillary data are fairly grotty (optimization based on micro-benchmarks, as usual, back in the PDP/VAX days), but the concept isn't that bad; you let netlink sockets into the kernel, didn't you? (But if somebody makes recvmgs a general VFS interface and makes it just work for everything, I'd probably take the patch as a cleanup. I really think it should have been a "struct file_operations" thing rather than being a socket-only thing.. But since you couldn't portably use it anyway, the thing is pretty moot) sendmsg()/recvmsg() to a file makes perfect sense, if you put the equivalent of an llseek tuple into ancillary data. And it's also IMHO the right way to do file AIO -- open up a netlink socket to the entity that's scheduling the IOs for a given filesystem, use the SCM_RIGHTS mechanism to do indirect opens (your credentials could come over an AF_UNIX socket from a completely separate "open server" process), and multiplex all the AIO to that filesystem across the netlink socket. If adding this to VFS is something you would seriously consider, I'll do the work. But I will need some coaching to work around my relative inexperience with the Linux core code and my lack of taste. :-) Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Davide Libenzi wrote: > > +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct siginfo > *info) > +{ > + int nsig = 0; > + struct list_head *pos; > + struct signalfd_ctx *ctx; > + struct signalfd_sq *sq; > + > + list_for_each(pos, >sfdlist) { > + ctx = list_entry(pos, struct signalfd_ctx, lnk); > + /* > + * We use a negative signal value as a way to broadcast that the > + * sighand has been orphaned, so that we can notify all the > + * listeners about this. > + */ > + if (sig < 0) > + __wake_up_locked(>wqh, TASK_UNINTERRUPTIBLE | > TASK_INTERRUPTIBLE); > + else if (sigismember(>sigmask, sig) && > + (sig >= SIGRTMIN || !sigismember(>pending, sig))) > { > + sigaddset(>pending, sig); I don't understand the "(sig >= SIGRTMIN || !sigismember(>pending, sig))" check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal may be pending in ctx->pending just because it was not signalfd_fetchsig()ed, yes? Please also see below. > +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t > sizemask) > +{ > > [...snip...] > > + } else { > + error = -EBADF; > + file = fget(ufd); > + if (!file) > + goto err_exit; > + ctx = file->private_data; > + error = -EINVAL; > + if (file->f_op != _fops) { > + fput(file); > + goto err_exit; > + } > + spin_lock_irq(>sighand->siglock); > + ctx->sigmask = sigmask; > + spin_unlock_irq(>sighand->siglock); > + wake_up(>wqh); Can't this race with sys_signalfd_dequeue() which use lockless __add_wait_queue()? Looks like we should do __wake_up_locked() under ctx->sighand->siglock. > --- linux-2.6.20.ep2.orig/kernel/signal.c 2007-03-07 15:55:43.0 > -0800 > +++ linux-2.6.20.ep2/kernel/signal.c 2007-03-07 15:59:01.0 -0800 > > [...snip...] > > @@ -780,6 +785,11 @@ > BUG_ON(!irqs_disabled()); > assert_spin_locked(>sighand->siglock); > > + /* > + * Deliver the signal to listening signalfds ... > + */ > + signalfd_notify(t->sighand, sig, info); > + > /* Short-circuit ignored signals. */ > if (sig_ignored(t, sig)) > goto out; > @@ -968,6 +978,11 @@ > assert_spin_locked(>sighand->siglock); > handle_stop_signal(sig, p); > > + /* > + * Deliver the signal to listening signalfds ... > + */ > + signalfd_notify(p->sighand, sig, info); > + > /* Short-circuit ignored signals. */ > if (sig_ignored(p, sig)) > return ret; It is strange that we are doing signalfd_notify() even if the signal is ignored. Isn't it better to shift signalfd_notify() into send_signal() ? This way we do not need the special check in signalfd_deliver() above. Also, this patch doesn't take send_sigqueue/send_group_sigqueue into account. Oleg. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Avi Kivity wrote: > Davide Libenzi wrote: > > So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like > > it too much (little, actually). The siginfo, as bad as it is, is a standard > > used in many POSIX APIs (hence even in kernel), and IMO if we want to send > > that back, a struct siginfo should be. > > > > You can send the data in the 32/64 neutral format, and have glibc convert it > to a siginfo, and everybody's happy. Except Uli :) The problem with the siginfo is that it is a brain-dead structure with unions and data dependent on types (and pointers). Pretty fugly. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. You can send the data in the 32/64 neutral format, and have glibc convert it to a siginfo, and everybody's happy. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: > > > On Thu, 8 Mar 2007, Davide Libenzi wrote: > > > > So, to cut it short, I can do the pseudo-siginfo read(2), but I don't > > like it too much (little, actually). The siginfo, as bad as it is, is a > > standard used in many POSIX APIs (hence even in kernel), and IMO if we > > want to send that back, a struct siginfo should be. > > No? > > I think it's perfectly fine if you make it "struct siginfo" (even though I > think it's a singularly ugly struct). It's just that then you'd have to > make your read() know whether it's a compat-read or not, which you really > can't. > > Which is why you introduced a new system call, but that leads to all the > problems with the file descriptor no longer being *usable*. > > Think scripts. It's easy to do reads in perl scripts, and parse the > output. In contrast, making perl use a new system call is quite > challenging. > > And *that* is why "everything is a stream of bytes" is so important. You > don't know where the file descriptor has been, or who uses it. Special > system calls for special file descriptors are just *wrong*. > > After all, that's why we'd have a signalfd() in the first place: exactly > so that you do *not* have to use special system calls, but can just pass > it on to any event waiting mechanism like select, poll, epoll. The same is > just *even*more*true* when it comes to reading the data! "Cheeseburger it is!" ;) I'll revert back to read(2) with pseudo-siginfo and O_NONBLOCK handling... - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: > > So, to cut it short, I can do the pseudo-siginfo read(2), but I don't > like it too much (little, actually). The siginfo, as bad as it is, is a > standard used in many POSIX APIs (hence even in kernel), and IMO if we > want to send that back, a struct siginfo should be. > No? I think it's perfectly fine if you make it "struct siginfo" (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. And *that* is why "everything is a stream of bytes" is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: > > > On Thu, 8 Mar 2007, Davide Libenzi wrote: > > > > The reason for the special function, was not to provide a non-blocking > > behaviour with zero timeout (that just a side effect), but to read the > > siginfo. I was all about using read(2) (and v1 used it), but when you have > > to transfer complex structures over it, it becomes hell. How do you > > cleanly compat over a f_op->read callback for example? > > I agree that it gets a bit "interesting", and one option might be that the > "read()" interface just gets the signal number and the minimal siginfo > information, which is, after all, what 99% of all apps actually care > about. > > But "siginfo_t" is really a *horrible* structure. Nobody sane should ever > use siginfo_t, and the designer of that thing was so high on LSD that it's > not even funny. Re-using fields in a union? Values that depend on other > bits in the thing in random manners? > > In other words, I bet that we could just make it a *lot* better by making > the read structure be: > > - 16 4-byte fields (fixed 64-byte packet), where each field is an >uint32_t (we could even do it in network byte order if we care and if >you want to just pipe the information from one machine to another, but >that sounds a bit excessive ;) > > - Just put the fields people actually use at fixed offsets: si_signo, >si_errno, si_pid, si_uid, si_band, si_fd. > > - that still leaves room for the other cases if anybody ever wants them >(but I doubt it - things like si_addr are really only useful for >synchronous signals that are actually done as *signals*, since you >cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*). > > So I bet 99% of users actually just want si_signo, while some small subset > might want the SIGCHLD info and some of the special cases (eg we migth > want to add si_addr as a 64-bit thing just because the USB stack sends a > SI_ASYNCIO thing for completed URB's, so a libusb might want it, but > that's probably the only such user). > > And it would be *cleaner* than the mess that is siginfo_t.. > > (I realize that siginfo_t is ugly because it built up over time, using the > same buffer for many different things. I'm just saying that we can > probably do better by *not* using it, and just laying things out in a > cleaner manner to begin with, which also solves any compatibility issues) I can do that, no problem. But isn't it better to recognize that this kind of data just can't be shipped through a non compat-able function? Like, for example, the current trend to say "just use u64 for a pointer, it'll be fine". I remeber, a long time ago when the i386 architecture came out, to say "Wow! 4GB is gonna last *forever*!", let's use u32 for pointers. Well, forever is almost here in my watches. And all the userspace code using APIs assuming to cleanly store a pointer in a u32 will have to be re-factored. So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Michael K. Edwards wrote: > > Make it a netlink socket and fetch your structures using recvmsg(). > siginfo_t belongs in ancillary data. Gaah. That interface is horrible. > The UNIX philosophy is "everything's a file". The Berkeley philosophy > is "everything's a socket, except for files, which are feeble > mini-sockets". I'd go with the Berkeley crowd here. No, the berkeley crowd is totally out to lunch. I might agree with you *if* you could actually do "recvmsg()" on arbitrary file descriptors, but you cannot. We could fix that in Linux, of course, but the fact is, "recvmsg()" is *not* a superset of "read()". In general, it's a *subset*, exactly because very few file descriptors support it. So the normal way to read from a file descriptor (and the *only* way in any generic select loop) is to use "read()". That's the only thing that works for everything. And we shouldn't break that. The sad part is that there really is no reason why the BSD crowd couldn't have done recvmsg() as an "extended read with per-system call flags", which would have made things like O_NONBLOCK etc unnecessary, because you could do it just with MSG_DONTWAIT.. So anybody who would "go with the Berkeley crowd" really shows a lot of bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's so long ago that I don't think there is any point in us trying to fix it any more. (But if somebody makes recvmgs a general VFS interface and makes it just work for everything, I'd probably take the patch as a cleanup. I really think it should have been a "struct file_operations" thing rather than being a socket-only thing.. But since you couldn't portably use it anyway, the thing is pretty moot) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: > > The reason for the special function, was not to provide a non-blocking > behaviour with zero timeout (that just a side effect), but to read the > siginfo. I was all about using read(2) (and v1 used it), but when you have > to transfer complex structures over it, it becomes hell. How do you > cleanly compat over a f_op->read callback for example? I agree that it gets a bit "interesting", and one option might be that the "read()" interface just gets the signal number and the minimal siginfo information, which is, after all, what 99% of all apps actually care about. But "siginfo_t" is really a *horrible* structure. Nobody sane should ever use siginfo_t, and the designer of that thing was so high on LSD that it's not even funny. Re-using fields in a union? Values that depend on other bits in the thing in random manners? In other words, I bet that we could just make it a *lot* better by making the read structure be: - 16 4-byte fields (fixed 64-byte packet), where each field is an uint32_t (we could even do it in network byte order if we care and if you want to just pipe the information from one machine to another, but that sounds a bit excessive ;) - Just put the fields people actually use at fixed offsets: si_signo, si_errno, si_pid, si_uid, si_band, si_fd. - that still leaves room for the other cases if anybody ever wants them (but I doubt it - things like si_addr are really only useful for synchronous signals that are actually done as *signals*, since you cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*). So I bet 99% of users actually just want si_signo, while some small subset might want the SIGCHLD info and some of the special cases (eg we migth want to add si_addr as a 64-bit thing just because the USB stack sends a SI_ASYNCIO thing for completed URB's, so a libusb might want it, but that's probably the only such user). And it would be *cleaner* than the mess that is siginfo_t.. (I realize that siginfo_t is ugly because it built up over time, using the same buffer for many different things. I'm just saying that we can probably do better by *not* using it, and just laying things out in a cleaner manner to begin with, which also solves any compatibility issues) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Davide Libenzi wrote: The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op->read callback for example? Make it a netlink socket and fetch your structures using recvmsg(). siginfo_t belongs in ancillary data. The UNIX philosophy is "everything's a file". The Berkeley philosophy is "everything's a socket, except for files, which are feeble mini-sockets". I'd go with the Berkeley crowd here. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: > You missed David's worry, I think. > > Not only is POLLIN potentially an edge event (depending on the interface > you use to fetch it), but even as a level-triggered one you generally want > to read as much as possible per POLLIN event, and go back to the event > loop only when you get EAGAIN. > > So that's in addition to the read/signal race with other > threads/processes. > > You solved it by having a separate system call, but since it's a regular > file descriptor, why have a new system call at all, and not just make it > be a "read()"? In which case you definitely want O_NONBLOCK support. > > The UNIX philosophy is often quoted as "everything is a file", but that > really means "everything is a stream of bytes". > > In Windows, you have 15 different versions of "read()" with sockets and > files and pipes all having strange special cases and special system calls. > That's not the UNIX way. It should be just a "read()", and then people can > use general libraries and treat all sources the same. > > For example, the main select/poll/epoll loop may be the one doing all the > reading, and then pass off "full buffers only" to the individual per-fd > "action routines". And that kind of model really very fundamentally wants > an fd to be an fd to be an fd - not "some file descriptors need > 'read_from_sigfd()', and some file descriptors need 'read()', and some > file descriptors need 'recvmsg()'" etc. > > So I think you should get rid of signalfd_dequeue(), and just replace it > with a "read()" function. The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op->read callback for example? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: > > This patch, if you get a POLLIN, you have a signal to read for sure (well, > unless you another thread/task reads it before you - but that's just > somthing you have to take care). There is not explicit check for > O_NONBLOCK now, but a zero timeout would do exactly the same thing. You missed David's worry, I think. Not only is POLLIN potentially an edge event (depending on the interface you use to fetch it), but even as a level-triggered one you generally want to read as much as possible per POLLIN event, and go back to the event loop only when you get EAGAIN. So that's in addition to the read/signal race with other threads/processes. You solved it by having a separate system call, but since it's a regular file descriptor, why have a new system call at all, and not just make it be a "read()"? In which case you definitely want O_NONBLOCK support. The UNIX philosophy is often quoted as "everything is a file", but that really means "everything is a stream of bytes". In Windows, you have 15 different versions of "read()" with sockets and files and pipes all having strange special cases and special system calls. That's not the UNIX way. It should be just a "read()", and then people can use general libraries and treat all sources the same. For example, the main select/poll/epoll loop may be the one doing all the reading, and then pass off "full buffers only" to the individual per-fd "action routines". And that kind of model really very fundamentally wants an fd to be an fd to be an fd - not "some file descriptors need 'read_from_sigfd()', and some file descriptors need 'read()', and some file descriptors need 'recvmsg()'" etc. So I think you should get rid of signalfd_dequeue(), and just replace it with a "read()" function. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, David M. Lloyd wrote: > On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: > > int signalfd_dequeue(int fd, siginfo_t *info, long timeo); > > > > The "fd" parameter must ba a signalfd file descriptor. The "info" parameter > > is a pointer to the siginfo that will receive the dequeued signal, and > > "timeo" is a timeout in milliseconds, or -1 for infinite. > > The signalfd_dequeue function returns 0 if successfull. > > Does this support non-blocking mode? It doesn't seem to at my level of > understanding anyway. If I use this with EPOLLET for example, I'd > expect to get a single EPOLLIN when a signal arrives, which would > indicate to me that I must call signalfd_dequeue() in a loop until I get > EAGAIN in order to be sure I've consumed all the outstanding signals so > that the edge-triggered notification can be "re-armed". This patch, if you get a POLLIN, you have a signal to read for sure (well, unless you another thread/task reads it before you - but that's just somthing you have to take care). There is not explicit check for O_NONBLOCK now, but a zero timeout would do exactly the same thing. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: > int signalfd_dequeue(int fd, siginfo_t *info, long timeo); > > The "fd" parameter must ba a signalfd file descriptor. The "info" parameter > is a pointer to the siginfo that will receive the dequeued signal, and > "timeo" is a timeout in milliseconds, or -1 for infinite. > The signalfd_dequeue function returns 0 if successfull. Does this support non-blocking mode? It doesn't seem to at my level of understanding anyway. If I use this with EPOLLET for example, I'd expect to get a single EPOLLIN when a signal arrives, which would indicate to me that I must call signalfd_dequeue() in a loop until I get EAGAIN in order to be sure I've consumed all the outstanding signals so that the edge-triggered notification can be "re-armed". Make sense? - DML - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, David M. Lloyd wrote: On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: int signalfd_dequeue(int fd, siginfo_t *info, long timeo); The fd parameter must ba a signalfd file descriptor. The info parameter is a pointer to the siginfo that will receive the dequeued signal, and timeo is a timeout in milliseconds, or -1 for infinite. The signalfd_dequeue function returns 0 if successfull. Does this support non-blocking mode? It doesn't seem to at my level of understanding anyway. If I use this with EPOLLET for example, I'd expect to get a single EPOLLIN when a signal arrives, which would indicate to me that I must call signalfd_dequeue() in a loop until I get EAGAIN in order to be sure I've consumed all the outstanding signals so that the edge-triggered notification can be re-armed. This patch, if you get a POLLIN, you have a signal to read for sure (well, unless you another thread/task reads it before you - but that's just somthing you have to take care). There is not explicit check for O_NONBLOCK now, but a zero timeout would do exactly the same thing. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: This patch, if you get a POLLIN, you have a signal to read for sure (well, unless you another thread/task reads it before you - but that's just somthing you have to take care). There is not explicit check for O_NONBLOCK now, but a zero timeout would do exactly the same thing. You missed David's worry, I think. Not only is POLLIN potentially an edge event (depending on the interface you use to fetch it), but even as a level-triggered one you generally want to read as much as possible per POLLIN event, and go back to the event loop only when you get EAGAIN. So that's in addition to the read/signal race with other threads/processes. You solved it by having a separate system call, but since it's a regular file descriptor, why have a new system call at all, and not just make it be a read()? In which case you definitely want O_NONBLOCK support. The UNIX philosophy is often quoted as everything is a file, but that really means everything is a stream of bytes. In Windows, you have 15 different versions of read() with sockets and files and pipes all having strange special cases and special system calls. That's not the UNIX way. It should be just a read(), and then people can use general libraries and treat all sources the same. For example, the main select/poll/epoll loop may be the one doing all the reading, and then pass off full buffers only to the individual per-fd action routines. And that kind of model really very fundamentally wants an fd to be an fd to be an fd - not some file descriptors need 'read_from_sigfd()', and some file descriptors need 'read()', and some file descriptors need 'recvmsg()' etc. So I think you should get rid of signalfd_dequeue(), and just replace it with a read() function. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: You missed David's worry, I think. Not only is POLLIN potentially an edge event (depending on the interface you use to fetch it), but even as a level-triggered one you generally want to read as much as possible per POLLIN event, and go back to the event loop only when you get EAGAIN. So that's in addition to the read/signal race with other threads/processes. You solved it by having a separate system call, but since it's a regular file descriptor, why have a new system call at all, and not just make it be a read()? In which case you definitely want O_NONBLOCK support. The UNIX philosophy is often quoted as everything is a file, but that really means everything is a stream of bytes. In Windows, you have 15 different versions of read() with sockets and files and pipes all having strange special cases and special system calls. That's not the UNIX way. It should be just a read(), and then people can use general libraries and treat all sources the same. For example, the main select/poll/epoll loop may be the one doing all the reading, and then pass off full buffers only to the individual per-fd action routines. And that kind of model really very fundamentally wants an fd to be an fd to be an fd - not some file descriptors need 'read_from_sigfd()', and some file descriptors need 'read()', and some file descriptors need 'recvmsg()' etc. So I think you should get rid of signalfd_dequeue(), and just replace it with a read() function. The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op-read callback for example? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Davide Libenzi davidel@xmailserver.org wrote: The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op-read callback for example? Make it a netlink socket and fetch your structures using recvmsg(). siginfo_t belongs in ancillary data. The UNIX philosophy is everything's a file. The Berkeley philosophy is everything's a socket, except for files, which are feeble mini-sockets. I'd go with the Berkeley crowd here. Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op-read callback for example? I agree that it gets a bit interesting, and one option might be that the read() interface just gets the signal number and the minimal siginfo information, which is, after all, what 99% of all apps actually care about. But siginfo_t is really a *horrible* structure. Nobody sane should ever use siginfo_t, and the designer of that thing was so high on LSD that it's not even funny. Re-using fields in a union? Values that depend on other bits in the thing in random manners? In other words, I bet that we could just make it a *lot* better by making the read structure be: - 16 4-byte fields (fixed 64-byte packet), where each field is an uint32_t (we could even do it in network byte order if we care and if you want to just pipe the information from one machine to another, but that sounds a bit excessive ;) - Just put the fields people actually use at fixed offsets: si_signo, si_errno, si_pid, si_uid, si_band, si_fd. - that still leaves room for the other cases if anybody ever wants them (but I doubt it - things like si_addr are really only useful for synchronous signals that are actually done as *signals*, since you cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*). So I bet 99% of users actually just want si_signo, while some small subset might want the SIGCHLD info and some of the special cases (eg we migth want to add si_addr as a 64-bit thing just because the USB stack sends a SI_ASYNCIO thing for completed URB's, so a libusb might want it, but that's probably the only such user). And it would be *cleaner* than the mess that is siginfo_t.. (I realize that siginfo_t is ugly because it built up over time, using the same buffer for many different things. I'm just saying that we can probably do better by *not* using it, and just laying things out in a cleaner manner to begin with, which also solves any compatibility issues) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Michael K. Edwards wrote: Make it a netlink socket and fetch your structures using recvmsg(). siginfo_t belongs in ancillary data. Gaah. That interface is horrible. The UNIX philosophy is everything's a file. The Berkeley philosophy is everything's a socket, except for files, which are feeble mini-sockets. I'd go with the Berkeley crowd here. No, the berkeley crowd is totally out to lunch. I might agree with you *if* you could actually do recvmsg() on arbitrary file descriptors, but you cannot. We could fix that in Linux, of course, but the fact is, recvmsg() is *not* a superset of read(). In general, it's a *subset*, exactly because very few file descriptors support it. So the normal way to read from a file descriptor (and the *only* way in any generic select loop) is to use read(). That's the only thing that works for everything. And we shouldn't break that. The sad part is that there really is no reason why the BSD crowd couldn't have done recvmsg() as an extended read with per-system call flags, which would have made things like O_NONBLOCK etc unnecessary, because you could do it just with MSG_DONTWAIT.. So anybody who would go with the Berkeley crowd really shows a lot of bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's so long ago that I don't think there is any point in us trying to fix it any more. (But if somebody makes recvmgs a general VFS interface and makes it just work for everything, I'd probably take the patch as a cleanup. I really think it should have been a struct file_operations thing rather than being a socket-only thing.. But since you couldn't portably use it anyway, the thing is pretty moot) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: The reason for the special function, was not to provide a non-blocking behaviour with zero timeout (that just a side effect), but to read the siginfo. I was all about using read(2) (and v1 used it), but when you have to transfer complex structures over it, it becomes hell. How do you cleanly compat over a f_op-read callback for example? I agree that it gets a bit interesting, and one option might be that the read() interface just gets the signal number and the minimal siginfo information, which is, after all, what 99% of all apps actually care about. But siginfo_t is really a *horrible* structure. Nobody sane should ever use siginfo_t, and the designer of that thing was so high on LSD that it's not even funny. Re-using fields in a union? Values that depend on other bits in the thing in random manners? In other words, I bet that we could just make it a *lot* better by making the read structure be: - 16 4-byte fields (fixed 64-byte packet), where each field is an uint32_t (we could even do it in network byte order if we care and if you want to just pipe the information from one machine to another, but that sounds a bit excessive ;) - Just put the fields people actually use at fixed offsets: si_signo, si_errno, si_pid, si_uid, si_band, si_fd. - that still leaves room for the other cases if anybody ever wants them (but I doubt it - things like si_addr are really only useful for synchronous signals that are actually done as *signals*, since you cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*). So I bet 99% of users actually just want si_signo, while some small subset might want the SIGCHLD info and some of the special cases (eg we migth want to add si_addr as a 64-bit thing just because the USB stack sends a SI_ASYNCIO thing for completed URB's, so a libusb might want it, but that's probably the only such user). And it would be *cleaner* than the mess that is siginfo_t.. (I realize that siginfo_t is ugly because it built up over time, using the same buffer for many different things. I'm just saying that we can probably do better by *not* using it, and just laying things out in a cleaner manner to begin with, which also solves any compatibility issues) I can do that, no problem. But isn't it better to recognize that this kind of data just can't be shipped through a non compat-able function? Like, for example, the current trend to say just use u64 for a pointer, it'll be fine. I remeber, a long time ago when the i386 architecture came out, to say Wow! 4GB is gonna last *forever*!, let's use u32 for pointers. Well, forever is almost here in my watches. And all the userspace code using APIs assuming to cleanly store a pointer in a u32 will have to be re-factored. So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it struct siginfo (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. And *that* is why everything is a stream of bytes is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it struct siginfo (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. And *that* is why everything is a stream of bytes is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! Cheeseburger it is! ;) I'll revert back to read(2) with pseudo-siginfo and O_NONBLOCK handling... - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. You can send the data in the 32/64 neutral format, and have glibc convert it to a siginfo, and everybody's happy. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Avi Kivity wrote: Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. You can send the data in the 32/64 neutral format, and have glibc convert it to a siginfo, and everybody's happy. Except Uli :) The problem with the siginfo is that it is a brain-dead structure with unions and data dependent on types (and pointers). Pretty fugly. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Davide Libenzi wrote: +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct siginfo *info) +{ + int nsig = 0; + struct list_head *pos; + struct signalfd_ctx *ctx; + struct signalfd_sq *sq; + + list_for_each(pos, sighand-sfdlist) { + ctx = list_entry(pos, struct signalfd_ctx, lnk); + /* + * We use a negative signal value as a way to broadcast that the + * sighand has been orphaned, so that we can notify all the + * listeners about this. + */ + if (sig 0) + __wake_up_locked(ctx-wqh, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE); + else if (sigismember(ctx-sigmask, sig) + (sig = SIGRTMIN || !sigismember(ctx-pending, sig))) { + sigaddset(ctx-pending, sig); I don't understand the (sig = SIGRTMIN || !sigismember(ctx-pending, sig)) check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal may be pending in ctx-pending just because it was not signalfd_fetchsig()ed, yes? Please also see below. +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask) +{ [...snip...] + } else { + error = -EBADF; + file = fget(ufd); + if (!file) + goto err_exit; + ctx = file-private_data; + error = -EINVAL; + if (file-f_op != signalfd_fops) { + fput(file); + goto err_exit; + } + spin_lock_irq(ctx-sighand-siglock); + ctx-sigmask = sigmask; + spin_unlock_irq(ctx-sighand-siglock); + wake_up(ctx-wqh); Can't this race with sys_signalfd_dequeue() which use lockless __add_wait_queue()? Looks like we should do __wake_up_locked() under ctx-sighand-siglock. --- linux-2.6.20.ep2.orig/kernel/signal.c 2007-03-07 15:55:43.0 -0800 +++ linux-2.6.20.ep2/kernel/signal.c 2007-03-07 15:59:01.0 -0800 [...snip...] @@ -780,6 +785,11 @@ BUG_ON(!irqs_disabled()); assert_spin_locked(t-sighand-siglock); + /* + * Deliver the signal to listening signalfds ... + */ + signalfd_notify(t-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(t, sig)) goto out; @@ -968,6 +978,11 @@ assert_spin_locked(p-sighand-siglock); handle_stop_signal(sig, p); + /* + * Deliver the signal to listening signalfds ... + */ + signalfd_notify(p-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(p, sig)) return ret; It is strange that we are doing signalfd_notify() even if the signal is ignored. Isn't it better to shift signalfd_notify() into send_signal() ? This way we do not need the special check in signalfd_deliver() above. Also, this patch doesn't take send_sigqueue/send_group_sigqueue into account. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 3/8/07, Linus Torvalds [EMAIL PROTECTED] wrote: So anybody who would go with the Berkeley crowd really shows a lot of bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's so long ago that I don't think there is any point in us trying to fix it any more. Well, they did invent the socket, which sucks less than a lot of other things. You would prefer STREAMS, perhaps? :-) My point is that this is a message-oriented problem, not a stream-oriented problem, and read() just isn't a very good interface for passing structured messages around. I agree that the details of recvmsg() ancillary data are fairly grotty (optimization based on micro-benchmarks, as usual, back in the PDP/VAX days), but the concept isn't that bad; you let netlink sockets into the kernel, didn't you? (But if somebody makes recvmgs a general VFS interface and makes it just work for everything, I'd probably take the patch as a cleanup. I really think it should have been a struct file_operations thing rather than being a socket-only thing.. But since you couldn't portably use it anyway, the thing is pretty moot) sendmsg()/recvmsg() to a file makes perfect sense, if you put the equivalent of an llseek tuple into ancillary data. And it's also IMHO the right way to do file AIO -- open up a netlink socket to the entity that's scheduling the IOs for a given filesystem, use the SCM_RIGHTS mechanism to do indirect opens (your credentials could come over an AF_UNIX socket from a completely separate open server process), and multiplex all the AIO to that filesystem across the netlink socket. If adding this to VFS is something you would seriously consider, I'll do the work. But I will need some coaching to work around my relative inexperience with the Linux core code and my lack of taste. :-) Cheers, - Michael - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it struct siginfo (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. Probably, but someone will have to add the 'signalfd' system call anyway. And *that* is why everything is a stream of bytes is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! The problem with read() returning arbitrary unstructured data is that there is hard to do versioning/extensibility, since the userspace can't specify the requested/expected format. The only way it could be done is by the (nbytes) parameter to read() which is not very nice (and useless for scripts). This is the same problem that makes sysfs/procfs fragile unless the file format is very well specified for extensibility (and it's easy to f-it up, since there seems to be little consistency there... using the XML horror would actually be an improvement). Breaking sysfs/procfs might be acceptable once every few years, but signal handling will be part of every application event loop and there is no room for breaking anything. (although, one could to the versioning the ugly way by creating the new 'signalfd' syscall instead). I'd say: make read() return the signal number (probably as 4-byte int, in network byte order), but for everything else, use the system call. Mark P.S. I'm currently worried if the fact that FUTEX_FD is being deprecated is a problem :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Oleg Nesterov wrote: Davide Libenzi wrote: +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct siginfo *info) +{ + int nsig = 0; + struct list_head *pos; + struct signalfd_ctx *ctx; + struct signalfd_sq *sq; + + list_for_each(pos, sighand-sfdlist) { + ctx = list_entry(pos, struct signalfd_ctx, lnk); + /* +* We use a negative signal value as a way to broadcast that the +* sighand has been orphaned, so that we can notify all the +* listeners about this. +*/ + if (sig 0) + __wake_up_locked(ctx-wqh, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE); + else if (sigismember(ctx-sigmask, sig) +(sig = SIGRTMIN || !sigismember(ctx-pending, sig))) { + sigaddset(ctx-pending, sig); I don't understand the (sig = SIGRTMIN || !sigismember(ctx-pending, sig)) check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal may be pending in ctx-pending just because it was not signalfd_fetchsig()ed, yes? Logic is, if it's not an RT signal, queue only one, otherwise multiple. The bit on the -pending mask is clealer only when the queue slot becomes empty. Please also see below. +asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemask) +{ [...snip...] + } else { + error = -EBADF; + file = fget(ufd); + if (!file) + goto err_exit; + ctx = file-private_data; + error = -EINVAL; + if (file-f_op != signalfd_fops) { + fput(file); + goto err_exit; + } + spin_lock_irq(ctx-sighand-siglock); + ctx-sigmask = sigmask; + spin_unlock_irq(ctx-sighand-siglock); + wake_up(ctx-wqh); Can't this race with sys_signalfd_dequeue() which use lockless __add_wait_queue()? Looks like we should do __wake_up_locked() under ctx-sighand-siglock. Yes, good catch. Fixed. --- linux-2.6.20.ep2.orig/kernel/signal.c 2007-03-07 15:55:43.0 -0800 +++ linux-2.6.20.ep2/kernel/signal.c2007-03-07 15:59:01.0 -0800 [...snip...] @@ -780,6 +785,11 @@ BUG_ON(!irqs_disabled()); assert_spin_locked(t-sighand-siglock); + /* +* Deliver the signal to listening signalfds ... +*/ + signalfd_notify(t-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(t, sig)) goto out; @@ -968,6 +978,11 @@ assert_spin_locked(p-sighand-siglock); handle_stop_signal(sig, p); + /* +* Deliver the signal to listening signalfds ... +*/ + signalfd_notify(p-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(p, sig)) return ret; It is strange that we are doing signalfd_notify() even if the signal is ignored. Isn't it better to shift signalfd_notify() into send_signal() ? This way we do not need the special check in signalfd_deliver() above. The two trasports can rely on different masks. The signalfd_notify() does not even go in signalfd_deliver() if no signalfds are attached to the sighand. Also, this patch doesn't take send_sigqueue/send_group_sigqueue into account. I added that too. but I noticed something strange, dunno if intentional or not. In send_sigqueue and send_group_sigqueue, the check for the timer-special and the ignored is inverted. This lead to two different behaviours. Is there a reason for that? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: On Thu, 8 Mar 2007, Davide Libenzi wrote: So, to cut it short, I can do the pseudo-siginfo read(2), but I don't like it too much (little, actually). The siginfo, as bad as it is, is a standard used in many POSIX APIs (hence even in kernel), and IMO if we want to send that back, a struct siginfo should be. No? I think it's perfectly fine if you make it struct siginfo (even though I think it's a singularly ugly struct). It's just that then you'd have to make your read() know whether it's a compat-read or not, which you really can't. Which is why you introduced a new system call, but that leads to all the problems with the file descriptor no longer being *usable*. Think scripts. It's easy to do reads in perl scripts, and parse the output. In contrast, making perl use a new system call is quite challenging. Probably, but someone will have to add the 'signalfd' system call anyway. And *that* is why everything is a stream of bytes is so important. You don't know where the file descriptor has been, or who uses it. Special system calls for special file descriptors are just *wrong*. After all, that's why we'd have a signalfd() in the first place: exactly so that you do *not* have to use special system calls, but can just pass it on to any event waiting mechanism like select, poll, epoll. The same is just *even*more*true* when it comes to reading the data! The problem with read() returning arbitrary unstructured data is that there is hard to do versioning/extensibility, since the userspace can't specify the requested/expected format. The only way it could be done is by the (nbytes) parameter to read() which is not very nice (and useless for scripts). This is the same problem that makes sysfs/procfs fragile unless the file format is very well specified for extensibility (and it's easy to f-it up, since there seems to be little consistency there... using the XML horror would actually be an improvement). Breaking sysfs/procfs might be acceptable once every few years, but signal handling will be part of every application event loop and there is no room for breaking anything. (although, one could to the versioning the ugly way by creating the new 'signalfd' syscall instead). I'd say: make read() return the signal number (probably as 4-byte int, in network byte order), but for everything else, use the system call. Mark P.S. I'm currently worried if the fact that FUTEX_FD is being deprecated is a problem :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 03/08, Davide Libenzi wrote: On Thu, 8 Mar 2007, Oleg Nesterov wrote: Davide Libenzi wrote: +int signalfd_deliver(struct sighand_struct *sighand, int sig, struct siginfo *info) +{ + int nsig = 0; + struct list_head *pos; + struct signalfd_ctx *ctx; + struct signalfd_sq *sq; + + list_for_each(pos, sighand-sfdlist) { + ctx = list_entry(pos, struct signalfd_ctx, lnk); + /* + * We use a negative signal value as a way to broadcast that the + * sighand has been orphaned, so that we can notify all the + * listeners about this. + */ + if (sig 0) + __wake_up_locked(ctx-wqh, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE); + else if (sigismember(ctx-sigmask, sig) + (sig = SIGRTMIN || !sigismember(ctx-pending, sig))) { + sigaddset(ctx-pending, sig); I don't understand the (sig = SIGRTMIN || !sigismember(ctx-pending, sig)) check. This mimics the LEGACY_QUEUE() check, but seems strange. The signal may be pending in ctx-pending just because it was not signalfd_fetchsig()ed, yes? Logic is, if it's not an RT signal, queue only one, otherwise multiple. The bit on the -pending mask is clealer only when the queue slot becomes empty. Yes, I see what the code does, but I don't undestand why. For example, SIGCHLD was delivered to the process _and_ handled several times, then sys_signalfd_dequeue() comes and finds only one siginfo. Isn't this strange? @@ -780,6 +785,11 @@ BUG_ON(!irqs_disabled()); assert_spin_locked(t-sighand-siglock); + /* + * Deliver the signal to listening signalfds ... + */ + signalfd_notify(t-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(t, sig)) goto out; @@ -968,6 +978,11 @@ assert_spin_locked(p-sighand-siglock); handle_stop_signal(sig, p); + /* + * Deliver the signal to listening signalfds ... + */ + signalfd_notify(p-sighand, sig, info); + /* Short-circuit ignored signals. */ if (sig_ignored(p, sig)) return ret; It is strange that we are doing signalfd_notify() even if the signal is ignored. Isn't it better to shift signalfd_notify() into send_signal() ? This way we do not need the special check in signalfd_deliver() above. The two trasports can rely on different masks. The signalfd_notify() does not even go in signalfd_deliver() if no signalfds are attached to the sighand. Sorry, I don't understand. The masks are different, yes, but -sighand is the same? How this can make any difference for if no signalfds are attached ? OK. What is the purpose of signalfd? Should it record the signals which were sent to the process, or only those which were delivered? The latter looks more natural for me. But inn any case, I don't see a reason to check -pending mask in signalfd_deliver(). BTW, sys_signalfd_dequeue() re-queues signalfd_sq if copy_siginfo_to_user() fails. Why not -EFAULT? Also. A malicious user can eat all memory, signalfd_deliver()-kmem_cache_alloc() doesn't check any limits. Also, this patch doesn't take send_sigqueue/send_group_sigqueue into account. I added that too. but I noticed something strange, dunno if intentional or not. In send_sigqueue and send_group_sigqueue, the check for the timer-special and the ignored is inverted. This lead to two different behaviours. Is there a reason for that? I was wondering about that too. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: Also. A malicious user can eat all memory, signalfd_deliver()-kmem_cache_alloc() doesn't check any limits. This, btw, is one reason I *really* think signalfd() should just use the same old signal queue, and not try to make its own. Signal queueing and unqueuing simply isn't that simple. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Linus Torvalds wrote: On Fri, 9 Mar 2007, Oleg Nesterov wrote: Also. A malicious user can eat all memory, signalfd_deliver()-kmem_cache_alloc() doesn't check any limits. This, btw, is one reason I *really* think signalfd() should just use the same old signal queue, and not try to make its own. Signal queueing and unqueuing simply isn't that simple. Alrighty. Back to the dequeue_signal ... - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: Logic is, if it's not an RT signal, queue only one, otherwise multiple. The bit on the -pending mask is clealer only when the queue slot becomes empty. Yes, I see what the code does, but I don't undestand why. For example, SIGCHLD was delivered to the process _and_ handled several times, then sys_signalfd_dequeue() comes and finds only one siginfo. Isn't this strange? That's the same logic the kernel folows for non-RT signals. The two trasports can rely on different masks. The signalfd_notify() does not even go in signalfd_deliver() if no signalfds are attached to the sighand. Sorry, I don't understand. The masks are different, yes, but -sighand is the same? How this can make any difference for if no signalfds are attached ? The list_empty() che would not make you fall inside signalfd_deliver(), hence the fast path really lives up to its name ;) Also. A malicious user can eat all memory, signalfd_deliver()-kmem_cache_alloc() doesn't check any limits. I'll make that use the std dequeu_signal, so everything is handled in there. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
Linus Torvalds wrote: So I think you should get rid of signalfd_dequeue(), and just replace it with a read() function. The difficulty is that there are 4 different formats of signal structure you could get: (traditional|siginfo) x (32bit|64bit). What happens if you're a 32 bit process, you fork and exec a 64bit process who inherits the signalfd, and they start reading? What format of signal structure do they get? What do you get? What if you start doing partial reads? I think signalfd_dequeue() is warty, but read() has has a number of details to sort out. (Hey, can you send signals by writing into the signalfd? Very plan9...) J - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Thu, 8 Mar 2007, Jeremy Fitzhardinge wrote: The difficulty is that there are 4 different formats of signal structure you could get: (traditional|siginfo) x (32bit|64bit). See my suggestion of a fixed-format (and much cleaned up) pseudo-siginfo thing earlier in this thread, and also actually mentioned as a good to do in my original email from 2003 that did the original patch ;) (Hey, can you send signals by writing into the signalfd? Very plan9...) It's an obvious extension. Whether there is any real point to it or not, I dunno. After all, you could just send the signal instead. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On 03/08, Davide Libenzi wrote: On Fri, 9 Mar 2007, Oleg Nesterov wrote: Logic is, if it's not an RT signal, queue only one, otherwise multiple. The bit on the -pending mask is clealer only when the queue slot becomes empty. Yes, I see what the code does, but I don't undestand why. For example, SIGCHLD was delivered to the process _and_ handled several times, then sys_signalfd_dequeue() comes and finds only one siginfo. Isn't this strange? That's the same logic the kernel folows for non-RT signals. Yes, specific_send_sig_info/__group_send_sig_info ignores the non-RT signal - if we already have a pending one (blocked, or not yet handled) - the only reason is backward compatibility signalfd: - if the same signr was not fetched via sys_signalfd_dequeue() - the reason is Suppose we are doing a simple application which just logs all signals which were sent to this process. Now, we can miss the signal if the logging blocks. Think about sub-threads. We can send the same non-RT signal to T1, T2, T3 via specific_send_sig_info(). All 3 signals will be delivered, but signalfd (which is process wide) will see only the first. The two trasports can rely on different masks. The signalfd_notify() does not even go in signalfd_deliver() if no signalfds are attached to the sighand. Sorry, I don't understand. The masks are different, yes, but -sighand is the same? How this can make any difference for if no signalfds are attached ? The list_empty() che would not make you fall inside signalfd_deliver(), hence the fast path really lives up to its name ;) Ugh. Still can't understand, probably I missed something or misread this patch. If we shift signalfd_notify() from specific_send_sig_info/__group_send_sig_info to send_signal(), we have the same list_empty() fastpath if no signalfds are attached to the sighand. The difference is that we don't count sig_ignored() signals, which looks right to me. Oleg. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Fri, 9 Mar 2007, Oleg Nesterov wrote: Ugh. Still can't understand, probably I missed something or misread this patch. If we shift signalfd_notify() from specific_send_sig_info/__group_send_sig_info to send_signal(), we have the same list_empty() fastpath if no signalfds are attached to the sighand. The difference is that we don't count sig_ignored() signals, which looks right to me. Now I shifted the check into send_signal(), since signalfd now uses the standard dequeue_signal, and hence compete with standard signal delivery against the queue. This was initial Linus design. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/5] signalfd v2 - signalfd core ...
On Wed, 2007-03-07 at 17:21 -0800, Davide Libenzi wrote: int signalfd_dequeue(int fd, siginfo_t *info, long timeo); The fd parameter must ba a signalfd file descriptor. The info parameter is a pointer to the siginfo that will receive the dequeued signal, and timeo is a timeout in milliseconds, or -1 for infinite. The signalfd_dequeue function returns 0 if successfull. Does this support non-blocking mode? It doesn't seem to at my level of understanding anyway. If I use this with EPOLLET for example, I'd expect to get a single EPOLLIN when a signal arrives, which would indicate to me that I must call signalfd_dequeue() in a loop until I get EAGAIN in order to be sure I've consumed all the outstanding signals so that the edge-triggered notification can be re-armed. Make sense? - DML - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/