[patch 11/13] signal/timer/event fds v10 - eventfd wire up i386 arch ...
This patch wire the eventfd system call to the i386 architecture. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.fds/arch/i386/kernel/syscall_table.S === --- linux-2.6.21-rc5.fds.orig/arch/i386/kernel/syscall_table.S 2007-04-02 15:06:39.0 -0700 +++ linux-2.6.21-rc5.fds/arch/i386/kernel/syscall_table.S 2007-04-02 15:06:44.0 -0700 @@ -321,3 +321,4 @@ .long sys_epoll_pwait .long sys_signalfd /* 320 */ .long sys_timerfd + .long sys_eventfd Index: linux-2.6.21-rc5.fds/include/asm-i386/unistd.h === --- linux-2.6.21-rc5.fds.orig/include/asm-i386/unistd.h 2007-04-02 15:06:39.0 -0700 +++ linux-2.6.21-rc5.fds/include/asm-i386/unistd.h 2007-04-02 15:06:44.0 -0700 @@ -327,10 +327,11 @@ #define __NR_epoll_pwait 319 #define __NR_signalfd 320 #define __NR_timerfd 321 +#define __NR_eventfd 322 #ifdef __KERNEL__ -#define NR_syscalls 322 +#define NR_syscalls 323 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/13] signal/timer/event fds v10 - signalfd core ...
ChangeLog: v10 - Renamed from aino to anon_inode -- This patch series implements the new signalfd() system call. I took part of the original Linus code (and you know how badly it can be broken :), and I added even more breakage ;) Signals are fetched from the same signal queue used by the process, so signalfd will compete with standard kernel delivery in dequeue_signal(). If you want to reliably fetch signals on the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This seems to be working fine on my Dual Opteron machine. I made a quick test program for it: http://www.xmailserver.org/signafd-test.c The signalfd() system call implements signal delivery into a file descriptor receiver. The signalfd file descriptor if created with the following API: int signalfd(int ufd, const sigset_t *mask, size_t masksize); The ufd parameter allows to change an existing signalfd sigmask, w/out going to close/create cycle (Linus idea). Use ufd == -1 if you want a brand new signalfd file. The mask allows to specify the signal mask of signals that we are interested in. The masksize parameter is the size of mask. The signalfd fd supports the poll(2) and read(2) system calls. The poll(2) will return POLLIN when signals are available to be dequeued. As a direct consequence of supporting the Linux poll subsystem, the signalfd fd can use used together with epoll(2) too. The read(2) system call will return a struct signalfd_siginfo structure in the userspace supplied buffer. The return value is the number of bytes copied in the supplied buffer, or -1 in case of error. The read(2) call can also return 0, in case the sighand structure to which the signalfd was attached, has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will return -EAGAIN in case no signal is available. If the size of the buffer passed to read(2) is lower than sizeof(struct signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also return -ERESTARTSYS in case a signal hits the process. The format of the struct signalfd_siginfo is, and the valid fields depends of the (-code __SI_MASK) value, in the same way a struct siginfo would: struct signalfd_siginfo { __u32 signo;/* si_signo */ __s32 err; /* si_errno */ __s32 code; /* si_code */ __u32 pid; /* si_pid */ __u32 uid; /* si_uid */ __s32 fd; /* si_fd */ __u32 tid; /* si_fd */ __u32 band; /* si_band */ __u32 overrun; /* si_overrun */ __u32 trapno; /* si_trapno */ __s32 status; /* si_status */ __s32 svint;/* si_int */ __u64 svptr;/* si_ptr */ __u64 utime;/* si_utime */ __u64 stime;/* si_stime */ __u64 addr; /* si_addr */ }; Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.fds/fs/signalfd.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.21-rc5.fds/fs/signalfd.c 2007-04-02 15:06:29.0 -0700 @@ -0,0 +1,353 @@ +/* + * fs/signalfd.c + * + * Copyright (C) 2003 Linus Torvalds + * + * Mon Mar 5, 2007: Davide Libenzi davidel@xmailserver.org + * Changed -read() to return a siginfo strcture instead of signal number. + * Fixed locking in -poll(). + * Added sighand-detach notification. + * Added fd re-use in sys_signalfd() syscall. + * Now using anonymous inode source. + * Thanks to Oleg Nesterov for useful code review and suggestions. + * More comments and suggestions from Arnd Bergmann. + */ + +#include linux/file.h +#include linux/poll.h +#include linux/slab.h +#include linux/init.h +#include linux/fs.h +#include linux/mount.h +#include linux/module.h +#include linux/kernel.h +#include linux/signal.h +#include linux/list.h +#include linux/anon_inodes.h +#include linux/signalfd.h + +#include asm/uaccess.h + +struct signalfd_ctx { + struct list_head lnk; + wait_queue_head_t wqh; + sigset_t sigmask; + struct task_struct *tsk; +}; + +struct signalfd_lockctx { + struct task_struct *tsk; + unsigned long flags; +}; + +/* + * Tries to acquire the sighand lock. We do not increment the sighand + * use count, and we do not even pin the task struct, so we need to + * do it inside an RCU read lock, and we must be prepared for the + * ctx-tsk going to NULL (in signalfd_deliver()), and for the sighand + * being detached. We return 0 if the sighand has been detached, or + * 1 if we were able to pin the sighand lock. + */ +static int signalfd_lock(struct signalfd_ctx *ctx, struct signalfd_lockctx *lk) +{ + struct sighand_struct *sighand = NULL; + + rcu_read_lock(); + lk-tsk = rcu_dereference(ctx-tsk); + if (likely(lk-tsk != NULL)) + sighand = lock_task_sighand(lk-tsk, lk-flags); + rcu_read_unlock(); + + if (sighand !ctx-tsk
Re: getting processor numbers
On Tue, 3 Apr 2007, Ulrich Drepper wrote: More and more code depends on knowing the number of processors in the system to efficiently scale the code. E.g., in OpenMP it is used by default to determine how many threads to create. Creating more threads than there are processors/cores doesn't make sense. glibc for a long time provides functionality to retrieve the number through sysconf() and this is what fortunately most programs use. The problem is that we are currently using /proc/cpuinfo since this is all there was available at that time. Creating /proc/cpuinfo takes the kernel quite a long time, unfortunately (I think Jakub said it is mainly the interrupt information). The alternative today is to use /sys/devices/system/cpu and count the number of cpu* directories in it. This is somewhat faster. But there would be another possibility: simply stat /sys/devices/system/cpu and use st_nlink - 2. This last step unfortunately it made impossible by recent changes: http://article.gmane.org/gmane.linux.kernel/413178 I would like to propose changing that patch, move the sched_* pseudo-files in some other directly and permanently ban putting any new file into /sys/devices/system/cpu. To get some numbers, you can try http://people.redhat.com/drepper/nproc-timing.c The numbers I see on x86-64: cpuinfo 10145810 cycles for 100 accesses readdir /sys 3113870 cycles for 100 accesses stat /sys 741070 cycles for 100 accesses It sucks when seen from a micro-bench POV, but does it really matter overall? The vast majority of software usually calls sysconf(_SC_NPROCESSORS_*) with very little frequency (mostly once at initialization time) anyway. That's what 50us / call? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc5-mm4
On Tue, 3 Apr 2007, Randy Dunlap wrote: On Mon, 2 Apr 2007 22:47:45 -0700 Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/ - The oops in git-net.patch has been fixed, so that tree has been restored. It is huge. - Added the device-mapper development tree to the -mm lineup (Alasdair Kergon). It is a quilt tree, living at ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/. - Added davidel's signalfd stuff. ~~ with those new CONFIGs=n: (we are discussing it) arch/x86_64/kernel/built-in.o:(.rodata+0x18a8): undefined reference to `sys_signal fd' arch/x86_64/kernel/built-in.o:(.rodata+0x18b0): undefined reference to `sys_timerf d' arch/x86_64/kernel/built-in.o:(.rodata+0x18b8): undefined reference to `sys_eventf d' arch/x86_64/ia32/ia32entry.S:(.rodata+0xa08): undefined reference to `sys_signalfd ' arch/x86_64/ia32/ia32entry.S:(.rodata+0xa10): undefined reference to `sys_timerfd' arch/x86_64/ia32/ia32entry.S:(.rodata+0xa18): undefined reference to `sys_eventfd' make: *** [.tmp_vmlinux1] Error 1 I'll be sending Andrew patches against -mm4. The ones that fix the above, and the one for the include-files diet. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 5/5] new fds fixes for 2.6.21-rc5.mm4 - eventfd include files diet ...
Removes a few unneeded include files from the eventfd code. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/eventfd.c === --- linux-2.6.21-rc5.mm4.orig/fs/eventfd.c 2007-04-03 13:17:25.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/eventfd.c 2007-04-03 13:17:39.0 -0700 @@ -7,20 +7,15 @@ #include linux/file.h #include linux/poll.h -#include linux/slab.h #include linux/init.h #include linux/fs.h -#include linux/mount.h -#include linux/module.h +#include linux/sched.h #include linux/kernel.h #include linux/list.h #include linux/spinlock.h -#include linux/jiffies.h #include linux/anon_inodes.h #include linux/eventfd.h -#include asm/uaccess.h - struct eventfd_ctx { spinlock_t lock; wait_queue_head_t wqh; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/5] new fds fixes for 2.6.21-rc5.mm4 - fix a spelling error ...
Fixes a spelling error inside init/Kconfig. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/init/Kconfig === --- linux-2.6.21-rc5.mm4.orig/init/Kconfig 2007-04-03 13:17:25.0 -0700 +++ linux-2.6.21-rc5.mm4/init/Kconfig 2007-04-03 13:17:35.0 -0700 @@ -479,7 +479,7 @@ run glibc-based applications correctly. config ANON_INODES - bool Enable anynimous inode source if EMBEDDED + bool Enable anonymous inode source if EMBEDDED default y help Anonymous inode source for pseudo-files like epoll, signalfd, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/5] new fds fixes for 2.6.21-rc5.mm4 - timerfd include files diet ...
Removes a few unneeded include files from the timerfd code. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/timerfd.c === --- linux-2.6.21-rc5.mm4.orig/fs/timerfd.c 2007-04-03 13:17:25.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/timerfd.c 2007-04-03 13:17:38.0 -0700 @@ -10,23 +10,17 @@ #include linux/file.h #include linux/poll.h -#include linux/slab.h #include linux/init.h #include linux/fs.h -#include linux/mount.h -#include linux/module.h +#include linux/sched.h #include linux/kernel.h -#include linux/signal.h #include linux/list.h #include linux/spinlock.h #include linux/time.h #include linux/hrtimer.h -#include linux/jiffies.h #include linux/anon_inodes.h #include linux/timerfd.h -#include asm/uaccess.h - struct timerfd_ctx { struct hrtimer tmr; ktime_t tintv; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/5] new fds fixes for 2.6.21-rc5.mm4 - fix a build error on x86_64 when the new CONFIG_* are not set ...
Fixes a build error on x86_64 that happens when the new CONFIG_* options for the new fds are not set Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/kernel/sys_ni.c === --- linux-2.6.21-rc5.mm4.orig/kernel/sys_ni.c 2007-04-03 13:17:25.0 -0700 +++ linux-2.6.21-rc5.mm4/kernel/sys_ni.c2007-04-03 13:17:30.0 -0700 @@ -146,3 +146,9 @@ cond_syscall(sys_bdflush); cond_syscall(sys_ioprio_set); cond_syscall(sys_ioprio_get); + +/* New file descriptors */ +cond_syscall(sys_signalfd); +cond_syscall(sys_timerfd); +cond_syscall(sys_eventfd); + - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/5] new fds fixes for 2.6.21-rc5.mm4 - signalfd include files diet ...
Removes a few unneeded include files from the signalfd code. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/signalfd.c === --- linux-2.6.21-rc5.mm4.orig/fs/signalfd.c 2007-04-03 13:17:25.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/signalfd.c 2007-04-03 13:17:36.0 -0700 @@ -15,19 +15,15 @@ #include linux/file.h #include linux/poll.h -#include linux/slab.h #include linux/init.h #include linux/fs.h -#include linux/mount.h -#include linux/module.h +#include linux/sched.h #include linux/kernel.h #include linux/signal.h #include linux/list.h #include linux/anon_inodes.h #include linux/signalfd.h -#include asm/uaccess.h - struct signalfd_ctx { struct list_head lnk; wait_queue_head_t wqh; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/3] epoll cleanups - epoll include diet ...
Remove some unneeded include files from epoll code. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c === --- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 17:59:54.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:30.0 -0700 @@ -1,6 +1,6 @@ /* - * fs/eventpoll.c ( Efficent event polling implementation ) - * Copyright (C) 2001,...,2006 Davide Libenzi + * fs/eventpoll.c (Efficent event notification implementation) + * Copyright (C) 2001,...,2007 Davide Libenzi * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -17,30 +17,21 @@ #include linux/sched.h #include linux/fs.h #include linux/file.h -#include linux/signal.h #include linux/errno.h -#include linux/mm.h #include linux/slab.h #include linux/poll.h #include linux/string.h #include linux/list.h #include linux/hash.h #include linux/spinlock.h -#include linux/syscalls.h #include linux/rwsem.h #include linux/rbtree.h #include linux/wait.h #include linux/eventpoll.h -#include linux/mount.h -#include linux/bitops.h #include linux/mutex.h #include linux/anon_inodes.h #include asm/uaccess.h -#include asm/system.h -#include asm/io.h -#include asm/mman.h #include asm/atomic.h -#include asm/semaphore.h /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/3] epoll cleanups - epoll no module ...
Epoll is either compiled it, or not (if EMBEDDED). Remove the module code and use fs_initcall(). Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c === --- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 18:33:30.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:40.0 -0700 @@ -11,7 +11,6 @@ * */ -#include linux/module.h #include linux/init.h #include linux/kernel.h #include linux/sched.h @@ -1462,16 +1461,5 @@ return 0; } +fs_initcall(eventpoll_init); - -static void __exit eventpoll_exit(void) -{ - /* Undo all operations done inside eventpoll_init() */ - kmem_cache_destroy(pwq_cache); - kmem_cache_destroy(epi_cache); -} - -module_init(eventpoll_init); -module_exit(eventpoll_exit); - -MODULE_LICENSE(GPL); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/3] epoll cleanups - epoll remove static pre-declarations and akpm-ize the code ...
Re-arrange epoll code to avoid static functions pre-declarations, and apply akpm-filter on it. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide Index: linux-2.6.21-rc5.mm4/fs/eventpoll.c === --- linux-2.6.21-rc5.mm4.orig/fs/eventpoll.c2007-04-03 18:33:40.0 -0700 +++ linux-2.6.21-rc5.mm4/fs/eventpoll.c 2007-04-03 18:33:42.0 -0700 @@ -32,7 +32,6 @@ #include asm/uaccess.h #include asm/atomic.h - /* * LOCKING: * There are three level of locking required by epoll : @@ -65,7 +64,6 @@ * a greater scalability. */ - #define DEBUG_EPOLL 0 #if DEBUG_EPOLL 0 @@ -95,7 +93,6 @@ #define EP_MAX_EVENTS (INT_MAX / sizeof(struct epoll_event)) - struct epoll_filefd { struct file *file; int fd; @@ -213,36 +210,6 @@ struct epitem *epi; }; - - -static void ep_poll_safewake_init(struct poll_safewake *psw); -static void ep_poll_safewake(struct poll_safewake *psw, wait_queue_head_t *wq); -static int ep_alloc(struct eventpoll **pep); -static void ep_free(struct eventpoll *ep); -static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd); -static void ep_use_epitem(struct epitem *epi); -static void ep_release_epitem(struct epitem *epi); -static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, -poll_table *pt); -static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi); -static int ep_insert(struct eventpoll *ep, struct epoll_event *event, -struct file *tfile, int fd); -static int ep_modify(struct eventpoll *ep, struct epitem *epi, -struct epoll_event *event); -static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi); -static int ep_unlink(struct eventpoll *ep, struct epitem *epi); -static int ep_remove(struct eventpoll *ep, struct epitem *epi); -static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key); -static int ep_eventpoll_close(struct inode *inode, struct file *file); -static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait); -static int ep_send_events(struct eventpoll *ep, struct list_head *txlist, - struct epoll_event __user *events, int maxevents); -static int ep_events_transfer(struct eventpoll *ep, - struct epoll_event __user *events, - int maxevents); -static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, - int maxevents, long timeout); - /* * This semaphore is used to serialize ep_free() and eventpoll_release_file(). */ @@ -257,19 +224,6 @@ /* Slab cache used to allocate struct eppoll_entry */ static struct kmem_cache *pwq_cache __read_mostly; -/* File callbacks that implement the eventpoll file behaviour */ -static const struct file_operations eventpoll_fops = { - .release= ep_eventpoll_close, - .poll = ep_eventpoll_poll -}; - - - -/* Fast test to see if the file is an evenpoll file */ -static inline int is_file_epoll(struct file *f) -{ - return f-f_op == eventpoll_fops; -} /* Setup the structure that is used as key for the rb-tree */ static inline void ep_set_ffd(struct epoll_filefd *ffd, @@ -338,7 +292,6 @@ spin_lock_init(psw-lock); } - /* * Perform a safe wake up of the poll wait list. The problem is that * with the new callback'd wake up system, it is possible that the @@ -393,303 +346,265 @@ spin_unlock_irqrestore(psw-lock, flags); } - /* - * This is called from eventpoll_release() to unlink files from the eventpoll - * interface. We need to have this facility to cleanup correctly files that are - * closed without being removed from the eventpoll interface. + * This function unregister poll callbacks from the associated file descriptor. + * Since this must be called without holding ep-lock the atomic exchange trick + * will protect us from multiple unregister. */ -void eventpoll_release_file(struct file *file) +static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi) { - struct list_head *lsthead = file-f_ep_links; - struct eventpoll *ep; - struct epitem *epi; + int nwait; + struct list_head *lsthead = epi-pwqlist; + struct eppoll_entry *pwq; - /* -* We don't want to get file-f_ep_lock because it is not -* necessary. It is not necessary because we're in the struct file -* cleanup path, and this means that noone is using this file anymore. -* The only hit might come from ep_free() but by holding the semaphore -* will correctly serialize the operation. We do need to acquire -* ep-sem after epmutex because ep_remove() requires it when called -* from anywhere but ep_free(). -*/ - mutex_lock(epmutex); + /* This is called without locks, so we
Re: 2.6.24-rc6: possible recursive locking detected
On Sun, 6 Jan 2008, Christian Kujau wrote: On Sat, 5 Jan 2008, Davide Libenzi wrote: A solution may be to move the call to ep_poll_safewake() (that'd become a simple wake_up()) inside a tasklet or whatever is today trendy for delayed work. But his kinda scares me to be honest, since epoll has already a bunch of places where it could be asynchronously hit (plus performance regression will need to be verified). Although I'm not able to reproduce this one right now, I'm happy to test any patches you guys come up with. There's no need to reproduce it. It's there, it's among us ;) - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Improve scalability of epoll_ctl
On Tue, 8 Jan 2008, Eric Dumazet wrote: Changli Gao a écrit : Replace the epitem rbtree with a dynamic array to get the constant insertion/deletion/modification time of the file descriptors. Reuse the size argument of epoll_create, however the size must be smaller than the max file descriptor number: ether the resource limitation or the compiling time limitation. Hum, you should read this : http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7-rc3/2.6.7-rc3-mm2/broken-out/epoll-uses-rbtrees.patch Your patch is a revert of this change, it probably wont be accepted. epoll_ctl() is scalable, since it's O(log2(N)) : With 1 millions files descriptors, that means less than 20 nodes to lookup in the tree. Indeed, and we aren't going back. In what situation do you think epoll_ctl() performance is bad ? I'm sure you can cook up a microbench that only does epoll_ctl(), and with a good hash implementation you get better numbers. This is what I did before I changed to rbtree, and the numbers did not justify the code (the size in epoll_create() is just an hint, and that means you need to handle resizing - plus other things like locking up memory when the number of fds shrinks, and other stuff I don't even remember). - Davide
Re: [PATCH 5/6] syslets: add generic syslets infrastructure
On Thu, 10 Jan 2008, Rusty Russell wrote: On Thursday 10 January 2008 05:16:57 Zach Brown wrote: The latter. A ring is optimal for processing a huge number of requests, but if you're really going to be firing off syslet threads all over the place you're not going to be optimal anyway. And being able to point the return value to the stack or into some datastructure is way nicer to code (zero setup == easy to understand and easy to convert). One of Linus' rhetorical requirements for the syslets work is that we be able to wait for the result without spending overhead building up state in some completion context. The userland rings in the current syslet patches achieve that and don't seem outrageously complicated. I'd have to read his original statement, but eventfd doesn't build up state, so I think it qualifies. I think you and Zach are talking about different issues ;) Eventfd born for two different reasons. First, to be able to have userspace to signal to a poll/select/epoll based listener an event. This can elso be done with pipes, but eventfd has a few advantages over pipes (3-4 times faster and *a lot* less memory footprint). Second, as a generic way for kernel subsystems to signal completions to a poll/select/epoll userspace listener. And this is what is used in the new KAIO eventfd feature (patch was like 5 lines IIRC). This allow for KAIO events to be signaled to poll/select/epoll in a pretty easy way, using a simple extension of the AIO API. What we talked originally with Ingo, when the first syslet code came up, was the ability to do the reverse thing. That is, host an epoll_wait() inside a syslet, and gather the completion using whatever the syslet code was/is going to use for it. Given that 1) the eventfd completion patch was trivial and immediately available 2) the future of the whole syslet concept was/is still unclear, I believe it makes/made sense. If the syslets will become mainline, it'll mean that userspace will then be able to select the event-completion hosting method that better suits their needs. That are, either AIO completion hosted inside an epoll_wait() via an eventfd, or an epoll_wait() hosted inside a syslet. - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: A revised timerfd API
On Tue, 18 Sep 2007, Michael Kerrisk wrote: The four designs are: a) A multiplexing timerfd() system call. b) Creating three syscalls analogous to the POSIX timers API (i.e., timerfd_create/timerfd_settime/timerfd_gettime). c) Creating a simplified timerfd() system call that is integrated with the POSIX timers API. d) Extending the POSIX timers API to support the timerfd concept. If you really want to shoot yourself in your foot, I'd pick bullet B. Bullet A makes me sea-sick, and bullets C and D, well, let's leave POSIX APIs being *POSIX* APIs. Once you remove all the ifs and elses that resulted from your previous bullet A multiplexing implementation, timerfd_gettime and timerfd_settime should result in being pretty slick. I still think we could have survived w/out all this done inside the kernel though. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Wed, 19 Sep 2007, Nagendra Tomar wrote: [ ... ] Seems like epoll ate your message too :) Can you repost your bug report and the patch? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Wed, 19 Sep 2007, David Miller wrote: From: Nagendra Tomar [EMAIL PROTECTED] Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT) With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will not return, even when the incoming acks free the buffers. Note that this patch assumes that the SOCK_NOSPACE check in tcp_check_space is a trivial optimization which can be safely removed. I already replied to your patch posting explaining that whatever is not setting SOCK_NOSPACE should be fixed instead. Please address that, thanks. You're not planning of putting the notion of a SOCK_NOSPACE bit inside a completely device-unaware interface like epoll, I hope? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Wed, 19 Sep 2007, Nagendra Tomar wrote: Definitely not ! The point is that the tcp write space available wakeup does not get called if SOCK_NOSPACE bit is not set. This was fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit indicated that someone really cared abt the wakeup). Now after the introduction of callback'ed wakeups, we might have some work to do inside the callback even if there is nobody interested in the wakeup at that point of time. In this particular case the ep_poll_callback is not getting called and hence the socket fd is not getting added to the ready list. I know, I saw the patch. I was just commenting the point where DaveM was heading to ;) This things needs to be looked at a little bit more deeply. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Wed, 19 Sep 2007, Nagendra Tomar wrote: The tcp_check_space() function calls tcp_new_space() only if the SOCK_NOSPACE bit is set in the socket flags. This is causing Edge Triggered EPOLLOUT events to be missed for TCP sockets, as the ep_poll_callback() is not called from the wakeup routine. The SOCK_NOSPACE bit indicates the user's intent to perform writes on that socket (set in tcp_sendmsg and tcp_poll). I believe the idea behind the SOCK_NOSPACE check is to optimize away the tcp_new_space call in cases when user is not interested in writing to the socket. These two take care of all possible scenarios in which a user can convey his intent to write on that socket. Case 1: tcp_sendmsg detects lack of sndbuf space Case 2: tcp_poll returns not writable This is fine if we do not deal with epoll's Edge Triggered events (EPOLLET). With ET events we can have a scenario where the SOCK_NOSPACE bit is not set, as the user has neither done a sendmsg nor a poll/epoll call that returned with the POLLOUT condition not set. Looking back at it, I think the current TCP code is right, once you look at the event to be a output buffer full-with_space transition. If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event (free space on the output buffer), if you do not consume it (say a tcp_sendmsg that re-fill the buffer), you can't see other OUT event anymore since they happen on the full-with_space transition. Yes, I know, the read size (EPOLLIN) works differently and you get an event for every packet you receive. And yes, I do not like asymmetric things. But that does not make the EPOLLOUT|EPOLLET wrong IMO. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Thu, 20 Sep 2007, Eric Dumazet wrote: Does it means that with your patch each ACK on a ET managed socket will trigger an epoll event ? Maybe your very sensitive high throuput appication needs to set a flag or something at socket level to ask for such a behavior. The default should stay as is. That is an event should be sent only if someone cared about the wakeup. Unfortunately f_op-poll() does not let the caller to specify the events it's interested in, that would allow to split send/recevie wait queues and better detect read/write cases. The detection of a waitqueue_active(-sk_wr_sleep) would work fine in detecting is someone is actually waiting for a write, w/out the false positives triggered by the read-waiters. That would be a very sane thing to do, but would require a bigdumb change to all the -poll around (that could be automated by a script - devices not caring about the events hint can just continue to use the single queue like they currently do), and a more critical and gradual change of all the devices that wants to take advantage of it. That way, no more magic bits are needed, and a simple waitqueue_active() would tell you if someone is waiting for write-space events. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] signalfd simplification
The following patch simplifies signalfd code, by avoiding it to remain attached to the sighand during its lifetime. In this way, the signalfd remain attached to the sighand only during poll(2) (and select and epoll) and read(2). This also allows to remove all the custom tsk == current checks in kernel/signal.c, since dequeue_signal() will only be called by current. I think this is also what Ben was suggesting time ago. The external effect of this, is that a thread can extract only its own private signals and the group ones. I think this is an acceptable behaviour, in that those are the signals the thread would be able to fetch w/out signalfd. Oleg, as far as the smart wakeup thing. Can't be done w/out huge revolutions to the kernel code. It can easily be done for the -read(), but not for the -poll(). You need to be the provider of the poll_table in order to install custom wakeups, and inside -poll(), the providers are either fs/select.c or fs/eventpoll.c (or fs/aio.c). We could add a new poll_wait_func() to pass the function to be called, and that would work for fs/select.c (in there you can filter based on the signal and eventually issue the real wakeup). But it won't work for epoll, since it already expect to install its own callback, and ATM callbacks can't be chained. Since the thundering herd effect on signalfds should be pretty limited, I think it's not worth the change. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/exec.c |3 fs/signalfd.c | 190 +++--- include/linux/init_task.h |2 include/linux/sched.h |2 include/linux/signalfd.h | 40 - kernel/exit.c |9 -- kernel/fork.c |2 kernel/signal.c |8 - 8 files changed, 39 insertions(+), 217 deletions(-) Index: linux-2.6.mod/fs/signalfd.c === --- linux-2.6.mod.orig/fs/signalfd.c2007-09-19 15:54:21.0 -0700 +++ linux-2.6.mod/fs/signalfd.c 2007-09-19 15:54:42.0 -0700 @@ -11,8 +11,10 @@ * Now using anonymous inode source. * Thanks to Oleg Nesterov for useful code review and suggestions. * More comments and suggestions from Arnd Bergmann. - * Sat May 19, 2007: Davi E. M. Arnaut [EMAIL PROTECTED] + * Sat May 19, 2007: Davi E. M. Arnaut [EMAIL PROTECTED] * Retrieve multiple signals with one read() call + * Sun Jul 15, 2007: Davide Libenzi [EMAIL PROTECTED] + * Attach to the sighand only during read() and poll(). */ #include linux/file.h @@ -27,102 +29,12 @@ #include linux/signalfd.h struct signalfd_ctx { - struct list_head lnk; - wait_queue_head_t wqh; sigset_t sigmask; - struct task_struct *tsk; }; -struct signalfd_lockctx { - struct task_struct *tsk; - unsigned long flags; -}; - -/* - * Tries to acquire the sighand lock. We do not increment the sighand - * use count, and we do not even pin the task struct, so we need to - * do it inside an RCU read lock, and we must be prepared for the - * ctx-tsk going to NULL (in signalfd_deliver()), and for the sighand - * being detached. We return 0 if the sighand has been detached, or - * 1 if we were able to pin the sighand lock. - */ -static int signalfd_lock(struct signalfd_ctx *ctx, struct signalfd_lockctx *lk) -{ - struct sighand_struct *sighand = NULL; - - rcu_read_lock(); - lk-tsk = rcu_dereference(ctx-tsk); - if (likely(lk-tsk != NULL)) - sighand = lock_task_sighand(lk-tsk, lk-flags); - rcu_read_unlock(); - - if (!sighand) - return 0; - - if (!ctx-tsk) { - unlock_task_sighand(lk-tsk, lk-flags); - return 0; - } - - if (lk-tsk-tgid == current-tgid) - lk-tsk = current; - - return 1; -} - -static void signalfd_unlock(struct signalfd_lockctx *lk) -{ - unlock_task_sighand(lk-tsk, lk-flags); -} - -/* - * This must be called with the sighand lock held. - */ -void signalfd_deliver(struct task_struct *tsk, int sig) -{ - struct sighand_struct *sighand = tsk-sighand; - struct signalfd_ctx *ctx, *tmp; - - BUG_ON(!sig); - list_for_each_entry_safe(ctx, tmp, sighand-signalfd_list, lnk) { - /* -* We use a negative signal value as a way to broadcast that the -* sighand has been orphaned, so that we can notify all the -* listeners about this. Remember the ctx-sigmask is inverted, -* so if the user is interested in a signal, that corresponding -* bit will be zero. -*/ - if (sig 0) { - if (ctx-tsk == tsk) { - ctx-tsk = NULL; - list_del_init(ctx-lnk); - wake_up(ctx-wqh
Re: Tesing of / bugs in new timerfd API
On Mon, 17 Dec 2007, Michael Kerrisk wrote: Can you try the two patches below? I tried them on my 32 bit box (one of the rare beasts still lingering around here) and it seems to be working fine (those go on top of the previous ones). Against 2.6.24-rc5, I applied first your earlier patches (v3) and then the newest patch. My tests confirm that: This fixed the 32 bit tick-count truncation, and makes the time returned to be the remaining time till the next expiration. Are you going to resubmit a new patch set that includes these latest changes? Yes, I'll send them out to Andrew today. - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/2] timerfd - make hrtimer_forward() to return a u64
This patch makes hrtimer_forward() to return a u64 instead of unsigned long. Since timerfd returns the number of timer ticks in a u64 variable, and hrtimer_forward() is used to calculate the timer ticks, the patch allow full 64 bit usage even on 32 bit platforms. The core of the hrtimer_forward() ticks calculation, ktime_divns(), was already having the result in u64 and it was chopping it to unsigned long. Andrew, this goes on top of the ones you already have in -mm. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/timerfd.c|6 +++--- include/linux/hrtimer.h | 10 +- kernel/hrtimer.c|9 - kernel/posix-timers.c |9 + 4 files changed, 17 insertions(+), 17 deletions(-) Index: linux-2.6.mod/include/linux/hrtimer.h === --- linux-2.6.mod.orig/include/linux/hrtimer.h 2007-12-13 16:37:36.0 -0800 +++ linux-2.6.mod/include/linux/hrtimer.h 2007-12-14 16:05:23.0 -0800 @@ -295,12 +295,12 @@ } /* Forward a hrtimer so it expires after now: */ -extern unsigned long +extern u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); /* Forward a hrtimer so it expires after the hrtimer's current now */ -static inline unsigned long hrtimer_forward_now(struct hrtimer *timer, - ktime_t interval) +static inline u64 hrtimer_forward_now(struct hrtimer *timer, + ktime_t interval) { return hrtimer_forward(timer, timer-base-get_time(), interval); } @@ -322,9 +322,9 @@ extern void __init hrtimers_init(void); #if BITS_PER_LONG 64 -extern unsigned long ktime_divns(const ktime_t kt, s64 div); +extern u64 ktime_divns(const ktime_t kt, s64 div); #else /* BITS_PER_LONG 64 */ -# define ktime_divns(kt, div) (unsigned long)((kt).tv64 / (div)) +# define ktime_divns(kt, div) (u64)((kt).tv64 / (div)) #endif /* Show pending timers: */ Index: linux-2.6.mod/kernel/hrtimer.c === --- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800 +++ linux-2.6.mod/kernel/hrtimer.c 2007-12-13 16:41:42.0 -0800 @@ -306,7 +306,7 @@ /* * Divide a ktime value by a nanosecond value */ -unsigned long ktime_divns(const ktime_t kt, s64 div) +u64 ktime_divns(const ktime_t kt, s64 div) { u64 dclc, inc, dns; int sft = 0; @@ -321,7 +321,7 @@ dclc = sft; do_div(dclc, (unsigned long) div); - return (unsigned long) dclc; + return dclc; } #endif /* BITS_PER_LONG = 64 */ @@ -655,10 +655,9 @@ * Forward the timer expiry so it will expire in the future. * Returns the number of overruns. */ -unsigned long -hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) +u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) { - unsigned long orun = 1; + u64 orun = 1; ktime_t delta; delta = ktime_sub(now, timer-expires); Index: linux-2.6.mod/kernel/posix-timers.c === --- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 -0800 +++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800 @@ -256,8 +256,9 @@ if (timr-it.real.interval.tv64 == 0) return; - timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(), - timr-it.real.interval); + timr-it_overrun += (unsigned int) hrtimer_forward(timer, + timer-base-get_time(), + timr-it.real.interval); timr-it_overrun_last = timr-it_overrun; timr-it_overrun = -1; @@ -386,7 +387,7 @@ now = ktime_add(now, kj); } #endif - timr-it_overrun += + timr-it_overrun += (unsigned int) hrtimer_forward(timer, now, timr-it.real.interval); ret = HRTIMER_RESTART; @@ -662,7 +663,7 @@ */ if (iv.tv64 (timr-it_requeue_pending REQUEUE_PENDING || (timr-it_sigev_notify ~SIGEV_THREAD_ID) == SIGEV_NONE)) - timr-it_overrun += hrtimer_forward(timer, now, iv); + timr-it_overrun += (unsigned int) hrtimer_forward(timer, now, iv); remaining = ktime_sub(timer-expires, now); /* Return 0 only, when the timer is expired and not pending */ Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-12-13 16:49:46.0 -0800 +++ linux-2.6.mod/fs/timerfd.c 2007-12-14 16:04:36.0
[patch 2/2] timerfd - make the returned time to be the remaining time till the next expiration
Make the returned time to be the remaining time till the next expiration. If the timer is already expired, and there's no next expiration, zero will be returned. Andrew, this goes on top of the ones you already have in -mm. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/timerfd.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-12-14 16:04:36.0 -0800 +++ linux-2.6.mod/fs/timerfd.c 2007-12-14 16:05:32.0 -0800 @@ -49,6 +49,15 @@ return HRTIMER_NORESTART; } +static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) { + ktime_t now, remaining; + + now = ctx-tmr.base-get_time(); + remaining = ktime_sub(ctx-tmr.expires, now); + + return remaining.tv64 0 ? ktime_set(0, 0): remaining; +} + static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { @@ -240,7 +249,7 @@ if (ctx-expired ctx-tintv.tv64) hrtimer_forward_now(ctx-tmr, ctx-tintv); - kotmr.it_value = ktime_to_timespec(ctx-tmr.expires); + kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx)); kotmr.it_interval = ktime_to_timespec(ctx-tintv); /* @@ -274,7 +283,7 @@ hrtimer_forward_now(ctx-tmr, ctx-tintv) - 1; hrtimer_restart(ctx-tmr); } - kotmr.it_value = ktime_to_timespec(ctx-tmr.expires); + kotmr.it_value = ktime_to_timespec(timerfd_get_remaining(ctx)); kotmr.it_interval = ktime_to_timespec(ctx-tintv); spin_unlock_irq(ctx-wqh.lock); fput(file); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix group stop with exit race
On Wed, 5 Dec 2007, Oleg Nesterov wrote: do_signal_stop() counts all sub-thread and sets -group_stop_count accordingly. Every thread should decrement -group_stop_count and stop, the last one should notify the parent. However a sub-thread can exit before it notices the signal_pending(), or it may be somewhere in do_exit() already. In that case the group stop never finishes properly. Note: this is a minimal fix, we can add some optimizations later. Say we can return quickly if thread_group_empty(). Also, we can move some signal related code from exit_notify() to exit_signals(). Signed-off-by: Oleg Nesterov [EMAIL PROTECTED] Looks OK for me, even though we're doing more work on the exit path. OTOH I don't see a non-racy way of doing it w/out grabbing the lock. Did you try to bench how much this change costs? Anyway, looks sane to me... Acked-by: Davide Libenzi [EMAIL PROTECTED] - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] move the related code from exit_notify() to exit_signals()
On Thu, 6 Dec 2007, Oleg Nesterov wrote: The previous bugfix was not optimal, we shouldn't care about group stop when we are the only thread or the group stop is in progress. In that case nothing special is needed, just set PF_EXITING and return. Also, take the related TIF_SIGPENDING re-targeting code from exit_notify(). So, from the performance POV the only difference is that we don't trust !signal_pending() until we take -siglock. But this in fact fixes another ___pure___ theoretical minor race. __group_complete_signal() finds the task without PF_EXITING and chooses it as the target for signal_wake_up(). But nothing prevents this task from exiting in between without noticing the pending signal and thus unpredictably delaying the actual delivery. For a second there, you got me confused, since it wasn't clear to me this patch was going over the previous one. When posting patches that are not in any tree, it may be wise to include the set they nest onto ;) + if (!signal_pending(tsk)) + goto out; + + /* It could be that __group_complete_signal() choose us to + * notify about group-wide signal. Another thread should be + * woken now to take the signal since we will not. + */ + for (t = tsk; (t = next_thread(t)) != tsk; ) + if (!signal_pending(t) !(t-flags PF_EXITING)) + recalc_sigpending_and_wake(t); + + if (unlikely(tsk-signal-group_stop_count) + !--tsk-signal-group_stop_count) { + tsk-signal-flags = SIGNAL_STOP_STOPPED; + group_stop = 1; + } +out: Looks fine to me, even though the one above could simply be an: if (signal_pending(tsk)) { ... } (since the out label is used by that if only). - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
On Tue, 20 Nov 2007, Ingo Molnar wrote: * H. Peter Anvin [EMAIL PROTECTED] wrote: It seems that you're doing the same thing in both cases, except you're now extending it to include other random functionality, which means other things than syslets are suddenly affected. syslets are arguably a little bit different, since what you're effectively doing there is running a miniature interpreted language in kernel space. A higher startup overhead should be acceptable, since you're amortizing it over a larger number of calls. Extending that mechanism suddenly means you HAVE to use that interpreted language message mechanism to access certain system calls, which really does not seem like a good thing neither for performance nor for encouraging sane design of interfaces. whether that interpreted syslet language survives is still an open question - it was extremely ugly when i wrote the first version of it and it only got uglier since then :-) Aha! You admitted it finally :) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the new timerfd?
On Thu, 22 Nov 2007, Michael Kerrisk wrote: Hey Davide, Where is the new timerfd API. In 2.6.24-rc3, I see the *old* API... Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was merged. Some screw up reverted it? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the new timerfd?
On Thu, 22 Nov 2007, Michael Kerrisk wrote: On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote: On Thu, 22 Nov 2007, Michael Kerrisk wrote: Hey Davide, Where is the new timerfd API. In 2.6.24-rc3, I see the *old* API... Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was merged. Some screw up reverted it? t looks that way. I'm looking at the log now. It never went in actually. Andrew-san, what happened? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the new timerfd?
On Thu, 22 Nov 2007, Andrew Morton wrote: On Thu, 22 Nov 2007 11:46:13 -0800 (PST) Davide Libenzi [EMAIL PROTECTED] wrote: On Thu, 22 Nov 2007, Michael Kerrisk wrote: On Nov 22, 2007 6:34 PM, Davide Libenzi [EMAIL PROTECTED] wrote: On Thu, 22 Nov 2007, Michael Kerrisk wrote: Hey Davide, Where is the new timerfd API. In 2.6.24-rc3, I see the *old* API... Maybe Andrew stuffed the turkey with it? :) It was there. I remeber it was merged. Some screw up reverted it? t looks that way. I'm looking at the log now. It never went in actually. Andrew-san, what happened? Last I recall, we removed the API for 2.6.23 because we intended to do a different interface for 2.6.24. But I don't recall seeing any timerfd patches in maybe a month. Was sent on Sep 23, Subject: new timerfd API Do you want me to repost? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/4] Timerfd v2 - un-break CONFIG_TIMERFD
Remove the broken status to CONFIG_TIMERFD. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- init/Kconfig |1 - 1 file changed, 1 deletion(-) Index: linux-2.6.mod/init/Kconfig === --- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:13:16.0 -0800 +++ linux-2.6.mod/init/Kconfig 2007-11-23 13:36:42.0 -0800 @@ -566,7 +566,6 @@ config TIMERFD bool Enable timerfd() system call if EMBEDDED select ANON_INODES - depends on BROKEN default y help Enable the timerfd() system call that allows to receive timer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] Timerfd v2 - introduce a new hrtimer_forward_now() function
I think that advancing the timer against the timer's current now can be a pretty common usage, so, w/out exposing hrtimer's internals, we add a new hrtimer_forward_now() function. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- include/linux/hrtimer.h |7 +++ 1 file changed, 7 insertions(+) Index: linux-2.6.mod/include/linux/hrtimer.h === --- linux-2.6.mod.orig/include/linux/hrtimer.h 2007-11-23 13:13:21.0 -0800 +++ linux-2.6.mod/include/linux/hrtimer.h 2007-11-23 13:36:36.0 -0800 @@ -298,6 +298,13 @@ extern unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); +/* Forward a hrtimer so it expires after the hrtimer's current now */ +static inline unsigned long hrtimer_forward_now(struct hrtimer *timer, + ktime_t interval) +{ + return hrtimer_forward(timer, timer-base-get_time(), interval); +} + /* Precise sleep: */ extern long hrtimer_nanosleep(struct timespec *rqtp, struct timespec *rmtp, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the new timerfd?
On Fri, 23 Nov 2007, Andrew Morton wrote: I suppose this means that timerfd will only go in for 2.6.25. I don't have a problem with that, but we better make sure that the existing timerfd in 2.6.24 is still disabled. (Andrew had a one liner for that, but I haven't checked if it's in place.) I have no timerfd patches here. Yes, it's disabled, and yes, I'll repost today ... - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Where is the new timerfd?
On Fri, 23 Nov 2007, Ulrich Drepper wrote: On Nov 23, 2007 9:29 AM, Davide Libenzi [EMAIL PROTECTED] wrote: Yes, it's disabled, and yes, I'll repost today ... I haven't seen the patch and don't feel like searching. So I say it here: please mak sure you add a flags parameter to the system call itself (instead of adding it on as for eventfd and signalfd). We need to be able to use O_CLOEXEC some way or another. I'm more then OK about adding a flags parameter. If it was for me, I'd add it even to eventfd and signalfd. I asked Linus if he was OK about adding the flags parameter to all. He didn't reply, and I read that as no. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] Timerfd v2 - new timerfd API
This is the new timerfd API as it is implemented by the following patch: int timerfd_create(int clockid, int flags); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); The timerfd_create() API creates an un-programmed timerfd fd. The clockid parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME. The timerfd_settime() API give new settings by the timerfd fd, by optionally retrieving the previous expiration time (in case the otmr parameter is not NULL). The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit is set in the flags parameter. Otherwise it's a relative time. The timerfd_gettime() API returns the next expiration time of the timer, or {0, 0} if the timerfd has not been set yet. Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Here's a simple test program I used to exercise the new timerfd APIs: http://www.xmailserver.org/timerfd-test2.c Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/compat.c | 32 ++- fs/timerfd.c | 197 ++- include/linux/compat.h |7 + include/linux/syscalls.h |7 + 4 files changed, 164 insertions(+), 79 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:13:19.0 -0800 +++ linux-2.6.mod/fs/timerfd.c 2007-11-23 13:36:39.0 -0800 @@ -25,13 +25,15 @@ struct hrtimer tmr; ktime_t tintv; wait_queue_head_t wqh; + u64 ticks; int expired; + int clockid; }; /* * This gets called when the timer event triggers. We set the expired * flag, but we do not re-arm the timer (in case it's necessary, - * tintv.tv64 != 0) until the timer is read. + * tintv.tv64 != 0) until the timer is accessed. */ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) { @@ -40,13 +42,14 @@ spin_lock_irqsave(ctx-wqh.lock, flags); ctx-expired = 1; + ctx-ticks++; wake_up_locked(ctx-wqh); spin_unlock_irqrestore(ctx-wqh.lock, flags); return HRTIMER_NORESTART; } -static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, +static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { enum hrtimer_mode htmode; @@ -57,8 +60,9 @@ texp = timespec_to_ktime(ktmr-it_value); ctx-expired = 0; + ctx-ticks = 0; ctx-tintv = timespec_to_ktime(ktmr-it_interval); - hrtimer_init(ctx-tmr, clockid, htmode); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); ctx-tmr.expires = texp; ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) @@ -83,7 +87,7 @@ poll_wait(file, ctx-wqh, wait); spin_lock_irqsave(ctx-wqh.lock, flags); - if (ctx-expired) + if (ctx-ticks) events |= POLLIN; spin_unlock_irqrestore(ctx-wqh.lock, flags); @@ -102,11 +106,11 @@ return -EINVAL; spin_lock_irq(ctx-wqh.lock); res = -EAGAIN; - if (!ctx-expired !(file-f_flags O_NONBLOCK)) { + if (!ctx-ticks !(file-f_flags O_NONBLOCK)) { __add_wait_queue(ctx-wqh, wait); for (res = 0;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx-expired) { + if (ctx-ticks) { res = 0; break; } @@ -121,22 +125,21 @@ __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (ctx-expired) { - ctx-expired = 0; - if (ctx-tintv.tv64 != 0) { + if (ctx-ticks) { + ticks = ctx-ticks; + if (ctx-expired ctx-tintv.tv64) { /* * If tintv.tv64 != 0, this is a periodic timer that * needs to be re-armed. We avoid doing it in the timer * callback to avoid DoS attacks specifying a very * short timer period. */ - ticks = (u64) - hrtimer_forward(ctx-tmr, - hrtimer_cb_get_time(ctx-tmr), - ctx-tintv); + ticks += (u64) hrtimer_forward_now(ctx-tmr, + ctx-tintv) - 1; hrtimer_restart(ctx-tmr); - } else - ticks = 1; + } + ctx-expired
[patch 3/4] Timerfd v2 - wire the new timerfd API to the x86 family
Wires up the new timerfd API to the x86 family. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- arch/x86/ia32/ia32entry.S |4 +++- arch/x86/kernel/syscall_table_32.S |4 +++- include/asm-x86/unistd_32.h|6 -- include/asm-x86/unistd_64.h|9 +++-- 4 files changed, 17 insertions(+), 6 deletions(-) Index: linux-2.6.mod/include/asm-x86/unistd_32.h === --- linux-2.6.mod.orig/include/asm-x86/unistd_32.h 2007-11-23 13:13:18.0 -0800 +++ linux-2.6.mod/include/asm-x86/unistd_32.h 2007-11-23 13:36:40.0 -0800 @@ -327,13 +327,15 @@ #define __NR_epoll_pwait 319 #define __NR_utimensat 320 #define __NR_signalfd 321 -#define __NR_timerfd 322 +#define __NR_timerfd_create322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_timerfd_settime 325 +#define __NR_timerfd_gettime 326 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 327 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/include/asm-x86/unistd_64.h === --- linux-2.6.mod.orig/include/asm-x86/unistd_64.h 2007-11-23 13:13:18.0 -0800 +++ linux-2.6.mod/include/asm-x86/unistd_64.h 2007-11-23 13:36:40.0 -0800 @@ -629,12 +629,17 @@ __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait) #define __NR_signalfd 282 __SYSCALL(__NR_signalfd, sys_signalfd) -#define __NR_timerfd 283 -__SYSCALL(__NR_timerfd, sys_timerfd) +#define __NR_timerfd_create283 +__SYSCALL(__NR_timerfd_create, sys_timerfd_create) #define __NR_eventfd 284 __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate 285 __SYSCALL(__NR_fallocate, sys_fallocate) +#define __NR_timerfd_settime 286 +__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime) +#define __NR_timerfd_gettime 287 +__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime) + #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S === --- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S 2007-11-23 13:13:18.0 -0800 +++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-23 13:36:40.0 -0800 @@ -321,6 +321,8 @@ .long sys_epoll_pwait .long sys_utimensat /* 320 */ .long sys_signalfd - .long sys_timerfd + .long sys_timerfd_create .long sys_eventfd .long sys_fallocate + .long sys_timerfd_settime /* 325 */ + .long sys_timerfd_gettime Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S === --- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 13:13:18.0 -0800 +++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-23 13:36:40.0 -0800 @@ -723,7 +723,9 @@ .quad sys_epoll_pwait .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd - .quad compat_sys_timerfd + .quad sys_timerfd_create .quad sys_eventfd .quad sys32_fallocate + .quad compat_sys_timerfd_settime/* 325 */ + .quad compat_sys_timerfd_gettime ia32_syscall_end: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] Timerfd v2 - new timerfd API
On Sat, 24 Nov 2007, Michael Kerrisk wrote: +asmlinkage long sys_timerfd_create(int clockid, int flags) { - int error; + int error, ufd; struct timerfd_ctx *ctx; struct file *file; struct inode *inode; - struct itimerspec ktmr; - - if (copy_from_user(ktmr, utmr, sizeof(ktmr))) - return -EFAULT; if (clockid != CLOCK_MONOTONIC clockid != CLOCK_REALTIME) return -EINVAL; Could I suggest here, the following placeholder addition: if (flags != 0) return -EINVAL; Make sense, will repost. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] Timerfd v3 - introduce a new hrtimer_forward_now() function
I think that advancing the timer against the timer's current now can be a pretty common usage, so, w/out exposing hrtimer's internals, we add a new hrtimer_forward_now() function. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- include/linux/hrtimer.h |7 +++ 1 file changed, 7 insertions(+) Index: linux-2.6.mod/include/linux/hrtimer.h === --- linux-2.6.mod.orig/include/linux/hrtimer.h 2007-11-23 13:55:16.0 -0800 +++ linux-2.6.mod/include/linux/hrtimer.h 2007-11-24 12:48:05.0 -0800 @@ -298,6 +298,13 @@ extern unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); +/* Forward a hrtimer so it expires after the hrtimer's current now */ +static inline unsigned long hrtimer_forward_now(struct hrtimer *timer, + ktime_t interval) +{ + return hrtimer_forward(timer, timer-base-get_time(), interval); +} + /* Precise sleep: */ extern long hrtimer_nanosleep(struct timespec *rqtp, struct timespec *rmtp, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] Timerfd v3 - new timerfd API
This is the new timerfd API as it is implemented by the following patch: int timerfd_create(int clockid, int flags); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); The timerfd_create() API creates an un-programmed timerfd fd. The clockid parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME. The timerfd_settime() API give new settings by the timerfd fd, by optionally retrieving the previous expiration time (in case the otmr parameter is not NULL). The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit is set in the flags parameter. Otherwise it's a relative time. The timerfd_gettime() API returns the next expiration time of the timer, or {0, 0} if the timerfd has not been set yet. Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Here's a simple test program I used to exercise the new timerfd APIs: http://www.xmailserver.org/timerfd-test2.c Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/compat.c | 32 ++- fs/timerfd.c | 199 ++- include/linux/compat.h |7 + include/linux/syscalls.h |7 + 4 files changed, 166 insertions(+), 79 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-11-23 13:55:16.0 -0800 +++ linux-2.6.mod/fs/timerfd.c 2007-11-24 12:49:21.0 -0800 @@ -25,13 +25,15 @@ struct hrtimer tmr; ktime_t tintv; wait_queue_head_t wqh; + u64 ticks; int expired; + int clockid; }; /* * This gets called when the timer event triggers. We set the expired * flag, but we do not re-arm the timer (in case it's necessary, - * tintv.tv64 != 0) until the timer is read. + * tintv.tv64 != 0) until the timer is accessed. */ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) { @@ -40,13 +42,14 @@ spin_lock_irqsave(ctx-wqh.lock, flags); ctx-expired = 1; + ctx-ticks++; wake_up_locked(ctx-wqh); spin_unlock_irqrestore(ctx-wqh.lock, flags); return HRTIMER_NORESTART; } -static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, +static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { enum hrtimer_mode htmode; @@ -57,8 +60,9 @@ texp = timespec_to_ktime(ktmr-it_value); ctx-expired = 0; + ctx-ticks = 0; ctx-tintv = timespec_to_ktime(ktmr-it_interval); - hrtimer_init(ctx-tmr, clockid, htmode); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); ctx-tmr.expires = texp; ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) @@ -83,7 +87,7 @@ poll_wait(file, ctx-wqh, wait); spin_lock_irqsave(ctx-wqh.lock, flags); - if (ctx-expired) + if (ctx-ticks) events |= POLLIN; spin_unlock_irqrestore(ctx-wqh.lock, flags); @@ -102,11 +106,11 @@ return -EINVAL; spin_lock_irq(ctx-wqh.lock); res = -EAGAIN; - if (!ctx-expired !(file-f_flags O_NONBLOCK)) { + if (!ctx-ticks !(file-f_flags O_NONBLOCK)) { __add_wait_queue(ctx-wqh, wait); for (res = 0;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx-expired) { + if (ctx-ticks) { res = 0; break; } @@ -121,22 +125,21 @@ __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (ctx-expired) { - ctx-expired = 0; - if (ctx-tintv.tv64 != 0) { + if (ctx-ticks) { + ticks = ctx-ticks; + if (ctx-expired ctx-tintv.tv64) { /* * If tintv.tv64 != 0, this is a periodic timer that * needs to be re-armed. We avoid doing it in the timer * callback to avoid DoS attacks specifying a very * short timer period. */ - ticks = (u64) - hrtimer_forward(ctx-tmr, - hrtimer_cb_get_time(ctx-tmr), - ctx-tintv); + ticks += (u64) hrtimer_forward_now(ctx-tmr, + ctx-tintv) - 1; hrtimer_restart(ctx-tmr); - } else - ticks = 1; + } + ctx-expired
[patch 3/4] Timerfd v3 - wire the new timerfd API to the x86 family
Wires up the new timerfd API to the x86 family. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- arch/x86/ia32/ia32entry.S |4 +++- arch/x86/kernel/syscall_table_32.S |4 +++- include/asm-x86/unistd_32.h|6 -- include/asm-x86/unistd_64.h|9 +++-- 4 files changed, 17 insertions(+), 6 deletions(-) Index: linux-2.6.mod/include/asm-x86/unistd_32.h === --- linux-2.6.mod.orig/include/asm-x86/unistd_32.h 2007-11-23 13:55:15.0 -0800 +++ linux-2.6.mod/include/asm-x86/unistd_32.h 2007-11-24 12:49:28.0 -0800 @@ -327,13 +327,15 @@ #define __NR_epoll_pwait 319 #define __NR_utimensat 320 #define __NR_signalfd 321 -#define __NR_timerfd 322 +#define __NR_timerfd_create322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_timerfd_settime 325 +#define __NR_timerfd_gettime 326 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 327 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/include/asm-x86/unistd_64.h === --- linux-2.6.mod.orig/include/asm-x86/unistd_64.h 2007-11-23 13:55:15.0 -0800 +++ linux-2.6.mod/include/asm-x86/unistd_64.h 2007-11-24 12:49:28.0 -0800 @@ -629,12 +629,17 @@ __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait) #define __NR_signalfd 282 __SYSCALL(__NR_signalfd, sys_signalfd) -#define __NR_timerfd 283 -__SYSCALL(__NR_timerfd, sys_timerfd) +#define __NR_timerfd_create283 +__SYSCALL(__NR_timerfd_create, sys_timerfd_create) #define __NR_eventfd 284 __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate 285 __SYSCALL(__NR_fallocate, sys_fallocate) +#define __NR_timerfd_settime 286 +__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime) +#define __NR_timerfd_gettime 287 +__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime) + #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/arch/x86/kernel/syscall_table_32.S === --- linux-2.6.mod.orig/arch/x86/kernel/syscall_table_32.S 2007-11-23 13:55:16.0 -0800 +++ linux-2.6.mod/arch/x86/kernel/syscall_table_32.S2007-11-24 12:49:28.0 -0800 @@ -321,6 +321,8 @@ .long sys_epoll_pwait .long sys_utimensat /* 320 */ .long sys_signalfd - .long sys_timerfd + .long sys_timerfd_create .long sys_eventfd .long sys_fallocate + .long sys_timerfd_settime /* 325 */ + .long sys_timerfd_gettime Index: linux-2.6.mod/arch/x86/ia32/ia32entry.S === --- linux-2.6.mod.orig/arch/x86/ia32/ia32entry.S2007-11-23 13:55:16.0 -0800 +++ linux-2.6.mod/arch/x86/ia32/ia32entry.S 2007-11-24 12:49:28.0 -0800 @@ -723,7 +723,9 @@ .quad sys_epoll_pwait .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd - .quad compat_sys_timerfd + .quad sys_timerfd_create .quad sys_eventfd .quad sys32_fallocate + .quad compat_sys_timerfd_settime/* 325 */ + .quad compat_sys_timerfd_gettime ia32_syscall_end: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/4] Timerfd v3 - un-break CONFIG_TIMERFD
Remove the broken status to CONFIG_TIMERFD. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- init/Kconfig |1 - 1 file changed, 1 deletion(-) Index: linux-2.6.mod/init/Kconfig === --- linux-2.6.mod.orig/init/Kconfig 2007-11-23 13:55:15.0 -0800 +++ linux-2.6.mod/init/Kconfig 2007-11-24 12:49:30.0 -0800 @@ -566,7 +566,6 @@ config TIMERFD bool Enable timerfd() system call if EMBEDDED select ANON_INODES - depends on BROKEN default y help Enable the timerfd() system call that allows to receive timer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
On Mon, 26 Nov 2007, H. Peter Anvin wrote: Ingo Molnar wrote: So it's not like sys_indirect() would break some magic pristine state of a flat parameter space - on the contrary, most of the nontrivial syscalls take pointers to structures or pointers to streams of information. The parameter count histogram i believe further underlines this point: #args #syscalls - 0 22 1 51 2 83 3 85 4 40 5 23 68 the natural 'center' of function call parameter counts is around 1-4 parameters, and that is natural. (most operators that the human brain prefers to operate with are like that - having higher complexity than that often defeats the purpose of getting an API used by ... humans.) I was preparing a response to Linus' email, but I really feel this needs to be addressed specifically. When it comes to dealing with the operator-visible state, what matters is what happens on the API level, NOT on the system call level. Furthermore, the proposed sys_indirect interface just means that there are parameters hidden from immediately view, even though they fundamentally change the operation performed, and that it is much harder to correlate, say, the output of strace(1) with what actually happened in the program. So from a *psychological* point of view, this seems to be an insane design choice. I think there are two different issues. One is the proliferation of system calls, and the other is the sane design of internal kernel APIs. The first one is not very interesting to me, since I don't have any strong opinions in either cases. The second is the one I'd care most. I think that, whatever is the solution used to address the first, internal kernel APIs should be designed so that parameters flow down from the system call to the parameter's user code. IMO, besides very few cases where it could make some sense [*], setting some thread-global bits in the upper layer, to be magically picked up by code in the lower layers, does not lead to readable interfaces. [*] Things that already read from a shared context, that is already exposed to the user through some sort of set/get APIs. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] Timerfd v3 - new timerfd API
On Tue, 27 Nov 2007, Andrew Morton wrote: On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] wrote: +static struct file *timerfd_fget(int fd) +{ + struct file *file; + + file = fget(fd); + if (!file) + return ERR_PTR(-EBADF); + if (file-f_op != timerfd_fops) { + fput(file); + return ERR_PTR(-EINVAL); + } + + return file; +} I suppose we could use fget_light() in here sometime. It is significantly quicker in microbenchmarks. Or it was - nobody has checked that in a few years afaik. Should I now? Half of the internet traffic of the last three month was generated by me posting those timerfd patches :) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] Timerfd v3 - new timerfd API
On Tue, 27 Nov 2007, Andrew Morton wrote: On Tue, 27 Nov 2007 12:47:46 -0800 (PST) Davide Libenzi [EMAIL PROTECTED] wrote: On Tue, 27 Nov 2007, Andrew Morton wrote: On Sun, 25 Nov 2007 14:14:19 -0800 Davide Libenzi [EMAIL PROTECTED] wrote: +static struct file *timerfd_fget(int fd) +{ + struct file *file; + + file = fget(fd); + if (!file) + return ERR_PTR(-EBADF); + if (file-f_op != timerfd_fops) { + fput(file); + return ERR_PTR(-EINVAL); + } + + return file; +} I suppose we could use fget_light() in here sometime. It is significantly quicker in microbenchmarks. Or it was - nobody has checked that in a few years afaik. Should I now? No great rush. It'd be fun to see if it actually makes any measurable difference on modern hardware (hint ;)). I was going to say BS, given that at the time of the tests, the files struct was protected by an rwlock whereas now there's a rcu one. But that seems not the case: http://www.xmailserver.org/fget-light-test.c $ fget-light-test -r 9 warming up ... testing non-shared ... time = 314.354000 ms testing shared ... time = 390.781000 ms And here is the oprofile output: [SHARED CASE] samples %app name symbol name 7436 28.9339 vmlinux __clear_user 2369 9.2179 vmlinux system_call 1710 6.6537 vmlinux fget_light 1244 4.8405 vmlinux inotify_dentry_parent_queue_event 1128 4.3891 vmlinux sys_read 1041 4.0506 libpthread-2.6.1.so __read_nocancel 1027 3.9961 Xorg (no symbols) 978 3.8054 vmlinux read_zero 755 2.9377 vmlinux vfs_read 545 2.1206 vmlinux inotify_inode_queue_event 507 1.9728 vmlinux sysret_check 414 1.6109 vmlinux rw_verify_area 405 1.5759 vmlinux unix_poll 371 1.4436 nvidia (no symbols) 333 1.2957 vmlinux acpi_pm_read 311 1.2101 nvidia_drv.so(no symbols) 290 1.1284 vmlinux do_select 288 1.1206 vmlinux dnotify_parent 253 0.9844 libc-2.6.1.so(no symbols) 232 0.9027 bash (no symbols) 216 0.8405 libpthread-2.6.1.so __pthread_enable_asynccancel [UNSHARED CASE] samples %app name symbol name 6542 27.6922 vmlinux __clear_user 4074 17.2452 vmlinux vfs_read 1266 5.3590 vmlinux inotify_inode_queue_event 1091 4.6182 Xorg (no symbols) 1059 4.4827 vmlinux system_call 937 3.9663 libpthread-2.6.1.so __read_nocancel 544 2.3027 vmlinux clear_user 484 2.0488 vmlinux dnotify_parent 445 1.8837 vmlinux read_zero 438 1.8540 vmlinux sysret_check 414 1.7525 vmlinux unix_poll 407 1.7228 nvidia (no symbols) 389 1.6466 vmlinux acpi_pm_read 315 1.3334 nvidia_drv.so(no symbols) 312 1.3207 vmlinux inotify_dentry_parent_queue_event 312 1.3207 vmlinux sys_read 305 1.2911 vmlinux rw_verify_area 304 1.2868 vmlinux do_select 222 0.9397 libc-2.6.1.so(no symbols) 214 0.9059 bash (no symbols) 196 0.8297 fget-light-test read_test 185 0.7831 vmlinux fget_light You can clearly notice the fget_light() drop out of the relevance window. BTW, are all those notify noises supposed to be there? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3-mm2 - Build Failure on powerpc timerfd() undeclared
On Wed, 28 Nov 2007, Andrew Morton wrote: On Wed, 28 Nov 2007 14:32:07 +0100 Arnd Bergmann [EMAIL PROTECTED] wrote: On Wednesday 28 November 2007, Kamalesh Babulal wrote: Kernel build fails, with build error CC arch/powerpc/platforms/cell/spu_callbacks.o In file included from arch/powerpc/platforms/cell/spu_callbacks.c:49: include/asm/systbl.h:312: error: ‘sys_timerfd’ undeclared here (not in a function) make[2]: *** [arch/powerpc/platforms/cell/spu_callbacks.o] Error 1 make[1]: *** [arch/powerpc/platforms/cell] Error 2 make: *** [arch/powerpc/platforms] Error 2 I guess all architectures except x86 are currently broken because they reference the old sys_timerfd function. None of them were broken in my testing and I'm unsure why powerpc broke here. This patch should add the missing bits to powerpc. Because the patches in -mm left the stubs in place in sys_ni.c and powerpc _should_ have (incorrectly) picked those up. My fault. I forgot to update sys_ni.c with the new functions (and with the sys_timerfd-sys_timerfd_create name change). Do you want a patch Andrew? - Davide
[patch] update sys_ni.c with the new timerfd syscalls
Update sys_ni.c with the new timerfd syscalls. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- kernel/sys_ni.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.mod/kernel/sys_ni.c === --- linux-2.6.mod.orig/kernel/sys_ni.c 2007-11-28 11:44:48.0 -0800 +++ linux-2.6.mod/kernel/sys_ni.c 2007-11-28 11:46:01.0 -0800 @@ -153,7 +153,9 @@ /* New file descriptors */ cond_syscall(sys_signalfd); -cond_syscall(sys_timerfd); +cond_syscall(sys_timerfd_create); +cond_syscall(sys_timerfd_settime); +cond_syscall(sys_timerfd_gettime); cond_syscall(compat_sys_signalfd); cond_syscall(compat_sys_timerfd); cond_syscall(sys_eventfd); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tesing of / bugs in new timerfd API
On Thu, 13 Dec 2007, Michael Kerrisk wrote: Davide, Andrew, I applied Davide's v3 patchset (sent into LKML on 25 Nov) against 2.4.24-rc3, and did various tests (all on x86). Several tests were done using the program at the foot of this mail. Various others were done by cobbling together bits of code that I haven't included here. Thanks for such a thorough test Michael. Tested: after setting a CLOCK_REALTIME timer with the TFD_TIMER_ABSTIME flag to expire at some time in the past with a non-zero interval (e.g., setting 100 seconds in the past, with a 5 second interval), read() from the file descriptor returns the correct number of expirations (e.g., 20). This seems a reasonable thing to do, I suppose. However, while playing around to test this, I found what looks like a bug (see below). BUG 1: However, this test exposed what looks like a bug: if I set a CLOCK_REALTIME clock to expire in the past, with a very small interval, then the maximum number of expirations that can be returned via read seems to be limited to 32 bits, even though we have a 64-bit value for returning this information. I haven't checked the kernel source to determine where this bug is. Yes, true. Right now hrtimer_forward() (that we use to get the missed ticks) returns an unsigned long, that on 32 bit is (duh!) 32 bit :) But hrtimer_forward() has all in place to just return an u64. So, Thomas, would it be OK to have hrtimer_forward() (and the new hrtimer_forward_now()) to return a u64? BUG 2: The last sentence does not match the implementation. (Nor is it consistent with the behavior of POSIX timers. And I *think* things did work correctly in the original timerfd() implementation, but I have not gone back to check.) Suppose that we set an absolute timer to expire 100 seconds in the future. Then according to this sentence of the man page then each subsequent call to timerfd_gettime() should retrun an itimerspec structure whose it_value steadily decreases from 100 to 0 (when the timer expires). (This is the behavior in the analogous situation with POSIX timers and with setitimer()/getitimer().) However, the implementation of timerfd_gettime() always returns the time when the timer would next expire, and this value depends on whether TFD_TIMER_ABSTIME was specified when setting the timer. This is been like that from the beginning of the new API. So no, the previous was behaving exactly the same WRT this feature. Is this something really needed? - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tesing of / bugs in new timerfd API
On Thu, 13 Dec 2007, Michael Kerrisk wrote: BUG 2: The last sentence does not match the implementation. (Nor is it consistent with the behavior of POSIX timers. And I *think* things did work correctly in the original timerfd() implementation, but I have not gone back to check.) Suppose that we set an absolute timer to expire 100 seconds in the future. Then according to this sentence of the man page then each subsequent call to timerfd_gettime() should retrun an itimerspec structure whose it_value steadily decreases from 100 to 0 (when the timer expires). (This is the behavior in the analogous situation with POSIX timers and with setitimer()/getitimer().) However, the implementation of timerfd_gettime() always returns the time when the timer would next expire, and this value depends on whether TFD_TIMER_ABSTIME was specified when setting the timer. This is been like that from the beginning of the new API. So no, the previous was behaving exactly the same WRT this feature. Is this something really needed? Three reasons that I think of off the top of my head (and there might well be more reasosn) why this should change: a) consistency with the other two timer APIs (POSIX timers (timer_create(), etc.), and setitimer()/getitimer()). b) Returning the amount of time until the next expiration is more useful to userland: I'd say the most common case is for userland to want to know how long until the next expiration occurs, or to adjust that time by adding/subtracting some value to the existing setting. That is difficult to with the current implementation: the userland app must use timer_gettime(), call clock_gettime(), and calculate the difference between the two, in order to know how much time remains until the next timer expiration. Bah, I don't want to argue with that because otherwise it starts going towards the typical use cases listing, that can be found the same exact reasons to have one or the other way. And we end up wasting lots of time. We'd just have to move another thing that userspace could do, inside the kernel (subtract the current value returned by hrtimer_forward() in -expires with now). c) Currently, the information returned differs depending on whether TFD_TIMER_ABSTIME is specified -- this is not the case for the analogous situation for POSIX timers. For POSIX timers, the returned setting is always the amount of time until the timer next expires. This inconsistency is messy for applications -- the application may not (be able to) know whether or not the timer it is examining was set using TFD_TIMER_ABSTIME (the timerfd might have been created by a library, for example). Hmm... the time returned is always the next absolute time when the next timer tick will go off. It does not depend on TFD_TIMER_ABSTIME. I return the -expires field of the timer, and hrtimer_forward() sets it to the next absolute time. - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Tesing of / bugs in new timerfd API
On Fri, 14 Dec 2007, Michael Kerrisk wrote: You snipped my example that demonstrated the problem. Both of the following runs create a timer that expires 10 seconds from now, but observe the difference in the value returned by timerfd_gettime(): $ ./timerfd_test 10 # does not use TFD_TIMER_ABSTIME Initial setting for settime: value=10.000, interval=0.000 ./timerfd_test g (elapsed time= 1) Current value: value=346.448, interval=0.000 $ ./timerfd_test -a 10 # uses TFD_TIMER_ABSTIME Initial setting for settime: value=1197630855.254, interval=0.000 ./timerfd_test g (elapsed time= 1) Current value: value=1197630855.254, interval=0.000 Either there's an inconsistency here depending on the use of TFD_TIMER_ABSTIME, or there is a bug in my understanding or my test program (but so far I haven't spotted that bug ;-).). Can you try the two patches below? I tried them on my 32 bit box (one of the rare beasts still lingering around here) and it seems to be working fine (those go on top of the previous ones). This fixed the 32 bit tick-count truncation, and makes the time returned to be the remaining time till the next expiration. - Davide --- fs/timerfd.c|6 +++--- include/linux/hrtimer.h | 10 +- kernel/hrtimer.c|9 - kernel/posix-timers.c |9 + 4 files changed, 17 insertions(+), 17 deletions(-) Index: linux-2.6.mod/include/linux/hrtimer.h === --- linux-2.6.mod.orig/include/linux/hrtimer.h 2007-12-13 16:37:36.0 -0800 +++ linux-2.6.mod/include/linux/hrtimer.h 2007-12-14 16:05:23.0 -0800 @@ -295,12 +295,12 @@ } /* Forward a hrtimer so it expires after now: */ -extern unsigned long +extern u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); /* Forward a hrtimer so it expires after the hrtimer's current now */ -static inline unsigned long hrtimer_forward_now(struct hrtimer *timer, - ktime_t interval) +static inline u64 hrtimer_forward_now(struct hrtimer *timer, + ktime_t interval) { return hrtimer_forward(timer, timer-base-get_time(), interval); } @@ -322,9 +322,9 @@ extern void __init hrtimers_init(void); #if BITS_PER_LONG 64 -extern unsigned long ktime_divns(const ktime_t kt, s64 div); +extern u64 ktime_divns(const ktime_t kt, s64 div); #else /* BITS_PER_LONG 64 */ -# define ktime_divns(kt, div) (unsigned long)((kt).tv64 / (div)) +# define ktime_divns(kt, div) (u64)((kt).tv64 / (div)) #endif /* Show pending timers: */ Index: linux-2.6.mod/kernel/hrtimer.c === --- linux-2.6.mod.orig/kernel/hrtimer.c 2007-12-13 16:37:53.0 -0800 +++ linux-2.6.mod/kernel/hrtimer.c 2007-12-13 16:41:42.0 -0800 @@ -306,7 +306,7 @@ /* * Divide a ktime value by a nanosecond value */ -unsigned long ktime_divns(const ktime_t kt, s64 div) +u64 ktime_divns(const ktime_t kt, s64 div) { u64 dclc, inc, dns; int sft = 0; @@ -321,7 +321,7 @@ dclc = sft; do_div(dclc, (unsigned long) div); - return (unsigned long) dclc; + return dclc; } #endif /* BITS_PER_LONG = 64 */ @@ -655,10 +655,9 @@ * Forward the timer expiry so it will expire in the future. * Returns the number of overruns. */ -unsigned long -hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) +u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval) { - unsigned long orun = 1; + u64 orun = 1; ktime_t delta; delta = ktime_sub(now, timer-expires); Index: linux-2.6.mod/kernel/posix-timers.c === --- linux-2.6.mod.orig/kernel/posix-timers.c2007-12-13 16:38:15.0 -0800 +++ linux-2.6.mod/kernel/posix-timers.c 2007-12-13 16:45:20.0 -0800 @@ -256,8 +256,9 @@ if (timr-it.real.interval.tv64 == 0) return; - timr-it_overrun += hrtimer_forward(timer, timer-base-get_time(), - timr-it.real.interval); + timr-it_overrun += (unsigned int) hrtimer_forward(timer, + timer-base-get_time(), + timr-it.real.interval); timr-it_overrun_last = timr-it_overrun; timr-it_overrun = -1; @@ -386,7 +387,7 @@ now = ktime_add(now, kj); } #endif - timr-it_overrun += + timr-it_overrun += (unsigned int) hrtimer_forward(timer, now, timr-it.real.interval); ret =
Re: 2.6.24-rc6: possible recursive locking detected
On Sat, 5 Jan 2008, Peter Zijlstra wrote: On Sat, 2008-01-05 at 17:53 +0100, Peter Zijlstra wrote: On Sat, 2008-01-05 at 18:12 +1100, Herbert Xu wrote: On Fri, Jan 04, 2008 at 09:30:49AM +0100, Ingo Molnar wrote: [ 1310.670986] = [ 1310.671690] [ INFO: possible recursive locking detected ] [ 1310.672097] 2.6.24-rc6 #1 [ 1310.672421] - [ 1310.672828] FahCore_a0.exe/3692 is trying to acquire lock: [ 1310.673238] (q-lock){++..}, at: [c011544b] __wake_up+0x1b/0x50 [ 1310.673869] [ 1310.673870] but task is already holding lock: [ 1310.674567] (q-lock){++..}, at: [c011544b] __wake_up+0x1b/0x50 [ 1310.675267] [ 1310.675268] other info that might help us debug this: [ 1310.675952] 5 locks held by FahCore_a0.exe/3692: [ 1310.676334] #0: (rcu_read_lock){..--}, at: [c038b620] net_rx_action+0x60/0x1b0 [ 1310.677251] #1: (rcu_read_lock){..--}, at: [c0388d60] netif_receive_skb+0x100/0x470 [ 1310.677924] #2: (rcu_read_lock){..--}, at: [c03a7fb2] ip_local_deliver_finish+0x32/0x210 [ 1310.678460] #3: (clock-AF_INET){-.-?}, at: [c038164e] sock_def_readable+0x1e/0x80 [ 1310.679250] #4: (q-lock){++..}, at: [c011544b] __wake_up+0x1b/0x50 The net part might just be a red herring, since the problem is that __wake_up is somehow reentering itself. /* * Perform a safe wake up of the poll wait list. The problem is that * with the new callback'd wake up system, it is possible that the * poll callback is reentered from inside the call to wake_up() done * on the poll wait queue head. The rule is that we cannot reenter the * wake up code from the same task more than EP_MAX_POLLWAKE_NESTS times, * and we cannot reenter the same wait queue head at all. This will * enable to have a hierarchy of epoll file descriptor of no more than * EP_MAX_POLLWAKE_NESTS deep. We need the irq version of the spin lock * because this one gets called by the poll callback, that in turn is called * from inside a wake_up(), that might be called from irq context. */ Seems to suggest that the epoll code can indeed recurse into wakeup. Davide, Johannes, any ideas? Since EP_MAX_POLLWAKE_NESTS MAX_LOCKDEP_SUBCLASSES we could perhaps do something like: wake_up_nested(..., wake_nests); although I'm not quite sure that is correct, my understanding of this code is still fragile at best. I remember I talked with Arjan about this time ago. Basically, since 1) you can drop an epoll fd inside another epoll fd 2) callback-based wakeups are used, you can see a wake_up() from inside another wake_up(), but they will never refer to the same lock instance. Think about: dfd = socket(...); efd1 = epoll_create(); efd2 = epoll_create(); epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...); epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...); When a packet arrives to the device underneath dfd, the net code will issue a wake_up() on its poll wake list. Epoll (efd1) has installed a callback wakeup entry on that queue, and the wake_up() performed by the dfd net code will end up in ep_poll_callback(). At this point epoll (efd1) notices that it may have some event ready, so it needs to wake up the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() that ends up in another wake_up(), after having checked about the recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to avoid stack blasting. Never hit the same queue, to avoid loops like: epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...); epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...); epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...); epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...); The code if (tncur-wq == wq || ... prevents re-entering the same queue/lock. I don't know how the lockdep code works, so I can't say about wake_up_nested(). Although I have a feeling is not enough in this case. A solution may be to move the call to ep_poll_safewake() (that'd become a simple wake_up()) inside a tasklet or whatever is today trendy for delayed work. But his kinda scares me to be honest, since epoll has already a bunch of places where it could be asynchronously hit (plus performance regression will need to be verified). - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: epoll design problems with common fork/exec patterns
On Sat, 27 Oct 2007, Marc Lehmann wrote: Please provide some code to illustrate one exact problem you have. // assume there is an open epoll set that listens for events on fd 5 if (fork () = 0) { close (5); // fd 5 is now removed from the epoll set of the parent. _exit (0); } Hmmm ... what? I assume you know that: 1) A file descriptor is a userspace view/handle of a kernel object 2) The kernel object has a use-count for as many file descriptors that have been handed out to userspace 3) A close() decreases the internal counter by one 4) The kernel object gets effectively closed when the internal counter goes to zero 5) A fork() acts as a dup() on the file descriptors by hence bumping up its internal counter 6) Epoll removes the file from the set, when the *kernel* object gets closed (internal use-count goes to zero) With that in mind, how can the code snippet above trigger a removal from the epoll set? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: epoll design problems with common fork/exec patterns
On Sat, 27 Oct 2007, Willy Tarreau wrote: On Sat, Oct 27, 2007 at 09:59:07AM -0700, Davide Libenzi wrote: On Sat, 27 Oct 2007, Marc Lehmann wrote: Please provide some code to illustrate one exact problem you have. // assume there is an open epoll set that listens for events on fd 5 if (fork () = 0) { close (5); // fd 5 is now removed from the epoll set of the parent. _exit (0); } Hmmm ... what? I assume you know that: 1) A file descriptor is a userspace view/handle of a kernel object 2) The kernel object has a use-count for as many file descriptors that have been handed out to userspace 3) A close() decreases the internal counter by one 4) The kernel object gets effectively closed when the internal counter goes to zero 5) A fork() acts as a dup() on the file descriptors by hence bumping up its internal counter 6) Epoll removes the file from the set, when the *kernel* object gets closed (internal use-count goes to zero) With that in mind, how can the code snippet above trigger a removal from the epoll set? Davide, from what I understand, Marc is not asking for the code above to remove the fd from the epoll set, but he's in fact complaining that he *observed* that the fd was removed from the epoll set in the *parent* process when the child closes it, which is of course not expected at all. As strange as it looks like, this might need investigation. It is possible that there is some strange bug somewhere in some kernel versions. That would be *really* strange, since epoll hooks in __fput() in order to perform proper cleanup. This means that, in the case above, the file will be really closed in the parent too. That, I think, would trigger way more serious problems in userspace. Marc, I think that if you indicate the last kernel version on which you observed this and provide a very short and easy reproducer, it would help everyone investigating this. Basically something which reports OK or KO. Of course. That'd be great. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: epoll design problems with common fork/exec patterns
On Sat, 27 Oct 2007, David Schwartz wrote: I don't see how that can be. Suppose I add fd 8 to an epoll set. Suppose fd 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set, since there no longer is an fd 8? Events on files registered for epoll notification are reported by descriptor, so the set membership has to be associated (as reflected into userspace) with the descriptor, not the file. Eric already answered to your question (epoll deals with internal kernel objects - aka file*). I just want to answer this one for another reason. WTF is wrong with all of you Cc-list-trimmers? Could you *please* stop trimming Cc-lists? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: epoll design problems with common fork/exec patterns
On Sun, 28 Oct 2007, David Schwartz wrote: Eric Dumazet wrote: Events are not necessarly reported by descriptors. epoll uses an opaque field provided by the user. It's up to the user to properly chose a tag that will makes sense if the user app is playing dup()/close() games for example. Great. So the only issue then is that the documentation is confusing. It frequently uses the term fd where it means file. For example, it says: Q1 What happens if you add the same fd to an epoll_set twice? A1 You will probably get EEXIST. However, it is possible that two threads may add the same fd twice. This is a harmless condition. This gives no reason to think there's anything wrong with adding the same file twice so long as you do so through different descriptors. (One can imagine an application that does this to segregate read and write operations to avoid a race where the descriptor is closed from under a writer due to handling a fatal read error.) Obviously, that won't work. I agree, that is confusing. However, you can safely add two different file descriptors pointing to the same file*, with different event masks, and that will work as expected. And this part: Q6 Will the close of an fd cause it to be removed from all epoll sets automatically? A6 Yes. This is incorrect. Closing an fd will not cause it to be removed from all epoll sets automatically. Only closing a file will. This is what caused the OP's confusion, and it is at best imprecise and, at worst, flat out wrong. OTOH you cannot list *every* possible scenario in a man page, otherwise you end up writing a book instead of a man page. I will try to find some time with Michael to refine the man page. PS: It is customary to trim individuals off of CC lists when replying to a list when the subject matter of the post is squarely inside the subject of the list. If the person CC'd was interested in the list's subject, he or she would presumably subscribe to the list. Not everyone wants two copies of every post. Not everyone wants a personal copy of every sub-thread that results from a post they make. In the past few years, I've received approximately an equal number of complaints about trimming CC's on posts to LKML and not trimming CC's on such posts. Does anyone that in 2007 still did not manage to find a way to avoid dups in hitting his mailbox, deserve any consideration at all? OTOH many ppl, like myself, uses To and Cc header to direct email to proper folders, where they are treated with a different level of attention. And your stripp-all-headers mania screws that up badly. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised signalfd man-page
On Wed, 17 Oct 2007, Michael Kerrisk wrote: With the new code Linus already merged, signalfd does not attach to the sighand anymore, so the orphaned sighand behaviour is no more there. An exec() will carry the fd over, and you will be able to use the fd in the same way you did before the exec(). If sigpending()/sigwaitinfo() will show signals available, so it should signalfd. So I wrote: execve(2) semantics Just like any other file descriptor, a signalfd file descriptor remains open across an execve(2), unless it has been marked for close-on-exec (see fcntl(2)). Any signals that were available for reading before the execve(2) remain available to the newly loaded program. (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).) (This is analogous to traditional signal semantics, where a blocked signal that is pending remains pending across an execve(2).) Okay? Ok It'll return the signals that would be normally returned to the thread with the standard signal delivery methods. That is, calling thread private signals, and calling thread-group shared signals. So I wrote: Thread semantics The semantics of signalfd file descriptors in a multi- threaded program mirror the standard semantics for sig- nals. In other words, when a thread reads from a sig- nalfd file descriptor, it will read the signals that are directed to the thread itself and the signals that are directed to the process (i.e., the entire thread group). (A thread will not be able to read signals that are directed to other threads in the process.) Okay? Ok, although this looks more specific: (A thread will not be able to read other threads private signals) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] avoid kmemcheck warning in epoll
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but rb_set_parent() accesses node's pointer in its code. This creates a warning in kmemcheck (reported by Vegard Nossum) about an uninitialized memory access. The warning is harmless since the following rb-tree node insert is going to overwrite the node data. In any case I think it's better to not have that happening at all, and fix it by properly initializing the data. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/eventpoll.c |2 +- include/linux/rbtree.h | 12 2 files changed, 13 insertions(+), 1 deletion(-) Index: linux-2.6.mod/fs/eventpoll.c === --- linux-2.6.mod.orig/fs/eventpoll.c 2008-02-10 12:36:20.0 -0800 +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800 @@ -260,7 +260,7 @@ /* Special initialization for the RB tree node to detect linkage */ static inline void ep_rb_initnode(struct rb_node *n) { - rb_set_parent(n, n); + rb_init_node(n, n); } /* Removes a node from the RB tree and marks it for a fast is-linked check */ Index: linux-2.6.mod/include/linux/rbtree.h === --- linux-2.6.mod.orig/include/linux/rbtree.h 2008-02-10 12:36:13.0 -0800 +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 -0800 @@ -112,6 +112,18 @@ struct rb_node *rb_node; }; +/** + * rb_init_node - Initializes the node internal data + * + * @node: Pointer to the RB-Tree node + * @parent: Pointer to the parent node, or NULL + * + */ +static inline void rb_init_node(struct rb_node *node, struct rb_node *parent) +{ + node-rb_parent_color = (unsigned long) parent; + node-rb_left = node-rb_right = NULL; +} #define rb_parent(r) ((struct rb_node *)((r)-rb_parent_color ~3)) #define rb_color(r) ((r)-rb_parent_color 1) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] avoid kmemcheck warning in epoll
Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but rb_set_parent() accesses node's pointer in its code. This creates a warning in kmemcheck (reported by Vegard Nossum) about an uninitialized memory access. The warning is harmless since the following rb-tree node insert is going to overwrite the node data. In any case I think it's better to not have that happening at all, and fix it by simplifying the code to get rid of a few lines that became superfluous after the previous epoll changes. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/eventpoll.c | 27 +++ 1 file changed, 3 insertions(+), 24 deletions(-) Index: linux-2.6.mod/fs/eventpoll.c === --- linux-2.6.mod.orig/fs/eventpoll.c 2008-02-11 15:31:02.0 -0800 +++ linux-2.6.mod/fs/eventpoll.c2008-02-11 15:32:46.0 -0800 @@ -257,25 +257,6 @@ (p1-file p2-file ? -1 : p1-fd - p2-fd)); } -/* Special initialization for the RB tree node to detect linkage */ -static inline void ep_rb_initnode(struct rb_node *n) -{ - rb_set_parent(n, n); -} - -/* Removes a node from the RB tree and marks it for a fast is-linked check */ -static inline void ep_rb_erase(struct rb_node *n, struct rb_root *r) -{ - rb_erase(n, r); - rb_set_parent(n, n); -} - -/* Fast check to verify that the item is linked to the main RB tree */ -static inline int ep_rb_linked(struct rb_node *n) -{ - return rb_parent(n) != n; -} - /* Tells us if the item is currently linked */ static inline int ep_is_linked(struct list_head *p) { @@ -283,13 +264,13 @@ } /* Get the struct epitem from a wait queue pointer */ -static inline struct epitem * ep_item_from_wait(wait_queue_t *p) +static inline struct epitem *ep_item_from_wait(wait_queue_t *p) { return container_of(p, struct eppoll_entry, wait)-base; } /* Get the struct epitem from an epoll queue wrapper */ -static inline struct epitem * ep_item_from_epqueue(poll_table *p) +static inline struct epitem *ep_item_from_epqueue(poll_table *p) { return container_of(p, struct ep_pqueue, pt)-epi; } @@ -411,8 +392,7 @@ list_del_init(epi-fllink); spin_unlock(file-f_ep_lock); - if (ep_rb_linked(epi-rbn)) - ep_rb_erase(epi-rbn, ep-rbr); + rb_erase(epi-rbn, ep-rbr); spin_lock_irqsave(ep-lock, flags); if (ep_is_linked(epi-rdllink)) @@ -728,7 +708,6 @@ goto error_return; /* Item initialization follow here ... */ - ep_rb_initnode(epi-rbn); INIT_LIST_HEAD(epi-rdllink); INIT_LIST_HEAD(epi-fllink); INIT_LIST_HEAD(epi-pwqlist); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] avoid kmemcheck warning in epoll
On Mon, 11 Feb 2008, Andrew Morton wrote: On Sun, 10 Feb 2008 13:32:01 -0800 (PST) Davide Libenzi [EMAIL PROTECTED] wrote: Epoll calls rb_set_parent(n, n) to initialize the rb-tree node, but rb_set_parent() accesses node's pointer in its code. This creates a warning in kmemcheck (reported by Vegard Nossum) about an uninitialized memory access. The warning is harmless since the following rb-tree node insert is going to overwrite the node data. In any case I think it's better to not have that happening at all, and fix it by properly initializing the data. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/eventpoll.c |2 +- include/linux/rbtree.h | 12 2 files changed, 13 insertions(+), 1 deletion(-) Index: linux-2.6.mod/fs/eventpoll.c === --- linux-2.6.mod.orig/fs/eventpoll.c 2008-02-10 12:36:20.0 -0800 +++ linux-2.6.mod/fs/eventpoll.c2008-02-10 12:50:41.0 -0800 @@ -260,7 +260,7 @@ /* Special initialization for the RB tree node to detect linkage */ static inline void ep_rb_initnode(struct rb_node *n) { - rb_set_parent(n, n); + rb_init_node(n, n); } /* Removes a node from the RB tree and marks it for a fast is-linked check */ Index: linux-2.6.mod/include/linux/rbtree.h === --- linux-2.6.mod.orig/include/linux/rbtree.h 2008-02-10 12:36:13.0 -0800 +++ linux-2.6.mod/include/linux/rbtree.h2008-02-10 12:51:57.0 -0800 @@ -112,6 +112,18 @@ struct rb_node *rb_node; }; +/** + * rb_init_node - Initializes the node internal data + * + * @node: Pointer to the RB-Tree node + * @parent: Pointer to the parent node, or NULL + * + */ +static inline void rb_init_node(struct rb_node *node, struct rb_node *parent) +{ + node-rb_parent_color = (unsigned long) parent; + node-rb_left = node-rb_right = NULL; +} Is epoll the only rbtree-using code which exhibits this problem? If so, what is epoll doing differently from all the others? Dunno, but I don't think so. Epoll initializes the node to a state so that later on can check if the file-fd item is inserted or not. And it uses the parent information for that. But rb_set_parent() (that is used in the current initialization code) uses the data in the node, and this triggers the uninitialized memory access. Taking a better look at it, the code (after the latest changes) has no more point-of-failures after the rb-tree insert, so I can probably avoid doing the node's parent-initialization and check altogether. Yeah, let's that ;) eventpoll.c | 27 +++ 1 file changed, 3 insertions(+), 24 deletions(-) Andrew, drop that one. I'm gonna send the new one after some test... - Davide -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Thu, 20 Sep 2007, Nagendra Tomar wrote: --- Davide Libenzi [EMAIL PROTECTED] wrote: Looking back at it, I think the current TCP code is right, once you look at the event to be a output buffer full-with_space transition. If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event (free space on the output buffer), if you do not consume it (say a tcp_sendmsg that re-fill the buffer), you can't see other OUT event anymore since they happen on the full-with_space transition. Yes, I know, the read size (EPOLLIN) works differently and you get an event for every packet you receive. And yes, I do not like asymmetric things. But that does not make the EPOLLOUT|EPOLLET wrong IMO. I agree that ET means the event should happen at the transition from nospace-space condition, but isn't the other case (event is delivered every time the event actually happens) more usable. Also the epoll man page says so ... Edge Triggered event distribution delivers events only when events happens on the monitored file. This serves the purpose of ET (reducing the number of poll events) and at the same time makes userspace coding easier. My userspace code has the liberty of deciding when it can write to the socket. f.e. the sendfile buffer management example that I quoted in my earlier post will be difficult with the current ET|POLLOUT behaviour. I cannot write in full-buffer units. I'll ve to write partial buffers just to fill the TCP writeq which is needed to trigger the event. That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the ability to write, and it is not meant as to signal every time a packet (skb) is sent on the wire (and the buffer released). In your particular application, you could simply split the sendfile into appropriately sized chunks, and handle the buffer realease in there. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets
On Thu, 20 Sep 2007, Nagendra Tomar wrote: That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the ability to write, and it is not meant as to signal every time a packet (skb) is sent on the wire (and the buffer released). Aren't they both the same ? Everytime an incoming ACK frees up a buffer from the retransmit queue, the writability condition is freshly asserted, much the same way as the readability condition is asserted everytime a new data is queued in the socket receive queue (irrespective of whether there was data already waiting to be read in the receive queue). This difference in meaning of POLLOUT only arises in the ET case, which was not what traditional Unix poll referred to. Again, events here are readability and writeability and the fact that the lower network layer freed a buffer is not, per se, an event (unless before, writeability was tested and reported as unavailable). In you case the solution looks pretty simple. Just create appropriately sized buffers, split the single sendfile into multiple buffer-sized ones, and recycle the buffer once each of them completes. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: A revised timerfd API
On Sat, 22 Sep 2007, Michael Kerrisk wrote: So I'm inclined to implement option (b), unless someone has strong objections. Davide, could I persuade you to help? I guess I better do, otherwise you'll continue to stress me ;) int timerfd_create(int clockid); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); Patch below. Builds, not tested yet (you need to remove the broken status from CONFIG_TIMERFD in case you want to test - and plug the new syscall to arch/xxx). May that work for you? Thomas-san, hrtimer_try_to_cancel() does not touch -expires and I assume it'll never do, granted? - Davide --- fs/compat.c | 32 -- fs/timerfd.c | 144 +-- include/linux/compat.h |7 +- include/linux/syscalls.h |7 +- 4 files changed, 126 insertions(+), 64 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-09-22 12:22:19.0 -0700 +++ linux-2.6.mod/fs/timerfd.c 2007-09-22 13:21:21.0 -0700 @@ -23,6 +23,7 @@ struct timerfd_ctx { struct hrtimer tmr; + int clockid; ktime_t tintv; wait_queue_head_t wqh; int expired; @@ -46,7 +47,7 @@ return HRTIMER_NORESTART; } -static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, +static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { enum hrtimer_mode htmode; @@ -58,7 +59,7 @@ texp = timespec_to_ktime(ktmr-it_value); ctx-expired = 0; ctx-tintv = timespec_to_ktime(ktmr-it_interval); - hrtimer_init(ctx-tmr, clockid, htmode); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); ctx-tmr.expires = texp; ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) @@ -150,76 +151,109 @@ .read = timerfd_read, }; -asmlinkage long sys_timerfd(int ufd, int clockid, int flags, - const struct itimerspec __user *utmr) +asmlinkage long sys_timerfd_create(int clockid) { - int error; + int error, ufd; struct timerfd_ctx *ctx; struct file *file; struct inode *inode; - struct itimerspec ktmr; - - if (copy_from_user(ktmr, utmr, sizeof(ktmr))) - return -EFAULT; if (clockid != CLOCK_MONOTONIC clockid != CLOCK_REALTIME) return -EINVAL; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + init_waitqueue_head(ctx-wqh); + ctx-clockid = clockid; + + error = anon_inode_getfd(ufd, inode, file, [timerfd], +timerfd_fops, ctx); + if (error) + goto err_kfree_ctx; + + return ufd; + +err_kfree_ctx: + kfree(ctx); + return error; +} + +asmlinkage long sys_timerfd_settime(int ufd, int flags, + const struct itimerspec __user *utmr, + struct itimerspec __user *otmr) +{ + struct file *file; + struct timerfd_ctx *ctx; + struct itimerspec ktmr, kotmr; + + if (copy_from_user(ktmr, utmr, sizeof(ktmr))) + return -EFAULT; + if (!timespec_valid(ktmr.it_value) || !timespec_valid(ktmr.it_interval)) return -EINVAL; - if (ufd == -1) { - ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); - if (!ctx) - return -ENOMEM; - - init_waitqueue_head(ctx-wqh); - - timerfd_setup(ctx, clockid, flags, ktmr); - - /* -* When we call this, the initialization must be complete, since -* anon_inode_getfd() will install the fd. -*/ - error = anon_inode_getfd(ufd, inode, file, [timerfd], -timerfd_fops, ctx); - if (error) - goto err_tmrcancel; - } else { - file = fget(ufd); - if (!file) - return -EBADF; - ctx = file-private_data; - if (file-f_op != timerfd_fops) { - fput(file); - return -EINVAL; - } - /* -* We need to stop the existing timer before reprogramming -* it to the new values. -*/ - for (;;) { - spin_lock_irq(ctx-wqh.lock); - if (hrtimer_try_to_cancel(ctx-tmr) = 0) - break; - spin_unlock_irq(ctx-wqh.lock); - cpu_relax(); -
Re: RFC: A revised timerfd API
On Sat, 22 Sep 2007, Thomas Gleixner wrote: On Sat, 2007-09-22 at 14:07 -0700, Davide Libenzi wrote: On Sat, 22 Sep 2007, Michael Kerrisk wrote: So I'm inclined to implement option (b), unless someone has strong objections. Davide, could I persuade you to help? I guess I better do, otherwise you'll continue to stress me ;) int timerfd_create(int clockid); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); Patch below. Builds, not tested yet (you need to remove the broken status from CONFIG_TIMERFD in case you want to test - and plug the new syscall to arch/xxx). May that work for you? Thomas-san, hrtimer_try_to_cancel() does not touch -expires and I assume it'll never do, granted? Davide-san, I have no intention to change that, but remember there is this file Documentation/stable_api_nonsense.txt :) Heh, I guess that'll work then ;) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: A revised timerfd API
On Sun, 23 Sep 2007, Michael Kerrisk wrote: I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown in the definitions below. When I ran the the program below, my system immediately froze. Can you try it on your system please. There's an hrtimer_init() missing in timerfd_create(). I'll refactor the patch. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: A revised timerfd API
On Sun, 23 Sep 2007, Davide Libenzi wrote: On Sun, 23 Sep 2007, Michael Kerrisk wrote: I applied this patch against 2.6.27-rc7, and wired up the syscalls as shown in the definitions below. When I ran the the program below, my system immediately froze. Can you try it on your system please. There's an hrtimer_init() missing in timerfd_create(). I'll refactor the patch. There's the case of a timerfd_gettime return status when the timerfd has not been set yet (ie, soon after a timerfd_create), to handle. Current way is to return an (itimerspec) { 0, 0 }. Ok? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/3] new timerfd API - new timerfd API
This is the new timerfd API as it is implemented by the following patch: int timerfd_create(int clockid); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); The timerfd_create() API creates an un-programmed timerfd fd. The clockid parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME. The timerfd_settime() API give new settings by the timerfd fd, by optionally retrieving the previous expiration time (in case the otmr parameter is not NULL). The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit is set in the flags parameter. Otherwise it's a relative time. The timerfd_gettime() API returns the next expiration time of the timer, or {0, 0} if the timerfd has not been set yet. Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Here's a simple test program I used to exercise the new timerfd APIs: http://www.xmailserver.org/timerfd-test2.c Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/compat.c | 32 ++- fs/timerfd.c | 199 ++- include/linux/compat.h |7 + include/linux/syscalls.h |7 + 4 files changed, 168 insertions(+), 77 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-09-23 15:18:09.0 -0700 +++ linux-2.6.mod/fs/timerfd.c 2007-09-23 15:25:55.0 -0700 @@ -23,15 +23,17 @@ struct timerfd_ctx { struct hrtimer tmr; + int clockid; ktime_t tintv; wait_queue_head_t wqh; int expired; + u64 ticks; }; /* * This gets called when the timer event triggers. We set the expired * flag, but we do not re-arm the timer (in case it's necessary, - * tintv.tv64 != 0) until the timer is read. + * tintv.tv64 != 0) until the timer is accessed. */ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) { @@ -40,13 +42,14 @@ spin_lock_irqsave(ctx-wqh.lock, flags); ctx-expired = 1; + ctx-ticks++; wake_up_locked(ctx-wqh); spin_unlock_irqrestore(ctx-wqh.lock, flags); return HRTIMER_NORESTART; } -static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, +static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { enum hrtimer_mode htmode; @@ -57,8 +60,9 @@ texp = timespec_to_ktime(ktmr-it_value); ctx-expired = 0; + ctx-ticks = 0; ctx-tintv = timespec_to_ktime(ktmr-it_interval); - hrtimer_init(ctx-tmr, clockid, htmode); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); ctx-tmr.expires = texp; ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) @@ -83,7 +87,7 @@ poll_wait(file, ctx-wqh, wait); spin_lock_irqsave(ctx-wqh.lock, flags); - if (ctx-expired) + if (ctx-ticks) events |= POLLIN; spin_unlock_irqrestore(ctx-wqh.lock, flags); @@ -102,11 +106,11 @@ return -EINVAL; spin_lock_irq(ctx-wqh.lock); res = -EAGAIN; - if (!ctx-expired !(file-f_flags O_NONBLOCK)) { + if (!ctx-ticks !(file-f_flags O_NONBLOCK)) { __add_wait_queue(ctx-wqh, wait); for (res = 0;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx-expired) { + if (ctx-ticks) { res = 0; break; } @@ -121,22 +125,23 @@ __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (ctx-expired) { - ctx-expired = 0; - if (ctx-tintv.tv64 != 0) { + if (ctx-ticks) { + ticks = ctx-ticks; + if (ctx-expired ctx-tintv.tv64) { /* * If tintv.tv64 != 0, this is a periodic timer that * needs to be re-armed. We avoid doing it in the timer * callback to avoid DoS attacks specifying a very * short timer period. */ - ticks = (u64) + ticks += (u64) hrtimer_forward(ctx-tmr, hrtimer_cb_get_time(ctx-tmr), - ctx-tintv); + ctx-tintv) - 1; hrtimer_restart(ctx-tmr); - } else - ticks = 1; + } + ctx-expired = 0; + ctx
[patch 2/3] new timerfd API - wire the new timerfd API to the x86 family
Wires up the new timerfd API to the x86 family. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- arch/i386/kernel/syscall_table.S |5 - arch/x86_64/ia32/ia32entry.S |4 +++- include/asm-i386/unistd.h|6 -- include/asm-x86_64/unistd.h |8 ++-- 4 files changed, 17 insertions(+), 6 deletions(-) Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S === --- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-23 15:28:48.0 -0700 +++ linux-2.6.mod/arch/i386/kernel/syscall_table.S 2007-09-23 15:28:52.0 -0700 @@ -321,6 +321,9 @@ .long sys_epoll_pwait .long sys_utimensat /* 320 */ .long sys_signalfd - .long sys_timerfd + .long sys_timerfd_create .long sys_eventfd .long sys_fallocate + .long sys_timerfd_settime /* 325 */ + .long sys_timerfd_gettime + Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-23 15:28:48.0 -0700 +++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S 2007-09-23 15:28:52.0 -0700 @@ -730,7 +730,9 @@ .quad sys_epoll_pwait .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd - .quad compat_sys_timerfd + .quad sys_timerfd_create .quad sys_eventfd .quad sys32_fallocate + .quad compat_sys_timerfd_settime/* 325 */ + .quad compat_sys_timerfd_gettime ia32_syscall_end: Index: linux-2.6.mod/include/asm-i386/unistd.h === --- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-23 15:28:48.0 -0700 +++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-23 15:28:52.0 -0700 @@ -327,13 +327,15 @@ #define __NR_epoll_pwait 319 #define __NR_utimensat 320 #define __NR_signalfd 321 -#define __NR_timerfd 322 +#define __NR_timerfd_create322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_timerfd_settime 325 +#define __NR_timerfd_gettime 326 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 327 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/include/asm-x86_64/unistd.h === --- linux-2.6.mod.orig/include/asm-x86_64/unistd.h 2007-09-23 15:28:48.0 -0700 +++ linux-2.6.mod/include/asm-x86_64/unistd.h 2007-09-23 15:28:52.0 -0700 @@ -626,12 +626,16 @@ __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait) #define __NR_signalfd 282 __SYSCALL(__NR_signalfd, sys_signalfd) -#define __NR_timerfd 283 -__SYSCALL(__NR_timerfd, sys_timerfd) +#define __NR_timerfd_create283 +__SYSCALL(__NR_timerfd_create, sys_timerfd_create) #define __NR_eventfd 284 __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate 285 __SYSCALL(__NR_fallocate, sys_fallocate) +#define __NR_timerfd_settime 286 +__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime) +#define __NR_timerfd_gettime 287 +__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/3] new timerfd API - un-break CONFIG_TIMERFD
Remove the broken status to CONFIG_TIMERFD. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- init/Kconfig |1 - 1 file changed, 1 deletion(-) Index: linux-2.6.mod/init/Kconfig === --- linux-2.6.mod.orig/init/Kconfig 2007-09-23 15:18:06.0 -0700 +++ linux-2.6.mod/init/Kconfig 2007-09-23 15:28:54.0 -0700 @@ -488,7 +488,6 @@ config TIMERFD bool Enable timerfd() system call if EMBEDDED select ANON_INODES - depends on BROKEN default y help Enable the timerfd() system call that allows to receive timer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family
On Mon, 24 Sep 2007, Michael Kerrisk wrote: Davide, Is it perhaps not better to group the three syscalls contiguously with respect to syscall numbers? The old timerfd slot can be re-used for some other syscall later. There's no problem if they're not contiguous. Holes, unless filled immediately, need to be remembered to be filled. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] new timerfd API - new timerfd API
On Mon, 24 Sep 2007, Thomas Gleixner wrote: struct timerfd_ctx { struct hrtimer tmr; + int clockid; ktime_t tintv; wait_queue_head_t wqh; int expired; + u64 ticks; }; Can you please restructure the struct in a way which does not result in padding by the compiler ? struct timerfd_ctx { struct hrtimer tmr; ktime_t tintv; wait_queue_head_t wqh; u64 ticks; int expired; int clockid; }; Sure. + ticks += (u64) hrtimer_forward(ctx-tmr, hrtimer_cb_get_time(ctx-tmr), You need to use ctx-tmr.base-get_time() here, otherwise you might read a stale time value (in case that CONFIG_HIGH_RES_TIMERS is off). Is the particular position of hrtimer_cb_get_time() in the code that would break here? Because function was added by your patch ;) Did something change later? +err_kfree_ctx: + kfree(ctx); + return error; You really can avoid the goto here. Ack. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] new timerfd API - wire the new timerfd API to the x86 family
On Mon, 24 Sep 2007, Michael Kerrisk wrote: Hi Davide, Davide Libenzi wrote: On Mon, 24 Sep 2007, Michael Kerrisk wrote: Is it perhaps not better to group the three syscalls contiguously with respect to syscall numbers? The old timerfd slot can be re-used for some other syscall later. There's no problem if they're not contiguous. I realise there is no problem, in a technical sense. But it strikes me as more aesthetic to make related syscalls numerically contiguous. Thus, we see such as the following in the kernel source #define __NR_epoll_create 254 #define __NR_epoll_ctl 255 #define __NR_epoll_wait 256 and #define __NR_timer_create 259 #define __NR_timer_settime (__NR_timer_create+1) #define __NR_timer_gettime (__NR_timer_create+2) #define __NR_timer_getoverrun (__NR_timer_create+3) #define __NR_timer_delete (__NR_timer_create+4) and #define __NR_inotify_init 291 #define __NR_inotify_add_watch 292 #define __NR_inotify_rm_watch 293 Holes, unless filled immediately, need to be remembered to be filled. Well, in the past it seems they do get filled soon enough though. There's fair odds that you'll be the one to fill it with the next syscall you write ;-). You have to talk to arch mantainers. I do not care. I simply provided the x86 hooks because I tested on x86. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] new timerfd API v2 - introduce a new hrtimer_forward_now() function
I think that advancing the timer against the timer's current now can be a pretty common usage, so, w/out exposing hrtimer's internals, we add a new hrtimer_forward_now() function. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- include/linux/hrtimer.h |7 +++ 1 file changed, 7 insertions(+) Index: linux-2.6.mod/include/linux/hrtimer.h === --- linux-2.6.mod.orig/include/linux/hrtimer.h 2007-09-24 12:27:20.0 -0700 +++ linux-2.6.mod/include/linux/hrtimer.h 2007-09-24 12:29:39.0 -0700 @@ -298,6 +298,13 @@ extern unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); +/* Forward a hrtimer so it expires after the hrtimer's current now */ +static inline unsigned long hrtimer_forward_now(struct hrtimer *timer, + ktime_t interval) +{ + return hrtimer_forward(timer, timer-base-get_time(), interval); +} + /* Precise sleep: */ extern long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2/4] new timerfd API v2 - new timerfd API
This is the new timerfd API as it is implemented by the following patch: int timerfd_create(int clockid); int timerfd_settime(int ufd, int flags, const struct itimerspec *utmr, struct itimerspec *otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); The timerfd_create() API creates an un-programmed timerfd fd. The clockid parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME. The timerfd_settime() API give new settings by the timerfd fd, by optionally retrieving the previous expiration time (in case the otmr parameter is not NULL). The time value specified in utmr is absolute, if the TFD_TIMER_ABSTIME bit is set in the flags parameter. Otherwise it's a relative time. The timerfd_gettime() API returns the next expiration time of the timer, or {0, 0} if the timerfd has not been set yet. Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Here's a simple test program I used to exercise the new timerfd APIs: http://www.xmailserver.org/timerfd-test2.c Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/compat.c | 32 ++- fs/timerfd.c | 197 ++- include/linux/compat.h |7 + include/linux/syscalls.h |7 + 4 files changed, 164 insertions(+), 79 deletions(-) Index: linux-2.6.mod/fs/timerfd.c === --- linux-2.6.mod.orig/fs/timerfd.c 2007-09-24 12:26:19.0 -0700 +++ linux-2.6.mod/fs/timerfd.c 2007-09-24 12:31:22.0 -0700 @@ -25,13 +25,15 @@ struct hrtimer tmr; ktime_t tintv; wait_queue_head_t wqh; + u64 ticks; int expired; + int clockid; }; /* * This gets called when the timer event triggers. We set the expired * flag, but we do not re-arm the timer (in case it's necessary, - * tintv.tv64 != 0) until the timer is read. + * tintv.tv64 != 0) until the timer is accessed. */ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) { @@ -40,13 +42,14 @@ spin_lock_irqsave(ctx-wqh.lock, flags); ctx-expired = 1; + ctx-ticks++; wake_up_locked(ctx-wqh); spin_unlock_irqrestore(ctx-wqh.lock, flags); return HRTIMER_NORESTART; } -static void timerfd_setup(struct timerfd_ctx *ctx, int clockid, int flags, +static void timerfd_setup(struct timerfd_ctx *ctx, int flags, const struct itimerspec *ktmr) { enum hrtimer_mode htmode; @@ -57,8 +60,9 @@ texp = timespec_to_ktime(ktmr-it_value); ctx-expired = 0; + ctx-ticks = 0; ctx-tintv = timespec_to_ktime(ktmr-it_interval); - hrtimer_init(ctx-tmr, clockid, htmode); + hrtimer_init(ctx-tmr, ctx-clockid, htmode); ctx-tmr.expires = texp; ctx-tmr.function = timerfd_tmrproc; if (texp.tv64 != 0) @@ -83,7 +87,7 @@ poll_wait(file, ctx-wqh, wait); spin_lock_irqsave(ctx-wqh.lock, flags); - if (ctx-expired) + if (ctx-ticks) events |= POLLIN; spin_unlock_irqrestore(ctx-wqh.lock, flags); @@ -102,11 +106,11 @@ return -EINVAL; spin_lock_irq(ctx-wqh.lock); res = -EAGAIN; - if (!ctx-expired !(file-f_flags O_NONBLOCK)) { + if (!ctx-ticks !(file-f_flags O_NONBLOCK)) { __add_wait_queue(ctx-wqh, wait); for (res = 0;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx-expired) { + if (ctx-ticks) { res = 0; break; } @@ -121,22 +125,21 @@ __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (ctx-expired) { - ctx-expired = 0; - if (ctx-tintv.tv64 != 0) { + if (ctx-ticks) { + ticks = ctx-ticks; + if (ctx-expired ctx-tintv.tv64) { /* * If tintv.tv64 != 0, this is a periodic timer that * needs to be re-armed. We avoid doing it in the timer * callback to avoid DoS attacks specifying a very * short timer period. */ - ticks = (u64) - hrtimer_forward(ctx-tmr, - hrtimer_cb_get_time(ctx-tmr), - ctx-tintv); + ticks += (u64) hrtimer_forward_now(ctx-tmr, + ctx-tintv) - 1; hrtimer_restart(ctx-tmr); - } else - ticks = 1; + } + ctx-expired = 0
[patch 3/4] new timerfd API v2 - wire the new timerfd API to the x86 family
Wires up the new timerfd API to the x86 family. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- arch/i386/kernel/syscall_table.S |5 - arch/x86_64/ia32/ia32entry.S |4 +++- include/asm-i386/unistd.h|6 -- include/asm-x86_64/unistd.h |8 ++-- 4 files changed, 17 insertions(+), 6 deletions(-) Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S === --- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-09-24 12:13:28.0 -0700 +++ linux-2.6.mod/arch/i386/kernel/syscall_table.S 2007-09-24 12:31:39.0 -0700 @@ -321,6 +321,9 @@ .long sys_epoll_pwait .long sys_utimensat /* 320 */ .long sys_signalfd - .long sys_timerfd + .long sys_timerfd_create .long sys_eventfd .long sys_fallocate + .long sys_timerfd_settime /* 325 */ + .long sys_timerfd_gettime + Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-09-24 12:13:28.0 -0700 +++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S 2007-09-24 12:31:39.0 -0700 @@ -730,7 +730,9 @@ .quad sys_epoll_pwait .quad compat_sys_utimensat /* 320 */ .quad compat_sys_signalfd - .quad compat_sys_timerfd + .quad sys_timerfd_create .quad sys_eventfd .quad sys32_fallocate + .quad compat_sys_timerfd_settime/* 325 */ + .quad compat_sys_timerfd_gettime ia32_syscall_end: Index: linux-2.6.mod/include/asm-i386/unistd.h === --- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-09-24 12:13:28.0 -0700 +++ linux-2.6.mod/include/asm-i386/unistd.h 2007-09-24 12:31:39.0 -0700 @@ -327,13 +327,15 @@ #define __NR_epoll_pwait 319 #define __NR_utimensat 320 #define __NR_signalfd 321 -#define __NR_timerfd 322 +#define __NR_timerfd_create322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_timerfd_settime 325 +#define __NR_timerfd_gettime 326 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 327 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/include/asm-x86_64/unistd.h === --- linux-2.6.mod.orig/include/asm-x86_64/unistd.h 2007-09-24 12:13:28.0 -0700 +++ linux-2.6.mod/include/asm-x86_64/unistd.h 2007-09-24 12:31:39.0 -0700 @@ -626,12 +626,16 @@ __SYSCALL(__NR_epoll_pwait, sys_epoll_pwait) #define __NR_signalfd 282 __SYSCALL(__NR_signalfd, sys_signalfd) -#define __NR_timerfd 283 -__SYSCALL(__NR_timerfd, sys_timerfd) +#define __NR_timerfd_create283 +__SYSCALL(__NR_timerfd_create, sys_timerfd_create) #define __NR_eventfd 284 __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate 285 __SYSCALL(__NR_fallocate, sys_fallocate) +#define __NR_timerfd_settime 286 +__SYSCALL(__NR_timerfd_settime, sys_timerfd_settime) +#define __NR_timerfd_gettime 287 +__SYSCALL(__NR_timerfd_gettime, sys_timerfd_gettime) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 4/4] new timerfd API v2 - un-break CONFIG_TIMERFD
Remove the broken status to CONFIG_TIMERFD. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- init/Kconfig |1 - 1 file changed, 1 deletion(-) Index: linux-2.6.mod/init/Kconfig === --- linux-2.6.mod.orig/init/Kconfig 2007-09-24 12:13:25.0 -0700 +++ linux-2.6.mod/init/Kconfig 2007-09-24 12:31:40.0 -0700 @@ -488,7 +488,6 @@ config TIMERFD bool Enable timerfd() system call if EMBEDDED select ANON_INODES - depends on BROKEN default y help Enable the timerfd() system call that allows to receive timer - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/2] Enable link power management for ata drivers
On Tue, 25 Sep 2007, roel wrote: + if (!(ap-flags ATA_FLAG_IPM) || !ata_dev_enabled(dev)) { if (!((ap-flags ATA_FLAG_IPM) ata_dev_enabled(dev))) { int foo(int i, int j) { return !(i 8) || !j; } int moo(int i, int j) { return !((i 8) j); } gcc -O2 -S: .globl foo .type foo, @function foo: shrl$3, %edi xorl$1, %edi testl %esi, %esi sete%al orl %eax, %edi andl$1, %edi movl%edi, %eax ret .globl moo .type moo, @function moo: shrl$3, %edi xorl$1, %edi testl %esi, %esi sete%al orl %eax, %edi andl$1, %edi movl%edi, %eax ret - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/4] new timerfd API v2 - new timerfd API
On Tue, 25 Sep 2007, Jonathan Corbet wrote: One quick question: Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Looking at that interface, it appears that a process doing a read() on a timerfd with no timer set will block for a very long time. It's an obvious don't do that situation, but perhaps we could help an occasional developer get a clue by returning something like -EINVAL when the timer has not been set? That is the same as you try to read once more after an expired timer. You won't wake up until the next timer event will show up. That is, after at most TP time for periodic timers, or after the time the next timerfd_settime() will setup. I'd like to keep the timerfd not set yet and the timerfd already expired and not re-armed acting the same way. That is, wait till next event happen (unless O_NONBLOCK of course). - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Man page for revised timerfd API
Michael, SCB ... On Wed, 26 Sep 2007, Michael Kerrisk wrote: .TH TIMERFD_CREATE 2 2007-09-26 Linux Linux Programmer's Manual .SH NAME timerfd_create, timerfd_settime, timer_gettime \- timers that notify via file descriptors .SH SYNOPSIS .\ FIXME . This header file may well change .\ FIXME . Probably _GNU_SOURCE will be required .\ FIXME . May require: Link with \fI\-lrt\f .nf .B #include sys/timerfd.h .sp .BI int timerfd_create(int clockid ); .sp .BI int timerfd_settime(int fd , int flags , .BI const struct itimerspec * new_value , .BI struct itimerspec * curr_value ); .sp .BI int timerfd_gettime(int fd , struct itimerspec * curr_value ); .fi .SH DESCRIPTION These system calls create and operate on a timer that delivers timer expiration notifications via a file descriptor. They provide an alternative to the use of .BR setitimer (2) or .BR timer_create (3), with the advantage that the file descriptor may be monitored by .BR poll (2) and .BR select (2). epoll, no? The use of these three system calls is analogous to the use of .BR timer_create (2), .BR timer_settime (2), and .BR timer_gettime (2). .\ .SS timerfd_create() .BR timerfd_create () creates a new timer object, and returns a file descriptor that refers to that timer. The .I clockid argument specifies the clock that is used to mark the progress of the timer, and must be either .B CLOCK_REALTIME or .BR CLOCK_MONOTONIC . .B CLOCK_REALTIME is a settable system-wide clock. .B CLOCK_MONOTONIC is a non-settable clock that is not affected by discontinuous changes in the system clock (e.g., manual changes to system time). The current value of each of these clocks can be retrieved using .BR clock_gettime (3). .\ .SS timerfd_settime() .BR timerfd_settime () arms (starts) or disarms (stops) the timer referred to by the file descriptor .IR fd . The .I new_value argument specifies the initial expiration and interval for the timer. The .I itimer structure used for this argument contains two fields, each of which is in turn a structure of type .IR timespec : .in +0.25i .nf struct timespec { time_t tv_sec;/* Seconds */ long tv_nsec; /* Nanoseconds */ }; struct itimerspec { struct timespec it_interval; /* Interval for periodic timer */ struct timespec it_value; /* Initial expiration */ }; .fi .in .PP .I new_value.it_value specifies the initial expiration of the timer, in seconds and nanoseconds. Setting either field of .I new_value.it_value to a non-zero value arms the timer. Setting both fields of .I new_value.it_value to zero disarms the timer. Setting one or both fields of .I new_value.it_interval to non-zero values specifies the period, in seconds and nanoseconds, for repeated timer expirations after the initial expiration. If both fields of .I new_value.it_interval are zero, the timer expires just once, at the time specified by .IR new_value.it_value . The .I flags argument is either 0, to start a relative timer .RI ( new_value.it_interval specifies a time relative to the current value of the clock specified by .IR clockid ), or .BR TFD_TIMER_ABSTIME , to start an absolute timer .RI ( new_value.it_interval specifies an absolute time for the clock specified by .IR clockid ; that is, the timer will expire when the value of that clock reaches the value specified in .IR new_value.it_interval ). The .I curr_value argument returns a structure containing the setting of the timer that was current at the time of the call; see the description of .BR timerfd_gettime () following. .\ .SS timerfd_gettime() .BR timerfd_gettime () returns, in .IR curr_value , an .IR itimerspec that contains the current setting of the timer referred to by the file descriptor .IR fd . The .I it_value field returns the amount of time until the timer will next expire. If both fields of this structure are zero, then the timer is currently disarmed. This field always contains a relative value, regardless of whether the .BR TFD_TIMER_ABSTIME flag was specified when setting the timer. The .I it_interval field returns the interval of the timer. If both fields of this structure are zero, then the timer is set to expire just once, at the time specified by .IR curr_value.it_value . .SS Operating on a timer file descriptor The file descriptor returned by .BR timerfd_create (2) supports the following operations: .TP .BR read (2) If the timer has already expired one or more times since it was created, or since the last .BR read (2), then the buffer given to .BR read (2) returns an unsigned 8-byte integer .RI ( uint64_t ) containing the number of expirations that have occurred. .IP If no timer expirations have occurred at the time of the .BR read (2), then the call either blocks until the next timer expiration, or fails with the
Re: Man page for revised timerfd API
On Thu, 27 Sep 2007, Michael Kerrisk wrote: Hi Davide, A follow up to the man page text. Does passing a timerfd file descriptor via a Unix domain socket to another process do the expected thing? That is, the receiving process will be able to read from the file descriptor in order to obtain notification of timer expirations that are occurring for the process that sent the file descriptor, right? Yup. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Man page for revised timerfd API
On Thu, 27 Sep 2007, Michael Kerrisk wrote: Davide, A further question: what is the expected behavior in the following scenario: 1. Create a timerfd and arm it. 2. Wait until M timer expirations have occurred 3. Modify the settings of the timer 4. Wait for N further timer expirations have occurred 5. read() from the timerfd Does the buffer returned by the read() contain the value N or (M+N)? In other words, should modifying the timer settings reset the expiration count to zero? Every timerfd_settime() zeroes the tick counter. So in your scenario it'll return N. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: eventfd.2 man page
On Thu, 27 Sep 2007, Michael Kerrisk wrote: Hi Davide, I've slightly tweaked the eventfd.2 man page in preparation for adding it to the man-pages set. Could you please review the text below, and confirm that it correctly describes intended behavior. Looks good to me. At the time that I was tsting it, I cooked an example about how to use an eventfd with KAIO: http://www.xmailserver.org/eventfd-aio-test.c In case you'd like to add a sketch to the man page. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revised signalfd man-page
On Thu, 27 Sep 2007, Michael Kerrisk wrote: .\ Copyright (C) 2007 Michael Kerrisk [EMAIL PROTECTED] .\ starting from a version by Davide Libenzi [EMAIL PROTECTED] .\ .\ This program is free software; you can redistribute it and/or modify .\ it under the terms of the GNU General Public License as published by .\ the Free Software Foundation; either version 2 of the License, or .\ (at your option) any later version. .\ .\ This program is distributed in the hope that it will be useful, .\ but WITHOUT ANY WARRANTY; without even the implied warranty of .\ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the .\ GNU General Public License for more details. .\ .\ You should have received a copy of the GNU General Public License .\ along with this program; if not, write to the Free Software .\ Foundation, Inc., 59 Temple Place, Suite 330, Boston, .\ MA 02111-1307 USA .\ .TH SIGNALFD 2 2007-09-27 Linux Linux Programmer's Manual .SH NAME signalfd \- create a file descriptor for accepting signals .SH SYNOPSIS .\ FIXME . This header file may well change .\ FIXME . Probably _GNU_SOURCE will be required .\ FIXME . May require: Link with \fI\-lrt\f .B #include sys/signalfd.h .sp .BI int signalfd(int fd , const sigset_t * mask ); .\ Almost certainly the glibc wrapper will hide this argument: .\ , size_t sizemask .SH DESCRIPTION .BR signalfd (2) creates a file descriptor that can be used to accept signals targeted at the caller. This provides an alternative to the use of a signal handler or .BR sigwaitinfo (2), and has the advantage that the file descriptor may be monitored by .BR select (2), .BR poll (2), and .BR epoll (7). The .I mask argument specifies the set of signals that the caller wishes to accept via the file descriptor. This argument is a signal set whose contents can be initialized using the macros described in .BR sigsetops (3). Normally, the set of signals to be received via the file descriptor should be blocked using .BR sigprocmask (2), to prevent the signals being handled according to their default dispositions. It is not possible to receive .B SIGKILL or .B SIGSTOP signals via a signalfd file descriptor; these signals are silently ignored if specified in .IR mask . If the .I fd argument is \-1, then the call creates a new file descriptor and associates the signal set specified in .I mask with that descriptor. If .I fd is not \-1, then it must specify a valid existing signalfd file descriptor, and .I mask is used to replace the signal set associated with that descriptor. .BR signalfd (2) returns a file descriptor that supports the following operations: .TP .BR read (2) If one or more of the signals specified in .I mask is pending for the process, then the buffer supplied to .BR read (2) is used to return one or more .I signalfd_siginfo structures (see below) that describe the signals. The .BR read (2) returns information for as many signals as are pending and will fit in the supplied buffer. The buffer must be at least .I sizeof(struct signalfd_siginfo) bytes. The return value of the .BR read (2) is the total number of bytes read. .IP As a consequence of the .BR read (2), the signals are consumed, so that they are no longer pending for the process (i.e., will not be caught by signal handlers, and cannot be accepted using .BR sigwaitinfo (2)). .IP If none of the signals in .I mask is pending for the process, then the .BR read (2) either blocks until one of the signals in .I mask is generated for the process, or fails with the error .B EAGAIN if the file descriptor has been made non-blocking (via the use of the .BR fcntl (2) .B F_SETFL operation to set the .B O_NONBLOCK flag). .\ FIXME Davide, what does the following mean? How (in userspace .\ terms) does a sighand structure become orphaned? The .BR read (2) call can also return 0, in case the sighand structure to which the signalfd was attached, has been orphaned. You can remove the five lines above, in virtue of the fact that Linus merged my simplification patch. .TP .BR poll (2), select (2) (and similar) The file descriptor is readable (the .BR select (2) .I readfds argument; the .BR poll (2) .B POLLIN flag) if one or more of the signals in .I mask is pending for the process. .IP The signalfd file descriptor also supports the other file-descriptor multiplexing APIs: .BR pselect (2), .BR ppoll (2), and .BR epoll (7). .TP .BR close (2) When the file descriptor is no longer required it should be closed. When all file descriptors associated with the same signalfd object have been closed, the resources for object are freed by the kernel. .SS The signalfd_siginfo structure The format of the .I signalfd_siginfo structure(s) returned by .BR read (2)s from a signalfd file descriptor is as follows: .in +0.5i .nf .\ FIXME Davide, a question: why rename these fields
[patch] rename signalfd_siginfo fields
For Michael Kerrisk request, the following patch renames signalfd_siginfo fields in order to keep them consistent with the siginfo_t ones. Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/signalfd.c| 44 ++-- include/linux/signalfd.h | 32 2 files changed, 38 insertions(+), 38 deletions(-) Index: linux-2.6.mod/fs/signalfd.c === --- linux-2.6.mod.orig/fs/signalfd.c2007-09-27 10:31:48.0 -0700 +++ linux-2.6.mod/fs/signalfd.c 2007-09-27 10:33:45.0 -0700 @@ -74,45 +74,45 @@ * If you change siginfo_t structure, please be sure * this code is fixed accordingly. */ - err |= __put_user(kinfo-si_signo, uinfo-signo); - err |= __put_user(kinfo-si_errno, uinfo-err); - err |= __put_user((short)kinfo-si_code, uinfo-code); + err |= __put_user(kinfo-si_signo, uinfo-ssi_signo); + err |= __put_user(kinfo-si_errno, uinfo-ssi_errno); + err |= __put_user((short) kinfo-si_code, uinfo-ssi_code); switch (kinfo-si_code __SI_MASK) { case __SI_KILL: - err |= __put_user(kinfo-si_pid, uinfo-pid); - err |= __put_user(kinfo-si_uid, uinfo-uid); + err |= __put_user(kinfo-si_pid, uinfo-ssi_pid); + err |= __put_user(kinfo-si_uid, uinfo-ssi_uid); break; case __SI_TIMER: -err |= __put_user(kinfo-si_tid, uinfo-tid); -err |= __put_user(kinfo-si_overrun, uinfo-overrun); -err |= __put_user((long)kinfo-si_ptr, uinfo-svptr); +err |= __put_user(kinfo-si_tid, uinfo-ssi_tid); +err |= __put_user(kinfo-si_overrun, uinfo-ssi_overrun); +err |= __put_user((long) kinfo-si_ptr, uinfo-ssi_ptr); break; case __SI_POLL: - err |= __put_user(kinfo-si_band, uinfo-band); - err |= __put_user(kinfo-si_fd, uinfo-fd); + err |= __put_user(kinfo-si_band, uinfo-ssi_band); + err |= __put_user(kinfo-si_fd, uinfo-ssi_fd); break; case __SI_FAULT: - err |= __put_user((long)kinfo-si_addr, uinfo-addr); + err |= __put_user((long) kinfo-si_addr, uinfo-ssi_addr); #ifdef __ARCH_SI_TRAPNO - err |= __put_user(kinfo-si_trapno, uinfo-trapno); + err |= __put_user(kinfo-si_trapno, uinfo-ssi_trapno); #endif break; case __SI_CHLD: - err |= __put_user(kinfo-si_pid, uinfo-pid); - err |= __put_user(kinfo-si_uid, uinfo-uid); - err |= __put_user(kinfo-si_status, uinfo-status); - err |= __put_user(kinfo-si_utime, uinfo-utime); - err |= __put_user(kinfo-si_stime, uinfo-stime); + err |= __put_user(kinfo-si_pid, uinfo-ssi_pid); + err |= __put_user(kinfo-si_uid, uinfo-ssi_uid); + err |= __put_user(kinfo-si_status, uinfo-ssi_status); + err |= __put_user(kinfo-si_utime, uinfo-ssi_utime); + err |= __put_user(kinfo-si_stime, uinfo-ssi_stime); break; case __SI_RT: /* This is not generated by the kernel as of now. */ case __SI_MESGQ: /* But this is */ - err |= __put_user(kinfo-si_pid, uinfo-pid); - err |= __put_user(kinfo-si_uid, uinfo-uid); - err |= __put_user((long)kinfo-si_ptr, uinfo-svptr); + err |= __put_user(kinfo-si_pid, uinfo-ssi_pid); + err |= __put_user(kinfo-si_uid, uinfo-ssi_uid); + err |= __put_user((long) kinfo-si_ptr, uinfo-ssi_ptr); break; default: /* this is just in case for now ... */ - err |= __put_user(kinfo-si_pid, uinfo-pid); - err |= __put_user(kinfo-si_uid, uinfo-uid); + err |= __put_user(kinfo-si_pid, uinfo-ssi_pid); + err |= __put_user(kinfo-si_uid, uinfo-ssi_uid); break; } Index: linux-2.6.mod/include/linux/signalfd.h === --- linux-2.6.mod.orig/include/linux/signalfd.h 2007-09-27 10:28:59.0 -0700 +++ linux-2.6.mod/include/linux/signalfd.h 2007-09-27 10:31:31.0 -0700 @@ -10,22 +10,22 @@ struct signalfd_siginfo { - __u32 signo; - __s32 err; - __s32 code; - __u32 pid; - __u32 uid; - __s32 fd; - __u32 tid; - __u32 band; - __u32 overrun; - __u32 trapno; - __s32 status; - __s32 svint; - __u64 svptr; - __u64 utime; - __u64 stime; - __u64 addr; + __u32 ssi_signo; + __s32 ssi_errno; + __s32 ssi_code; + __u32 ssi_pid; + __u32 ssi_uid; + __s32 ssi_fd; + __u32 ssi_tid; + __u32
[patch] anon-inodes use open coded atomic_inc for the shared inode
Since we know the shared inode count is always 0, we can avoid igrab() and use an open coded atomic_inc(). Signed-off-by: Davide Libenzi [EMAIL PROTECTED] - Davide --- fs/anon_inodes.c | 25 - 1 file changed, 12 insertions(+), 13 deletions(-) Index: linux-2.6.mod/fs/anon_inodes.c === --- linux-2.6.mod.orig/fs/anon_inodes.c 2007-09-27 11:26:57.0 -0700 +++ linux-2.6.mod/fs/anon_inodes.c 2007-09-27 11:51:42.0 -0700 @@ -76,7 +76,6 @@ { struct qstr this; struct dentry *dentry; - struct inode *inode; struct file *file; int error, fd; @@ -86,15 +85,9 @@ if (!file) return -ENFILE; - inode = igrab(anon_inode_inode); - if (IS_ERR(inode)) { - error = PTR_ERR(inode); - goto err_put_filp; - } - error = get_unused_fd(); if (error 0) - goto err_iput; + goto err_put_filp; fd = error; /* @@ -108,14 +101,22 @@ dentry = d_alloc(anon_inode_mnt-mnt_sb-s_root, this); if (!dentry) goto err_put_unused_fd; + + /* +* We know the anon_inode inode count is always greater than zero, +* so we can avoid doing an igrab() and we can use an open-coded +* atomic_inc(). +*/ + atomic_inc(anon_inode_inode-i_count); + dentry-d_op = anon_inodefs_dentry_operations; /* Do not publish this dentry inside the global dentry hash table */ dentry-d_flags = ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); + d_instantiate(dentry, anon_inode_inode); file-f_path.mnt = mntget(anon_inode_mnt); file-f_path.dentry = dentry; - file-f_mapping = inode-i_mapping; + file-f_mapping = anon_inode_inode-i_mapping; file-f_pos = 0; file-f_flags = O_RDWR; @@ -127,14 +128,12 @@ fd_install(fd, file); *pfd = fd; - *pinode = inode; + *pinode = anon_inode_inode; *pfile = file; return 0; err_put_unused_fd: put_unused_fd(fd); -err_iput: - iput(inode); err_put_filp: put_filp(file); return error; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: F_DUPFD_CLOEXEC implementation
On Fri, 28 Sep 2007, Ulrich Drepper wrote: One more small change to extend the availability of creation of file descriptors with FD_CLOEXEC set. Adding a new command to fcntl() requires no new system call and the overall impact on code size if minimal. If this patch gets accepted we will also add this change to the next revision of the POSIX spec. To test the patch, use the following little program. Adjust the value of F_DUPFD_CLOEXEC appropriately. I think new system calls would have been a cleaner way to accomplish this. The small pill at a time may have better chance to go in, but will likely result in an uglier userspace interface. In any case, this is better than *nothing*, if it makes it easier to use fds inside system libraries. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc][patch] i386: remove comment about barriers
On Sat, 29 Sep 2007, Nick Piggin wrote: [ This is true for x86's sfence/lfence, but raises a question about Linux's memory barriers. Does anybody expect that a sequence of smp_wmb and smp_rmb would ever provide a full smp_mb barrier? I've always assumed no, but I don't know if it is actually documented? ] No, that can't be. rmb+wmb can't be considered a full mb. There was a recent discussion about this in the thread originated by peterz scalable rw_mutex patches. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/9] RCU: Preemptible RCU
On Sun, 30 Sep 2007, Oleg Nesterov wrote: Ah, but I asked the different question. We must see CPU 1's stores by definition, but what about CPU 0's stores (which could be seen by CPU 1)? Let's take a real life example, A = B = X = 0; P = Q = A; CPU_0 CPU_1 CPU_2 P = B; *P = 1; if (X) { wmb(); rmb(); X = 1; BUG_ON(*P != 1 *Q != 1); } So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ? That can't be. CPU_2 sees X=1, that happened after (or same time at most - from a cache inv. POV) to *P=1, that must have happened after P=B (in order for *P to assign B). So P=B happened, from a pure time POV, before the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: F_DUPFD_CLOEXEC implementation
On Sun, 30 Sep 2007, Denys Vlasenko wrote: Hi Ulrich, On Friday 28 September 2007 18:34, Ulrich Drepper wrote: One more small change to extend the availability of creation of file descriptors with FD_CLOEXEC set. Adding a new command to fcntl() requires no new system call and the overall impact on code size if minimal. Tangential question: do you have any idea how userspace can safely do nonblocking read or write on a potentially-shared fd? IIUC, currently it cannot be done without races: old_flags = fcntl(fd, F_GETFL); ...other process may change flags!... fcntl(fd, F_SETFL, old_flags | O_NONBLOCK); read(fd, ...) ...other process may see flags changed under its feet!... fcntl(fd, F_SETFL, old_flags); Can this be fixed? I'm not sure I understood correctly your use case. But, if you have two processes/threads randomly switching O_NONBLOCK on/off, your problems arise not only the F_SETFL time. If one of the tasks is not expecting an fd to be O_NONBLOCK, that will likely end up not handling correctly read/write-miss situations. In that case it'd be better to keep the fd as O_NONBLOCK, and manually create blocking behaviour (when needed) with poll+read/write. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: F_DUPFD_CLOEXEC implementation
On Mon, 1 Oct 2007, Denys Vlasenko wrote: My use case is: I want to do a nonblocking read on descriptor 0 (stdin). It may be a pipe or a socket. There may be other processes which share this descriptor with me, I simply cannot know that. And they, too, may want to do reads on it. I want to do nonblocking read in such a way that neither those other processes will ever see fd switching to O_NONBLOCK and back, and I also want to be safe from other processes doing the same. I don't see how this can be done using standard unix primitives. Indeed. You could simulate non-blocking using poll with zero timeout, but if another task may read/write on it, your following read/write may end up blocking even after a poll returned the required events. One way to solve this would be some sort of readx/writex where you pass an extra flags parameter (this could be done with sys_indirect, assuming we'll ever get that mainline) where you specify the non-blocking requirement for-this-call, and not as global per-file flag. Then, of course, you'll have to modify all the file-f_flags O_NONBLOCK tests (and there are many of them) to check for that flag too (that can be a per task_struct flag). - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/9] RCU: Preemptible RCU
On Sun, 30 Sep 2007, Paul E. McKenney wrote: On Sun, Sep 30, 2007 at 04:02:09PM -0700, Davide Libenzi wrote: On Sun, 30 Sep 2007, Oleg Nesterov wrote: Ah, but I asked the different question. We must see CPU 1's stores by definition, but what about CPU 0's stores (which could be seen by CPU 1)? Let's take a real life example, A = B = X = 0; P = Q = A; CPU_0 CPU_1 CPU_2 P = B; *P = 1; if (X) { wmb(); rmb(); X = 1; BUG_ON(*P != 1 *Q != 1); } So, is it possible that CPU_1 sees P == B, but CPU_2 sees P == A ? That can't be. CPU_2 sees X=1, that happened after (or same time at most - from a cache inv. POV) to *P=1, that must have happened after P=B (in order for *P to assign B). So P=B happened, from a pure time POV, before the rmb(), and the rmb() should guarantee that CPU_2 sees P=B too. Actually, CPU designers have to go quite a ways out of their way to prevent this BUG_ON from happening. One way that it would happen naturally would be if the cache line containing P were owned by CPU 2, and if CPUs 0 and 1 shared a store buffer that they both snooped. So, here is what could happen given careless or sadistic CPU designers: Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted that it was a cross-CPU sync point (ala read_barrier_depends). If that's a local CPU load ordering only, things are different, clearly. But ... o CPU 0 stores B to P, but misses the cache, so puts the result in the store buffer. This means that only CPUs 0 and 1 can see it. o CPU 1 fetches P, and sees B, so stores a 1 to B. Again, this value for P is visible only to CPUs 0 and 1. o CPU 1 executes a wmb(), which forces CPU 1's stores to happen in order. But it does nothing about CPU 0's stores, nor about CPU 1's loads, for that matter (and the only reason that POWER ends up working the way you would like is because wmb() turns into sync rather than the eieio instruction that would have been used for smp_wmb() -- which is maybe what Oleg was thinking of, but happened to abbreviate. If my analysis is buggy, Anton and Paulus will no doubt correct me...) If a store buffer is shared between CPU_0 and CPU_1, it is very likely that a sync done on CPU_1 is going to sync even CPU_0 stores that are held in the buffer at the time of CPU_1's sync. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: F_DUPFD_CLOEXEC implementation
On Mon, 1 Oct 2007, Denys Vlasenko wrote: On Monday 01 October 2007 19:16, Al Viro wrote: * it's on a bunch of cyclic lists. Have its neighbor go away while you are doing all that crap = boom * there's that thing call current position... It gets buggered. * overwriting it while another task might be in the middle of syscall involving it = boom Hm, I suspected that it's herecy. Any idea how to do it cleanly? * non-cooperative tasks reading *in* *parallel* from the same opened file are going to have a lot more serious problems than agreeing on O_NONBLOCK anyway, so I really don't understand what the hell is that for. They don't even need to read in parallel, just having shared fd is enough. Think about pipes, sockets and terminals. A real-world scenario: * a process started from shell (interactive or shell script) * it sets O_NONBLOCK and does a read from fd 0... * it gets killed (kill -9, whatever) * shell suddenly has it's fd 0 in O_NONBLOCK mode * shell and all subsequent commands started from it unexpectedly have O_NONBLOCKed stdin. I told you how in the previous email. You cannot use the: 1) set O_NONBLOCK 2) read/write 3) unset O_NONBLOCK in a racy-free fashion, w/out wrapping it with a lock (thing that we don't want to do). PS: send/recv are socket functions, and you really don't want to overload them for other fds. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/9] RCU: Preemptible RCU
On Mon, 1 Oct 2007, Paul E. McKenney wrote: That would indeed be one approach that CPU designers could take to avoid being careless or sadistic. ;-) That'd be the easier (unique maybe) approach too for them, from an silicon complexity POV. Distinguishing between different CPUs stores once inside a shared store buffer, would require tagging them in some way. That'd defeat most of the pros of having a shared store buffer ;) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: F_DUPFD_CLOEXEC implementation
On Tue, 2 Oct 2007, Denys Vlasenko wrote: I have following proposals: * make recv(..., MSG_DONTWAIT) work on any fd Sounds neat, but not trivial to implement in current kernel. This is mildly ugly, if you ask me. Those are socket functions, and the flags parameter contain some pretty specific network meanings. * new fcntl command F_DUPFL: fcntl(fd, F_DUPFL, n): Analogous to F_DUPFD, but gives you *unshared* copy of the fd. Further seeks, fcntl(fd, F_SETFL, O_NONBLOCK), etc won't affect any other process. You'll need an ad-hoc copy function though, since your memcpy-based one is gonna explode even before memcpy returns ;) You'll have problems with ref-counting too. And that layer is not designed to cleanly support that operation. Unfortunately the clean solution would involve changing a whole bunch of code, and I don't feel exactly sure it'd be worth it. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC + PATCH] signalfd simplification
On Sun, 2 Sep 2007, Oleg Nesterov wrote: We can optimize this later, using a clever wait_queue_func_t if needed. Good idea. Will do ... - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Tue, 4 Sep 2007, Andrew Morton wrote: On Tue, 04 Sep 2007 10:03:56 +0200 Michael Kerrisk [EMAIL PROTECTED] wrote: Davide, Davide -- ping! Can you please offer your comments about this change, and also thoughts on Jon's and my comments about a more radical API change later in this thread. IMO the complexity of the resulting API (and resulting patch), and the ABI change, is not justified by the added value. Neither of the proposed APIs (either my multiplexed version of timerfd() or Jon's/my idea of using three system calls (like POSIX timers), or the notion of timerfd() integrated with POSIX timers) is more complicated than the existing POSIX timers API. The ABI change doesn't really matter, since timerfd() was broken in 2.6.22 anyway. Both previous APIs provided the features I have described provide: * the ability to fetch the old timer value when applying a new setting * the ability to non-destructively fetch the amount of time remaining on a timer. This is clearly useful for timers -- but you have not explained why you think this is not necessary for timerfd timers. wakes up I'd have thought that the existing stuff would be near-useless without the capabilities which you describe? Useless like it'd be a motorcycle w/out a cup-holder :) Seriously, the ability to get the previous values from something could have a meaning if this something is a shared global resource (like signals for example). In the timerfd case this makes little sense, since you can create as many timerfd as you like and you do not need to share a single one by changing/restoring the original context. On top of that, the cup-holder addition would cost in terms of API clarity (or in terms of two additional system calls in the other case), and in terms of kernel code footprint. Costs that IMO are not balanced by the added values. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Tue, 4 Sep 2007, Michael Kerrisk wrote: Hi Davide, wakes up I'd have thought that the existing stuff would be near-useless without the capabilities which you describe? Useless like it'd be a motorcycle w/out a cup-holder :) Seriously, the ability to get the previous values from something could have a meaning if this something is a shared global resource (like signals for example). In the timerfd case this makes little sense, since you can create as many timerfd as you like and you do not need to share a single one by changing/restoring the original context. However, one can have multipe POSIX timers, just as you can have multiple timerfd timers; nevertheless POSIX timers provide the get and get-while-setting functionality. The fact that POSIX defined a certain API in a given way, does not automatically mean that every other API has to look exactly like that. POSIX has the tendency to bloat things up at times ;) and in terms of kernel code footprint. Not sure what your concern is here. The total amount of new code for all of these options is pretty small. From your patch: fs/compat.c | 34 -- fs/timerfd.c | 147 +++ include/linux/compat.h |3 include/linux/syscalls.h |3 4 files changed, 153 insertions(+), 34 deletions(-) And the API definition becomes pretty messy. The other way is to add new system calls. 120+ lines of code more of new system calls wouldn't even be a problem in itself, if the added value was there. IMO, as I already said, the added value does not justify them. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Tue, 4 Sep 2007, Michael Kerrisk wrote: Useless like it'd be a motorcycle w/out a cup-holder :) Seriously, the ability to get the previous values from something could have a meaning if this something is a shared global resource (like signals for example). In the timerfd case this makes little sense, since you can create as many timerfd as you like and you do not need to share a single one by changing/restoring the original context. Davide, As I think about this more, I see more problems with your argument. timerfd needs the ability to get and get-while-setting just as much as the earlier APIs. Consider a library that creates a timerfd file descriptor that is handed off to an application: that library may want to modify the timer settings without having to create a new file descriptor (the app mey not be able to be told about the new fd). Your argument just doesn't hold, AFAICS. Such hypotethical library, in case it really wanted to offer such functionality, could simply return an handle instead of the raw fd, and take care of all that stuff in userspace. Again, mimicking POSIX APIs doesn't always take you in the right place. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Wed, 5 Sep 2007, Michael Kerrisk wrote: Davide, A Michael! As I think about this more, I see more problems with your argument. timerfd needs the ability to get and get-while-setting just as much as the earlier APIs. Consider a library that creates a timerfd file descriptor that is handed off to an application: that library may want to modify the timer settings without having to create a new file descriptor (the app mey not be able to be told about the new fd). Your argument just doesn't hold, AFAICS. Such hypotethical library, in case it really wanted to offer such functionality, could simply return an handle instead of the raw fd, and take care of all that stuff in userspace. Did I miss something? Is it not the case that as soon as the library returns a handle, rather than an fd, then the whole advantage of timerfd() (being able to select/poll/epoll on the timer as well as other fds) is lost? Why? The handle would simply be a little struct where the timerfd fd is stored, and a XXX_getfd() would return it. So my point is, I doubt such functionalities are really needed, and I also argue that the kernel is the best place for such wrapper code to go. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Wed, 5 Sep 2007, Michael Kerrisk wrote: Hi Davide, As I think about this more, I see more problems with your argument. timerfd needs the ability to get and get-while-setting just as much as the earlier APIs. Consider a library that creates a timerfd file descriptor that is handed off to an application: that library may want to modify the timer settings without having to create a new file descriptor (the app mey not be able to be told about the new fd). Your argument just doesn't hold, AFAICS. Such hypotethical library, in case it really wanted to offer such functionality, could simply return an handle instead of the raw fd, and take care of all that stuff in userspace. Did I miss something? Is it not the case that as soon as the library returns a handle, rather than an fd, then the whole advantage of timerfd() (being able to select/poll/epoll on the timer as well as other fds) is lost? Why? The handle would simply be a little struct where the timerfd fd is stored, and a XXX_getfd() would return it. So my point is, I doubt such functionalities are really needed, and I also argue that the kernel is the best place for such wrapper code to go. So what happens if one thread (via the library) wants modify a timer's settings at the same timer as another thread is select()ing on it? The first thread can't do this by creating a new timerfd timer, since it wants to affect the select() in the other thread? It can be done w/out any problems. The select thread will be notified whenever the new timer setting expires. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Revised timerfd() interface
On Thu, 6 Sep 2007, Michael Kerrisk wrote: Hi Davide, As I think about this more, I see more problems with your argument. timerfd needs the ability to get and get-while-setting just as much as the earlier APIs. Consider a library that creates a timerfd file descriptor that is handed off to an application: that library may want to modify the timer settings without having to create a new file descriptor (the app mey not be able to be told about the new fd). Your argument just doesn't hold, AFAICS. Such hypotethical library, in case it really wanted to offer such functionality, could simply return an handle instead of the raw fd, and take care of all that stuff in userspace. Did I miss something? Is it not the case that as soon as the library returns a handle, rather than an fd, then the whole advantage of timerfd() (being able to select/poll/epoll on the timer as well as other fds) is lost? Why? The handle would simply be a little struct where the timerfd fd is stored, and a XXX_getfd() would return it. So my point is, I doubt such functionalities are really needed, and I also argue that the kernel is the best place for such wrapper code to go. So what happens if one thread (via the library) wants modify a timer's settings at the same timer as another thread is select()ing on it? The first thread can't do this by creating a new timerfd timer, since it wants to affect the select() in the other thread? It can be done w/out any problems. The select thread will be notified whenever the new timer setting expires. We are going in circles here. I think you are missing my point. Consider the following [[ Thread A: calls library function which creates a timerfd file descriptor. Thread B: calls select() on the timerfd file descriptor. Thread A: calls library function which wants to: a) modify timer settings, and retrieve copy of current timer settings, and later b) restore old timer settings. ]] This seems a quite reasonable use-case to me, and the existing interface simply can't support it. Quite reasonable? :) I honestly doubt it, but anyway. Modulo error checking: struct tfd { int fd, clockid; struct itimerspec ts; }; struct tfd *tfd_create(int clockid, int flags, const struct itimerspec *ts) { struct tfd *th; th = malloc(sizeof(*th)); th-clockid = clockid; th-ts = *ts; th-fd = timerfd(-1, clockid, flags, ts); return th; } void tfd_close(struct tfd *th) { close(th-fd); free(th); } int tfd_getfd(const struct tfd *th) { return th-fd; } int tfd_gettime(const struct tfd *th, int *clockid, struct itimerspec *ts) { *clockid = th-clockid; *ts = th-ts; return 0; } int tfd_settime(struct tfd *th, int clockid, int flags, const struct itimerspec *ts) { th-fd = timerfd(th-fd, clockid, flags, ts); th-clockid = clockid; th-ts = *ts; return 0; } Wrap the get/set with a mutex in case you plan to shoot yourself in a foot by doing get/set from multiple threads ;) So, once again: - I sincerly doubt the above is common usage/design patters for timerfds * timerfds are not a common global resource, ala signals, that requires get+set+restore pattern - you can have many of them set to different times - Those IMO *very* special use cases can be handled in userspace with few lines of code, *if* really needed - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/